Unicode search is case-sensitive

Things that we don't consider worth fixing at this moment.
troonie
Burner Inserter
Burner Inserter
Posts: 16
Joined: Thu Apr 21, 2016 4:13 pm
Contact:

Unicode search is case-sensitive

Post by troonie »

When searching through technology tree with Russian localization turned on in settings, the search is case-sensitive, which it probably shouldn't be.

Latin alphabet search is already case-insensitive.
posila
Factorio Staff
Factorio Staff
Posts: 5359
Joined: Thu Jun 11, 2015 1:35 pm
Contact:

Re: Unicode search is case-sensitive

Post by posila »

Unfortunatelly, doing case-insensitive compare in unicode is not trivial thing to do. There are large multiplaform libraries that can handle that, but we don't feel like this feature is important enought for us to add dependecy on such a library. I am sorry.
pavlukivan
Inserter
Inserter
Posts: 26
Joined: Sat May 05, 2018 8:07 am
Contact:

Re: Unicode search is case-sensitive

Post by pavlukivan »

Let's see... Currently, Factorio has English, German, French, Italian, Korean, Spanish, Chinese, Russian, Japanese, Polish, Danish, Dutch, Finnish, Norwegian, Swedish, Hungarian, Czech, Romanian, Portuguese, and Ukrainian translations.

Japanese search is super complex - it would indeed require a large library to handle it, in fact it would require a built-in Japanese dictionary! Most likely, the best, or perhaps the only way to achieve it is by cooperating with translators to make them add pronunciation info to each word (It's fine if it isn't added for every word, those could use the current search logic). I could understand why you're not doing that (Personally, I'd do it anyway, as I consider i18n very important and wouldn't want those playing in other languages to have an inferior experience). I assume Chinese is similar to Japanese. English works properly already, Korean doesn't require any capitalization and should work fine as well.

This leaves alphabets with diacritics, and Cyrillic alphabets. Unicode collation is pretty hard, so it would require a fairly big library to do properly.

However, Cyrillics in particular are super easy to handle. You could use std::locale if that works for you - it won't require any plumbing with ICU, just a few wchar_t conversions. Not doing that is just lazy in my opinion. You could simply iterate over Unicode character boundaries and check for the particular 37 values of Russian and Ukrainian capital letters, that wouldn't even require allocation!

I'll even go as far as to say not having proper search is a deal-breaker for me, and is one of the main reasons I never play in my native language.

Which is why when I saw this reply, I implemented a lightweight C++ Unicode collation library that doesn't support the entire Unicode subset, but will definitely work for Cyrillics and most diacritics (All diacritics currently used in Factorio, if there isn't a bug somewhere).

It's made of two parts - a Python 3 script to generate the Unicode mapping, and a 250-line autogenerated C++ function that actually processes text according to the generated mapping. It uses std::string, but you can easily adapt it for any string type. I licensed it as 0BSD, so you can use it in Factorio without any licensing obligations (if you do end up using it, I'd be grateful if you credited me like you do with MIT libs, but that isn't required, since the library is really small). It's sad Wube doesn't consider it important - but I hope my implementation will make adding it easy enough to do despite being low on priority list.
KeepResearchinSpoons
Long Handed Inserter
Long Handed Inserter
Posts: 77
Joined: Tue Dec 01, 2020 6:57 pm
Contact:

Re: Unicode search is case-sensitive

Post by KeepResearchinSpoons »

This post has been referenced or duplicated more than once, mostly by Russian (or Cyrillic-family) speakers.

For anyone implementing the "lowercase" collation in the Lua land, I leave a minimal example of how I did it some few years ago helping with the map-markers-contents search extension (aka map tags search module, if you remember that one name better :wink:).

[sar:]
ALSO, (on_console_command) > (commands)! since it does not "nil" the empty argument. as it should not. and is expected not to. period.
just use the "commands" for the helpstring and to prevent "UNKNOWN COMMAND, SIR!" printout
[/sar]

Anyways, here's the
code
Note, that you can add the missing letters for your language either manually, or by running the included "autogen script" with the alphabet extended to your liking.
To run such script you can use the NodeJS repl, or simply the browser-console in devtools (yep, that simple).

To test that the .lua file "works", you could either:
install a Lua5.2 (say, `sudo apt install lua5.2`) and run `lua ./ru_lowercase_example.lua` directly,
or
do it using rcon to test it on a running game map

Do not attempt to run the code by pasting it into chat, however.
The chat is known to have "broken newlines" which eats these and consequently makes any --comment "permanent" for the rest of the chunk.
Might NOT be something you really want.
You can either strip all these away before pasting (any strip-dash-dash script would work), or make sure you have ALL the comments in the form of --[[ ]].
^ as of [1.1.80]
And do not attempt to use editor snippets


Might be some old post bumping, but I put it here as it is the reference target.

Have fun with the Lua and let's wait for the Unicode-ready factory together! as of [1.1.80]
and as of 1.1.80, we are all doing a great and definitely quite stable job on that! (as of [1.1.80]). ((kappa))
Post Reply

Return to “Won't fix.”