When searching through technology tree with Russian localization turned on in settings, the search is case-sensitive, which it probably shouldn't be.
Latin alphabet search is already case-insensitive.
Unicode search is case-sensitive
Re: Unicode search is case-sensitive
Unfortunatelly, doing case-insensitive compare in unicode is not trivial thing to do. There are large multiplaform libraries that can handle that, but we don't feel like this feature is important enought for us to add dependecy on such a library. I am sorry.
-
- Inserter
- Posts: 26
- Joined: Sat May 05, 2018 8:07 am
- Contact:
Re: Unicode search is case-sensitive
Let's see... Currently, Factorio has English, German, French, Italian, Korean, Spanish, Chinese, Russian, Japanese, Polish, Danish, Dutch, Finnish, Norwegian, Swedish, Hungarian, Czech, Romanian, Portuguese, and Ukrainian translations.
Japanese search is super complex - it would indeed require a large library to handle it, in fact it would require a built-in Japanese dictionary! Most likely, the best, or perhaps the only way to achieve it is by cooperating with translators to make them add pronunciation info to each word (It's fine if it isn't added for every word, those could use the current search logic). I could understand why you're not doing that (Personally, I'd do it anyway, as I consider i18n very important and wouldn't want those playing in other languages to have an inferior experience). I assume Chinese is similar to Japanese. English works properly already, Korean doesn't require any capitalization and should work fine as well.
This leaves alphabets with diacritics, and Cyrillic alphabets. Unicode collation is pretty hard, so it would require a fairly big library to do properly.
However, Cyrillics in particular are super easy to handle. You could use std::locale if that works for you - it won't require any plumbing with ICU, just a few wchar_t conversions. Not doing that is just lazy in my opinion. You could simply iterate over Unicode character boundaries and check for the particular 37 values of Russian and Ukrainian capital letters, that wouldn't even require allocation!
I'll even go as far as to say not having proper search is a deal-breaker for me, and is one of the main reasons I never play in my native language.
Which is why when I saw this reply, I implemented a lightweight C++ Unicode collation library that doesn't support the entire Unicode subset, but will definitely work for Cyrillics and most diacritics (All diacritics currently used in Factorio, if there isn't a bug somewhere).
It's made of two parts - a Python 3 script to generate the Unicode mapping, and a 250-line autogenerated C++ function that actually processes text according to the generated mapping. It uses std::string, but you can easily adapt it for any string type. I licensed it as 0BSD, so you can use it in Factorio without any licensing obligations (if you do end up using it, I'd be grateful if you credited me like you do with MIT libs, but that isn't required, since the library is really small). It's sad Wube doesn't consider it important - but I hope my implementation will make adding it easy enough to do despite being low on priority list.
Japanese search is super complex - it would indeed require a large library to handle it, in fact it would require a built-in Japanese dictionary! Most likely, the best, or perhaps the only way to achieve it is by cooperating with translators to make them add pronunciation info to each word (It's fine if it isn't added for every word, those could use the current search logic). I could understand why you're not doing that (Personally, I'd do it anyway, as I consider i18n very important and wouldn't want those playing in other languages to have an inferior experience). I assume Chinese is similar to Japanese. English works properly already, Korean doesn't require any capitalization and should work fine as well.
This leaves alphabets with diacritics, and Cyrillic alphabets. Unicode collation is pretty hard, so it would require a fairly big library to do properly.
However, Cyrillics in particular are super easy to handle. You could use std::locale if that works for you - it won't require any plumbing with ICU, just a few wchar_t conversions. Not doing that is just lazy in my opinion. You could simply iterate over Unicode character boundaries and check for the particular 37 values of Russian and Ukrainian capital letters, that wouldn't even require allocation!
I'll even go as far as to say not having proper search is a deal-breaker for me, and is one of the main reasons I never play in my native language.
Which is why when I saw this reply, I implemented a lightweight C++ Unicode collation library that doesn't support the entire Unicode subset, but will definitely work for Cyrillics and most diacritics (All diacritics currently used in Factorio, if there isn't a bug somewhere).
It's made of two parts - a Python 3 script to generate the Unicode mapping, and a 250-line autogenerated C++ function that actually processes text according to the generated mapping. It uses std::string, but you can easily adapt it for any string type. I licensed it as 0BSD, so you can use it in Factorio without any licensing obligations (if you do end up using it, I'd be grateful if you credited me like you do with MIT libs, but that isn't required, since the library is really small). It's sad Wube doesn't consider it important - but I hope my implementation will make adding it easy enough to do despite being low on priority list.
-
- Long Handed Inserter
- Posts: 77
- Joined: Tue Dec 01, 2020 6:57 pm
- Contact:
Re: Unicode search is case-sensitive
This post has been referenced or duplicated more than once, mostly by Russian (or Cyrillic-family) speakers.
For anyone implementing the "lowercase" collation in the Lua land, I leave a minimal example of how I did it some few years ago helping with the map-markers-contents search extension (aka map tags search module, if you remember that one name better ).
[sar:]
ALSO, (on_console_command) > (commands)! since it does not "nil" the empty argument. as it should not. and is expected not to. period.
just use the "commands" for the helpstring and to prevent "UNKNOWN COMMAND, SIR!" printout
[/sar]
Anyways, here's the
To run such script you can use the NodeJS repl, or simply the browser-console in devtools (yep, that simple).
To test that the .lua file "works", you could either:
install a Lua5.2 (say, `sudo apt install lua5.2`) and run `lua ./ru_lowercase_example.lua` directly,
or
Do not attempt to run the code by pasting it into chat, however.
The chat is known to have "broken newlines" which eats these and consequently makes any --comment "permanent" for the rest of the chunk.
Might NOT be something you really want.
You can either strip all these away before pasting (any strip-dash-dash script would work), or make sure you have ALL the comments in the form of --[[ ]].
^ as of [1.1.80]
Might be some old post bumping, but I put it here as it is the reference target.
Have fun with the Lua and let's wait for the Unicode-ready factory together! as of [1.1.80]
and as of 1.1.80, we are all doing a great and definitely quite stable job on that! (as of [1.1.80]). ((kappa))
For anyone implementing the "lowercase" collation in the Lua land, I leave a minimal example of how I did it some few years ago helping with the map-markers-contents search extension (aka map tags search module, if you remember that one name better ).
[sar:]
ALSO, (on_console_command) > (commands)! since it does not "nil" the empty argument. as it should not. and is expected not to. period.
just use the "commands" for the helpstring and to prevent "UNKNOWN COMMAND, SIR!" printout
[/sar]
Anyways, here's the
code
Note, that you can add the missing letters for your language either manually, or by running the included "autogen script" with the alphabet extended to your liking.To run such script you can use the NodeJS repl, or simply the browser-console in devtools (yep, that simple).
To test that the .lua file "works", you could either:
install a Lua5.2 (say, `sudo apt install lua5.2`) and run `lua ./ru_lowercase_example.lua` directly,
or
do it using rcon to test it on a running game map
Do not attempt to run the code by pasting it into chat, however.
The chat is known to have "broken newlines" which eats these and consequently makes any --comment "permanent" for the rest of the chunk.
Might NOT be something you really want.
You can either strip all these away before pasting (any strip-dash-dash script would work), or make sure you have ALL the comments in the form of --[[ ]].
^ as of [1.1.80]
And do not attempt to use editor snippets
Might be some old post bumping, but I put it here as it is the reference target.
Have fun with the Lua and let's wait for the Unicode-ready factory together! as of [1.1.80]
and as of 1.1.80, we are all doing a great and definitely quite stable job on that! (as of [1.1.80]). ((kappa))