Unicode search is case-sensitive

troonie · Post by **troonie** » Fri Jun 04, 2021 11:07 pm

When searching through technology tree with Russian localization turned on in settings, the search is case-sensitive, which it probably shouldn't be.

Latin alphabet search is already case-insensitive.

Post by **posila** » Sun Jun 06, 2021 7:12 pm

Unfortunatelly, doing case-insensitive compare in unicode is not trivial thing to do. There are large multiplaform libraries that can handle that, but we don't feel like this feature is important enought for us to add dependecy on such a library. I am sorry.

pavlukivan · Post by **pavlukivan** » Fri Jan 28, 2022 2:49 am

Let's see... Currently, Factorio has English, German, French, Italian, Korean, Spanish, Chinese, Russian, Japanese, Polish, Danish, Dutch, Finnish, Norwegian, Swedish, Hungarian, Czech, Romanian, Portuguese, and Ukrainian translations.

Japanese search is super complex - it would indeed require a large library to handle it, in fact it would require a built-in Japanese dictionary! Most likely, the best, or perhaps the only way to achieve it is by cooperating with translators to make them add pronunciation info to each word (It's fine if it isn't added for every word, those could use the current search logic). I could understand why you're not doing that (Personally, I'd do it anyway, as I consider i18n very important and wouldn't want those playing in other languages to have an inferior experience). I assume Chinese is similar to Japanese. English works properly already, Korean doesn't require any capitalization and should work fine as well.

This leaves alphabets with diacritics, and Cyrillic alphabets. Unicode collation is pretty hard, so it would require a fairly big library to do properly.

However, Cyrillics in particular are super easy to handle. You could use std::locale if that works for you - it won't require any plumbing with ICU, just a few wchar_t conversions. Not doing that is just lazy in my opinion. You could simply iterate over Unicode character boundaries and check for the particular 37 values of Russian and Ukrainian capital letters, that wouldn't even require allocation!

I'll even go as far as to say not having proper search is a deal-breaker for me, and is one of the main reasons I never play in my native language.

Which is why when I saw this reply, I implemented a lightweight C++ Unicode collation library that doesn't support the entire Unicode subset, but will definitely work for Cyrillics and most diacritics (All diacritics currently used in Factorio, if there isn't a bug somewhere).

It's made of two parts - a Python 3 script to generate the Unicode mapping, and a 250-line autogenerated C++ function that actually processes text according to the generated mapping. It uses std::string, but you can easily adapt it for any string type. I licensed it as 0BSD, so you can use it in Factorio without any licensing obligations (if you do end up using it, I'd be grateful if you credited me like you do with MIT libs, but that isn't required, since the library is really small). It's sad Wube doesn't consider it important - but I hope my implementation will make adding it easy enough to do despite being low on priority list.

KeepResearchinSpoons · Fri Jul 07, 2023 3:00 am

This post has been referenced or duplicated more than once, mostly by Russian (or Cyrillic-family) speakers.

For anyone implementing the "lowercase" collation in the Lua land, I leave a minimal example of how I did it some few years ago helping with the map-markers-contents search extension (aka map tags search module, if you remember that one name better

).

[sar:]
ALSO, (on_console_command) > (commands)! since it does not "nil" the empty argument. as it should not. and is expected not to. period.
just use the "commands" for the helpstring and to prevent "UNKNOWN COMMAND, SIR!" printout
[/sar]

Anyways, here's the

code

Code: Select all

-- @@2023-07-07 release
-- LICENSED UNDER CC0

-- stylua: ignore start
--[[
js autogen:
var a ="АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
var q = a + a.toLowerCase();
var ret = q.split('').map(L=>`["${L}"]="${L.toLowerCase()}"`).join(", ")
console.log(ret)

second lcased half is not "strictly" required in this example though;
]]
local cheat_sheet_map_RU = {
  ["А"]="а", ["Б"]="б", ["В"]="в", ["Г"]="г", ["Д"]="д", ["Е"]="е", ["Ё"]="ё", ["Ж"]="ж", ["З"]="з", ["И"]="и", ["Й"]="й", ["К"]="к", ["Л"]="л", ["М"]="м", ["Н"]="н", ["О"]="о", ["П"]="п", ["Р"]="р", ["С"]="с", ["Т"]="т", ["У"]="у", ["Ф"]="ф", ["Х"]="х", ["Ц"]="ц", ["Ч"]="ч", ["Ш"]="ш", ["Щ"]="щ", ["Ъ"]="ъ", ["Ы"]="ы", ["Ь"]="ь", ["Э"]="э", ["Ю"]="ю", ["Я"]="я",
  ["а"]="а", ["б"]="б", ["в"]="в", ["г"]="г", ["д"]="д", ["е"]="е", ["ё"]="ё", ["ж"]="ж", ["з"]="з", ["и"]="и", ["й"]="й", ["к"]="к", ["л"]="л", ["м"]="м", ["н"]="н", ["о"]="о", ["п"]="п", ["р"]="р", ["с"]="с", ["т"]="т", ["у"]="у", ["ф"]="ф", ["х"]="х", ["ц"]="ц", ["ч"]="ч", ["ш"]="ш", ["щ"]="щ", ["ъ"]="ъ", ["ы"]="ы", ["ь"]="ь", ["э"]="э", ["ю"]="ю", ["я"]="я",
}
-- stylua: ignore end

local s_lower = string.lower
local s_sub = string.sub
local t_concat = table.concat

local function manual_lowercase(s)
  local ret = {}
  local ret_len = 0

  local idx = 1
  local len_s = #s
  local cheat
  while idx <= len_s do
    cheat = cheat_sheet_map_RU[s_sub(s, idx, idx + 1)]
    if cheat then
      ret_len = ret_len + 1
      ret[ret_len] = cheat
      idx = idx + 1 -- since we consumed 2 "bytes"
    else
      ret_len = ret_len + 1
      ret[ret_len] = s_lower(s_sub(s, idx, idx))
    end
    idx = idx + 1
  end
  return t_concat(ret, "")
end

print(manual_lowercase("ЫХАЛО walks into the bar"))
print(manual_lowercase("then 1 ПРЫНЦ Машет Ему Лапкой"))
print(manual_lowercase('and lastly, "THE PARTY" beginz!'))

file:

ru_lowercase_example.lua: (1.98 KiB) Downloaded 108 times

I am not sure why we went with substring tbh, not sure the gmatch or other options were profiled or had any crit-problems either.
but anyways, in the end it worked and all the bois were happy.

Note, that you can add the missing letters for your language either manually, or by running the included "autogen script" with the alphabet extended to your liking.
To run such script you can use the NodeJS repl, or simply the browser-console in devtools (yep, that simple).

To test that the .lua file "works", you could either:
install a Lua5.2 (say, `sudo apt install lua5.2`) and run `lua ./ru_lowercase_example.lua` directly,
or

do it using rcon to test it on a running game map

Do not attempt to run the code by pasting it into chat, however.
The chat is known to have "broken newlines" which eats these and consequently makes any --comment "permanent" for the rest of the chunk.
Might NOT be something you really want.
You can either strip all these away before pasting (any strip-dash-dash script would work), or make sure you have ALL the comments in the form of --[[ ]].
^ as of [1.1.80]

And do not attempt to use editor snippets

Might be some old post bumping, but I put it here as it is the reference target.

Have fun with the Lua and let's wait for the Unicode-ready factory together! as of [1.1.80]
and as of 1.1.80, we are all doing a great and definitely quite stable job on that! (as of [1.1.80]). ((kappa))

Factorio Forums

Unicode search is case-sensitive

Unicode search is case-sensitive

Re: Unicode search is case-sensitive

Re: Unicode search is case-sensitive

Re: Unicode search is case-sensitive