collation kinda means you can compare strings A and B to decide if they are "the same".
flip-the-bit option proposed above is an example of lowercase-it-all approach (uppercase it all, for the fans of Microsoft, haha).
Simple example of dates collation: m/d/y vs d.m.y; could be solved with a perfect "isodate" yyyy-mm-dd form.
Harder example of numbers collation: 1.00 and 1,00 is probably what looks the same for you. but 1_00 and 1,00 can also mean the same as 100; could only be solved with the help of the "context" of what 1,00 *
probably* means by-default.
Same kind of stuff exists for languages.
In English you get kinda close to what "canonical form" of a grammar could be. Programming then dances around it even more, using a subset with a number of strict rules (such as with JSON) that make the "context-free parsing" part of it easy.
And that largely influences what ppl consider
languages to be in the first place.
Yet, with, say, Russian,
all the words are mutable. Some have to, some depend on your taste preference. You can say change the "copper-plate" depending on how or where you want to use it. ("plateys", "coppah" etc is kinda allowed, if you are willing to) You can even say "iron-gears" in a few different popular ways, still meaning "gears" for any native speaker (since the same "root", think "gearz"). Maybe not something one would consider for collation, but even the "plurals formula" is fun;
declension; wiki is what you can look at for some more fun.
But CJK ("glyphy langs") can get you further than that.
There, you can write the stuff that sounds the same, means the same ...yet is written not the same way.
And there is more than one way to do that!
Not just "capital" letters but the "sets" of same-meaning glyphs.
And don't get me started on how
they "mutate", compact or extend.
As such, imagine the collation that allows you to, say, have:
LDS -> Low density structure, or even GC -> (eh) Electronic-Circuit (yep, no Green in ANY form here, haha).
the ability to have the latter WOULD have been nice. Especially if user could control the mapping.
But I suppose that would have some fun effects on determination and overall system complexity, hehe.
Yet, for the CJK languages, having such an extensible vocab would be a
killer feature in more ways than one.
And who knows, maybe there is even a goto-way to collate all that stuff already in the cpp world; but sorry, I do not speak cpp fluently ;>
Well, even Hebrew lets you "omit" the vowels as you go. German has ẞ/ß (for "ss", see
eszett; wiki).
And most languages there are have some sort of irregular yet popular and widely used stuff.
And there is a whole another issue of motors-vs-engines like collation, often seen with players reading the non-english locale try to communicate in English.
so in short, making a perfect collation even for a known subset of world grammar is tough. Even for any "reasonable" limit on what is "the same".
thus having a take of "latin only, and the rest we do not care" is "valid" and "understandable".
Definitely not something of "call me a manager here!", I would say.