(in Python interpreter)
Code: Select all
>>> with open('level.dat') as f:
... x = f.read()
...
>>> len(x)
12035350
>>> x.count('\x00')
3710742
One trick I've seen used in lots of places is "base128" integers. The idea is to take the 8 bits of a singe byte character, and treat 7 of them normally and the most significant bit is just used to say "and there is another byte to be considered". This means that all strings that are less than 128 bytes long (eg most of them) end up only taking 1 prefix byte. This is a big deal when your string itself is only 10 bytes long ('technology') So instead of:
"\x0a\x00\x00\x00technology" you end up with "\x0atechnology".
https://en.wikipedia.org/wiki/LEB128
Most often the "cost" of variable length prefix is more than offset by the benefit of less data to be read. Also, since strings themselves are variable length, it isn't like you have fixed size records that you can just read all of it into memory/skip over if you are making a seek-able file.