File Format suggestion (base128 integers)
Posted: Sat Apr 02, 2016 9:41 am
I have one small suggestion for how your file format .dat files are. One thing I noticed is that it appears all of your strings are length prefixed, which is generally a great thing. But I also noticed that they are all 32-bit integer length prefixed. I did a quick check and:
(in Python interpreter)
So of the ~12MB of a level.dat file, 3.7MB of that is null characters.
One trick I've seen used in lots of places is "base128" integers. The idea is to take the 8 bits of a singe byte character, and treat 7 of them normally and the most significant bit is just used to say "and there is another byte to be considered". This means that all strings that are less than 128 bytes long (eg most of them) end up only taking 1 prefix byte. This is a big deal when your string itself is only 10 bytes long ('technology') So instead of:
"\x0a\x00\x00\x00technology" you end up with "\x0atechnology".
https://en.wikipedia.org/wiki/LEB128
Most often the "cost" of variable length prefix is more than offset by the benefit of less data to be read. Also, since strings themselves are variable length, it isn't like you have fixed size records that you can just read all of it into memory/skip over if you are making a seek-able file.
(in Python interpreter)
Code: Select all
>>> with open('level.dat') as f:
... x = f.read()
...
>>> len(x)
12035350
>>> x.count('\x00')
3710742
One trick I've seen used in lots of places is "base128" integers. The idea is to take the 8 bits of a singe byte character, and treat 7 of them normally and the most significant bit is just used to say "and there is another byte to be considered". This means that all strings that are less than 128 bytes long (eg most of them) end up only taking 1 prefix byte. This is a big deal when your string itself is only 10 bytes long ('technology') So instead of:
"\x0a\x00\x00\x00technology" you end up with "\x0atechnology".
https://en.wikipedia.org/wiki/LEB128
Most often the "cost" of variable length prefix is more than offset by the benefit of less data to be read. Also, since strings themselves are variable length, it isn't like you have fixed size records that you can just read all of it into memory/skip over if you are making a seek-able file.