File Format suggestion (base128 integers)

jameinel · Post by **jameinel** » Sat Apr 02, 2016 9:41 am

I have one small suggestion for how your file format .dat files are. One thing I noticed is that it appears all of your strings are length prefixed, which is generally a great thing. But I also noticed that they are all 32-bit integer length prefixed. I did a quick check and:
(in Python interpreter)

Code: Select all

>>> with open('level.dat') as f:
...   x = f.read()
...
>>> len(x)
12035350
>>> x.count('\x00')
3710742

So of the ~12MB of a level.dat file, 3.7MB of that is null characters.

One trick I've seen used in lots of places is "base128" integers. The idea is to take the 8 bits of a singe byte character, and treat 7 of them normally and the most significant bit is just used to say "and there is another byte to be considered". This means that all strings that are less than 128 bytes long (eg most of them) end up only taking 1 prefix byte. This is a big deal when your string itself is only 10 bytes long ('technology') So instead of:
"\x0a\x00\x00\x00technology" you end up with "\x0atechnology".

https://en.wikipedia.org/wiki/LEB128

Most often the "cost" of variable length prefix is more than offset by the benefit of less data to be read. Also, since strings themselves are variable length, it isn't like you have fixed size records that you can just read all of it into memory/skip over if you are making a seek-able file.

Post by **ssilk** » Sat Apr 02, 2016 2:07 pm

Why? The file is AFAIK zipped before saving and the devs can use more or less standard serialization of C++.

orzelek · Post by **orzelek** » Sat Apr 02, 2016 2:17 pm

ssilk wrote:Why? The file is AFAIK zipped before saving and the devs can use more or less standard serialization of C++.

It would help even if it's zipped. Those 0's are scattered all over so compression can do only so much.
I'm not sure about standard C++ serialization since afaik there is none. There is a boost library for sure.

jameinel · Post by **jameinel** » Sun Apr 03, 2016 5:19 am

2 factors (IMO)

The file is still smaller after compression. Not by a huge amount. If I just remove all of the "\x00\x00\x00" in the file (which would be a small integer followed by 3 null bytes for a 32-bit int), then I end up with
Code: Select all
```
           original   without-000   .zip   .zip-without-000
level.dat  12M        9.6M          4.6M   4.4M
```
So certainly it is less pronounced in the actual DEFLATEd content. But still
about 5% there.
Its less data just to read, compress/uncompress, and have to process.
Which should be slightly faster load and save times, and sync's with other
peers, etc.

There are other methods as well, for example if the integer is <255, use a
single byte, if it is >=255 you put 255 in the first byte, and then follow it
with 4-bytes of the normal 32-bit integer.

Once the string is longer than 255 bytes, the 4-byte overhead of the prefix
isn't a big deal. The main problem is all of them 5-20 byte strings that end
up with a 4-byte overhead.

My personal preference would actually be to use one of the self-describing
formats (protobuf https://developers.google.com/protocol- ... ocs/proto3,
JSON, BSON http://bsonspec.org/spec.html).
(Interestingly BSON also uses 4-bytes for string prefix, but protobuf uses
variable length encoding as standard for integer types with alternatives). But
some sort of standard means it is easy to parse with standard tools. And
Factorio isn't really a game that is worried people will cheat if they can
hack the save file. There is '/c game.player.insert' for all the cheating you
want to do.

Something like Protobuf gives you highly optimized reader/writer for several
different languages, built-in support for cross version compatibility, etc.
And having a format that is easily interoperated with means that 3rd party
tools can thrive. Like the map viewer, etc.

ratchetfreak · Post by **ratchetfreak** » Mon Apr 04, 2016 10:37 am

jameinel wrote:2 factors (IMO)
The file is still smaller after compression. Not by a huge amount. If I just remove all of the "\x00\x00\x00" in the file (which would be a small integer followed by 3 null bytes for a 32-bit int), then I end up with
Code: Select all
           original   without-000   .zip   .zip-without-000
level.dat  12M        9.6M          4.6M   4.4M
So certainly it is less pronounced in the actual DEFLATEd content. But still
about 5% there.
Its less data just to read, compress/uncompress, and have to process.
Which should be slightly faster load and save times, and sync's with other
peers, etc.
There are other methods as well, for example if the integer is <255, use a
single byte, if it is >=255 you put 255 in the first byte, and then follow it
with 4-bytes of the normal 32-bit integer.

Once the string is longer than 255 bytes, the 4-byte overhead of the prefix
isn't a big deal. The main problem is all of them 5-20 byte strings that end
up with a 4-byte overhead.

My personal preference would actually be to use one of the self-describing
formats (protobuf https://developers.google.com/protocol- ... ocs/proto3,
JSON, BSON http://bsonspec.org/spec.html).
(Interestingly BSON also uses 4-bytes for string prefix, but protobuf uses
variable length encoding as standard for integer types with alternatives). But
some sort of standard means it is easy to parse with standard tools. And
Factorio isn't really a game that is worried people will cheat if they can
hack the save file. There is '/c game.player.insert' for all the cheating you
want to do.

Something like Protobuf gives you highly optimized reader/writer for several
different languages, built-in support for cross version compatibility, etc.
And having a format that is easily interoperated with means that 3rd party
tools can thrive. Like the map viewer, etc.

reading 4 bytes and interpreting as an int is cheaper than reading a byte, checking the high bit and then branching to shift and read the next byte, goto check for high bit

Hexicube · Post by **Hexicube** » Mon Apr 04, 2016 4:15 pm

ratchetfreak wrote:reading 4 bytes and interpreting as an int is cheaper than reading a byte, checking the high bit and then branching to shift and read the next byte, goto check for high bit

This actually depends on the player's read speed and CPU speed, though typically it is in fact the opposite. It would take a pretty bad CPU to make it slower, and using the highest bit makes it super simple:

Code: Select all

boolean hasNext = current > 127
byte thisByte = current & 128

This assumes that the bytes are unsigned, if they're signed it's this instead:

Code: Select all

boolean hasNext = current < 0
if(hasNext) current += 256

Reading even a single extra byte takes longer than these operations do.

ratchetfreak · Post by **ratchetfreak** » Mon Apr 04, 2016 4:40 pm

Hexicube wrote:
ratchetfreak wrote:reading 4 bytes and interpreting as an int is cheaper than reading a byte, checking the high bit and then branching to shift and read the next byte, goto check for high bit
This actually depends on the player's read speed and CPU speed, though typically it is in fact the opposite. It would take a pretty bad CPU to make it slower, and using the highest bit makes it super simple:
Code: Select all
boolean hasNext = current > 127
byte thisByte = current & 128
This assumes that the bytes are unsigned, if they're signed it's this instead:
Code: Select all
boolean hasNext = current < 0
if(hasNext) current += 256
Reading even a single extra byte takes longer than these operations do.

There are usually 2 expensive operations on a typical modern day CPU: memory access cache miss and miss-predicted branch. Everything else is faster than those two.

Loading in the length in (conditionally) multiple steps will invite the miss-prediction. Cache miss is not an issue because the next byte to read would be read either way.

The code you suggest would end up being: (p is a pointer to a chunk of the file and the resulting string should be copied into a std::string dest)

Code: Select all

uint length = *p++;//load a byte into a register
boolean hasNext = length &0x80; // move the high bit in another register
while(hasnext){ //test other register
    length <<=7;
    int current = *p++; // load next byte into register
    length |= current &0x7f;
    hasnext  = current &0x80; // move  the high bit in another register
}
dest = std::string(p, length);
p+=length;

while without the var length encoding it would be

Code: Select all

uint length = (uint)*p; (load 4 bytes in a register)
p+=sizeof(uint);

dest = std::string(p, length);(read rest of string)
p+=length;

higher register pressure, more branches. All around worse performing code.

Hexicube · Post by **Hexicube** » Mon Apr 04, 2016 6:01 pm

ratchetfreak wrote:

Code: Select all

uint length = *p++;//load a byte into a register
boolean hasNext = length &0x80; // move the high bit in another register
while(hasnext){ //test other register
    length <<=7;
    int current = *p++; // load next byte into register
    length |= current &0x7f;
    hasnext  = current &0x80; // move  the high bit in another register
}
dest = std::string(p, length);
p+=length;

You don't need to fiddle around that much with storing results of operations, or even do that many:

Code: Select all

uint length = 0;
int current = 0; //Kept outside the loop to avoid repeat allocations.
do //"do...while" used because the loop must run at least once. Keeps it clean, though a regular "while" loop will suffice with no additional variables.
{
    current = *p++;
    length = (length << 7) + current & 0x7F;
}
while(current < 0); //current > 127 if unsigned
dest = std::string(p, length);
p+=length;

No variable creation is done inside the loop, and only 2 variables are created as opposed to 3. There's also one less bitwise operation, as you can sneak around it with a direct comparison instead.

I'd still bet on it being a bit faster, too. Also, since you're expecting many string reads, you can easily move "length" and "current" outside the string read segment, so that they can be used for any varint reads.

ratchetfreak · Post by **ratchetfreak** » Mon Apr 04, 2016 11:30 pm

Allocating locals is a noop especially dumb value types like the int primitive. The compiler will pull out the "allocation" all the way to the top of the function in the preamble where it subtracts the stack pointer with a constant anyway (increasing the constant to the size of all the variables that need spilling to the stack).

Also the "current = *p++;" needs an extra hidden op for the sign extension.

Frankly don't worry that much about local variables because your "length = (length << 7) + current & 0x7F;" statement will need a temporary to hold the value of "current & 0x7f" before it adds it to the length anyway (because you need to keep "current" for the test on the high bit).

Frankly arguing like this is moot without profiling it on -o2.

One last comment I'll make is that 65k length strings (only 2 bytes for the length) should be enough for everyone.

Factorio Forums

File Format suggestion (base128 integers)

File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)

Re: File Format suggestion (base128 integers)