Friday Facts #204 - Another day, another optimisation

Jap2.0 · Post by **Jap2.0** » Tue Aug 22, 2017 1:21 am

DaveMcW wrote:
bobucles wrote:When multiplying and dividing extremely specific floating point decimals there is a very good chance you'll accumulate a "drift" from tiny rounding errors (reason number 907.0000000000000156 to hate floats). Why not use a simple list of zoom values? Or at least snap the zoom level when it should hit an integer.
I think a 99.9999999999% pixel perfect image is good enough.

Then you don't know the Factorio devs.

Patashu · Post by **Patashu** » Tue Aug 22, 2017 1:51 am

You can easily do something like zoom_amount = math.pow(2, zoom_level/7), where zoom_level is an integer and zoom_amount is the proportion you zoom the map by. At 0 you get 1, at 7 you get 2, at -7 you get 0.5 and so on. I am sure the Factorio devs already do something like this as it's very obvious, pow is fast and there's no way errors can creep in.

Zanthra · Post by **Zanthra** » Tue Aug 22, 2017 2:19 am

I had suspected for a while that the reason I could run somewhat larger factories at higher UPS than some other people was due to the large cache of the i7-980X processor I have (12MB), even though the actual performance falls quite far behind some of the more mainstream gaming processors like the i7-7700K (8MB). With the high end Ryzen processors like the R7 1700 having a cache of 16MB they sound like really nice processors if you are willing to spend money for performance in Factorio (although prefetching will alleviate some of that benefit). Still I feel that Intel's current direction of shrinking caches (when compared with the large caches seen on similar core count Xeon chips before they moved to the Xeon Scalable architecture) on their newer chips, with even the Factorio dev's new i9-7900X only having 13.75MB cache is a poor direction to move in without something like eDRAM as an L4 cache, at least in regards to video games where you usually have a certain set of data that needs to be updated every frame.

It would be most interesting to see how the prefetching affects performance on more limited systems such as laptop processors. I would expect to see a higher performance gain there.

TheTom · Post by **TheTom** » Tue Aug 22, 2017 1:23 pm

Zanthra wrote:I had suspected for a while that the reason I could run somewhat larger factories at higher UPS than some other people was due to the large cache of the i7-980X processor I have (12MB), even though the actual performance falls quite far behind some of the more mainstream gaming processors like the i7-7700K (8MB). With the high end Ryzen processors like the R7 1700 having a cache of 16MB they sound like really nice processors if you are willing to spend money for performance in Factorio (although prefetching will alleviate some of that benefit). Still I feel that Intel's current direction of shrinking caches (when compared with the large caches seen on similar core count Xeon chips before they moved to the Xeon Scalable architecture) on their newer chips, with even the Factorio dev's new i9-7900X only having 13.75MB cache is a poor direction to move in without something like eDRAM as an L4 cache, at least in regards to video games where you usually have a certain set of data that needs to be updated every frame.

It would be most interesting to see how the prefetching affects performance on more limited systems such as laptop processors. I would expect to see a higher performance gain there.

Possible. Missing locality and no prefetsch make this very possible. I look forward to see how a high performance processor handles that - just ordering my ThreadRipper build.

Max S. · Post by **Max S.** » Thu Aug 24, 2017 9:55 am

I did not read through the whole thread so this might have been raised already.

If you look into the memory optimization of structures, then you should also look at the padding bytes. E.g.:

Code: Select all

struct Item {double a; int b; };
sizeof(Item) == 16

The size of the data of item is 12 byte and the size of the structure is 16 due to padding bytes. For the structures you have mentioned this will not change much and this optimization is not necessary. But if you have a list of small structures. The memory can be greatly reduced.

I had an array of such structures in one of my applications and a quick and dirty fix was to use the packed attribute instead of an structure of arrays. E.g:

Code: Select all

struct Item {double a; int b; } __attribute__((packed));
sizeof(Item) == 12

This works with the intel and gcc compiler. For microsoft I do not know but there should be a similar option. How this will behave with linked lists I can not tell.

pleegwat · Post by **pleegwat** » Thu Aug 24, 2017 5:08 pm

@Max, you don't want to do that. The cost of the unaligned access of your double member (especially if it happens to cross a cache line boundary) is more expensive than the extra bandwidth used from padding.

Max S. · Post by **Max S.** » Fri Aug 25, 2017 8:37 am

@pleegwat, yes sure an Structure of Arrays is the more natural choice. But as a hack to see how it changes the memory it was quite effective.

mOoEyThEcOw · Post by **mOoEyThEcOw** » Fri Aug 25, 2017 2:57 pm

Max S. wrote:@pleegwat, yes sure an Structure of Arrays is the more natural choice. But as a hack to see how it changes the memory it was quite effective.

Except memory was never the problem; runtime performance is. Packing structs into unaligned values would decrease runtime performance. It's not relevant.

Max S. · Post by **Max S.** » Mon Aug 28, 2017 6:38 am

Ok, I give up. I just wanted to state a different type of memory/bandwidth optimization and therefore performance optimization. In my original post I did not feel it necessary to point out, that a structure of arrays is the most natural choice and after the "quick and dirty fix" has confirmed the bandwidth improvement, then the structure needs to be refactored for proper alignment and more improvements.

I also stated, that this change makes only sense for small structure. I already mentioned, that this makes no sense for the described structures.

As for the unaligned access. If your application is memory bandwidth limited, you can hide a lot inside these latencies.

Straw · Post by **Straw** » Fri Sep 15, 2017 5:45 pm

I was also surprised to see use of a linked list. The problem I see is not that the data for each entity needs to be loaded, but that the out of order execution cannot load ahead because it doesn't know the address of the next entity until the current one has loaded. Additionally, modern CPUs automatically prefetch when they detect sequential access patterns, making them even faster. I can see several approaches to deal with inactivity, and I suspect that they would all perform better than a linked list; of course I can't be sure, as I've never played with the code. What data structure are inactive objects currently stored in?

Activity flags would have lowish overhead, but would incur branch mispredictions while iterating. If you wanted to be clever, you could pack these in a bit vector or in low-order bits of a pointer to save space.

Keeping two separate vectors of objects or probably better pointers to objects, one for active, one for inactive, would remove the branch misprediction penalty due to activity flags, but require shuffling between vectors. However, I don't see an immediate problem with this, as when we are iterating its cheap to remove, and we can cheaply add objects to the end of a vector- do you need to preserve object update order?

As has been mentioned, the big gain in this area is often switching to SoA, getting better locality of reference and removing unpredictable virtual dispatches. However I can understand why you might not want to do this as it would probably require a big rewrite.

Thanks for the interesting info as usual!

Post by **Rseding91** » Fri Sep 15, 2017 7:31 pm

Straw wrote:I was also surprised to see use of a linked list. The problem I see is not that the data for each entity needs to be loaded, but that the out of order execution cannot load ahead because it doesn't know the address of the next entity until the current one has loaded. Additionally, modern CPUs automatically prefetch when they detect sequential access patterns, making them even faster. I can see several approaches to deal with inactivity, and I suspect that they would all perform better than a linked list; of course I can't be sure, as I've never played with the code. What data structure are inactive objects currently stored in?

Activity flags would have lowish overhead, but would incur branch mispredictions while iterating. If you wanted to be clever, you could pack these in a bit vector or in low-order bits of a pointer to save space.

Keeping two separate vectors of objects or probably better pointers to objects, one for active, one for inactive, would remove the branch misprediction penalty due to activity flags, but require shuffling between vectors. However, I don't see an immediate problem with this, as when we are iterating its cheap to remove, and we can cheaply add objects to the end of a vector- do you need to preserve object update order?

As has been mentioned, the big gain in this area is often switching to SoA, getting better locality of reference and removing unpredictable virtual dispatches. However I can understand why you might not want to do this as it would probably require a big rewrite.

Thanks for the interesting info as usual!

Storing entity pointers in a vector wouldn't help anything since now instead of loading memory of the entity you're loading memory of a vector of pointers and still the memory of the entity.

As it is now each node in the list is the entity itself. You load the node you've loaded the entity and the pointer to the next entity.

Entities that aren't active aren't stored in any list - they don't get touched. They simply exist on the map at a given location - like trees.

Straw · Post by **Straw** » Fri Sep 15, 2017 7:58 pm

Rseding91 wrote:
Storing entity pointers in a vector wouldn't help anything since now instead of loading memory of the entity you're loading memory of a vector of pointers and still the memory of the entity.

As it is now each node in the list is the entity itself. You load the node you've loaded the entity and the pointer to the next entity.

Entities that aren't active aren't stored in any list - they don't get touched. They simply exist on the map at a given location - like trees.

I don't see why it wouldn't help anything- we don't have to wait for the data of one entity to load before we can start loading the next one. The cost to load the pointers will be very small and probably hidden by automatic prefetching and cache lines (8 pointers per cache line!). With each entity having a pointer to the next, we have to wait for that address to arrive before we can load, so we will be memory latency bound, regardless of available bandwidth. If the object isn't in cache, this could be hundreds of cycles during which we might be stuck. If we have the addresses of the next few objects available, we can always start loading them while processing the current objects. Of course, by 'we' I really mean the CPU's out of order instruction scheduler, so this isn't something you have to code yourself.

Entities that aren't active must be stored somewhere, if not a list. Are you saying they are on a file? When you say they simply exist on the map, what does that mean in terms of memory layout?

ratchetfreak · Post by **ratchetfreak** » Sat Sep 16, 2017 12:24 pm

Rseding91 wrote: Storing entity pointers in a vector wouldn't help anything since now instead of loading memory of the entity you're loading memory of a vector of pointers and still the memory of the entity.

As it is now each node in the list is the entity itself. You load the node you've loaded the entity and the pointer to the next entity.

Entities that aren't active aren't stored in any list - they don't get touched. They simply exist on the map at a given location - like trees.

Except that pointers in a vector are tightly packed and sequential (few chache misses to the actual vector) and now you can look ahead 2 or 3 entities instead of just the one.

And even entities that just exist often need to be checked against collision, SOAing just the collision boxes per chunk you can check collisions 4 objects at a time and then loop over the hit pointers to the actual entities to handle the actual collision. I stole that idea from this presentation (pdf from page 45 on). You can also include a speed and timestamp offset to avoid needing to update the DB every tick for the logi bots.

sparr · Post by **sparr** » Sun Sep 17, 2017 5:55 pm

Was I the only person using mods to set perfect zoom levels? I started doing it for screenshots, and just kept doing it because it looked better. 25%, 50%, 100%, 150%, 200%, 300%, 400%

Factorio Forums

Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation

Re: Friday Facts #204 - Another day, another optimisation