coppercoil wrote: ↑Sun Dec 08, 2019 11:23 am
Nomnaut wrote: ↑Sun Dec 08, 2019 10:31 am
When talking about parallelization, if the tasks are indeed independent, why would they be competing for memory access?
Because memory access is the bottleneck.
[...]
I have an idea. Entities could have duplicated data: one regular structure for all entity data, and reduced copy containing only few compressed attributes (eg. ID, X, Y, State and no bit more) that are used for read only. This second copy can be collected into separated RAM chunk and optimized for multiple cache loads. There can be several compact copies, regrouped differently for different uses.
They are independent. The blog post just said that if you have tasks A, B and C, then by just executing them in parallel, you don't get faster than the slowest of them.
About that "idea", that's what Factorio, and so many other games out there are already doing. Entity-component system, whereby for each behavior of an entity (e.g. liquid behavior of a pipe-entity), the entity itself only holds a pointer onto the "component" simulating that specific behavior. Then you have a "manager" for that component type, which deals with updating that specific system, which only looks at the components it's responsible for. All components of the same type are stored in linear memory, and are updated by the manager in just that order. Simple example, the "liquid" component of the pipe knows that it's adjacent to up to 4 other liquid-components (doesn't matter what they are, might be underground pipes, storage, whatever). So when it's the turn of that component instance to be updated, it doesn't even need to go via the spatial database for the world to find its neighbours via spatial database, it already is linked to exactly the other related components it needs to do its job.
The catch here is that the other "liquid" components it needs to query for the update are not necessarily in the "ideal" memory locations either. If you look at component instance N, and it's linked to A, B, C and D, then even your all 5 of them are only liquid, A, B, C and D are likely not in the same area of memory, and thereby also not currently in the cache. Trying to read from them still hits you with the memory latency for every single one of them.
On the other hand, the next-to-be-processed entities N+1 and N+2 are likely already pre-loaded into the cache by the processors own choice, as a trivial optimization for linear memory access patterns.
Lubricus wrote: ↑Sun Dec 08, 2019 2:51 pm
Much of the problem with memory access in Factorio seems to be cache misses. Maybe several threads touching different parts of the memory is messing up the cache?
You don't just get cache-misses when you trash your caches ("touching too many different things"), but also when your working set simply doesn't fit within the cache and you have already optimized down to the point where you are only accessing each memory location once per frame.
Then there is the issue that Factorio, even if it were possible to update entities just as they are laid out in memory (which has no correlation whatsoever to how they are laid out in the world!), you can't constrain read accesses to the same memory range. If you update a pipe, you have to query the state of the surrounding pipes as well. Same for belts, and so much else.
At that point the processor already does assume that you will access the next entity in memory order, and prefetch that memory (it's simply reading ahead in proximity of what you have accessed). But in hardware, it is not able to predict that for the just prefechted entity N, you will also need a prefetch of the entities A, B, C and D which are linked from N by pointer, and will be queried in a few dozen CPU cycles.
At that point, you have to actively issue a prefetch yourself. When you process entity N, then you already look at the pointers A+2, B+2, C+2, D+2 belonging to Entity N+2, and trigger prefetch for each of these. By the time you are done computing N and N+1, the prefetch for A+2, B+2, C+2 and D+2 has hopefully completed and these entities are now ready to be accessed straight from L1 cache.
Factorio is not bound by memory throughput. Neither in terms of memory transactions, nor in terms of raw throughput. Not even remotely.
Just pure memory lantency, due to pretty obviously still missing active prefetch logic.
Hyperthreading helps here (as due to memory latency, there are less than 2-3 instructions being execute per clock cycle, meaning another thread can easily fit in there). But with just one additional thread per core, it doesn't mask that much either. Plus, multi-threaded updates of the same components are complicated.
Necessary software optimizations aside, the answer to the often asked "what RAM should I buy" is actually: Buy a low-core-count Threadripper. No RAM out there is going to provide access times so low that you could achieve a notable performance boost. A monstrous 128MB L3 cache on the other hand is a pretty simple workaround for hitting RAM every again. That is the amount of memory which is then sufficient to actually fit the whole working set in cache again.