Parrallel processing in games & applications

ratchetfreak · Post by **ratchetfreak** » Thu Feb 20, 2020 11:51 am

Nemo4809 wrote: Wed Feb 19, 2020 5:15 pm
mrvn wrote: Wed Feb 19, 2020 12:46 pmPut the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.

But if the cost of skipping some entities exceeds the cost of just processing it anyway then it's worth to just process it all anyway.

For example using an alive linked list is very costly per computed entity because of the cache trashing. And when the processing of sleeping entities is cheap enough then it may be worth to just loop over the array instead to take advantage of the prefetch available to it.

Nemo4809 · Post by **Nemo4809** » Thu Feb 20, 2020 11:55 am

ratchetfreak wrote: Thu Feb 20, 2020 11:51 am
Nemo4809 wrote: Wed Feb 19, 2020 5:15 pm
mrvn wrote: Wed Feb 19, 2020 12:46 pmPut the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.
But if the cost of skipping some entities exceeds the cost of just processing it anyway then it's worth to just process it all anyway.

For example using an alive linked list is very costly per computed entity because of the cache trashing. And when the processing of sleeping entities is cheap enough then it may be worth to just loop over the array instead to take advantage of the prefetch available to it.

I think in general it isn't worth processing. You waste energy and might trash the cache.

Whether you skip or process, you still have to wait for RAM to deliver data for entities that do need to be processed. Might as well idle and wait than do pointless processing.

Post by **Rseding91** » Thu Feb 20, 2020 12:43 pm

Oh boy here I go again explaining how active entities work

There is no array of entities. Every single entity in the entire game is allocated 1 at a time. When an entity is active it's put into the doubly-linked list of active entities on the chunk its center is located on. When the entities on that chunk are updated that list is iterated and ::update() is called on each entity in the list. When an entity goes inactive it just removes itself from that list so it doesn't get touched in the update loop. Updatable entities have the previous and next pointers stored directly on them so there's no allocation or de-allocation to go-active or go-inactive.

That's how it has worked in the past (as far as I can see since 0.9 which is when I first saw the code) and it's how it works today.

ratchetfreak · Post by **ratchetfreak** » Thu Feb 20, 2020 1:20 pm

Nemo4809 wrote: Thu Feb 20, 2020 11:55 am

I think in general it isn't worth processing. You waste energy and might trash the cache.

Whether you skip or process, you still have to wait for RAM to deliver data for entities that do need to be processed. Might as well idle and wait than do pointless processing.

But doing array iteration is massively faster than waiting on a L3 cache miss that your linked list will give you. To the point where you can probably skip a handful of entities with a if(!awake) return; before you get enough slowdown from needing to skip them. Of course this is something that needs profiling on representative hardware.

For example GPUs don't do processing per triangle to see if it's visible to skip processing the non-position attributes. The reason is that the cost of computing those components is negligible compared to the infrastructure required make that decision.

Rseding91 wrote: Thu Feb 20, 2020 12:43 pm Oh boy here I go again explaining how active entities work

There is no array of entities. Every single entity in the entire game is allocated 1 at a time. When an entity is active it's put into the doubly-linked list of active entities on the chunk its center is located on. When the entities on that chunk are updated that list is iterated and ::update() is called on each entity in the list. When an entity goes inactive it just removes itself from that list so it doesn't get touched in the update loop. Updatable entities have the previous and next pointers stored directly on them so there's no allocation or de-allocation to go-active or go-inactive.

That's how it has worked in the past (as far as I can see since 0.9 which is when I first saw the code) and it's how it works today.

I klnow that, it's the reason I mentioned linked lists, but my argument is that having std::vector<inserter_entity> (per chunk if need be) that you loop over directly is probably going to be better for a few reasons.1) not going through a virtual update() means you aren't trashing your instruction cache. 2) the built in array prefetch will outperform manual prefetch described in one of the FFFs any day of the week.

Post by **Rseding91** » Thu Feb 20, 2020 2:32 pm

Entities require stable memory addresses and can move around anywhere (sometimes between surfaces) so they can't be put into some vector<thing>.

coppercoil · Post by **coppercoil** » Thu Feb 20, 2020 2:45 pm

ratchetfreak wrote: Thu Feb 20, 2020 1:20 pm But doing array iteration is massively faster than waiting on a L3 cache miss that your linked list will give you.

Iterating list should be cheap IF all list pointers and awake flags are placed in L1. If they are not...
IMNSHO there should be awake + entity pointer list, not an entity list to not load its data into a cache if not necessary.

ratchetfreak · Post by **ratchetfreak** » Thu Feb 20, 2020 3:02 pm

Rseding91 wrote: Thu Feb 20, 2020 2:32 pm Entities require stable memory addresses and can move around anywhere (sometimes between surfaces) so they can't be put into some vector<thing>.

referential consistency can be solved with a chunked array.

from the mod api only players and vehicles can teleport between surfaces, sure treat those special.

And some things can't move at all, those don't need to be moveable between chunks either.

coppercoil wrote: Thu Feb 20, 2020 2:45 pm
ratchetfreak wrote: Thu Feb 20, 2020 1:20 pm But doing array iteration is massively faster than waiting on a L3 cache miss that your linked list will give you.
Iterating list should be cheap IF all list pointers and awake flags are placed in L1. If they are not...
IMNSHO there should be awake + entity pointer list, not an entity list to not load its data into a cache if not necessary.

array prefetch will mean you end up blasting through the list at memory bandwidth speed instead of cache miss latency speeds.

but yeah, the best strategy is going to depend on the ratio between awake and asleep entities and how much work per awake entity

SyncViews · Post by **SyncViews** » Thu Feb 20, 2020 4:25 pm

That sounds like a lot of complexity.

ratchetfreak wrote: Thu Feb 20, 2020 3:02 pm array prefetch will mean you end up blasting through the list at memory bandwidth speed instead of cache miss latency speeds.

Can't speak for Factorio's specific case, nor have I really thought about how it is specifically coded. But in many other cases when given a "look here is a great dense std::vector<Thing>" something I ran into almost every time is that while it works for some cases, once starting dealing with interactions between entities, access becomes a lot more random and the overall gains can be very minimal.

e.g. turrets need to search the local area for biters, there are lots of biters on the map. It is unlikely that the nearby biters to consider shooting are densely packed in an array, and once got a target, even if this is saved step to step, it still needs to access that target every step. I ran into this with AI type code especially time and again, and it consumes more CPU cycles in stuff i was playing with than say checking each entities health or movement in an array pretty much regardless of how unoptimal those were. Or going beyond turrets, pathfinding for moving objects.

You can have structures to keep track of things in a local area, but maintaining these has a cost, and is probably an array of pointers that will like to cache miss anyway, or you physically move an entities memory "into the area", which also has a cost and complicates anything that wants to reference entities that can now be moved in memory, having many-arrays makes each smaller and fragments data a bit anyway, and tune how big each area is, probably some more concerns...

Similarly things like inserters and bots looking for items.

ratchetfreak · Post by **ratchetfreak** » Thu Feb 20, 2020 4:51 pm

SyncViews wrote: Thu Feb 20, 2020 4:25 pm That sounds like a lot of complexity.

ratchetfreak wrote: Thu Feb 20, 2020 3:02 pm array prefetch will mean you end up blasting through the list at memory bandwidth speed instead of cache miss latency speeds.
Can't speak for Factorio's specific case, nor have I really thought about how it is specifically coded. But in many other cases when given a "look here is a great dense std::vector<Thing>" something I ran into almost every time is that while it works for some cases, once starting dealing with interactions between entities, access becomes a lot more random and the overall gains can be very minimal.

e.g. turrets need to search the local area for biters, there are lots of biters on the map. It is unlikely that the nearby biters to consider shooting are densely packed in an array, and once got a target, even if this is saved step to step, it still needs to access that target every step. I ran into this with AI type code especially time and again, and it consumes more CPU cycles in stuff i was playing with than say checking each entities health or movement in an array pretty much regardless of how unoptimal those were. Or going beyond turrets, pathfinding for moving objects.

You can have structures to keep track of things in a local area, but maintaining these has a cost, and is probably an array of pointers that will like to cache miss anyway, or you physically move an entities memory "into the area", which also has a cost and complicates anything that wants to reference entities that can now be moved in memory, having many-arrays makes each smaller and fragments data a bit anyway, and tune how big each area is, probably some more concerns...

Similarly things like inserters and bots looking for items.

I wasn't expecting this to win greatly where a lot of entities need another entity's data.

and removing the pointer hop means you can eliminate the cache miss for accessing the entity instead of hoping Out of Order execution does some loads in parallel

Post by **Rseding91** » Thu Feb 20, 2020 5:09 pm

ratchetfreak wrote: Thu Feb 20, 2020 4:51 pm I wasn't expecting this to win greatly where a lot of entities need another entity's data.

But those are the only ones that end up being slow

If an entity doesn't require another set of data outside of itself it's suuuuuper fast and not worth thinking about putting into an array of something.

In fact, I don't think there is any entity I can think of that only touches itself when updating. The closest would be something like a projectile but even that has to check against the terrain as it moves to re-link itself with what ever position it's on as it moves.

Nemo4809 · Post by **Nemo4809** » Thu Feb 20, 2020 5:33 pm

ratchetfreak wrote: Thu Feb 20, 2020 1:20 pmBut doing array iteration is massively faster than waiting on a L3 cache miss that your linked list will give you. To the point where you can probably skip a handful of entities with a if(!awake) return; before you get enough slowdown from needing to skip them. Of course this is something that needs profiling on representative hardware.

For example GPUs don't do processing per triangle to see if it's visible to skip processing the non-position attributes. The reason is that the cost of computing those components is negligible compared to the infrastructure required make that decision.

All I'm saying is, an array won't save you if the pre-fetch ends up fetching data that's not required.

Assuming each inserter's data is an array but many of the inserters following the current one don't need to be processed and the ones that do need processing are further up/down the array thus you are back to waiting for memory.

bobucles · Post by **bobucles** » Thu Feb 20, 2020 10:53 pm

When it comes to programming, everyone is an expert!

Such is the nature of internet forums. It's easy to open up task manager, look at any CPUs not running at 100%, and cry out "Look at those empty clocks! My game can go THAT much faster!". If only optimizing code was that easy.

There was a pretty neat video I saw a while back about devs optimizing some kind of script parsing engine. Their basic process started with removing conditional statements in weird ways. For example space and bracket parsing was done with unconditional commands like AND/OR/Shift logic. With enough bit banging, direct statements can behave in conditional ways I guess. Then they started going into the weird commands, for example there are instructions that can do multiple additions in one clock, or perform multiple lookup table events at once. At the end of the day their scripting engine could nearly parse 1 text character per CPU cycle, instead of the old system using multiple clocks per char. Sadly I lost the link and can't seem to find it.

Oktokolo · Post by **Oktokolo** » Fri Feb 21, 2020 6:00 am

bobucles wrote: Thu Feb 20, 2020 10:53 pm For example space and bracket parsing was done with unconditional commands like AND/OR/Shift logic. With enough bit banging, direct statements can behave in conditional ways I guess. Then they started going into the weird commands, for example there are instructions that can do multiple additions in one clock, or perform multiple lookup table events at once. At the end of the day their scripting engine could nearly parse 1 text character per CPU cycle, instead of the old system using multiple clocks per char. Sadly I lost the link and can't seem to find it.

Replacing branching with arithmetic is an old way of optimizing for architectures with long execution pipelines (like current Intel and AMD x86/x64).
But the obvious catch is that that sort of optimization is at the same time a pretty decent obfuscation too. You don't want to be the one maintaining that code.
So it is normally only done for selected tiny snippets of code wich contain local branches and are executed a lot.
Also compilers as well as CPUs got a lot better in predicting and optimizing branches in the last twenty years. So not sure, whether replacing branches with math is still a thing on CPUs (it certainly is on GPUs).

Nemo4809 · Post by **Nemo4809** » Fri Feb 21, 2020 7:01 am

Oktokolo wrote: Fri Feb 21, 2020 6:00 am
bobucles wrote: Thu Feb 20, 2020 10:53 pm For example space and bracket parsing was done with unconditional commands like AND/OR/Shift logic. With enough bit banging, direct statements can behave in conditional ways I guess. Then they started going into the weird commands, for example there are instructions that can do multiple additions in one clock, or perform multiple lookup table events at once. At the end of the day their scripting engine could nearly parse 1 text character per CPU cycle, instead of the old system using multiple clocks per char. Sadly I lost the link and can't seem to find it.
Replacing branching with arithmetic is an old way of optimizing for architectures with long execution pipelines (like current Intel and AMD x86/x64).
But the obvious catch is that that sort of optimization is at the same time a pretty decent obfuscation too. You don't want to be the one maintaining that code.
So it is normally only done for selected tiny snippets of code wich contain local branches and are executed a lot.
Also compilers as well as CPUs got a lot better in predicting and optimizing branches in the last twenty years. So not sure, whether replacing branches with math is still a thing on CPUs (it certainly is on GPUs).

Won’t be surprised if modern compilers already do the “replacing branches with math” for you.

hoho · Post by **hoho** » Fri Feb 21, 2020 9:44 am

mrvn wrote: Wed Feb 19, 2020 10:39 am
Rseding91 wrote: Tue Feb 18, 2020 11:33 am
Nemo4809 wrote: Tue Feb 18, 2020 1:21 am
mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...
Then why didn't you push threading more?
...
I might have inferred that you mean memory bandwidth because with memory latency more threads do help.

This is false. When you're latency-bound, adding more threads will hurt your performance as you'll have more different data streaming through the shared cache levels meaning each parallel thread will need to use a slower fetch from memory more often than an individual thread would have since those caches get filled up faster.

movax20h · Post by **movax20h** » Sat Feb 22, 2020 3:36 pm

Nemo4809 wrote: Wed Feb 19, 2020 5:15 pm
mrvn wrote: Mon Feb 17, 2020 12:57 pmEach entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.
I have toyed with this idea as a thought experiment ... and concluded it doesn't work out in practice.

e.g. 2 mobs. Each has a thread to determine where they move. Base on old state, they both decide to move to tile X. Except you can't have 2 entities occupying the same space. Old state to new state would allow this to happen unless you put a check when moving to a tile but that would make the outcome nondeterministic depending on whose thread ends up being processed first based on OS scheduling(?) and the recalculation effectively makes the 2 threads run sequentially.

posila wrote: Wed Feb 19, 2020 12:50 pmI still think "bad memory access patterns" is the most correct way of describing the cause of the problem.
From what I know about PC memory management, a "good" memory access pattern would involve the data you next require be near the data you are currently fetching because modern PC pre-fetch data surrounding the current data being fetched from memory. However this isn't always the case. Sometimes the next set of data required isn't even determined until the current calculation is done and could be anywhere in memory - effectively making pre-fetch as it is useless; and preventing any sort of pre-fetching strategy.

PS: From what I heard, this is a real bottleneck when it comes to raytracing. The memory access pattern is effectively random and this is very bad for GPU memory that is tuned for high bandwidth at the cost of high latency.

mrvn wrote: Wed Feb 19, 2020 12:46 pmPut the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.

Nobody claims it is not possible to highly parallelize factorio, but simply hard, rather complex and invasive to the code, and will add some overheads in some other cases too. There is a plenty of techniques known to make it work, it just would make code very complex (especially if one wants to keep multiplayer still work correctly), bigger and harder to debug, and break many other things (including mods). Factorio is not in a state where it is worth doing it right now. There is enough bugs and deadline to finish the game to not mess with it. Factorio devs are familiar with many techniques, including cache friendly algorithms and data structures, as they are already used in many places.

At this stage it is probably best to leave it at what it is. Or write a prototype demonstrating the good scaling and ability to handle all the complexities of the interactions between entities. I think discussing solutions without a prototype is mostly waste of time for everyone involved.

mrvn · Post by **mrvn** » Wed Feb 26, 2020 11:45 am

Nemo4809 wrote: Wed Feb 19, 2020 5:15 pm
mrvn wrote: Mon Feb 17, 2020 12:57 pmEach entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.
I have toyed with this idea as a thought experiment ... and concluded it doesn't work out in practice.

e.g. 2 mobs. Each has a thread to determine where they move. Base on old state, they both decide to move to tile X. Except you can't have 2 entities occupying the same space. Old state to new state would allow this to happen unless you put a check when moving to a tile but that would make the outcome nondeterministic depending on whose thread ends up being processed first based on OS scheduling(?) and the recalculation effectively makes the 2 threads run sequentially.

You didn't split it up into enough phases. For example:

Phase 1: Both mobs decide they want to go to tile X.
Phase 2: Tile X decides which mob can move into it.
Phase 3: Mobs that are allowed to move do move.

But you probably want the second mob to move to a different tile when X is occupied. That would be a harder problem.

The easiest solution to this is to use negative attraction. Mobs do not like to be hugged by other mobs so they keep their distance. With tile X being reachable by 2 mobs that makes them to near each other so tile X would never be a suitable destination.

Note: This only works if mobs are suitably large and tiles suitably small so keeping a tile distance between mobs doesn't make them look too far apart. Something I think is no problem in factorio.

Nemo4809 wrote: Wed Feb 19, 2020 5:15 pm
posila wrote: Wed Feb 19, 2020 12:50 pmI still think "bad memory access patterns" is the most correct way of describing the cause of the problem.
From what I know about PC memory management, a "good" memory access pattern would involve the data you next require be near the data you are currently fetching because modern PC pre-fetch data surrounding the current data being fetched from memory. However this isn't always the case. Sometimes the next set of data required isn't even determined until the current calculation is done and could be anywhere in memory - effectively making pre-fetch as it is useless; and preventing any sort of pre-fetching strategy.

PS: From what I heard, this is a real bottleneck when it comes to raytracing. The memory access pattern is effectively random and this is very bad for GPU memory that is tuned for high bandwidth at the cost of high latency.

mrvn wrote: Wed Feb 19, 2020 12:46 pmPut the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.

Yes. If you have entities (like inserters in factorio) that sleep then you don't get a nice sequential access pattern. Lots of skips then. You can still store the inserters in an array and process them in order with careful planning. That way any inserters that are active and adjacnet benefit.

Even if putting all the inserters into their own array has no positive effect itself it won't make random access patterns any worse. And it could have the beneficial effect hat inserters aren't placed in between other entities that can use sequential access. But I would think that at least one thing would benefit: Saving. Serializing all inserters for saving in a batch job and sequential memory access can't but improve the process. data pre-fetch improves and instruction caching and branch prediction works better. Better than iterating over all entities in random order randomly placed anywhere in memory.

bobucles · Post by **bobucles** » Wed Feb 26, 2020 12:56 pm

Simple arrays are fantastic for running super fast code, however most objects in factorio have to be added and removed in arbitrary ways. Working through an array is fast, but adding and removing elements is slow and only gets slower as the array grows. If an exploding item simply leaves the array spot empty then it ends up with a swiss cheese array, and the worst part is that the host's array may not look like a newly joining player's memory array. Defragging an array is its own increasingly slow operation and players hate any kind of observable lag spike. Using the random access pointers is a compromise between both worlds. It's pretty fast for iterating, and it's pretty fast for arbitrarily adding or removing elements, so it doesn't have any glaring weaknesses. It also doesn't matter where the game material ends up in RAM, so anyone can get the same experience.

There are some shortcuts that may help the array experience. For example when most things blow up, the object doesn't truly get removed and remains as a "ghost" entity. The only real deletion of items happens through the deconstruction planner and hand picking. But I imagine those shortcuts aren't enough to make it a good experience.

mrvn · Post by **mrvn** » Wed Feb 26, 2020 2:11 pm

bobucles wrote: Wed Feb 26, 2020 12:56 pm Simple arrays are fantastic for running super fast code, however most objects in factorio have to be added and removed in arbitrary ways. Working through an array is fast, but adding and removing elements is slow and only gets slower as the array grows. If an exploding item simply leaves the array spot empty then it ends up with a swiss cheese array, and the worst part is that the host's array may not look like a newly joining player's memory array. Defragging an array is its own increasingly slow operation and players hate any kind of observable lag spike. Using the random access pointers is a compromise between both worlds. It's pretty fast for iterating, and it's pretty fast for arbitrarily adding or removing elements, so it doesn't have any glaring weaknesses. It also doesn't matter where the game material ends up in RAM, so anyone can get the same experience.

There are some shortcuts that may help the array experience. For example when most things blow up, the object doesn't truly get removed and remains as a "ghost" entity. The only real deletion of items happens through the deconstruction planner and hand picking. But I imagine those shortcuts aren't enough to make it a good experience.

If the order of entries in the array doesn't matter (as long as it's identical on all clients) then deleting an entry can simply swap it with the last entry. Then all the free space is always at the end of the array without any lag spikes.

Anyway this is getting further and further off topic. This is simply not how factorio was written and it's too late to change it now.

elfstone · Post by **elfstone** » Mon Feb 08, 2021 6:03 pm

Are there any plans to revisit Multithreading for the DLC (or whatever comes next?)
In case the DLC will be based on Space Exploration (which could be, now that Earendel has joined the team) it might be possible to run different surfaces on different cores, since events on one surface don't interact with things on other surfaces, which should make multi threading a lot easier.
The logic could be changed so that changes in one surfaces do not interact with other surfaces for 10 ticks or so, so there is plenty of time to sync between the surfaces. (Since speed of light is actually a thing, it's even more realistic if information and rockets take a few ticks to travel between planets

Also one of the arguments in the beginning of this thread was, that normal CPUs don't have many cores, and only high end machines would profit. Now that even entry level CPUs like Ryzen 5600 have 6/12 cores/threads that argument won't hold for much longer, and since you're planning with a timescale of about a year, those will have quite some market share.

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Multithreaded performance

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications