Re: Parrallel processing in games & applications
Posted: Thu Feb 13, 2020 4:17 pm
[Koub] Merged into the "parallelization" topic
www.factorio.com
https://forums.factorio.com/
I'm not saying it's easy (good multi-threaded code can easily be 10x more complex). I'm pointing out what's possible.movax20h wrote: Thu Feb 13, 2020 4:09 pm You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.movax20h wrote: Thu Feb 13, 2020 4:29 pm One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
Oh. Thanks for the info. 15% is definitively not worth the trouble.Rseding91 wrote: Thu Feb 13, 2020 5:14 pmThere was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.movax20h wrote: Thu Feb 13, 2020 4:29 pm One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.
If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
The render preparation step is already utilizing all of your CPU cores. You can see how slow it gets by going into graphics settings and setting render threads to 0 or 1 or w/e the minimum is.Dixi wrote: Thu Feb 13, 2020 11:36 pm Calculate game tick in one thread and same time use another cpu core to visualize previous tick calculations? This will give extra 1 tick delay in visualization, but those calculations can be done independently on at least two cpu cores.
As I see on F5 diagnostics, Render takes about 1.5ms, while Update 3.0ms. If I understand those numbers correctly, then I can assume that game world/logic update takes, on my PC, on this map, twice the time of actual frame visualization. Then splitting those actions to different cores might improve performance.
From other side of view I dunno how those numbers will look like on low end PC. Because numbers I mentioned are in background task. While the game runs foreground, even on a big factory (2k+ SPM) max zoom out, there is sleep delay of about a half of frame time. So on modern average price gaming PC it works so good single core than nothing is really needed.
I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.Rseding91 wrote: Thu Feb 13, 2020 5:14 pmThere was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.movax20h wrote: Thu Feb 13, 2020 4:29 pm One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.
If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
That’s kind of what GPU computing is. Massive number of calculations in parallel. Mostly simple with few to no branches. Often the same calculation over and over on a ton of data.nafira wrote: Fri Feb 14, 2020 6:12 pmMaybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
Speaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful !
And even in your own 15*15 block it limits the number of enemies for example which is annoying.Nemo4809 wrote: Sat Feb 15, 2020 7:31 amSpeaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful !
The fact that Factorio simulates everything that needs to be simulated on the map is quite the technical feat - to me anyway.
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.movax20h wrote: Thu Feb 13, 2020 4:09 pm You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.
It is more of an open research area. Definitively doable, but requires a lot of work and restructuring / rearchitecting.
I am doing a research on a Factorio-like clone / simulator, that is mostly focused on multithreading, interaction graph partitioning and load balancing, and other tricks (like how to use some shared state asynchronously without locks, but with deterministic outcome) and it is non trivial. I do it in Erlang, which makes it nice to prototype, but it does induce interesting problems on its own (a bit higher memory per-entity). I already discovered quite a bit of bottlenecks in the process scheduler for example, mostly due to non-locality of entities.
It is easy to stamp out a massive number of ALUs on a single chip, but the difficult part is actually filling the ALUs with useful work. As noted, your idea of mega-core count computing is pretty similar to the reality of GPUs. There are a few other architectures with slightly different tradeoffs (e.g., FPGAs and CGRAs), but the moral of the story is going to end up the same.nafira wrote: Fri Feb 14, 2020 6:12 pm I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.
But under even 16 cores or 48 cores, it's not worse breaking the whole game logic.
Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...Nemo4809 wrote: Tue Feb 18, 2020 1:21 amDon't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Then why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.Rseding91 wrote: Tue Feb 18, 2020 11:33 amWe've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...Nemo4809 wrote: Tue Feb 18, 2020 1:21 amDon't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Throwing more threads at a problem can't help all problems related to memory latency. Take a linked list where the problem is traversing the list. Clearly the problem is memory latency bound but you can't throw more threads at the problems to speed it up.mrvn wrote: Wed Feb 19, 2020 10:39 amThen why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.Rseding91 wrote: Tue Feb 18, 2020 11:33 amWe've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...Nemo4809 wrote: Tue Feb 18, 2020 1:21 amDon't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Yes. That's basically the problem. THE LIST It's not the memory bandwidth or latency that's the problem. It's the list part that causing the problem and also preventing multi threading. Put the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too).keyboardhack wrote: Wed Feb 19, 2020 11:49 amThrowing more threads at a problem can't help all problems related to memory latency. Take a linked list where the problem is traversing the list. Clearly the problem is memory latency bound but you can't throw more threads at the problems to speed it up.mrvn wrote: Wed Feb 19, 2020 10:39 amThen why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.Rseding91 wrote: Tue Feb 18, 2020 11:33 amWe've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...Nemo4809 wrote: Tue Feb 18, 2020 1:21 amDon't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.mrvn wrote: Mon Feb 17, 2020 12:57 pmBut that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Afaik that's exactly what the problem with factorio is. There is a certain order in which the entities are updates and each entity points to the next entity which should be updated, like a linked list. We can see that it's possible to multithread some parts of the program when this updat order is changed like what happened to fluids but that doesn't mean that it's trivial to change it for all entity types.
We could say "just update all inserters, then belts and so on" but we don't know how difficult of a change that would be. From their FFFs it may seem like they sometimes discover mundane things that makes the program faster, but in all probability they are just writing it in a way the average reader can follow.
I have toyed with this idea as a thought experiment ... and concluded it doesn't work out in practice.mrvn wrote: Mon Feb 17, 2020 12:57 pmEach entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.
From what I know about PC memory management, a "good" memory access pattern would involve the data you next require be near the data you are currently fetching because modern PC pre-fetch data surrounding the current data being fetched from memory. However this isn't always the case. Sometimes the next set of data required isn't even determined until the current calculation is done and could be anywhere in memory - effectively making pre-fetch as it is useless; and preventing any sort of pre-fetching strategy.posila wrote: Wed Feb 19, 2020 12:50 pmI still think "bad memory access patterns" is the most correct way of describing the cause of the problem.
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.mrvn wrote: Wed Feb 19, 2020 12:46 pmPut the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)