Page 17 of 22

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:17 pm
by Koub
[Koub] Merged into the "parallelization" topic

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:29 pm
by movax20h
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:56 pm
by sushi_eater
movax20h wrote:
Thu Feb 13, 2020 4:09 pm
You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.
I'm not saying it's easy (good multi-threaded code can easily be 10x more complex). I'm pointing out what's possible.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 5:14 pm
by Rseding91
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 6:07 pm
by movax20h
Rseding91 wrote:
Thu Feb 13, 2020 5:14 pm
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
Oh. Thanks for the info. 15% is definitively not worth the trouble.

Do you have some more detailed write up about these attempts and findings? Any experimental old binaries (can be headless or even not working fully) to share maybe to do testing today?

I hope post Factorio 1.0 / 1.1, maybe some new attempts would be made to make it work.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 11:36 pm
by Dixi
Calculate game tick in one thread and same time use another cpu core to visualize previous tick calculations? This will give extra 1 tick delay in visualization, but those calculations can be done independently on at least two cpu cores.

As I see on F5 diagnostics, Render takes about 1.5ms, while Update 3.0ms. If I understand those numbers correctly, then I can assume that game world/logic update takes, on my PC, on this map, twice the time of actual frame visualization. Then splitting those actions to different cores might improve performance.

From other side of view I dunno how those numbers will look like on low end PC. Because numbers I mentioned are in background task. While the game runs foreground, even on a big factory (2k+ SPM) max zoom out, there is sleep delay of about a half of frame time. So on modern average price gaming PC it works so good single core than nothing is really needed.

Re: Parrallel processing in games & applications

Posted: Fri Feb 14, 2020 12:30 am
by Rseding91
Dixi wrote:
Thu Feb 13, 2020 11:36 pm
Calculate game tick in one thread and same time use another cpu core to visualize previous tick calculations? This will give extra 1 tick delay in visualization, but those calculations can be done independently on at least two cpu cores.

As I see on F5 diagnostics, Render takes about 1.5ms, while Update 3.0ms. If I understand those numbers correctly, then I can assume that game world/logic update takes, on my PC, on this map, twice the time of actual frame visualization. Then splitting those actions to different cores might improve performance.

From other side of view I dunno how those numbers will look like on low end PC. Because numbers I mentioned are in background task. While the game runs foreground, even on a big factory (2k+ SPM) max zoom out, there is sleep delay of about a half of frame time. So on modern average price gaming PC it works so good single core than nothing is really needed.
The render preparation step is already utilizing all of your CPU cores. You can see how slow it gets by going into graphics settings and setting render threads to 0 or 1 or w/e the minimum is.

Re: Parrallel processing in games & applications

Posted: Fri Feb 14, 2020 6:12 pm
by nafira
Rseding91 wrote:
Thu Feb 13, 2020 5:14 pm
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.
But under even 16 cores or 48 cores, it's not worse breaking the whole game logic.

Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.

And I think it's pretty much it sadly.
We can still thank them for making the game playable even with a megabase on a laptop.

An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:

Re: Parrallel processing in games & applications

Posted: Sat Feb 15, 2020 7:31 am
by Nemo4809
nafira wrote:
Fri Feb 14, 2020 6:12 pm
Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
That’s kind of what GPU computing is. Massive number of calculations in parallel. Mostly simple with few to no branches. Often the same calculation over and over on a ton of data.
An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:
Speaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.

The fact that Factorio simulates everything that needs to be simulated on the map is quite the technical feat - to me anyway.

Re: Parrallel processing in games & applications

Posted: Sun Feb 16, 2020 12:57 pm
by nafira
Nemo4809 wrote:
Sat Feb 15, 2020 7:31 am
An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:
Speaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.

The fact that Factorio simulates everything that needs to be simulated on the map is quite the technical feat - to me anyway.
And even in your own 15*15 block it limits the number of enemies for example which is annoying.

Re: Parrallel processing in games & applications

Posted: Mon Feb 17, 2020 1:39 am
by bobucles
GPU computing does not have the same abilities as CPU computing. Each new generation gets more stuff, but if it's not up to the whole task then it won't work.

Re: Multithreaded performance

Posted: Mon Feb 17, 2020 12:57 pm
by mrvn
movax20h wrote:
Thu Feb 13, 2020 4:09 pm
You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.

It is more of an open research area. Definitively doable, but requires a lot of work and restructuring / rearchitecting.

I am doing a research on a Factorio-like clone / simulator, that is mostly focused on multithreading, interaction graph partitioning and load balancing, and other tricks (like how to use some shared state asynchronously without locks, but with deterministic outcome) and it is non trivial. I do it in Erlang, which makes it nice to prototype, but it does induce interesting problems on its own (a bit higher memory per-entity). I already discovered quite a bit of bottlenecks in the process scheduler for example, mostly due to non-locality of entities.
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.

The easiest way to break all the interdependencies between entities that make multithreading impossible or very hard is to split actions into phases. Each entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.

Now the trick is to split the phases so that each has no dependencies on other entities. Sometimes that means doing something needs 1 phase, sometimes 2, others could be 10 phases. Lets take an example: inserters taking something from a belt:

Phase 1: belt has item, wake up all interested inserters
Phase 2: inserters check items and set flag if they can accept it
Phase 3: belt picks inserter to receive item from those agreeable, item is transferred

In each phase there is no dependency between entities. All belts can run at the same time in phase 1 and 3. All inserters can run at the same time in phase 2. Inserters don't actually pick up the item or decide between them self who gets the item. That choice is made in phase 3 by the belt, avoiding the dependency between inserters picking fro the same belt.

On the negative side the belts have to run twice. Each phase is probably less work than the whole job but you need to load data for each belt, some of it the same in different phases, into the cpu multiple times. So you buy your parallel execution with more bandwidth. But the test above seems to show that there is extra bandwidth to be had.

Note: In such a phased setup it greatly helps to use separate memory pools per entity so they are all in a nice big array. Then process them giving each core a continous batch. Maybe even split data for different phases into separate pools. Maximize the amount of data per cache line and dram pages.

Re: Parrallel processing in games & applications

Posted: Mon Feb 17, 2020 10:21 pm
by jcranmer
nafira wrote:
Fri Feb 14, 2020 6:12 pm
I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.
But under even 16 cores or 48 cores, it's not worse breaking the whole game logic.

Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
It is easy to stamp out a massive number of ALUs on a single chip, but the difficult part is actually filling the ALUs with useful work. As noted, your idea of mega-core count computing is pretty similar to the reality of GPUs. There are a few other architectures with slightly different tradeoffs (e.g., FPGAs and CGRAs), but the moral of the story is going to end up the same.

Scheduling a large number of very-fine grained tasks is inherently difficult, perhaps unsolvable. If you're working on the order of 10s to 1000s of clock cycles (this is where most Factorio entity updates happen, I suspect), then the overhead of any dynamic scheduler will outright kill you. Static scheduling could work, but itself has scalability issues (FPGAs are notorious for long compile times)--and a game like Factorio is going to hit the issue that the optimization of "don't compute the work of an entity doing nothing" is going to be far more effective an optimization than any scheduling algorithm can distribute work, even if you pretend the scheduler costs nothing.

Re: Multithreaded performance

Posted: Tue Feb 18, 2020 1:21 am
by Nemo4809
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.

Re: Multithreaded performance

Posted: Tue Feb 18, 2020 11:33 am
by Rseding91
Nemo4809 wrote:
Tue Feb 18, 2020 1:21 am
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...

Re: Multithreaded performance

Posted: Wed Feb 19, 2020 10:39 am
by mrvn
Rseding91 wrote:
Tue Feb 18, 2020 11:33 am
Nemo4809 wrote:
Tue Feb 18, 2020 1:21 am
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...
Then why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.

Re: Multithreaded performance

Posted: Wed Feb 19, 2020 11:49 am
by keyboardhack
mrvn wrote:
Wed Feb 19, 2020 10:39 am
Rseding91 wrote:
Tue Feb 18, 2020 11:33 am
Nemo4809 wrote:
Tue Feb 18, 2020 1:21 am
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...
Then why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.
Throwing more threads at a problem can't help all problems related to memory latency. Take a linked list where the problem is traversing the list. Clearly the problem is memory latency bound but you can't throw more threads at the problems to speed it up.
Afaik that's exactly what the problem with factorio is. There is a certain order in which the entities are updates and each entity points to the next entity which should be updated, like a linked list. We can see that it's possible to multithread some parts of the program when this updat order is changed like what happened to fluids but that doesn't mean that it's trivial to change it for all entity types.

We could say "just update all inserters, then belts and so on" but we don't know how difficult of a change that would be. From their FFFs it may seem like they sometimes discover mundane things that makes the program faster, but in all probability they are just writing it in a way the average reader can follow.

Re: Multithreaded performance

Posted: Wed Feb 19, 2020 12:46 pm
by mrvn
keyboardhack wrote:
Wed Feb 19, 2020 11:49 am
mrvn wrote:
Wed Feb 19, 2020 10:39 am
Rseding91 wrote:
Tue Feb 18, 2020 11:33 am
Nemo4809 wrote:
Tue Feb 18, 2020 1:21 am
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.
Don't think they ever said that. I don't remember which post, but a dev said that memory throughput isn't the problem. Factorio doesn't use much memory bandwidth. Memory latency on the other hand is - i.e. the CPU is bottlenecked waiting for RAM to deliver data it needs.
We've always said memory latency. It's always latency. I don't know where people get the bandwidth thing from...
Then why didn't you push threading more? I can't find the thread right now but I remember statements that factorio is memory bound and not cpu bound and more threads would not help. I might have inferred that you mean memory bandwidth because with memory latency more threads do help. So couldn't have been that.
Throwing more threads at a problem can't help all problems related to memory latency. Take a linked list where the problem is traversing the list. Clearly the problem is memory latency bound but you can't throw more threads at the problems to speed it up.
Afaik that's exactly what the problem with factorio is. There is a certain order in which the entities are updates and each entity points to the next entity which should be updated, like a linked list. We can see that it's possible to multithread some parts of the program when this updat order is changed like what happened to fluids but that doesn't mean that it's trivial to change it for all entity types.

We could say "just update all inserters, then belts and so on" but we don't know how difficult of a change that would be. From their FFFs it may seem like they sometimes discover mundane things that makes the program faster, but in all probability they are just writing it in a way the average reader can follow.
Yes. That's basically the problem. THE LIST It's not the memory bandwidth or latency that's the problem. It's the list part that causing the problem and also preventing multi threading. Put the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too).

Often the problem is what data structure and algorithm you picked at the start. The problem now is that factorio is basically finished. Changing the data structure for multi threadiung would set it back probably years. If you don't design something for threads from the start it becomes harder and harder to change it later in ways that actually benefit from threads. You get bits and pieces here and there but usually to really benefit you have to use a different design.

Re: Parrallel processing in games & applications

Posted: Wed Feb 19, 2020 12:50 pm
by posila
I feel like this thread is looping back on itself. On page 4, in 2017, Harkonnen mentioned it memory bandwith to be a problem, which was immediatelly disputed by someone measuring it. I said we had bad memory access patterns. Rseding said we never said memory bandwith was the problem (despite Harkonnen saying it was on that same page). kovarex might have blamed memory bandwidth in FFF ... sadly, we don't really take our time to double check and verify what we write on the internet.

I still think "bad memory access patterns" is the most correct way of describing the cause of the problem. So that's the essence of the most optimizations were coming from in the past years ... and from time to time, when we fix the bad memory access pattern, we remove some dependency that allows us to update different subsystems in parallel.

So, why don't we push threading more...
Let's say you have a map that's slow, and we take an entity that's the slowest to update on that map and optimize it so that it takes 0 time to update. If that resulted in 20% reduction of update time, that would be really good. It is more common for the slowest thing to take around 10% of the update time.

Spreading the thing to mutliple threads would not reduce it to 0 time and in addition would add some overhead to synchronize the threads.

What I am trying to get to is following (and is just my personal opinion on how things went): Since we released the game on Steam in 2016, we have been lying to ourselves we would finish 1.0 "next Summer". Parallelization of the game update would be huge undertaking, it would add more complexity to the update and would have very uncertain result. It would have been too risky to attempt to do it for 1.0, and if it can't be done for 1.0, it's better to invest the time to do optimizations that are smaller in scope and less risky to do.

EDIT: I wrote this before mrvn's post above this one. Basically, I agree with what mrvn said

Re: Parrallel processing in games & applications

Posted: Wed Feb 19, 2020 5:15 pm
by Nemo4809
mrvn wrote:
Mon Feb 17, 2020 12:57 pm
Each entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.
I have toyed with this idea as a thought experiment ... and concluded it doesn't work out in practice.

e.g. 2 mobs. Each has a thread to determine where they move. Base on old state, they both decide to move to tile X. Except you can't have 2 entities occupying the same space. Old state to new state would allow this to happen unless you put a check when moving to a tile but that would make the outcome nondeterministic depending on whose thread ends up being processed first based on OS scheduling(?) and the recalculation effectively makes the 2 threads run sequentially.
posila wrote:
Wed Feb 19, 2020 12:50 pm
I still think "bad memory access patterns" is the most correct way of describing the cause of the problem.
From what I know about PC memory management, a "good" memory access pattern would involve the data you next require be near the data you are currently fetching because modern PC pre-fetch data surrounding the current data being fetched from memory. However this isn't always the case. Sometimes the next set of data required isn't even determined until the current calculation is done and could be anywhere in memory - effectively making pre-fetch as it is useless; and preventing any sort of pre-fetching strategy.

PS: From what I heard, this is a real bottleneck when it comes to raytracing. The memory access pattern is effectively random and this is very bad for GPU memory that is tuned for high bandwidth at the cost of high latency.
mrvn wrote:
Wed Feb 19, 2020 12:46 pm
Put the inserters into an array and you remove the latency for looking up the next pointer. This then also allows splitting the array into threads (provided you do a few other things to make that possible too)
I think part of the complication is the skipping of processing certain entities - an optimization. This makes memory pre-fetch effectively useless as the next inserter to be updated could be the next in the array or +5 down the array and you are back to waiting for RAM.