Page 17 of 17

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:17 pm
by Koub
[Koub] Merged into the "parallelization" topic

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:29 pm
by movax20h
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 4:56 pm
by sushi_eater
movax20h wrote:
Thu Feb 13, 2020 4:09 pm
You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.
I'm not saying it's easy (good multi-threaded code can easily be 10x more complex). I'm pointing out what's possible.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 5:14 pm
by Rseding91
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 6:07 pm
by movax20h
Rseding91 wrote:
Thu Feb 13, 2020 5:14 pm
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
Oh. Thanks for the info. 15% is definitively not worth the trouble.

Do you have some more detailed write up about these attempts and findings? Any experimental old binaries (can be headless or even not working fully) to share maybe to do testing today?

I hope post Factorio 1.0 / 1.1, maybe some new attempts would be made to make it work.

Re: Parrallel processing in games & applications

Posted: Thu Feb 13, 2020 11:36 pm
by Dixi
Calculate game tick in one thread and same time use another cpu core to visualize previous tick calculations? This will give extra 1 tick delay in visualization, but those calculations can be done independently on at least two cpu cores.

As I see on F5 diagnostics, Render takes about 1.5ms, while Update 3.0ms. If I understand those numbers correctly, then I can assume that game world/logic update takes, on my PC, on this map, twice the time of actual frame visualization. Then splitting those actions to different cores might improve performance.

From other side of view I dunno how those numbers will look like on low end PC. Because numbers I mentioned are in background task. While the game runs foreground, even on a big factory (2k+ SPM) max zoom out, there is sleep delay of about a half of frame time. So on modern average price gaming PC it works so good single core than nothing is really needed.

Re: Parrallel processing in games & applications

Posted: Fri Feb 14, 2020 12:30 am
by Rseding91
Dixi wrote:
Thu Feb 13, 2020 11:36 pm
Calculate game tick in one thread and same time use another cpu core to visualize previous tick calculations? This will give extra 1 tick delay in visualization, but those calculations can be done independently on at least two cpu cores.

As I see on F5 diagnostics, Render takes about 1.5ms, while Update 3.0ms. If I understand those numbers correctly, then I can assume that game world/logic update takes, on my PC, on this map, twice the time of actual frame visualization. Then splitting those actions to different cores might improve performance.

From other side of view I dunno how those numbers will look like on low end PC. Because numbers I mentioned are in background task. While the game runs foreground, even on a big factory (2k+ SPM) max zoom out, there is sleep delay of about a half of frame time. So on modern average price gaming PC it works so good single core than nothing is really needed.
The render preparation step is already utilizing all of your CPU cores. You can see how slow it gets by going into graphics settings and setting render threads to 0 or 1 or w/e the minimum is.

Re: Parrallel processing in games & applications

Posted: Fri Feb 14, 2020 6:12 pm
by nafira
Rseding91 wrote:
Thu Feb 13, 2020 5:14 pm
movax20h wrote:
Thu Feb 13, 2020 4:29 pm
One of the possible lower hanging fruits for parallelisation is processing disconnected components in parallel, when it make sense. By connectivity I mean here connected via some path of inserters, assemblers / furnaces / labs and belts. And chests. But not counting trains, pipes, electric network or bots. It should be possible to make it deterministic, even if these components do share or are connected via bots, by properly designed order in which the results are merged back deterministically. It would help quite a bit, because in many cases one would have many outposts, which now can be run independently. Of course game should detect cases where it doesn't make sense (i.e. predominantly bot based base, which would be seen as a big number of disconnected component). Just some ideas.
There was an attempt at exactly that about 3 years ago and it gave maybe a 15% improvement at the time while causing other issues.

Virtually everything that takes any amount of time is connected to everything else. If you can manage to split that connection without breaking the game logic then it's just not slow anymore and not worth the complexity to try to thread.

If you just break that connection and break the logic of why it existed you remove a large part of what makes Factorio Factorio and now stuff just isn't the same.
I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.
But under even 16 cores or 48 cores, it's not worse breaking the whole game logic.

Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.

And I think it's pretty much it sadly.
We can still thank them for making the game playable even with a megabase on a laptop.

An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:

Re: Parrallel processing in games & applications

Posted: Sat Feb 15, 2020 7:31 am
by Nemo4809
nafira wrote:
Fri Feb 14, 2020 6:12 pm
Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
That’s kind of what GPU computing is. Massive number of calculations in parallel. Mostly simple with few to no branches. Often the same calculation over and over on a ton of data.
An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:
Speaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.

The fact that Factorio simulates everything that needs to be simulated on the map is quite the technical feat - to me anyway.

Re: Parrallel processing in games & applications

Posted: Sun Feb 16, 2020 12:57 pm
by nafira
Nemo4809 wrote:
Sat Feb 15, 2020 7:31 am
An nearly empty world in Minecraft is not playable on a 1500$ rig so ...Let's be thankful ! :geek:
Speaking of Minecraft, it cheats when it comes to its simulation - it only simulate a fix number of blocks around the player; if the player gets too far from whatever they build, the simulation stops.

The fact that Factorio simulates everything that needs to be simulated on the map is quite the technical feat - to me anyway.
And even in your own 15*15 block it limits the number of enemies for example which is annoying.

Re: Parrallel processing in games & applications

Posted: Mon Feb 17, 2020 1:39 am
by bobucles
GPU computing does not have the same abilities as CPU computing. Each new generation gets more stuff, but if it's not up to the whole task then it won't work.

Re: Multithreaded performance

Posted: Mon Feb 17, 2020 12:57 pm
by mrvn
movax20h wrote:
Thu Feb 13, 2020 4:09 pm
You are mostly testing the memory bandwidth here. Not latency. It is possible to hide latency by using more thread and utilizing multithreading or very smart software prefetching, but that would complicate the code base a lot, and still doesn't work fully with all entities easily.

It is more of an open research area. Definitively doable, but requires a lot of work and restructuring / rearchitecting.

I am doing a research on a Factorio-like clone / simulator, that is mostly focused on multithreading, interaction graph partitioning and load balancing, and other tricks (like how to use some shared state asynchronously without locks, but with deterministic outcome) and it is non trivial. I do it in Erlang, which makes it nice to prototype, but it does induce interesting problems on its own (a bit higher memory per-entity). I already discovered quite a bit of bottlenecks in the process scheduler for example, mostly due to non-locality of entities.
But that's what the devs are claiming. That the memory bandwidth simply isn't there to make multiple threads useful. And this pretty clearly proves them wrong. It seems to be more a problem with latency. That's where threads and hyper threads would really help.

The easiest way to break all the interdependencies between entities that make multithreading impossible or very hard is to split actions into phases. Each entity would also have an old-state and next-state. In each phase the next-state is computed from old-state and at the end you switch the two for the next phase.

Now the trick is to split the phases so that each has no dependencies on other entities. Sometimes that means doing something needs 1 phase, sometimes 2, others could be 10 phases. Lets take an example: inserters taking something from a belt:

Phase 1: belt has item, wake up all interested inserters
Phase 2: inserters check items and set flag if they can accept it
Phase 3: belt picks inserter to receive item from those agreeable, item is transferred

In each phase there is no dependency between entities. All belts can run at the same time in phase 1 and 3. All inserters can run at the same time in phase 2. Inserters don't actually pick up the item or decide between them self who gets the item. That choice is made in phase 3 by the belt, avoiding the dependency between inserters picking fro the same belt.

On the negative side the belts have to run twice. Each phase is probably less work than the whole job but you need to load data for each belt, some of it the same in different phases, into the cpu multiple times. So you buy your parallel execution with more bandwidth. But the test above seems to show that there is extra bandwidth to be had.

Note: In such a phased setup it greatly helps to use separate memory pools per entity so they are all in a nice big array. Then process them giving each core a continous batch. Maybe even split data for different phases into separate pools. Maximize the amount of data per cache line and dram pages.

Re: Parrallel processing in games & applications

Posted: Mon Feb 17, 2020 10:21 pm
by jcranmer
nafira wrote:
Fri Feb 14, 2020 6:12 pm
I think this posts sums up pretty quickly the thread (:joke:) : the fact that Factorio updates thousand or millions, even nearly a billion action every tick, is absolutely not compatible with multi-thread. Not with x86 arch at least. With a thousand core ARM PC (which doesn't exists, only a few try at servers), maybe, Factorio could parallelize a lot of things.
But under even 16 cores or 48 cores, it's not worse breaking the whole game logic.

Maybe one day, ARM will make such ASIC machine to run highly parallelized apps (like dozens of hundreds of thread with simple calculations), and maybe they will try a branch of headless (I don't think it's worth it otherwise, even worth it at all).
GPU computing is not an option clearly.
It is easy to stamp out a massive number of ALUs on a single chip, but the difficult part is actually filling the ALUs with useful work. As noted, your idea of mega-core count computing is pretty similar to the reality of GPUs. There are a few other architectures with slightly different tradeoffs (e.g., FPGAs and CGRAs), but the moral of the story is going to end up the same.

Scheduling a large number of very-fine grained tasks is inherently difficult, perhaps unsolvable. If you're working on the order of 10s to 1000s of clock cycles (this is where most Factorio entity updates happen, I suspect), then the overhead of any dynamic scheduler will outright kill you. Static scheduling could work, but itself has scalability issues (FPGAs are notorious for long compile times)--and a game like Factorio is going to hit the issue that the optimization of "don't compute the work of an entity doing nothing" is going to be far more effective an optimization than any scheduling algorithm can distribute work, even if you pretend the scheduler costs nothing.