Parrallel processing in games & applications

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Mon Sep 04, 2017 6:14 pm

@Rseding91

Single-core performance has been optimized to the point where it's almost always waiting for RAM. Adding more threads increases the RAM demand and as I mentioned above: can cause both threads to run slower for having run at the same time due to contention for memory than if they ran in series on a single thread.

Wouldn't memory block be a great reason to multi thread?

Sure, the CPU waits on memory. But while it's waiting for task 1, it could switch to another thread and start a request for more memory. Multi-threading isn't just about increasing multi core performance, but also single core performance for this reason. Lookahead helps (aided by pre-fetching), but by using a single thread you're limiting the program to (assuming an i7 core) part of the l3 cache and only 1 set of l1 and l2 caches. The amount of time waiting for memory should be reduced (when an appropriate amount of work is assigned to a single thread). Memory contention is a difficult problem and requires independent units of work to exist.

Ram demand increase for threading is only an issue if the units of work are small, as the ram overhead of a thread is constant.

However, this is only true if we are simply waiting for memory and there's still capacity in the pipe between memory and the CPU (which I believe there is, as Factorio's performance and my machine learning algorithm's performance seem independent of each other).

"Take that, and then still make sure everything is deterministic regardless of thread count without sacrificing the intricate entity interaction and most things simply can't be done in multiple threads: in order to update entity B entity A must first be updated as it can change what entity B will do during its update."

Order of work being done does sound like a difficult problem. I wonder where this appears in the game? I thought circuit networks update via pipeline (that is, if there's an operating entity in the middle, it takes 2 updates for the signal to be sent, updated, and then sent again). Is it the logistics network? Are there potentially massive dependency trees involved? many small trees?

I live in Java/C# land, I hated having to implement multi threading in C++, I don't envy you.

@BlakeMW

This has been discussed to death before...

I know the topic has been touched on before (especially in some Friday facts), but I hadn't seen any threads on the subject, even with my (albeit weak) search-fu. Could you point me towards some of those threads?

And the program really has to be designed from the ground up to use this kind of massive-parallelism strategy, with determinism taken into account.

I think this gets blown out of proportion. Sure, there are things that need to be in place for multithreading to work: thread-safe data structures and definable, independent, units of work. I don't think making the unit structures thread-safe will be hard, the toughest part I'd think is independent work unit identification. This may be what is meant by "ground up" design, but I don't think the cost of refactoring is so high that initial threaded design is requisite. It helps but certainly is not requisite (I personally introduce threading into programs not built with threading in mind on a somewhat regular basis).

Though, there are just some problems that cannot be parallelized, such as your hole example.

(by the way for this kind of massive parallelism often GPU is used instead of CPU).

The GPU is used when vector operations can be used, but the different data sizes of entities, and the operation dependant on entity type makes that impractical. While I personally haven't played with GPU operations, that level of multi-threading certainly is a unique beast: not only must the units of work be well defined and independent, but they must have the same operations performed on them. Branching is expensive for GPUs, for instance, so IF, CASE, and tertiary operators are not good.

TL:DR There are difficult problems to overcome that would take months to get "just right".

Engimage · Post by **Engimage** » Mon Sep 04, 2017 6:38 pm

Here is the topic on the matter (the biggest one)
viewtopic.php?f=5&t=39893
Please don't start this again.

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Mon Sep 04, 2017 6:51 pm

Please don't start this again.

Sure!

But, why the bad feelings? (scratches head)

Post by **Koub** » Mon Sep 04, 2017 7:24 pm

FrodoOf9Fingers wrote:But, why the bad feelings? (scratches head)

Because this forum is full of people who are deeply convinced they know better than the devs on the subject, whereas the devs seem to have really thought A LOT about paralellizing, and have come to the conclusion that it was not the solution.
It's as if a 6th grader tried to convince Enstein he's wrong about general and restricted relativity. Also it feels so much like my work : I've been working in IT support for 16 years, yet most of my end-users are convinced they know better than me how to fix their computer problem.

And as a single discussion on this subject is enough, ...

Post by **Koub** » Mon Sep 04, 2017 7:26 pm

[Koub] Merged into older and more complete topic

ratchetfreak · Post by **ratchetfreak** » Tue Sep 05, 2017 10:01 am

FrodoOf9Fingers wrote:@Rseding91
Single-core performance has been optimized to the point where it's almost always waiting for RAM. Adding more threads increases the RAM demand and as I mentioned above: can cause both threads to run slower for having run at the same time due to contention for memory than if they ran in series on a single thread.
Wouldn't memory block be a great reason to multi thread?

Sure, the CPU waits on memory. But while it's waiting for task 1, it could switch to another thread and start a request for more memory. Multi-threading isn't just about increasing multi core performance, but also single core performance for this reason. Lookahead helps (aided by pre-fetching), but by using a single thread you're limiting the program to (assuming an i7 core) part of the l3 cache and only 1 set of l1 and l2 caches. The amount of time waiting for memory should be reduced (when an appropriate amount of work is assigned to a single thread). Memory contention is a difficult problem and requires independent units of work to exist.

Ram demand increase for threading is only an issue if the units of work are small, as the ram overhead of a thread is constant.

However, this is only true if we are simply waiting for memory and there's still capacity in the pipe between memory and the CPU (which I believe there is, as Factorio's performance and my machine learning algorithm's performance seem independent of each other).

There is a upper limit on how many RAM requests can be in flight at a time though. So beyond that adding more threads will decrease performance again.

there is also a limit on how many threads can be loaded in a cpu at a time before you need a full context switch (and after switching back you may need to request the memory all over again).

BlakeMW · Post by **BlakeMW** » Tue Sep 05, 2017 10:26 am

FrodoOf9Fingers wrote: Order of work being done does sound like a difficult problem. I wonder where this appears in the game?

Imagine an inserter wants to take an item from a belt, when it does this the hand tracks items on the belt, it doesn't just ask the belt "gimme an item", it moves the hand towards an item, tracks the item as it moves on the belt, and picks it up once it catches it. So this means the belt and inserter are intimately connected, there's not a single point of interaction but they interact continuously over many frames.

Now imagine you're multithreading: the belt is updated in one thread and the inserter in another (and whatever scheme you could imagine other than belts and inserters in the same thread, this can be a possibility). Imagine that on one player's computer the belt updates first and the item moves onto the next belt causing the inserter to give up, on another player's computer the inserter updates first moving the hand closer to the item. The inserters end up in different states. DESYNC!!

How do we resolve this? Well for one we could update the inserter and belts in the same thread: that sounds like a really sensible solution to the determinism problem.

But if we aren't feeling like being sensible then we could decide "Okay, for each new frame update, the inserter thread should always pick up items before the belt thread moves items". So we have the inserter thread block the belt thread until the inserter thread has finished interacting with items on belts. Then we release the belt thread so it can go ahead and update the belts, and the inserter thread carries on doing stuff like updating all the inserters that aren't interacting with a belt. But now we're talking about blocking threads and communication overhead and the performance gains we had hoped to get are rapidly evaporating, and we haven't even considered all the other kinds of interactions like between inserters and assemblers and inserters and chests and bots and chests, all of which have to be strictly deterministic and so happen in a well defined order. It's not that you can't have multiprocessing and determinism but there are severe overheads to that determinism.

There are games that make excellent use of multi-threading. Actually I only personally have experience with Cities Skylines, it's "UPS" performance scales very well with core count. But it's not deterministic so it can afford to simulate agents in any old order with no real consequences - also it's base performance is less than stellar due to being implemented in Unity, the good exploitation of multi-core could be considered a crutch for not having a blazingly fast core.

I think that to make a game like Factorio multi-thread well you'd have to really limit the points of interaction between entities probably using some kind of message passing scheme - like an inserter wants to pick up an item so it passes a message to the belt asking it to reserve an item, the belt passes the item to the inserter - with these "messages" being sent between frames only and in sorted order. But clearly, we are adding overheard here: consider for example an inserter moving items from one chest to another. I assume this currently involves some kind of decrement one variable, increment another stack type business. With message passing every interaction becomes a conversation - it's probably not going to be anywhere less than quadruple the work for the CPU.

So then you start doing insanely complicated things, active chunks are divided up between the CPU cores into blocks of chunks, and each chunk block gets simulated in isolation using nice fast direct memory access except at the boundaries between the chunk blocks where message passing is used for deterministic interactions. The result: at least three times as much code (because the same thing gets done in two different ways and one of those ways is really complicated), a bunch of new bugs at chunk boundaries and a kind of uncanny valley thing where mechanisms are ever so slightly different at chunk boundaries. Or alternatively: you arbitrarily break the game into chunks and you call each chunk a sector or someshit and you just don't let players place inserters that move stuff between sectors so when items move between sectors its in a very well defined way and it is explained the engineer has OCD and feels compelled to build in neat squares.

Deadly-Bagel · Post by **Deadly-Bagel** » Tue Sep 05, 2017 2:49 pm

Like I've been saying from the start, the devs really know what they're doing so without even access to the code it's inane to suggest improvements.

mrvn · Post by **mrvn** » Tue Sep 05, 2017 4:39 pm

You don't go multi threaded. That usually is a major design decision since you have to approach problems a different way. If you develop a game single threaded it's basically impossible to switch to multi threaded and gain performance without rethinking the whole engine at the lowest level.

You have to plan in jobs or phases that can run independently on as many cores as you have without interacting each other and you have to have the data structures in place for that. For example you can't simply take all the inserter and process them in parallel or it becomes indetermined which inserter takes an item from a belt leaving the other without one. You would also need locking so two inserters don't take from the same belt at the same time. You basically try to simulate single threaded behaviour using multiple cores and that won't work out well for a game. The overhead becomes greater than the game.

An efficient multi threaded factorio would probably require a full redesign and rewrite of the core engine so I'm not holding my breath for that.

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 2:59 am

ratchetfreak wrote:
FrodoOf9Fingers wrote:@Rseding91
Single-core performance has been optimized to the point where it's almost always waiting for RAM. Adding more threads increases the RAM demand and as I mentioned above: can cause both threads to run slower for having run at the same time due to contention for memory than if they ran in series on a single thread.
Wouldn't memory block be a great reason to multi thread?

Sure, the CPU waits on memory. But while it's waiting for task 1, it could switch to another thread and start a request for more memory. Multi-threading isn't just about increasing multi core performance, but also single core performance for this reason. Lookahead helps (aided by pre-fetching), but by using a single thread you're limiting the program to (assuming an i7 core) part of the l3 cache and only 1 set of l1 and l2 caches. The amount of time waiting for memory should be reduced (when an appropriate amount of work is assigned to a single thread). Memory contention is a difficult problem and requires independent units of work to exist.

Ram demand increase for threading is only an issue if the units of work are small, as the ram overhead of a thread is constant.

However, this is only true if we are simply waiting for memory and there's still capacity in the pipe between memory and the CPU (which I believe there is, as Factorio's performance and my machine learning algorithm's performance seem independent of each other).
There is a upper limit on how many RAM requests can be in flight at a time though. So beyond that adding more threads will decrease performance again.

there is also a limit on how many threads can be loaded in a cpu at a time before you need a full context switch (and after switching back you may need to request the memory all over again).

I don't think the memory upper limit on number of flights will be constraining (if it exists, I've never heard of it, and if it's there, shoot me a link, sounds interesting). I don't see the memory bandwidth becoming an issue: I believe currently orders of magnitude less than the maximum memory to cpu bandwidth is being asked for per request (as it should be, a single thread taking up a large portion of the memory bandwidth for an entire CPU would have serious implications for context switching).

@BlakeMW

Message passing between threads would not work for factorio (at least, for this discussion involving the update loop. Message passing is often implemented between the UI and processing threads). As you've pointed that, there are many hard problems with it. But there are MUCH better options for threading. More probable would be splitting the X items that need to update as you mentioned earlier, but taking into account dependencies. 3x more code is certainly an exaggeration

. A simple solution would be to place the items of work in a concurrent queue, and then threads would take items off the front of the queue. As dependencies are discovered, they are added to the end of the queue. There are bound to be specific challenges to overcome (such as an entity depending on two other entities, and ensuring that both are processed before the dependant), but they can be overcome (for example, waiting periods between levels of dependency could be introduced, waiting for all threads to finish before moving onto the next section of work).

Pre-computing and assigning dependency trees to threads is another option. As is creating threads that don't finish for multiple update cycles (could be applicable for train route calculations, as path finding can be a very expensive operation) that would publish results back to the main loop thread to ease the burden on main thread.

As for overheads, If I took twice as long to compute a unit of work, but could do so at the same time as 20 other people, then I've come out ahead by 10x.

@mrvn

That's really not how it'd work. I think there was a dev would said multi threading could be implemented in short time frame (1-3 months). Yes, you do go multi threaded, it doesn't have to be taken into account at the start, it can be refactored in (so long as some standard code conventions are followed). I've done this type of work before, most recently I spent a couple days taking poorly written code and optimizing it. It went from 8 hours of processing down to 13 seconds using a mixture of caching and threading (caching was the biggest player, bringing it from 8 hours to 20 minutes, threading from 20 to ~2, and another layer of caching to bring it down from 2 minutes to 13 seconds. It was really bad code).

To clarify, I'm more curious to the challenges presented when implementing multi threading. It *sounds* like a month's worth of work. A difficult problem, but no where close to impossible. Also, I try to speculate on solutions only to better communicate, not to try and tell the devs what to do (which, as Deadly referred to, would be insane (although, not impossible, but ill-informed, rude, and often wrong). Customers/Product Owners describe requirements, developers determine the how).

I don't think that it's the difficulty thats preventing devs from implementing multi threading. Consider:

How many players have problems with UPs, that also have the hardware to benefit significantly from multi threading? -not very many.
How much would this time cost us? -Employee overhead is anywhere from 2x-4x, a single dev for a month could cost 12-24k dollars. Would making the game multithreaded bring in the (12,000/20$ =) 4,200+ players needed to cover the expense? What if it took more than 1 dev for that month?
What other features could be implemented during that time that might bring more players into the game?
What is the cost of future additional debug time introduced by working with threads? --significant

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 3:02 am

Koub wrote:[Koub] Merged into older and more complete topic

Thanks!

milo christiansen · Post by **milo christiansen** » Wed Sep 06, 2017 3:12 am

Ok, so I'm writing a media player, a very simple beast. There are three main threads, with more created on demand in certain cases (in the HTTP server mostly).

The first thread creates the others, then becomes a dispatch thread for the HTTP server that is used to interact with remote apps. One thread listens for input events and sends them on to the control thread. The control thread listens to events from all over and controls the player library (which has its own threads).

All this is really very simple, nothing I haven't done dozens of times before.

But. I always need to be on the lookout for race conditions and other pitfalls. I don't even want to think what it would be like in C++, where you need to worry about memory leaks and you don't have solid concurrency supprt built into the language. I could do it, but it would be a massive pain.

A game, even a simple one, is orders of magnatude harder. Multithreading had better bring some huge improments to be worth it, and for the current Factorio, it doesn't.

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 3:33 am

Processing lists of work items in threads is, IMO, the easiest form of threading (fork/join). We're talking about the update loop. What you are describing is message passing between threads, which is -already- implemented between the UI and game engine in some form or another (I would bet money on it). Race conditions would be annoying to search for, specifically looking for side effects from the given work items affecting things other than work items, but those shouldn't be hard to deal with.

milo christiansen · Post by **milo christiansen** » Wed Sep 06, 2017 4:37 am

Message passing is only the beginning. Updating shared memory is many times harder depending on exactly how the data is structured. You need much more syncronization to make sure everything is ready for each stage.

Have you ever done serious work with multithreading? For some reason I doubt it.

Zavian · Post by **Zavian** » Wed Sep 06, 2017 4:50 am

Reading the new comments in this thread makes it painfully obvious that many of the authors really haven't read and understood what the developers have already said here regarding multithreading Factorio.

I suggest that anyone who wants to make suggestions on how to multi-thread Factorio goes back and re-reads all the developer comments in this thread, especially viewtopic.php?f=5&t=39893&start=60#p238247 which describes what happened when Harkonnen added multi-threading. Quoting from that post ' It (multi-threading Factorio) was done not just lock-free, it was done atomic-free well, almost, except chunks scheduler.'

End result, it didn't give enough benefits to be worthwhile, (at least once the belt improvements drop in 0.16).

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 5:05 am

milo christiansen wrote:Have you ever done serious work with multithreading? For some reason I doubt it.

I've already stated my experience with threading. There's no reason to resort to personal attacks.

ratchetfreak · Post by **ratchetfreak** » Wed Sep 06, 2017 10:08 am

FrodoOf9Fingers wrote:
I don't think the memory upper limit on number of flights will be constraining (if it exists, I've never heard of it, and if it's there, shoot me a link, sounds interesting). I don't see the memory bandwidth becoming an issue: I believe currently orders of magnitude less than the maximum memory to cpu bandwidth is being asked for per request (as it should be, a single thread taking up a large portion of the memory bandwidth for an entire CPU would have serious implications for context switching).

Unless the cpu has hyper threading when it encounters a stall it does absolutely nothing besides checking "is it there yet? is it there yet?". This can easily last 200 clock cycles. It's a long time but not worth context switching over.

If there are hyper threads then the other thread loaded in the core will still be executing. Right up until it also stalls and then you are back in the "is it there yet? is it there yet?" cycle.

With out of order execution the cpu can look ahead at the next instructions and issue them before the result of previous ones is ready as long as the result isn't needed by that instruction. However even one of the bigger out of order machines the PS4's Jaguar can only have 64 instructions in the retirement queue until it stalls. Only a dozen of which can be memory loads.

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 5:18 pm

ratchetfreak wrote:
FrodoOf9Fingers wrote:
I don't think the memory upper limit on number of flights will be constraining (if it exists, I've never heard of it, and if it's there, shoot me a link, sounds interesting). I don't see the memory bandwidth becoming an issue: I believe currently orders of magnitude less than the maximum memory to cpu bandwidth is being asked for per request (as it should be, a single thread taking up a large portion of the memory bandwidth for an entire CPU would have serious implications for context switching).
Unless the cpu has hyper threading when it encounters a stall it does absolutely nothing besides checking "is it there yet? is it there yet?". This can easily last 200 clock cycles. It's a long time but not worth context switching over.

If there are hyper threads then the other thread loaded in the core will still be executing. Right up until it also stalls and then you are back in the "is it there yet? is it there yet?" cycle.

With out of order execution the cpu can look ahead at the next instructions and issue them before the result of previous ones is ready as long as the result isn't needed by that instruction. However even one of the bigger out of order machines the PS4's Jaguar can only have 64 instructions in the retirement queue until it stalls. Only a dozen of which can be memory loads.

Thanks for pointing me towards the right direction.

mrvn · Post by **mrvn** » Wed Sep 06, 2017 5:52 pm

FrodoOf9Fingers wrote:@mrvn

That's really not how it'd work. I think there was a dev would said multi threading could be implemented in short time frame (1-3 months). Yes, you do go multi threaded, it doesn't have to be taken into account at the start, it can be refactored in (so long as some standard code conventions are followed). I've done this type of work before, most recently I spent a couple days taking poorly written code and optimizing it. It went from 8 hours of processing down to 13 seconds using a mixture of caching and threading (caching was the biggest player, bringing it from 8 hours to 20 minutes, threading from 20 to ~2, and another layer of caching to bring it down from 2 minutes to 13 seconds. It was really bad code).

To clarify, I'm more curious to the challenges presented when implementing multi threading. It *sounds* like a month's worth of work. A difficult problem, but no where close to impossible. Also, I try to speculate on solutions only to better communicate, not to try and tell the devs what to do (which, as Deadly referred to, would be insane (although, not impossible, but ill-informed, rude, and often wrong). Customers/Product Owners describe requirements, developers determine the how).

I don't think that it's the difficulty thats preventing devs from implementing multi threading. Consider:

How many players have problems with UPs, that also have the hardware to benefit significantly from multi threading? -not very many.
How much would this time cost us? -Employee overhead is anywhere from 2x-4x, a single dev for a month could cost 12-24k dollars. Would making the game multithreaded bring in the (12,000/20$ =) 4,200+ players needed to cover the expense? What if it took more than 1 dev for that month?
What other features could be implemented during that time that might bring more players into the game?
What is the cost of future additional debug time introduced by working with threads? --significant

It is how it NOT works. If you take your simple single threaded algorithms and multithread them then you run into race conditions. Then you need to add synchronisation and locking and all matter of overhead that eats away your gain from using threads. And with something as simple as an inserter picking up an item of a belt adding locking and ordering to that kills it. I don't know what the source looks like but from the desync problems that keep happening and what the developeres said ordering of what gets calculated when is a major part in the game engine. Change the ordering and you change the results and you can't have that in good multi core programming. The game would behave differently depending on how many cores you have or interrupts and scheduling the OS throws inbetween your code. You don't want to add synchronisation on
a low level (like on a piece of belt) to preserve ordering. You want the tick updates to be designed so you only need to synchronize on a high level. So each core can run for thousands to millions of cycles before it hits a synchronisation point again. Like have one core do all the inserters in order while another core does all the assembler and the thrid does all the fluid pipes. And form all that was said that is not how factorios tick updates where designed. The inserter depend on the assembler, the assembler depend on the inserter and pipes, the pipes depend on the assembler. It's all interconnected back and forth in a way that doesn't lend itself to running each part on a different core.

FrodoOf9Fingers · Post by **FrodoOf9Fingers** » Wed Sep 06, 2017 8:07 pm

Perhaps there's been some confusion here.

I am not saying that implementing multi-threading is easy. I am not saying that we should implement multi-threading. I know it's hard. But it's not as hard as many people seem to think. Yes, race conditions need to be considered. Yes, there is overhead from threads. But this notion of "games need to be written from the ground up with parallelism in mind" is simply false and that was what I was referring to.

It's all interconnected back and forth in a way that doesn't lend itself to running each part on a different core.

Considering dependencies during threading is not a new problem, and has been solved many times. It's NOT an easy problem, but it's NOT impossible.

I think we are generally on the same page. Do you not think that:
Multi-threading factorio is hard?
It may have alot, very little, or negative impact on performance depending on implementation? (As we've been told, it's very little, but positive).

I've gotten my questions answered though. Thanks for your time! I am always open to friendly debate through PM though.

Factorio Forums

Parrallel processing in games & applications

Re: What's currently preventing more parallelization?

Re: What's currently preventing more parallelization?

Re: What's currently preventing more parallelization?

Re: What's currently preventing more parallelization?

Re: Parrallel processing in games & applications

Re: What's currently preventing more parallelization?

Re: What's currently preventing more parallelization?

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications