Friday Facts #215 - Multithreading issues

Pure Ownage · Post by **Pure Ownage** » Tue Nov 07, 2017 11:06 pm

This is going to break the electric train mods, etc.

Paul17041993 · Post by **Paul17041993** » Wed Nov 08, 2017 1:33 am

Rseding91 wrote:
pleegwat wrote:creating/reaping threads costs time too. I do hope you're not doing that every tick?
Actually the cost of making/collecting a thread is so tiny these days that the cost of a thread pool type system ends up roughly the same. You just lose the benefits of exact start/stop when using the pool system.

Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).

phot · Post by **phot** » Wed Nov 08, 2017 1:56 am

I saw other people brought this up, but it still hasn't been addressed.

I may be wrong, but based on the FFF it appears that you have three different processes that update a single variable, but that variable is shared between each. In the post, there was a strong implication that the other processes didn't need each others updated value. If this is the case and you are running into issues invalidating cache because you are updating values on the same page, simply don't do that. Declare your variables constant in the other threads, or copy by value when computing the results.

If you are running into issues because of actual mutual exclusion and atomic operations, then A: you should immediately update the FFF post, as that is not the message that was conveyed, and B: maybe it isn't worth looking for parallelization down this specific path.

Clearly the solution to the problem stated is instead of making big changes to the code base, just make sure you *are* doing read only or write only operations, and reconvene the results at the end. But I suspect that we simply aren't given enough information for why this is as big of an issue as it is.

I also suspect that this isn't the correct path for parallelization any way, you'll likely find that each step itself could be paralleled instead of trying to run each step in parralel. You can use a iterate and check pattern to independently run steps of the process for trains, electric network, and belts, if you even need that.

ske · Post by **ske** » Wed Nov 08, 2017 4:52 pm

Paul17041993 wrote:Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).

Do you have numbers to support your argument (like microseconds per thread creation)?

One thing I found here from 2013:

Milliseconds to create thread: 0.015625

As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.

Post by **Rseding91** » Wed Nov 08, 2017 6:32 pm

Paul17041993 wrote:
Rseding91 wrote:
pleegwat wrote:creating/reaping threads costs time too. I do hope you're not doing that every tick?
Actually the cost of making/collecting a thread is so tiny these days that the cost of a thread pool type system ends up roughly the same. You just lose the benefits of exact start/stop when using the pool system.
Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).

That's not what I'm experiencing when I run the game. Do you have any real code that backs up your statements? Because you can just run Factorio to back up what I'm saying

galibert · Post by **galibert** » Thu Nov 09, 2017 8:33 am

phot wrote:I may be wrong, but based on the FFF it appears that you have three different processes that update a single variable, but that variable is shared between each.

(...)
I also suspect that this isn't the correct path for parallelization any way, you'll likely find that each step itself could be paralleled instead of trying to run each step in parralel. You can use a iterate and check pattern to independently run steps of the process for trains, electric network, and belts, if you even need that.

Yes and no. What they're suffering from is extreme non-locality in terms of cache. Which is obvious-in-hindsight when you look at the game. The computations themselves are rather simple (moving entities, collisions, creating/destroying entities in assemblers, etc). But they apply on data which is all over the place (belt entity, inserter entity, assembler entity, item themselves, etc). And they can't really be together (different entities) nor interleaved nicely (inserter 1 doesn't always interact with belt 1 and assembler 1 and item 1, at all). So every otherwise simple operation requires reading data from all over the place. They're not CPU-bound at all, they're memory-subsystem-bandwidth bound because they're blowing the caches. Cores, vector instructions, micro-parallelism, these only really work when the computation is the limit or if the data is very separable. Neither is the case here.

The only thing they can do is do less (especially including reads) for a gameplay-equivalent result, which the belt optimizations are a really nice demontration of.

OG.

TheTom · Post by **TheTom** » Thu Nov 09, 2017 8:54 am

Would it not be nice then to organize entities into separate pools?

* One pool per chunk

* Separate pools for large multi chunk elements, such as electric networks, trains

That way chunks could align with pages. Per chunk processing would then be able to take advantage of more locality, at least down to L3 or L2 cache. May be good preparation for multi threading, but at the same time - if it increases locaility then it may already boost single threaded performance.

bobingabout · Post by **bobingabout** » Thu Nov 09, 2017 9:14 am

TheTom wrote:Would it not be nice then to organize entities into separate pools?

* One pool per chunk

* Separate pools for large multi chunk elements, such as electric networks, trains

That way chunks could align with pages. Per chunk processing would then be able to take advantage of more locality, at least down to L3 or L2 cache. May be good preparation for multi threading, but at the same time - if it increases locaility then it may already boost single threaded performance.

If you read the FFF post, that's what they're looking into doing... separating the data to be a chunk per chunk basis.

Personally, for things that leave one chunk and enter another, like vehicles etc, they could do with their own memory pool rather than be bound to a chunk (and have to be moved when they move chunks.)

galibert · Post by **galibert** » Thu Nov 09, 2017 9:43 am

TheTom wrote:That way chunks could align with pages. Per chunk processing would then be able to take advantage of more locality, at least down to L3 or L2 cache.

But pages don't matter much, comparatively. Hitting different pages only hit the TLB, which has a cost but not as high as the cache one. Cachelines matter, and they're 64 bytes. From the FFF, not many structures are smaller than that. So as long as they're aligned, it doesn't really matter where they are allocated in practice.

Of course the TLB impact by itself may be noticeable.

OG.

mrvn · Post by **mrvn** » Thu Nov 09, 2017 1:14 pm

ske wrote:
Paul17041993 wrote:Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).
Do you have numbers to support your argument (like microseconds per thread creation)?

One thing I found here from 2013:
Milliseconds to create thread: 0.015625
As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.

What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.

Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.

Mekronid · Post by **Mekronid** » Thu Nov 09, 2017 1:58 pm

bobingabout wrote:... for things that leave one chunk and enter another, like vehicles etc, they could do with their own memory pool rather than be bound to a chunk (and have to be moved when they move chunks.)

This has probably been discussed already but using the chunk method creates race conditions in entity/chunk interaction, and possibly also entity/entity interaction, that undermine some of the benefits of multithreading when handled in realtime. Best way to avoid this is being careful with synchronicity only within players' local view while lazily multithreading everything else. Then race conditions can occur, be picked up after the fact, and repaired. But since it's outside of the player's view you avoid showing that mess to the player.

That's more work than just tossing things in their own threads and organizing the code to cache more efficiently. Loosely multithreading interacting objects can become volatile if re-synchronization is not handled well. Many games I've seen that implement chunks simply unload the chunks they're not using to save processing time; they don't use chunks to multithread things across infinity chunks as is being speculated here. Though, most games don't bother updating chunks outside of the player's general vicinity like Factorio, either, so multithreading may be the only viable option.

One game I know of that uses a hybrid of the two, Skyrim, renders the world using chunks but runs everything else on completely unsynchronized threads. Meaning, when the chunk loads, most of the processing about changed AI locations, quests, destroyed buildings, etc. has already been determined beforehand. The synchronization of these changes is only handled on a render tick when the chunk is loaded. Using this method in Factorio, for example, you would ignore the state of all the factory's belts while off-screen and simply use a counter to determine how "full" a stretch of belt is and which inserters would therefore be able to operate on that belt. Pick a inserter to jam if crap is loaded onto the belt, possibly by determining where on the belt the inserter is relative to the amount of items loaded into the belt, but ignore the exact order of the items. Then when the chunk loads, the game determines at that time how many items of which types are on the belt, places them where they should be, and jams the proper inserter if the chosen one was incorrect. This kind of threaded abstraction can become volatile as well, but it undoubtedly allows the game to save a lot of resources when processing many things off-screen. This is demonstrated by the fact that Skyrim can use 30+ gigs of system memory (via mods) without forming a processing bottleneck. The game only becomes unstable if someone writes badly formed code in their mod or they overrun the VRAM buffer. Even then, the game doesn't slow down because it's simply not loading enough entities into the world to show lag on the render thread.

galibert · Post by **galibert** » Thu Nov 09, 2017 3:46 pm

mrvn wrote:
ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?

One thing I found here from 2013:
Milliseconds to create thread: 0.015625
As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.
What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.

Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.

I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.

OG.

mrvn · Post by **mrvn** » Thu Nov 09, 2017 4:23 pm

galibert wrote:
mrvn wrote:
ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?

One thing I found here from 2013:
Milliseconds to create thread: 0.015625
As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.
What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.

Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.
I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.

OG.

Creating one thread is 33000 cycles. But above 8 threads where mentioned. And at 60 ticks per second:

33000*8*60 = 15840000

That's 1% cpu time on a 1.6GHz system.

mordof · Post by **mordof** » Thu Nov 09, 2017 5:19 pm

@kovarex

Regarding the difficulty you ran into parallelising trains, electricity, belts... I'm not all that familiar with how that would cause slowdowns. I tried to discuss with a friend and we're not sure if when thread A updates data in the L3 cache, do thread B and C end up with cache misses?

pleegwat · Post by **pleegwat** » Thu Nov 09, 2017 5:32 pm

galibert wrote:
phot wrote:I may be wrong, but based on the FFF it appears that you have three different processes that update a single variable, but that variable is shared between each.

(...)
I also suspect that this isn't the correct path for parallelization any way, you'll likely find that each step itself could be paralleled instead of trying to run each step in parralel. You can use a iterate and check pattern to independently run steps of the process for trains, electric network, and belts, if you even need that.
Yes and no. What they're suffering from is extreme non-locality in terms of cache. Which is obvious-in-hindsight when you look at the game. The computations themselves are rather simple (moving entities, collisions, creating/destroying entities in assemblers, etc). But they apply on data which is all over the place (belt entity, inserter entity, assembler entity, item themselves, etc). And they can't really be together (different entities) nor interleaved nicely (inserter 1 doesn't always interact with belt 1 and assembler 1 and item 1, at all). So every otherwise simple operation requires reading data from all over the place. They're not CPU-bound at all, they're memory-subsystem-bandwidth bound because they're blowing the caches. Cores, vector instructions, micro-parallelism, these only really work when the computation is the limit or if the data is very separable. Neither is the case here.

The only thing they can do is do less (especially including reads) for a gameplay-equivalent result, which the belt optimizations are a really nice demontration of.

OG.

Actually, in one thread they'd hit memory latency first. Prefetching allows additional bandwidth use, but somehow I doubt if that's enough to actually make it a bandwidth problem. If multiple threads doesn't work at all, I strongly suspect there's something else going on.

Lubricus · Post by **Lubricus** » Thu Nov 09, 2017 5:36 pm

Mekronid wrote:Using this method in Factorio, for example, you would ignore the state of all the factory's belts while off-screen and simply use a counter to determine how "full" a stretch of belt is and which inserters would therefore be able to operate on that belt. Pick a inserter to jam if crap is loaded onto the belt, possibly by determining where on the belt the inserter is relative to the amount of items loaded into the belt, but ignore the exact order of the items.

Factorio is more subtle than that, inserters don't just have a pick up speed, they have a rotation and elongation speeds so the order, placements and how clumped the items is on the belt will have an effect on the speed of the inserters. Then everything also have to be perfectly deterministic so simplifications "on the fly" would be problematic.

Qweesdy · Post by **Qweesdy** » Fri Nov 10, 2017 1:24 pm

We were always saying, that we keep the full update multithreading as the last ace in the sleeve to be used after the normal optimisations become too difficult.

I think that's very much the wrong approach. "Multi-threading done properly" effects the design of everything (the game loop, all data structures, how locking and synchronisation is done, etc). Retro-fitting support for multiple threads is like replacing the chassis in your car while you're driving on the highway - you can get some small benefits in a few specific places (e.g. splitting graphics off into its own thread), but getting real gains involves changing far too much. Retro-fitting support for multiple threads into a game that's already been heavily optimised for single-CPU is even worse (it's like replacing the chassis in your car while you're driving on the highway, juggling chainsaws and drinking a keg of beer) - it becomes so hard that it's probably faster to start "version 2.0" from scratch (which leads to inevitable scope creep, which leads to "OMG, 5 years without an update?").

ratchetfreak · Post by **ratchetfreak** » Fri Nov 10, 2017 2:02 pm

Qweesdy wrote:
We were always saying, that we keep the full update multithreading as the last ace in the sleeve to be used after the normal optimisations become too difficult.
I think that's very much the wrong approach. "Multi-threading done properly" effects the design of everything (the game loop, all data structures, how locking and synchronisation is done, etc). Retro-fitting support for multiple threads is like replacing the chassis in your car while you're driving on the highway - you can get some small benefits in a few specific places (e.g. splitting graphics off into its own thread), but getting real gains involves changing far too much. Retro-fitting support for multiple threads into a game that's already been heavily optimised for single-CPU is even worse (it's like replacing the chassis in your car while you're driving on the highway, juggling chainsaws and drinking a keg of beer) - it becomes so hard that it's probably faster to start "version 2.0" from scratch (which leads to inevitable scope creep, which leads to "OMG, 5 years without an update?").

Yeah there are some fundamental design decisions can happen to make multithreading easier.

One is to put a hard limit on how far each entity has to reach to do its update.

For example an inserter dropping an item becomes a 2 tick event. First it sends a message to the target that it is going to drop an item. Then (the next tick) the target will look at all entities trying to drop something and select who is allowed to drop and sends replies. Then again a tick later the inserter can rotate back or continue to try and drop.

There is only read only access from the previous state (completely fixed during the update) so no locking required. Any state that changes is only visible on the next tick so update order of each entity is unimportant just that the actions it needs to take are deterministic.

However currently factorio doesn't follow that message-style communication. Instead every entity can reach in and change anything about another entity. This is not conducive to parallelizing. Changing factorio to this model is a major rewrite and may change things that easy now to impossible and vice verça.

PunkSkeleton · Post by **PunkSkeleton** » Fri Nov 10, 2017 7:15 pm

mrvn wrote:
galibert wrote:
mrvn wrote:
ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?

One thing I found here from 2013:
Milliseconds to create thread: 0.015625
As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.
What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.

Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.
I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.

OG.
Creating one thread is 33000 cycles. But above 8 threads where mentioned. And at 60 ticks per second:

33000*8*60 = 15840000

That's 1% cpu time on a 1.6GHz system.

It doesn't matter how long it takes to create a thread. It just throws away the L1 and L2 caches. I was recently able to almost quadruple the throughput of multi-threaded system by getting rid of thread migrations from core to core.

Mekronid · Post by **Mekronid** » Mon Nov 13, 2017 3:53 pm

Lubricus wrote:Factorio is more subtle than that, inserters don't just have a pick up speed, they have a rotation and elongation speeds so the order, placements and how clumped the items is on the belt will have an effect on the speed of the inserters. Then everything also have to be perfectly deterministic so simplifications "on the fly" would be problematic.

I'm aware, but to be frank, subtlety costs CPU time. Look at it from a mathematical standpoint: belts' throughput is a fixed value, production is a fixed value, and inserters' throughput is a fixed value. Rotation and elongation speeds should be fixed values as well. Any variation in the above should be due to three things: power (predictable), boosts (predictable), and random chance (random). Because there is only one source of randomness, the difference over time caused by any subtleties should result in a reversion to the mean. Reversion to the mean is the rule that says any series of random events with a fixed probability will eventually converge on an average distribution.

I've found there's often a complicated way that looks nice on screen and there's an elegant way that looks nice to the computer's hardware. If the player isn't gonna see it, and we know the math averages out, then the little subtleties only serve to bog down the CPU. By doing this you separate the probabilistic math from the synchronous calculations, thereby allowing for more effective multithreading.

Some related trivia: removing a progress bar from some resource-intensive applications can drastically speed up execution speed. Positioning and rendering takes up an asston of processing time when being used synchronously on backend code. Hence why some programs nowadays (notably, several operating systems, especially if no video drivers are loaded) don't have functional progress bars.

Factorio Forums

Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues

Re: Friday Facts #215 - Multithreading issues