Re: Friday Facts #215 - Multithreading issues
Posted: Tue Nov 07, 2017 11:06 pm
This is going to break the electric train mods, etc. 

www.factorio.com
https://forums.factorio.com/
Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).Rseding91 wrote:Actually the cost of making/collecting a thread is so tiny these days that the cost of a thread pool type system ends up roughly the same. You just lose the benefits of exact start/stop when using the pool system.pleegwat wrote:creating/reaping threads costs time too. I do hope you're not doing that every tick?
Do you have numbers to support your argument (like microseconds per thread creation)?Paul17041993 wrote:Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).
As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.Milliseconds to create thread: 0.015625
That's not what I'm experiencing when I run the game. Do you have any real code that backs up your statements? Because you can just run Factorio to back up what I'm sayingPaul17041993 wrote:Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).Rseding91 wrote:Actually the cost of making/collecting a thread is so tiny these days that the cost of a thread pool type system ends up roughly the same. You just lose the benefits of exact start/stop when using the pool system.pleegwat wrote:creating/reaping threads costs time too. I do hope you're not doing that every tick?
Yes and no. What they're suffering from is extreme non-locality in terms of cache. Which is obvious-in-hindsight when you look at the game. The computations themselves are rather simple (moving entities, collisions, creating/destroying entities in assemblers, etc). But they apply on data which is all over the place (belt entity, inserter entity, assembler entity, item themselves, etc). And they can't really be together (different entities) nor interleaved nicely (inserter 1 doesn't always interact with belt 1 and assembler 1 and item 1, at all). So every otherwise simple operation requires reading data from all over the place. They're not CPU-bound at all, they're memory-subsystem-bandwidth bound because they're blowing the caches. Cores, vector instructions, micro-parallelism, these only really work when the computation is the limit or if the data is very separable. Neither is the case here.phot wrote:I may be wrong, but based on the FFF it appears that you have three different processes that update a single variable, but that variable is shared between each.
(...)
I also suspect that this isn't the correct path for parallelization any way, you'll likely find that each step itself could be paralleled instead of trying to run each step in parralel. You can use a iterate and check pattern to independently run steps of the process for trains, electric network, and belts, if you even need that.
If you read the FFF post, that's what they're looking into doing... separating the data to be a chunk per chunk basis.TheTom wrote:Would it not be nice then to organize entities into separate pools?
* One pool per chunk
* Separate pools for large multi chunk elements, such as electric networks, trains
That way chunks could align with pages. Per chunk processing would then be able to take advantage of more locality, at least down to L3 or L2 cache. May be good preparation for multi threading, but at the same time - if it increases locaility then it may already boost single threaded performance.
But pages don't matter much, comparatively. Hitting different pages only hit the TLB, which has a cost but not as high as the cache one. Cachelines matter, and they're 64 bytes. From the FFF, not many structures are smaller than that. So as long as they're aligned, it doesn't really matter where they are allocated in practice.TheTom wrote:That way chunks could align with pages. Per chunk processing would then be able to take advantage of more locality, at least down to L3 or L2 cache.
What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?Paul17041993 wrote:Creating a thread involves allocating new memory for the context and the thread's stack, requires you construct the thread's required data and then have it do something, definitely don't ever do that repeatedly. Thread pools are also often significantly slow for real-time code like this, it's much better to write your own thread manager and use things such as spinlock preparation (wake mutex'ed threads before scheduling their tasks, let them spinlock while you feed them tasks) and resonant (one thread in a group wakes others when suitable) or self scheduling techniques (no unnecessary sleeping).
One thing I found here from 2013:As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.Milliseconds to create thread: 0.015625
This has probably been discussed already but using the chunk method creates race conditions in entity/chunk interaction, and possibly also entity/entity interaction, that undermine some of the benefits of multithreading when handled in realtime. Best way to avoid this is being careful with synchronicity only within players' local view while lazily multithreading everything else. Then race conditions can occur, be picked up after the fact, and repaired. But since it's outside of the player's view you avoid showing that mess to the player.bobingabout wrote:... for things that leave one chunk and enter another, like vehicles etc, they could do with their own memory pool rather than be bound to a chunk (and have to be moved when they move chunks.)
I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.mrvn wrote:What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?
One thing I found here from 2013:As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.Milliseconds to create thread: 0.015625
Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.
Creating one thread is 33000 cycles. But above 8 threads where mentioned. And at 60 ticks per second:galibert wrote:I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.mrvn wrote:What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?
One thing I found here from 2013:As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.Milliseconds to create thread: 0.015625
Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.
OG.
Actually, in one thread they'd hit memory latency first. Prefetching allows additional bandwidth use, but somehow I doubt if that's enough to actually make it a bandwidth problem. If multiple threads doesn't work at all, I strongly suspect there's something else going on.galibert wrote:Yes and no. What they're suffering from is extreme non-locality in terms of cache. Which is obvious-in-hindsight when you look at the game. The computations themselves are rather simple (moving entities, collisions, creating/destroying entities in assemblers, etc). But they apply on data which is all over the place (belt entity, inserter entity, assembler entity, item themselves, etc). And they can't really be together (different entities) nor interleaved nicely (inserter 1 doesn't always interact with belt 1 and assembler 1 and item 1, at all). So every otherwise simple operation requires reading data from all over the place. They're not CPU-bound at all, they're memory-subsystem-bandwidth bound because they're blowing the caches. Cores, vector instructions, micro-parallelism, these only really work when the computation is the limit or if the data is very separable. Neither is the case here.phot wrote:I may be wrong, but based on the FFF it appears that you have three different processes that update a single variable, but that variable is shared between each.
(...)
I also suspect that this isn't the correct path for parallelization any way, you'll likely find that each step itself could be paralleled instead of trying to run each step in parralel. You can use a iterate and check pattern to independently run steps of the process for trains, electric network, and belts, if you even need that.
The only thing they can do is do less (especially including reads) for a gameplay-equivalent result, which the belt optimizations are a really nice demontration of.
OG.
Factorio is more subtle than that, inserters don't just have a pick up speed, they have a rotation and elongation speeds so the order, placements and how clumped the items is on the belt will have an effect on the speed of the inserters. Then everything also have to be perfectly deterministic so simplifications "on the fly" would be problematic.Mekronid wrote:Using this method in Factorio, for example, you would ignore the state of all the factory's belts while off-screen and simply use a counter to determine how "full" a stretch of belt is and which inserters would therefore be able to operate on that belt. Pick a inserter to jam if crap is loaded onto the belt, possibly by determining where on the belt the inserter is relative to the amount of items loaded into the belt, but ignore the exact order of the items.
I think that's very much the wrong approach. "Multi-threading done properly" effects the design of everything (the game loop, all data structures, how locking and synchronisation is done, etc). Retro-fitting support for multiple threads is like replacing the chassis in your car while you're driving on the highway - you can get some small benefits in a few specific places (e.g. splitting graphics off into its own thread), but getting real gains involves changing far too much. Retro-fitting support for multiple threads into a game that's already been heavily optimised for single-CPU is even worse (it's like replacing the chassis in your car while you're driving on the highway, juggling chainsaws and drinking a keg of beer) - it becomes so hard that it's probably faster to start "version 2.0" from scratch (which leads to inevitable scope creep, which leads to "OMG, 5 years without an update?").We were always saying, that we keep the full update multithreading as the last ace in the sleeve to be used after the normal optimisations become too difficult.
Yeah there are some fundamental design decisions can happen to make multithreading easier.Qweesdy wrote:I think that's very much the wrong approach. "Multi-threading done properly" effects the design of everything (the game loop, all data structures, how locking and synchronisation is done, etc). Retro-fitting support for multiple threads is like replacing the chassis in your car while you're driving on the highway - you can get some small benefits in a few specific places (e.g. splitting graphics off into its own thread), but getting real gains involves changing far too much. Retro-fitting support for multiple threads into a game that's already been heavily optimised for single-CPU is even worse (it's like replacing the chassis in your car while you're driving on the highway, juggling chainsaws and drinking a keg of beer) - it becomes so hard that it's probably faster to start "version 2.0" from scratch (which leads to inevitable scope creep, which leads to "OMG, 5 years without an update?").We were always saying, that we keep the full update multithreading as the last ace in the sleeve to be used after the normal optimisations become too difficult.
It doesn't matter how long it takes to create a thread. It just throws away the L1 and L2 caches. I was recently able to almost quadruple the throughput of multi-threaded system by getting rid of thread migrations from core to core.mrvn wrote:Creating one thread is 33000 cycles. But above 8 threads where mentioned. And at 60 ticks per second:galibert wrote:I just ran a test on my four years old laptop and creating a thread is 33000 cycles. Which may look like a lot, until you realize that means 11 microseconds. With a 16666 microseconds tick, creating a handful of threads is not noticeable.mrvn wrote:What I remember is that creating a thread under linux takes ~13000 cpu cycles. Creating an operating system thread is a syscall and that alone is a major overhead. It also needs to lock the memory management and file descriptors for the process so other threads can't (s)brk, mmap, open or close among others. Probably not a problem for factorio.ske wrote:Do you have numbers to support your argument (like microseconds per thread creation)?
One thing I found here from 2013:As far as I remember thread creation cost highly depends on the operating system and libraries and it can be very little when you have a good combination. 15 microseconds wouldn't bother me too much in a 16666 microsecond slice.Milliseconds to create thread: 0.015625
Mutex, conditional or spinlock should all be faster than creating a thread by a landslide. Certainly when you have 8 threads waiting on a condition (prepare next tick) instead of creating 8 new threads every tick. So I'm not sure what thread pool implementation you have that is slower than creating threads.
OG.
33000*8*60 = 15840000
That's 1% cpu time on a 1.6GHz system.
I'm aware, but to be frank, subtlety costs CPU time. Look at it from a mathematical standpoint: belts' throughput is a fixed value, production is a fixed value, and inserters' throughput is a fixed value. Rotation and elongation speeds should be fixed values as well. Any variation in the above should be due to three things: power (predictable), boosts (predictable), and random chance (random). Because there is only one source of randomness, the difference over time caused by any subtleties should result in a reversion to the mean. Reversion to the mean is the rule that says any series of random events with a fixed probability will eventually converge on an average distribution.Lubricus wrote:Factorio is more subtle than that, inserters don't just have a pick up speed, they have a rotation and elongation speeds so the order, placements and how clumped the items is on the belt will have an effect on the speed of the inserters. Then everything also have to be perfectly deterministic so simplifications "on the fly" would be problematic.