Friday Facts #215 - Multithreading issues

Regular reports on Factorio development.
ske
Filter Inserter
Filter Inserter
Posts: 412
Joined: Sat Oct 17, 2015 8:00 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by ske »

Lubricus wrote:
player8472 wrote:Sad to hear, that there probably won't be any real multithreading in the "near" future.

It should give an overview on what to buy for more or less single-threaded processes.
And yes, if you don't do heavy multithreading you probably shouldn't get an i9.
Haven't the i9 double the memory bandwidth? Wube have stated Factorio is more memory latency/bandwidth limited than core speed limited? Then there is also different cache sizes on different CPU's
Of course I wouldn't get a 10-core CPU just for factorio. But for image processing and compiling it makes perfect sense. Then again it would be nice to also play factorio on such a machine and not get a second one just for factorio.

As far as I know the i9 and i7 don't really differ memory-wise, both offer 4 channels (not sure if this applies to all i7 models). Intel changes their own definition of i3/5/7/9 from time to time anyways in detail they mean nothing.

AMD offers an 8-core EPYC processor with 8 memory channels for about 500$. The Threadripper CPU offers two wholly different modes for memory access. For pure memory bandwidth these processors are king but for latency I don't know. Comparing factorio performance across these architectures would be interesting. The huge difference between Ryzen 1800X and i7 7700K begs the question if the AMD cache/memory implementation just isn't well suited for factorio or if factorio is just optimized for the peculiaries of Intel processors.
mrvn
Smart Inserter
Smart Inserter
Posts: 5969
Joined: Mon Sep 05, 2016 9:10 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by mrvn »

Zulan wrote:Don't want to nitpick, but maybe this helps to find better information. The piece of memory cache in question is a cache line, it's size is usually 64 byte (a page on OS level is 4096 byte). The effect in question is called false sharing is a common issue for multi-threaded programs.

However, I am quite surprised it happens here. Entities are multiple cache lines in size - so the figure actually looks different - I would not expect to have so much interference in this case. Also the L1/L2 cache is quite small compared to the typical Factorio working set, so the active cache lines change alot during the an update phase, reducing the probability of a conflict between threads. But parallel performance is often surprisingly different in reality versus expectation.

That makes me wonder, by what observation have you figured out the reason for the performance issue?
One would think trains and belts would exceed cache lines. But they probably have things like smart pointers and lists allocated dynamically. And then you have structures with just a bool and a pointer (16 byte with padding) or a prev, next and data link (32 byte with padding). And suddenly the list item for a train and list item for a belt share a cache line. If factorio is using a standard C++ STL then it is full of tiny structures.
Koub
Global Moderator
Global Moderator
Posts: 7955
Joined: Fri May 30, 2014 8:54 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Koub »

ske wrote:The huge difference between Ryzen 1800X and i7 7700K begs the question if the AMD cache/memory implementation just isn't well suited for factorio or if factorio is just optimized for the peculiaries of Intel processors.
Rseding91 wrote:We don't optimize for specific CPUs (or any hardware configuration for that matter). We optimize the algorithms we write to be more efficient (just do less work) and the data we use to be more cache-friendly - none of the things we do care about what hardware you have - it will all run faster regardless.

AMD has always slower slower than Intel in single threaded performance for a long time - which is what games (and the majority of applications) care about.
Source

I'd be interested to see a comparaison between Ryzen and an older AMD CPU, like a FX-8370, or a FX-4350 for the same map.
Koub - Please consider English is not my native language.
Zulan
Long Handed Inserter
Long Handed Inserter
Posts: 56
Joined: Mon Jan 25, 2016 5:55 pm
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Zulan »

mrvn wrote:One would think trains and belts would exceed cache lines. But they probably have things like smart pointers and lists allocated dynamically. And then you have structures with just a bool and a pointer (16 byte with padding) or a prev, next and data link (32 byte with padding). And suddenly the list item for a train and list item for a belt share a cache line. If factorio is using a standard C++ STL then it is full of tiny structures.
Any non stack pointer - smart or not - is part of bigger data structures. Non-intrusive lists are horribly inefficient anyway. That only leaves tiny vectors, but there's small size optimized containers for that. Also smart pointers are absolutely not implemented as bool + pointer...
Jap2.0
Smart Inserter
Smart Inserter
Posts: 2423
Joined: Tue Jun 20, 2017 12:02 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Jap2.0 »

bobingabout wrote:Those maps look totally different. and I'm not talking about the multiple starting areas being seamless or not.
It might have something to do with this and/or this.
There are 10 types of people: those who get this joke and those who don't.
shadetheartist
Inserter
Inserter
Posts: 23
Joined: Mon Sep 18, 2017 10:14 pm
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by shadetheartist »

basementjack wrote:
shadetheartist wrote:the devs should really consider open sourcing the code.
Heck No.

I don't recall where exactly, but it was actually stated by the devs at some point that open sourcing the code may be considered, probably once the game was quite mature.
d3x0r
Filter Inserter
Filter Inserter
Posts: 316
Joined: Sun Jun 04, 2017 8:56 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by d3x0r »

I really would have thought you would have already created SLAB allocators for things.
https://en.wikipedia.org/wiki/Slab_allocation

And you know, memory is cashed in 4096 byte chunks? I noticed in the FFF that you said 'some amount of bytes'
It's the same size that page map table tracks.


I've been using this code for a long long time...

https://github.com/d3x0r/SACK/blob/mast ... lib/sets.c
The header section ... _SETS_NAMESPACE
https://github.com/d3x0r/SACK/blob/mast ... lib.h#L859

This allocates slabs with a small header bit that remembers how big elements are, and how many are in a block. This is followed by a bit array that tracks whether a structure is used or not. When elements are allocated from the set, they are zero cleared.

It's actually pretty easy to use; although it is somewhat cryptic....

say you have

Code: Select all

class SOmething {
  /* content */ }
};
typedef class Something SomeThing,
typedef class Something *PSomeThing,
// should be a multiple of 32, and/or multiples of 4096/sizeof(class/struct)
#define MAXSomeThingSPERSET 256
DeclareSet( SomeThing, PSomeThings );


PSomeThings *set;  // initilaize to NULL.

PSomeThing *thing = GetFromSet( SomeThing, &set );



Of course since you're C++ based... really you'd have to either break the constructor out to an initilaizer instead... or do a new override that takes a set.

(C++ Sucks for custom allocations :) )

Ahh well... I'm sure you'll figure out something.
Paul17041993
Inserter
Inserter
Posts: 37
Joined: Fri Nov 25, 2016 4:26 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Paul17041993 »

@klonan are you using data-oriented memory allocation for each object? DOP is ideal for both single and multi-threaded code as it allows the CPU to load and store the data way faster and more efficiently. When it comes to thread management as well, you ideally want to pre-wake the threads into a spinlock while scheduling their tasks as this allows the OS to bring the threads into a running state without wasting valuable clock cycles, upon doing so keeping the threads fed by allowing them to pull more tasks as they go. Upon cycle completion you can then send them back into a mutex lock so they can sleep, this last step is important not just for power saving but for giving back context to system and driver threads, leaving threads in a spinlock when they're not needed will drastically kill the performance.
User avatar
bobingabout
Smart Inserter
Smart Inserter
Posts: 7352
Joined: Fri May 09, 2014 1:01 pm
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by bobingabout »

Jap2.0 wrote:
bobingabout wrote:Those maps look totally different. and I'm not talking about the multiple starting areas being seamless or not.
It might have something to do with this and/or this.
I'd probably say the first.
Creator of Bob's mods. Expanding your gameplay since version 0.9.8.
I also have a Patreon.
galibert
Inserter
Inserter
Posts: 43
Joined: Fri Sep 15, 2017 7:42 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by galibert »

d3x0r wrote:Of course since you're C++ based... really you'd have to either break the constructor out to an initilaizer instead... or do a new override that takes a set.

(C++ Sucks for custom allocations :) )
With modern C++ argpack forwarding and placement new works well to do syntaxes that are both sane and efficient. std::make_unique is a nice example of what can be done. In the end it depends on what you want to do, but it's usually doable now.

OG.
ratchetfreak
Filter Inserter
Filter Inserter
Posts: 952
Joined: Sat May 23, 2015 12:10 pm
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by ratchetfreak »

galibert wrote:
d3x0r wrote:Of course since you're C++ based... really you'd have to either break the constructor out to an initilaizer instead... or do a new override that takes a set.

(C++ Sucks for custom allocations :) )
With modern C++ argpack forwarding and placement new works well to do syntaxes that are both sane and efficient. std::make_unique is a nice example of what can be done. In the end it depends on what you want to do, but it's usually doable now.

OG.
Except make unique doesn't have a alloc_unique counterpart like make_shared has. Mostly because unique_ptr requires that the deleter is part of the type. Whereas the shared_ptr's deleter is virtualized and the allocated block already hold extra state (the ref counts at the very least) so adding a chunk of state for the allocator and the deletion function is not a big deal.
mrvn
Smart Inserter
Smart Inserter
Posts: 5969
Joined: Mon Sep 05, 2016 9:10 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by mrvn »

Zulan wrote:
mrvn wrote:One would think trains and belts would exceed cache lines. But they probably have things like smart pointers and lists allocated dynamically. And then you have structures with just a bool and a pointer (16 byte with padding) or a prev, next and data link (32 byte with padding). And suddenly the list item for a train and list item for a belt share a cache line. If factorio is using a standard C++ STL then it is full of tiny structures.
Any non stack pointer - smart or not - is part of bigger data structures. Non-intrusive lists are horribly inefficient anyway. That only leaves tiny vectors, but there's small size optimized containers for that. Also smart pointers are absolutely not implemented as bool + pointer...
Sorry, I mend refcount + pointer. And since the refcount must be shared with all copies of the smart pointer I don't think there is any way to make that intrusive.

As for non-intrusive being horrible inefficient: Guess why a lot of the time the STL is so slow.

But you can do really nice intrusive data structures with C++17 templates (actually since C++11 or C++15) if you spend the time making your own list and stuff.
TheGegger
Burner Inserter
Burner Inserter
Posts: 10
Joined: Tue Nov 07, 2017 11:14 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by TheGegger »

Forgive me for stating simple and obvious facts which you may have already looked at and overlooked, I'm hoping some of it may trigger an Idea that hasn't been thought of.

My understanding from reading the post is that you are trying to thread three serial activities. Each of these activities have prep work which prepares large areas of memory and is already threaded (I believe 8 threads per were mentioned).

My experiences in multi-threaded are in a different language (sorry), and was made with code which isn't as memory intensive as this, but I quickly noticed that thread over-creation decreased performance. My application was process and memory intensive and took minutes and hours, not milliseconds, so again, may be different issues.

i.e. on a 4 core 8 threads processor, there seemed to be a magic number of 6 threads which were faster than 4 or 8, which were faster than 10 threads, etc.
So my understanding is that if threads are not starving and ready to proceed, the time it takes to switch threads eventually becomes the bottleneck. I've always thought that the 2 missing threads were used by common OS, VM and OS background operations, but It cold also be that the extra 2 threads caused the CPU to throttle and decrease its 'turbo' mode, though this was fairly constant across multiple CPU architectures from newer Xeon to older Core2 Q6600 which I don't remember as having much turbo.

So if there are three activities, requiring more threads initially (lets say 8 for 6 segments, then 1 for another 6), then you may have something like this:
: A:888888111111 B:8888881111111 C:888888111111:

Code: Select all

88888811111118888881111118888888111111
I know I'm over simplifying things, but trying to make these activities paralell would result in:

Code: Select all

88888888888811111111
88888888888888888111111
8888888888888111111111
Now, of course, in most cases this would still be faster than running them serially, because the processor can magically optimise things on a micro-level that we can't at a macro level, but this isn't always the case.

Would it be possible to separate the activities into prep and work segments where Activity 2 waites until prep of activity one ends, etc, so by delaying the start of some threads to limit total thread count to an amount easier to orchestrate for the page file?

Code: Select all

888888111111
      888888111111
            888888111111
This would change the number of current thread attempts from 24 to 9 in this scenario, decrease total memory bandwidth of prepping to a single activity at a time, and reduce resource (e.g. cache) locking due to writes in time.

Of course, if you're implementing look-ahead caching as you have stated in other posts, I can only estimate that you thought about this already and I'm just spewing white noise.
mrvn
Smart Inserter
Smart Inserter
Posts: 5969
Joined: Mon Sep 05, 2016 9:10 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by mrvn »

I read the posts as it being

888888888 111111 1111111 11111

and now it is

888888888 333333

Where 8 is initially set to the number of cores you have (and I assume is number of threads in the settings).
Selvek
Fast Inserter
Fast Inserter
Posts: 238
Joined: Fri May 06, 2016 4:04 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Selvek »

The whole idea behind the "copy and paste" starting area generation was fairness. The new map example doesn't look very fair - i.e. the Southern spawn point is sitting on massive oil reserves.
ske
Filter Inserter
Filter Inserter
Posts: 412
Joined: Sat Oct 17, 2015 8:00 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by ske »

Koub wrote:
Rseding91 wrote:We don't optimize for specific CPUs (or any hardware configuration for that matter). We optimize the algorithms we write to be more efficient (just do less work) and the data we use to be more cache-friendly - none of the things we do care about what hardware you have - it will all run faster regardless.

AMD has always slower slower than Intel in single threaded performance for a long time - which is what games (and the majority of applications) care about.
Source

I'd be interested to see a comparaison between Ryzen and an older AMD CPU, like a FX-8370, or a FX-4350 for the same map.
I don't question your intentions. I don't question that AMD CPUs are less performant in single core tasks. The only thing I suspect is selective perception when measuring performance. Do you use AMD processors to evaluate your performance numbers? I'm not saying that you should specifically optimize for AMD processors. Most gamers use Intel processors anyways and you have limited resources. It would be nice to know that you look at a range of hardware when doing your optimizations.

The thing I was surprised about is the difference given of 35 UPS on a 7700K vs. 22.5 UPS on an overclocked 1800X. This is a ratio of 1.55. In the PassMark we get a ratio of 1.25 in favor of the 7700K vs. the 1800X. This leaves half of the difference unexplained.

One thing that could be of wider interest is making the game faster on slower processors. That way you would lower the level of entry for big multiplayer maps and widen your audience. Having a "reference map" for performance testing would be nice. I'd surely contribute my numbers!
Rseding91
Factorio Staff
Factorio Staff
Posts: 14817
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Rseding91 »

ske wrote:The thing I was surprised about is the difference given of 35 UPS on a 7700K vs. 22.5 UPS on an overclocked 1800X. This is a ratio of 1.55. In the PassMark we get a ratio of 1.25 in favor of the 7700K vs. the 1800X. This leaves half of the difference unexplained.
That's easy, the benchmark test is not the same workload as a Factorio save game. Which is why benchmarks aren't a true measure of hardware performance.

But again, we don't optimize for any specific hardware. We optimize so the code is more efficient - doing less work - simple things like: it took 100 checks to do this logic and I can make it work in only 50 checks so now it runs twice as fast.
If you want to get ahold of me I'm almost always on Discord.
ske
Filter Inserter
Filter Inserter
Posts: 412
Joined: Sat Oct 17, 2015 8:00 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by ske »

Rseding91 wrote:
ske wrote:The thing I was surprised about is the difference given of 35 UPS on a 7700K vs. 22.5 UPS on an overclocked 1800X. This is a ratio of 1.55. In the PassMark we get a ratio of 1.25 in favor of the 7700K vs. the 1800X. This leaves half of the difference unexplained.
That's easy, the benchmark test is not the same workload as a Factorio save game. Which is why benchmarks aren't a true measure of hardware performance.

But again, we don't optimize for any specific hardware. We optimize so the code is more efficient - doing less work - simple things like: it took 100 checks to do this logic and I can make it work in only 50 checks so now it runs twice as fast.
True, the PassMark benchmark probably tests for other bottlenecks. But it shows that - depending on the actual workload - those CPUs perform differently. In comparison, the 1800X can handle PassMark better than factorio.

I think I understand what you mean. By removing cruft your code should get faster on any architecture.
pleegwat
Filter Inserter
Filter Inserter
Posts: 278
Joined: Fri May 19, 2017 7:31 pm
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by pleegwat »

creating/reaping threads costs time too. I do hope you're not doing that every tick?
Rseding91
Factorio Staff
Factorio Staff
Posts: 14817
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: Friday Facts #215 - Multithreading issues

Post by Rseding91 »

pleegwat wrote:creating/reaping threads costs time too. I do hope you're not doing that every tick?
Actually the cost of making/collecting a thread is so tiny these days that the cost of a thread pool type system ends up roughly the same. You just lose the benefits of exact start/stop when using the pool system.
If you want to get ahold of me I'm almost always on Discord.
Post Reply

Return to “News”