Page 19 of 19

Re: Performance optimization - post your saves

Posted: Tue Aug 24, 2021 7:19 am
by SoShootMe
mrvn wrote:
Tue Aug 24, 2021 1:59 am
SoShootMe wrote:
Fri May 07, 2021 4:07 pm
ptx0 wrote:
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.
The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.

Re: Performance optimization - post your saves

Posted: Tue Aug 24, 2021 10:49 am
by mrvn
SoShootMe wrote:
Tue Aug 24, 2021 7:19 am
mrvn wrote:
Tue Aug 24, 2021 1:59 am
SoShootMe wrote:
Fri May 07, 2021 4:07 pm
ptx0 wrote:
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.
The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.
I was kind of hinting at a middle ground there.

For example you pin some threads to core {0,1,2,3} and some to {4,5,6,7} because sets of 4 cpu core/threads share L1/L2 caches. But there are 2 L3 caches so you don't want threads to jump between those two. L3 caches also are the largest so they take the longest to fill back up after a switch.

Re: Performance optimization - post your saves

Posted: Fri Jan 14, 2022 7:05 pm
by OvermindDL1
As requested by posila at viewtopic.php?p=559889#p559889 posting this save to look at Surface::listFriendlyEntities performance issues:
https://overminddl1.com/Factorio/perf/OverSE-E1-22.zip

Re: Performance optimization - post your saves

Posted: Sun Jan 16, 2022 9:07 pm
by ptx0
mrvn wrote:
Tue Aug 24, 2021 1:59 am
After that it makes a huge difference with NUMA but only if your memory allocation and usage is also NUMA aware. For example you would allocate all inserters in a memory segment in one NUMA domain and pin the thread doing inserter updates to cores in the same NUMA domain. The thread would run a lot faster there. Or split the map into quadrants and place each quadrant into one NUMA domain and pin a thread there. Then each assembler, belt, inserter needs to be allocated in the right memory block and processed by the right thread.

Factorio isn't really designed for that. The core design was really single threaded and since then it only got a few patch ups for special things that could be made multi threaded after the fact. Like the fluid system that could be broken into independent parts. But there isn't a inserter thread that runs in parallel with for example an assembler thread. It's not designed that way. Afaik nothing in factorio would be able to optimize for NUMA and pinned threads without a major overhaul. It's just not designed that way and maybe doesn't fit that memory model at all.

Best approach might be to split things by geography. Split the map into largish tiles and everything in that tile gets pinned to one core and it's closest memory. Of course that requires extra work at the tile boundaries where threads would collide but given large enough tiles that can be minimized (overhead wise). But that would probably only help systems with NUMA as the working set is far to big to make cache effects relevant in that optimization.
one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.

Re: Performance optimization - post your saves

Posted: Mon Jan 17, 2022 6:06 pm
by quyxkh
ptx0 wrote:
Sun Jan 16, 2022 9:07 pm
one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.
"Only" sees 6% gain.

If they'd additionally link with libhugetlbfs on linux that'd ... I think the only fair characterization here is "skyrocket". I'd expect >25% speed boost on big maps.

edit: I've seen the effects https://www.reddit.com/r/factorio/comme ... uge_pages/ reports

Re: Performance optimization - post your saves

Posted: Tue Jan 18, 2022 4:13 am
by ptx0
viewtopic.php?p=537779#p537779

it was in this thread, too :)