Performance optimization - post your saves

Post all other topics which do not belong to any other category.
SoShootMe
Filter Inserter
Filter Inserter
Posts: 287
Joined: Mon Aug 03, 2020 4:16 pm
Contact:

Re: Performance optimization - post your saves

Post by SoShootMe »

mrvn wrote:
Tue Aug 24, 2021 1:59 am
SoShootMe wrote:
Fri May 07, 2021 4:07 pm
ptx0 wrote:
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.
The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.

mrvn
Smart Inserter
Smart Inserter
Posts: 5112
Joined: Mon Sep 05, 2016 9:10 am
Contact:

Re: Performance optimization - post your saves

Post by mrvn »

SoShootMe wrote:
Tue Aug 24, 2021 7:19 am
mrvn wrote:
Tue Aug 24, 2021 1:59 am
SoShootMe wrote:
Fri May 07, 2021 4:07 pm
ptx0 wrote:
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.
The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.
I was kind of hinting at a middle ground there.

For example you pin some threads to core {0,1,2,3} and some to {4,5,6,7} because sets of 4 cpu core/threads share L1/L2 caches. But there are 2 L3 caches so you don't want threads to jump between those two. L3 caches also are the largest so they take the longest to fill back up after a switch.

OvermindDL1
Fast Inserter
Fast Inserter
Posts: 190
Joined: Sun Oct 05, 2014 6:12 am
Contact:

Re: Performance optimization - post your saves

Post by OvermindDL1 »

As requested by posila at viewtopic.php?p=559889#p559889 posting this save to look at Surface::listFriendlyEntities performance issues:
https://overminddl1.com/Factorio/perf/OverSE-E1-22.zip

User avatar
ptx0
Smart Inserter
Smart Inserter
Posts: 1358
Joined: Wed Jan 01, 2020 7:16 pm
Contact:

Re: Performance optimization - post your saves

Post by ptx0 »

mrvn wrote:
Tue Aug 24, 2021 1:59 am
After that it makes a huge difference with NUMA but only if your memory allocation and usage is also NUMA aware. For example you would allocate all inserters in a memory segment in one NUMA domain and pin the thread doing inserter updates to cores in the same NUMA domain. The thread would run a lot faster there. Or split the map into quadrants and place each quadrant into one NUMA domain and pin a thread there. Then each assembler, belt, inserter needs to be allocated in the right memory block and processed by the right thread.

Factorio isn't really designed for that. The core design was really single threaded and since then it only got a few patch ups for special things that could be made multi threaded after the fact. Like the fluid system that could be broken into independent parts. But there isn't a inserter thread that runs in parallel with for example an assembler thread. It's not designed that way. Afaik nothing in factorio would be able to optimize for NUMA and pinned threads without a major overhaul. It's just not designed that way and maybe doesn't fit that memory model at all.

Best approach might be to split things by geography. Split the map into largish tiles and everything in that tile gets pinned to one core and it's closest memory. Of course that requires extra work at the tile boundaries where threads would collide but given large enough tiles that can be minimized (overhead wise). But that would probably only help systems with NUMA as the working set is far to big to make cache effects relevant in that optimization.
one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.

quyxkh
Filter Inserter
Filter Inserter
Posts: 964
Joined: Sun May 08, 2016 9:01 am
Contact:

Re: Performance optimization - post your saves

Post by quyxkh »

ptx0 wrote:
Sun Jan 16, 2022 9:07 pm
one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.
"Only" sees 6% gain.

If they'd additionally link with libhugetlbfs on linux that'd ... I think the only fair characterization here is "skyrocket". I'd expect >25% speed boost on big maps.

edit: I've seen the effects https://www.reddit.com/r/factorio/comme ... uge_pages/ reports


MartinG
Burner Inserter
Burner Inserter
Posts: 6
Joined: Sun Jan 23, 2022 11:50 pm
Contact:

Re: Performance optimization - post your saves

Post by MartinG »

Hello, I described a significant performance drop in this post: viewtopic.php?f=49&t=101353&p=560747#p560747

Anyone care to check it out?

deep_remi
Burner Inserter
Burner Inserter
Posts: 12
Joined: Wed Mar 23, 2022 9:17 pm
Contact:

Re: Performance optimization - post your saves

Post by deep_remi »

This circuit is a good stress test for combinators (deep convolutional neural network with ~150k combinators):
viewtopic.php?f=193&p=565099
I feel it could be made faster when it is doing nothing. I tried to power it off when it is not used, but it made no significant difference. Any optimization trick is welcome.

Sopel
Long Handed Inserter
Long Handed Inserter
Posts: 58
Joined: Mon Sep 24, 2018 8:30 pm
Contact:

Re: Performance optimization - post your saves

Post by Sopel »

I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

Image

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?

Rseding91
Factorio Staff
Factorio Staff
Posts: 12216
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: Performance optimization - post your saves

Post by Rseding91 »

Sopel wrote:
Thu Apr 14, 2022 3:07 pm
I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?
You'll notice that the majority of the time is spent inspecting/chasing items on belts. Since inserters have 2 purposes: move items between a belt and something else, and move items between 2 inventories. As a player if you use belts; the majority of the time will be an inserter trying to remove/add an item from/to the belt. This is unavoidable.

Since multiple inserters can interact with belts, circuit networks, inventories, destroy item-entities, create item-entities, go active/inactive, and so on; running them in multiple threads is not viable. The part of the inserter update where it *doesn't* mutate other entities/state is the part you don't see in the profiler because it takes near zero time.
If you want to get ahold of me I'm almost always on Discord.

Sopel
Long Handed Inserter
Long Handed Inserter
Posts: 58
Joined: Mon Sep 24, 2018 8:30 pm
Contact:

Re: Performance optimization - post your saves

Post by Sopel »

Rseding91 wrote:
Fri Apr 15, 2022 2:04 pm
Sopel wrote:
Thu Apr 14, 2022 3:07 pm
I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?
You'll notice that the majority of the time is spent inspecting/chasing items on belts. Since inserters have 2 purposes: move items between a belt and something else, and move items between 2 inventories. As a player if you use belts; the majority of the time will be an inserter trying to remove/add an item from/to the belt. This is unavoidable.

Since multiple inserters can interact with belts, circuit networks, inventories, destroy item-entities, create item-entities, go active/inactive, and so on; running them in multiple threads is not viable. The part of the inserter update where it *doesn't* mutate other entities/state is the part you don't see in the profiler because it takes near zero time.
Creation/destruction of item entities is not a common occurence so I think it could be tracked per thread and then processed sequentially afterwards. As for the dependencies/interactions I still believe that can be resolved dynamically pretty well. I coded up some simplified demonstration in python with a naive algorithm. It randomizes inserter connections (which is the worst case) and then attempts to split the inserters into groups that are independent.

Code: Select all

import random

class InserterEntity:
    '''
    Each state would have a virtual "mutex" that identifies it.
    I presume it would be enough to have one per inventory/belt/circuit.
    Could just use a pointer to the state?
    '''
    def __init__(self, source_state_mutex, destination_state_mutex, circuit_state_mutex):
        self.source_state_mutex = source_state_mutex
        self.destination_state_mutex = destination_state_mutex
        self.circuit_state_mutex = circuit_state_mutex

    def get_dependent_states(self):
        if self.circuit_state_mutex:
            return (self.source_state_mutex, self.destination_state_mutex, self.circuit_state_mutex)
        else:
            return (self.source_state_mutex, self.destination_state_mutex)

def create_some_inserter_entities(num_inserters, num_other_entities, num_circuits, circuit_chance):
    entities = []
    for i in range(num_inserters):
        source_id = random.randrange(num_other_entities)
        destination_id = num_other_entities + random.randrange(num_other_entities)
        circuit_id = num_other_entities * 2 + random.randrange(num_circuits) if random.random() < circuit_chance else None
        entities.append(InserterEntity(source_id, destination_id, circuit_id))
    return entities

def split_into_independent_groups(entities, num_groups):
    '''
    Split into num_groups indepdendent groups and one more group
    that contains items that are always dependent on more than 1 other group.
    '''
    groups = [[] for i in range(num_groups)]
    mutexes_in_groups = [set() for i in range(num_groups)]
    seq_after = []

    for entity in entities:
        deps = entity.get_dependent_states()
        has_dep = [False] * num_groups
        for i, mutexes_in_group in enumerate(mutexes_in_groups):
            for dep in deps:
                if dep in mutexes_in_group:
                    has_dep[i] = True
                    break
        num_deps = has_dep.count(True)
        if num_deps > 1:
            seq_after.append(entity)
        elif num_deps == 1:
            for i in range(num_groups):
                if has_dep[i]:
                    groups[i].append(entity)
                    mutexes_in_groups[i].update(deps)
                    break
        elif num_deps == 0:
            best_group = 0
            for i in range(1, num_groups):
                if len(groups[i]) < len(groups[best_group]):
                    best_group = i
            groups[best_group].append(entity)
            mutexes_in_groups[best_group].update(deps)

    return groups, seq_after

num_groups = 8
num_inserters = 10000
num_unique_connectibles = 3000
num_circuit_networks = 20
inserter_with_circuit_rate = 0.01
entities = create_some_inserter_entities(num_inserters, num_unique_connectibles, num_circuit_networks, inserter_with_circuit_rate)
print('Total: ', len(entities))
groups, seq_after = split_into_independent_groups(entities, num_groups)
for i, group in enumerate(groups):
    print('Group {}: '.format(i), len(group))
print('Seq group: ', len(seq_after))

print('Second step maybe to parallelize leftover?')
groups, seq_after = split_into_independent_groups(seq_after, num_groups)
for i, group in enumerate(groups):
    print('Group {}: '.format(i), len(group))
print('Seq group: ', len(seq_after))
I'll leave it now. I don't think there's more I can talk about, knowing the actual in-game mechanics just as much as playing allows I don't have enough necessary knowledge to discuss this.

User avatar
ssilk
Global Moderator
Global Moderator
Posts: 12717
Joined: Tue Apr 16, 2013 10:35 pm
Contact:

Re: Performance optimization - post your saves

Post by ssilk »

I want to point to this subject
viewtopic.php?f=6&t=102385 [Optimization] Improvement for container insertion logic

Edit: discussion was merged and is now at viewtopic.php?p=567987#p567987


Because it seems to be a problem in megabases:
… the fact that insertion/removal logic behaves exactly as if it's iterating over every slot, and very large containers (warehouse mods, etc) are notorious for causing lag unless used sparingly, suggests it.
There have been threads discussing this and talking about adding indexing of some sort before, though I decided to make a new thread because i'm suggesting a specific indexing method.
It describes a way to add an index to the chest:
if you're just adding or removing the same items from a chest, and you've already found the slot to do so with, there's no reason to iterate over the slots again
Nevertheless it’s worth reading, because it describes the problem very well.
Cool suggestion: Eatable MOUSE-pointers.
Have you used the Advanced Search today?
Need help, question? FAQ - Wiki - Forum help
I still like small signatures...

User avatar
Taneeda
Long Handed Inserter
Long Handed Inserter
Posts: 56
Joined: Tue May 30, 2017 9:25 am
Contact:

Re: Performance optimization - post your saves

Post by Taneeda »

Here a save file of my spaghetti megabase, not the biggest but pretty big (~2.9kSpm) so I thought maybe its interesting to use it for researching possible performance improvements. Let me know how I can help...

The link contains several iterations of the same base. https://drive.google.com/drive/folders/ ... sp=sharing
Shit happens, don't worry, keep happy

dasiro
Fast Inserter
Fast Inserter
Posts: 114
Joined: Fri Jun 03, 2016 5:55 pm
Contact:

Re: Performance optimization - post your saves

Post by dasiro »

Sopel wrote:
Thu Apr 14, 2022 3:07 pm

Image
how did you make this?

I've got a seablock save that's 200MB COMPRESSED (2.2GB uncompressed) with a few extra mods, but really mega-base sized imho. my UPS is horrible and the game takes up about 30GB+ while running (half is pagefile on my ssd). It takes minutes to load/save and I'm more and more thinking about abandoning it, just because of the bad performance.

my trusty old i5-6600K, 16GB ought to be enough, but my train updates are running wild at around 16 most of the time causing my UPS to drop to 30 :/

If anyone could help me salvage this 960+h save it would be very appreciated
https://www.transferxl.com/download/08jk9yGYQtqvY7
(retention = 1 week so msg me if it's expired)

Rseding91
Factorio Staff
Factorio Staff
Posts: 12216
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: Performance optimization - post your saves

Post by Rseding91 »

dasiro wrote:
Sun Jun 05, 2022 3:13 pm
... I've got a seablock save that's 200MB COMPRESSED (2.2GB uncompressed) with a few extra mods ...
Just so you know, the mod train-log is what's taking 98% of your save file size and runtime memory usage.
If you want to get ahold of me I'm almost always on Discord.

dasiro
Fast Inserter
Fast Inserter
Posts: 114
Joined: Fri Jun 03, 2016 5:55 pm
Contact:

Re: Performance optimization - post your saves

Post by dasiro »

:shock: :shock: :shock:

I suspected LTN but I couldn't find anything that pointed in it's direction. Certainly since the UPS returned to 60 regularly but it happened both with a lot of trains in depot and in movement.

I've cleared the history and now my compressed size is only 24MB so the loading and saving is already a HUGE win.

What I suspected to be 1 problem appear to be 2, since the UPS is still unchanged with factorio only taking up 35% CPU and 50% usage in total

Is there any way I could diagnose this myself, since the saves are serialized

thanks a lot already and I guess that memory upgrade can wait a bit longer now so I'll get some more merch :D

Rseding91
Factorio Staff
Factorio Staff
Posts: 12216
Joined: Wed Jun 11, 2014 5:23 am
Contact:

Re: Performance optimization - post your saves

Post by Rseding91 »

F4 -> show-entity-time-usage for per-entity timings
F4 -> show-time-usage for other timings
If you want to get ahold of me I'm almost always on Discord.

Qon
Smart Inserter
Smart Inserter
Posts: 1744
Joined: Thu Mar 17, 2016 6:27 am
Contact:

Re: Performance optimization - post your saves

Post by Qon »

Issue 1
I get bad performance (9 UPS) with this somewhat large combinator contraption with ~800 000 combinators (~500 000 not counting constants). I expected that, obviously there's a limit to how many combinators the game can simulate per update. If it can be sped up in some way anyways that would be neat, I get 11 UPS when all the combinators are in a fixed state with no input updates. Might just be the switches and lamps that don't toggle when the circuits don't update that give those extra UPS. Only updating output of combinators that get updated inputs would speed my circuit up immensely. I guess it hasn't been done already because it would reduce performance for circuits that are always changing their inputs?
(Editor Extensions mod is used for power but can be replaced with vanilla interface for testing, Recursive Blueprints+ is needed for the circuit to do anything but isn't needed for performance testing. Other mods are not needed either and don't really affect UPS.)
Issue 2, Maybe this would better fit as a bug report:
Well, the reason I report this is that deconstruction of just a handful of combinators completely freezes the game (marking is fine without instant deconstruction, it freezes when bots start removing or if you have /editor deconstruction). It takes like 10 seconds to remove ~100 (one cell) entities with /editor instant deconstruction where the game is completely frozen. But only from the big connected block, things not connected to the big contraption are removed quickly. This is the same even if mods are disabled. Trying to remove more at once just makes the game freeze longer or permanently, which makes it practically impossible to remove the thousands of cells that exist now. Construction is instant still though. If you manage to remove a handful of cells after waiting for minutes and press Ctrl+Z the entities are back instantly. Is this performance issue fixable?
Attachments
cell command large.zip
(35.36 MiB) Downloaded 63 times

Post Reply

Return to “General discussion”