Friday Facts #421 - Optimizations 2.0

Tertius · Post by **Tertius** » Fri Jul 26, 2024 9:28 pm

eternaleye wrote: ↑
Fri Jul 26, 2024 6:42 pm
Speaking of lateral thinking in optimizations, I'm curious as to both why electric network updates do have such a heavy memory bandwidth impact, and whether that impact could be avoided in the common case.

This is not about programming, this is about CPU and memory hardware. To be able to speed things up with multithreading, the multithreaded stuff must run simultaneously. Stuff runs simultaneously, if one thread runs on one CPU core and another thread on a different CPU core at the same time.

The in-memory structure that holds the electric network info is some array. It's an array, so items are adjacent to each other in memory, so multiple items fit into one cache line, which speeds things up, since they are loaded into the cache just once, then stay over the course of the tick update.

Some part of that array belongs to one electric network and some other part belongs to another electric network. One thread for one electric network iterates over one part of that array, and the other thread for the other electric network iterates over an other part of that array. This means the same array is accessed at the same time by 2 different threads, so it is accessed by 2 different CPU cores.

Memory access is fast, but to speed it up even more, there is a memory cache that can be accessed faster than the base memory. There are even multiple tiers of cache - there are 3 tiers. The L1 cache is extremely fast, but very small, about 32KB. It's not shared among CPU cores. It's the L1 cache that makes modern CPUs really fast. It can be so fast, because it isn't shared among CPU cores. The next tier is the L2 cache. It's a buffer between the L1 and the last tier of cache, the L3 cache. The L2 cache is also not shared between CPU cores. Only the L3 cache is cached between CPU cores, but due to that sharing, this cache is considerably slower than the L1 cache.

Items are cached in blocks of 64 bytes (called the cache line size). Now, if 2 threads are working in parallel on the same array, the 1st CPU core executing the 1st thread loads the data into its L1 cache through the L2 cache from the L3 cache. Now, if the other thread on our 2nd CPU core wants to access some part of that array less than 64 bytes away, the memory management hardware invalidates the L1 cache from the 1st core, then loads the data from L3 to L2 to the L1 cache of the 2nd core.

The real impact starts, if core 1 wants to access the next item. It is probably less than 64 bytes away, so it's usually already in the L1 cache, and this is the fastest possible access. However, it must be fetched from the l3 cache again due to the cache line was invalidated due to the access from the other core. This again invalidates the L1 cache from the other core. In the end, all memory access for this kind of data is fetched with the speed of the L3 cache, not with the really fast speed of the L1 cache.

Of course, many items in that array are more than 64 bytes away, so this kind of cache thrashing isn't always taking place. But it seems to take place often enough to destroy the performance gain of the parallelization.

You might think that you just need to make items bigger than the cache line size. However, this means more memory access in general, because the data is bigger, so this also slows everything down, because more stuff has to be purged from the cache to free space for the more data to load into the L3 cache. This is usually what defines the limiting speed for bigger data structures (the L3 cache of Intel consumer CPUs is 6-30 MB).

Also see: https://igoro.com/archive/gallery-of-pr ... e-effects/

numzero · Post by **numzero** » Fri Jul 26, 2024 10:06 pm

j16sdiz wrote: ↑
Fri Jul 26, 2024 5:25 pm

Dev-iL wrote: ↑
Fri Jul 26, 2024 12:07 pm
I always wondered, and may have raised it on IRC in the past:
1) whether utilizing GPUs for computations instead of (or in addition to) graphic rendering could speed things up. It's obviously no small effort to convert parts of the code to work on a GPU, but since the team now has experience with porting the game to different platforms, including a console, perhaps this is more doable.
2) Use linear algebra to update states - (sparse) matrix operations such as multiplications, additions, etc, utilizing BLAS and MAGMA. It would be conceptually simple to create state transition matrices to update the states on each tick. The real question is which parts of the game loop code can benefit from moving away from explicit sub-looping...
Different video card do math slightly differently...
Trust me, you won't want to debug desync between different video card / drive version combinations.

And yet bitcoin mining works somehow. I suppose they only use integer operations and those are exact and thus the same everywhere. Factorio could do the same for bulkiest updates like fluid handling. That would be a radical move though, and possibly a problem for dedicated servers (do they even have GPUs?).

Polish_healthcare · Post by **Polish_healthcare** » Fri Jul 26, 2024 10:30 pm

Hello guys, long time no see (though I am playing this game for ~7 years and I have 2800 hrs, I rarely come to the forums)

Any devs here?

(Disclaimer: I am a software dev for 20+ years)

I think I may have a potential solution to fix the "one device connected to multiple electric networks" which apparently wrecks multithreading.

Basically, what I am thinking, instead of having 1 machine interact with multiple electric networks, each machine could have a set of multiple virtual internal accumulators, instead of just 1 accumulator like now. These accumulators do not actually exist, they are virtual constructs for the purpose of performance.

They could be reduced to few simple variables and functions, no virtual object has to actually exist.

The number of virtual accumulators would be equal to how many networks is the machine connected to. Each virtual accumulator would only draw power from related electric network. Such a scheme could be made multi-threaded, because essentially the machine is no longer connected to multiple electric networks. It's only connected to its internal accumulators.

Then the machine would draw power, either equally or using some simple logic, from these accumulators, this can be done on yet another thread.

Potential advantages:
- Whole scheme can be completely multi-threaded
- It will be possible to divide electric networks between multiple CPU threads, each network can run independently
- Possible performance improvements?

Potential disadvantages:
- There will be more total objects in game, so total performance impact is uncertain, would have to be benchmarked ultimately
- Moderately increased complexity of processing electric networks
- Slightly increased complexity of machines in general
- Slightly increased memory usage

I am not saying I am absolutely certain this fixes the problem yet, without benchmarks it's really hard to tell, but I think the concept is sound.

Related screenshot:

Trific · Post by **Trific** » Fri Jul 26, 2024 11:11 pm

With regard to the lighting array working on a space platform, I'd like to suggest a (not cheap) tiling option for all surfaces that would power things placed on it that draw no more than a lamp. It would have to either have a power source also placed on it contiguous with the tiles the entities drawing power are, or a pole/substation. It should be more expensive than an equivalent setup with power poles, so the case for using it is convenience/aesthetics rather than materials efficiency. A quick check shows it could power lamps, combinators, speakers, and not much more. Kind of like a low power underground option.

crj · Post by **crj** » Fri Jul 26, 2024 11:14 pm

With the exception of selector combinator (which could no longer use a global random generator)

You seem like smart people so you've probably already realised this, but there are at least two efficient ways to keep RNGs deterministic in a multi-threaded environment:

Give each selector combinator its own PRNG (which for a LCG is one word of state)
Create a global random number each tick, and hash that with each selector combinator's ID

EmperorArthur · Post by **EmperorArthur** » Fri Jul 26, 2024 11:21 pm

numzero wrote: ↑
Fri Jul 26, 2024 10:06 pm

j16sdiz wrote: ↑
Fri Jul 26, 2024 5:25 pm

Dev-iL wrote: ↑
Fri Jul 26, 2024 12:07 pm
I always wondered, and may have raised it on IRC in the past:
1) whether utilizing GPUs for computations instead of (or in addition to) graphic rendering could speed things up. It's obviously no small effort to convert parts of the code to work on a GPU, but since the team now has experience with porting the game to different platforms, including a console, perhaps this is more doable.
2) Use linear algebra to update states - (sparse) matrix operations such as multiplications, additions, etc, utilizing BLAS and MAGMA. It would be conceptually simple to create state transition matrices to update the states on each tick. The real question is which parts of the game loop code can benefit from moving away from explicit sub-looping...
Different video card do math slightly differently...
Trust me, you won't want to debug desync between different video card / drive version combinations.
And yet bitcoin mining works somehow. I suppose they only use integer operations and those are exact and thus the same everywhere. Factorio could do the same for bulkiest updates like fluid handling. That would be a radical move though, and possibly a problem for dedicated servers (do they even have GPUs?).

The problem for Factorio isn't just the answer, it's the speed and determinism.

Bitcoin mining is literally brute forcing a hash. Which is doing guess and check to solve a math problem. It doesn't matter if thread a sometimes finishes after thread b. All that matters is if the end result is correct or not.

raven2k3 · Post by **raven2k3** » Fri Jul 26, 2024 11:24 pm

I think one main issue/bug that causes long travels of robots is that logistic robots always try to stack things up first, instead of using the next "possible" storage.

Imagine there are 2 storage chests (yellow) with an active filter on wood.
One beside you and one at the end of the map with 1 wood stored already.
If you then dumb 1 wood from the inventory, the next logistic drone will always fly to the one chest (with the one wood to stack up) at the end of the map, which is ridiculous far away, compared to the chest next to you.

I saw that behaviour in many of my games and I am pretty sure this still happens.
This was always a problem when you have huge bases.

raven2k3 · Post by **raven2k3** » Fri Jul 26, 2024 11:56 pm

Related screenshot:

[/quote]

Hi,
how about to force a split of the electric network by vanilla?

E.g.: You need transformers which can only provide an specific amount of electric power.
Therefore you have to build a network of transformers connected by big power poles, so each network then can be mulithreaded?
It would be also a little more realstic ^^

Something like this mod https://mods.factorio.com/mod/transformer and https://mods.factorio.com/mod/PowerOverload.
If a player don't have to split the network he won't do it. ^^

eternaleye · Post by **eternaleye** » Sat Jul 27, 2024 12:00 am

Tertius wrote: ↑
Fri Jul 26, 2024 9:28 pm

eternaleye wrote: ↑
Fri Jul 26, 2024 6:42 pm
Speaking of lateral thinking in optimizations, I'm curious as to both why electric network updates do have such a heavy memory bandwidth impact, and whether that impact could be avoided in the common case.
This is not about programming, this is about CPU and memory hardware. To be able to speed things up with multithreading, the multithreaded stuff must run simultaneously. Stuff runs simultaneously, if one thread runs on one CPU core and another thread on a different CPU core at the same time.

I think you misunderstood me; I'm not proposing a way to make multithreading work. I'm saying that if the bottleneck is memory bandwidth, then a system that can classify large numbers of memory accesses as skippable can reduce the memory bandwidth, directly reducing the cost of electric network updates in the common case. This is an independent optimization from multithreading, aimed at a different bottleneck.

Tertius wrote: ↑
Fri Jul 26, 2024 9:28 pm
The in-memory structure that holds the electric network info is some array. It's an array, so items are adjacent to each other in memory, so multiple items fit into one cache line, which speeds things up, since they are loaded into the cache just once, then stay over the course of the tick update.

Er, no. If the electric network update is memory-bandwidth-bound, then that indicates that the working set it is operating on is larger than the cache, or at very least is pushed out of the cache between ticks. As a result, it'd be performing a linear scan - which is prefetchable, so you won't get cache misses, but still benefits from reducing memory accesses.

Tertius wrote: ↑
Fri Jul 26, 2024 9:28 pm
Some part of that array belongs to one electric network and some other part belongs to another electric network. One thread for one electric network iterates over one part of that array, and the other thread for the other electric network iterates over an other part of that array. This means the same array is accessed at the same time by 2 different threads, so it is accessed by 2 different CPU cores.

Again, I'm not proposing a change on top of multithreading, I'm asking questions based on the information about the current code's bottleneck that was revealed by the multithreading experiment.

The rest of your post covers information that I'm quite deeply familiar with already, but isn't germane to what I was saying.

Fundamentally, the post lays out that the current electric network update logic is memory bandwidth bound. This has the consequence that parallelizing it is unhelpful (and in fact detrimental) because it pessimizes the reasonably-performant-due-to-prefetching linear scan. However, a linear scan that only writes to a smaller subset of all entries is going to be more performant, because fewer cachelines will require writeback. This is especially true if the array can be ordered by priority and mode, causing there to be contiguous sequences of cachelines that do or do not require writeback. The reduction in memory bandwidth would directly improve performance, in those cases.

Blaster · Post by **Blaster** » Sat Jul 27, 2024 3:20 am

Okay, so the thing you all did with fluid pipes in the previous FFF - can you do that with heatpipes?

Me want glorious nuclear fission reactor with good cpu that doesn't have the janky old pipe-like black magic physics

dustyvilla · Post by **dustyvilla** » Sat Jul 27, 2024 3:34 am

raven2k3 wrote: ↑
Fri Jul 26, 2024 11:24 pm
I think one main issue/bug that causes long travels of robots is that logistic robots always try to stack things up first, instead of using the next "possible" storage.

i wonder if a possible solution to this, would be to have bots just dump things in whichever chest is closest, but have a background process where free logistics bots periodically try to consolidate items into stacks within the same storage chests, essentially allowing the stacking to be deferred for later

raven2k3 · Post by **raven2k3** » Sat Jul 27, 2024 5:00 am

dustyvilla wrote: ↑
Sat Jul 27, 2024 3:34 am

raven2k3 wrote: ↑
Fri Jul 26, 2024 11:24 pm
I think one main issue/bug that causes long travels of robots is that logistic robots always try to stack things up first, instead of using the next "possible" storage.
i wonder if a possible solution to this, would be to have bots just dump things in whichever chest is closest, but have a background process where free logistics bots periodically try to consolidate items into stacks within the same storage chests, essentially allowing the stacking to be deferred for later

I personally would prefer that robots will choose the closest storage chest with the same filter, otherwise you won't be able to optimze your base. Therefore robots then will stack the nearest storage chest up and after it's full they search for the next nearest available..

E.g.:
Imagine you want to harvest a lot of trees and you want to store it fast so you drop a storage chest with wood filter next to it,
but for some reason all robots fly all over the base, to the other chest which has one stack of wood in it. ..
If you want tide / stack things up you can use request or buffer chest I guess.

Current priority (https://wiki.factorio.com/Storage_chest) is :
1. Storage chests with a matching inventory
2. Storage chests with a matching filter
3. Storage chests with no filter and no inventory
4. Storage chests with no filter and a mismatched inventory

Maybe something like this would be better:
1. Storage chests with nearest matching filter
2. Storage chests with no filter but matching inventory
3. Storage chests with no filter and no inventory
4. Storage chests with no filter and a mismatched inventory

Some automatic triggered "background process" to consolidate items will end up in the original use case which is also not always wanted,
so I would not recommend that..
If you want to tide things up, maybe implement a "priority" / "cold storage" checkbox at the storage box
which is then rule number one "Storage chest with matching filter and has priority set" and robots will start to rearrange.

You also have the same behaviour / issue if you build something over the robo network.. it's almost never an item available in the nearest chest xD

Ironmonk · Post by **Ironmonk** » Sat Jul 27, 2024 5:47 am

So, it just occurred to me that if you zoom out the image with the lamps array, it kinda looks like a monster... probably a stealthy preview of new monsters?

Locane · Post by **Locane** » Sat Jul 27, 2024 6:52 am

I really hope this focus on performance extends to a little help with making it easier for the Global Tick Time Scaling mod to look and run right for people who want that!

ThePiachu · Post by **ThePiachu** » Sat Jul 27, 2024 7:53 am

Speaking of roboports with large coverage, there is one aspect I'd love to see improved - smarter large-scale deconstruction.

So I've been playing Space Exploration and I've explored my home planet by pushing out a tiled Roboport + radar + power blueprint. But once biters have been cleared out of the whole planet, I want to scale back my roboport and radar coverage to regain some UPS. I can order the deconstruction of everything I'm not using, but there is one problem - I need the robots to deconstruct everything from outside in so I don't end up with a mess of disconnected, half-deconstructed networks. Unfortunately for the time being I'm stuck manually marking the whole network for deconstruction one layer at a time, which on an uneven planet means many many hours of work...

Serenity · Post by **Serenity** » Sat Jul 27, 2024 10:34 am

With the roboports I have long wanted some features from Bob's mods: network extension poles and charge pads. Not as a replacement but to supplement the normal roboport.

I wonder if something like that could help. With an entity that only extends the range and doesn't have storage, charging, or logistics coverage there would be less to calculate. And charge pads don't have to worry about robots wanting to be stored there.

xpozec · Post by **xpozec** » Sat Jul 27, 2024 10:39 am

Why was there no optimisation related pun at the end, the part where you ask us to leave your thoughts? Or was it optimised perhaps already?

j16sdiz · Post by **j16sdiz** » Sat Jul 27, 2024 11:27 am

numzero wrote: ↑
Fri Jul 26, 2024 10:06 pm

j16sdiz wrote: ↑
Fri Jul 26, 2024 5:25 pm

Dev-iL wrote: ↑
Fri Jul 26, 2024 12:07 pm
I always wondered, and may have raised it on IRC in the past:
1) whether utilizing GPUs for computations instead of (or in addition to) graphic rendering could speed things up. It's obviously no small effort to convert parts of the code to work on a GPU, but since the team now has experience with porting the game to different platforms, including a console, perhaps this is more doable.
2) Use linear algebra to update states - (sparse) matrix operations such as multiplications, additions, etc, utilizing BLAS and MAGMA. It would be conceptually simple to create state transition matrices to update the states on each tick. The real question is which parts of the game loop code can benefit from moving away from explicit sub-looping...
Different video card do math slightly differently...
Trust me, you won't want to debug desync between different video card / drive version combinations.
And yet bitcoin mining works somehow. I suppose they only use integer operations and those are exact and thus the same everywhere. Factorio could do the same for bulkiest updates like fluid handling. That would be a radical move though, and possibly a problem for dedicated servers (do they even have GPUs?).

Bitcoin is interesting. The GPU in that mode can to crunch a large amount of data on video memory in parallel.
The memory bandwidth between cpu and gpu will likely become bottleneck.
Bitcoin mining basically repeats the same operation indefinitely without checking with cpu, so it can pay off..

For factorio, unless we can do majority of work on gpu (which is unlikely because of mod), we will saturate the memory bus very quickly

The Phoenixian · Post by **The Phoenixian** » Sat Jul 27, 2024 12:17 pm

samcday wrote: ↑
Fri Jul 26, 2024 2:33 pm

Agamemnon wrote: ↑
Fri Jul 26, 2024 2:22 pm
Ok first: Nice job!

Second: I wonder if we see some similar improvements for Biters. Their pathing and swarm mechanics are still something that has potential for improvement - most notably in the cases where they (for whatever reason) just accumulate over time and eat your UPS instead of your ammo.
Don't forget that the original Space Age announcement included that tantalizing sneak peek at some kind of new life form (and other than our engineer, those things exist purely for us to gleefully tear apart with laser energy, bullets, missiles, poison gas, nuclear warheads and who knows what else?). Since then we've heard exactly NOTHING about any form of organic life in any subsequent FFF, besides the non-sentient Fulgora flora

So my prediction is our heroes at Wube are saving some of the best/juiciest updates for last, right before launch

At this point, with all the hints on various planets, I feel like Military tech/new enemy developments in Space Age are a safe bet for one of the coming FFFs. Given how much active development is going on as the FFFs come out, though, it just might be something they're still polishing for prime time.

Given FFF #404 also rebrands the Spidertron remote to the "RTS Tool" it wouldn't surprise me if we also get to see other remote control entities. Certainly more entity remote control options would fit with being off-planet and thus unable to deal with a situation on the perimeter of one planet.

All that said, I could also see that having been part of the plan that was later dropped at some point in development, or focused to just Spidertrons since Kovarex's design philosophy seems to prefer one tool for many closely related jobs over a proliferation of niche tools.

(Still, being able to command tank swarms in vanilla would be fun, and the mod maker for RTS control and automation is on the staff now.)

SirGouki · Post by **SirGouki** » Sat Jul 27, 2024 12:37 pm

Would allowing robots to do more than 1 task at a time (I.E. you have a blueprint down for a solar array of, say ~400 accumulators, ~600 panels, and the required power poles/substations), using up their storage limit to perform more than one build, not tremendously improve their efficiency as far as optimization goes?

It seems to me (and I do have quite a bit of coding experience in relation to games) that, by having each robot only build one thing, then immediately having to decide on its next task, this adds to the overhead any time there are blueprints active, as opposed to say having 1 construction robot account for its current carrying capacity worth of build orders (in which then it'd only have to grab a new task when it's completely finished, and only have to check to see if it needs to recharge while it's performing builds).

Factorio Forums

Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0

Re: Friday Facts #421 - Optimizations 2.0