Friday Facts #324 - Sound design, Animated trees, Optimizations

Ext3h · Post by **Ext3h** » Sat Dec 07, 2019 6:09 pm

A lot of testing later and the results were correct; it was just that much faster. The underlying algorithm didn't change but it just ran > 3x faster now by touching less memory. This is another nice example of "Factorio is not CPU bound, it's memory latency bound". More cores wasn't going to make this faster - because it was never limited by how fast the CPU ran.

There is potentially another approach here, which could be of interest.

And that is transforming the iteration loop from sequential order, to partially overlapping, cooperative execution by the use of co-routines with a prefetch+yield operation pair wherever you assume that you will get a L1 cache miss.

It doesn't change the logic, and it doesn't change the code (much) either, but it completely changes how sensitive your code is to memory latency.
Co-routines are really the key feature there, as a pre-fetch without parallel execution is usually pretty hard to use properly (as you don't know what you will need to prefetch early enough), and manual interleaving .... that is just a pain in terms of maintainability of the resulting code.

Check https://www.youtube.com/watch?v=j9tlJAqMV7U for a further in-depth explanation how this works (and an example showcasing just this described prefetch-yield pattern). Once you get your had wrapped around the concept of corotuines in C++, they are a surprisingly helpful feature for a threading-free iteration strategy.

MassiveDynamic · Post by **MassiveDynamic** » Sat Dec 07, 2019 8:29 pm

First, let me be clear; I know not of what I speak.
That said, regarding inserters checking their destination every tick...
Could they only check their destination 1 tick prior to arrival and again upon arrival? Then simply drop their contents on the ground if the destination does not support the item?
It seems like this solution would reduce the number of checks without adverse consequences.

Ext3h · Post by **Ext3h** » Sat Dec 07, 2019 8:41 pm

Reika wrote: Fri Dec 06, 2019 6:38 pm

In other words, the pixels near the edge of sprites seem randomly scattered, and within the sprite frame-to-frame movement also results in a lot of random scattering of individual pixels, as opposed to smooth movement of larger regions.

This phenomenon is both ugly and I suspect prone to causing eyestrain.

Second that. And it's extremely obvious on the shadows. If you don't look too closely, they look as if they had soft edges now. If you look closely, you realize they are now dithered with high frequency noise. Shadow acne of the worst kind.

The shadows on water don't actually look better either. They just lost their sharp edges, as a result of that dithering on the edge. And that is actually even worse than before, as the noise really pops out if the background has smooth gradients, such as the water.

Didn't you ask yourself why the shadows on the water looked so "excessively sharp" in the first place? Because the water has no high frequency noise of its own. Shouldn't have either. But if you just cast the shadow onto a high frequency material such as your grass texture, it acts as if the shadows edge itself was already noisy / blurred. What this really tells you, is that the shadow map is missing a blur pass, to smoothen shadows in general.

Sorry, but that approach to "better shadows" backfired.
And the illumination model for shadows on the water is sill blatantly off. Seriously, just limit shadows to the highlight component where it should had been applied to, and get rid of the high frequency noise again please.

Builder_K · Post by **Builder_K** » Sat Dec 07, 2019 9:26 pm

Firstly, I would like to say, I really really like the water animation. It adds so much liveliness to an otherwise static world, looks fantastic, and fits in really well.

That said, I would like to echo the concerns that have been raised, that the tree animation looks "pixelated" or "wavy".
With this animation, the left & right side of a single leaf can be pulled apart or pushed together, which can look OK from far away, but looks stranger when it's looked at directly.

Aside: Could we get a more zoomed-out example of the current wave effect, with more trees? If might just be that I'm looking at it too closely.

I think the difference between these two, is that the water is almost purely generative, while the tree animation instead attempts to alter an existing texture.
Adding simple, subtle wave effects to an existing image tends to be dizzying, especially when it's looked at for a long time.

Some random reference footage:
https://www.youtube.com/watch?v=_x0fo9WU71M
https://www.youtube.com/watch?v=AsnSPQ4B-ak

Observations:

Parts of the tree that are further from the center tend to travel farther & faster. (And these parts tend to be in the direct sun, and thus be lighter in color.)
When there's a gust of wind, it travels in a wave, and hits all trees equally.
Waves are in the same direction as the wind travels (easy way to visualize this, is like waves that travel through a slinky if you were hold & move up & down one end).
There is a "reverse" wave just after a gust (as the energy now stored in the branches springs the leaves back to their normal position).
Strong wind has a direction.
Leaves will often flip over, revealing their more lightly-colored underside.
Individual leaves move & wave independently, and generally much more quickly than the tree as a whole. This effect is amplified by the movement of shadows.
Unless there's a storm, wind tends to subside at night.

Obviously, implementing all this is overkill for this simple case, but it's worth keeping in mind, and using one or two of these observations as the logical base when designing the final shader.

Proposal (Not saying you should use this directly, just an example that takes into account the above observations):

Move leaves independently.
This could be done by using an r/g/b map to assign each leaf of the sprite to a different waveform, using (0,0,0) for non-moving parts (i.e. - the trunk.), and interpolating at leaf boundaries (to blend the multiple waveforms).
Send ripples across the whole map in a single direction, to simulate gusts.
"Flip" (make lighter/more faded) leaves if they pass a certain wind threshold.
Gust then spring back, gust then spring back...

I hope this doesn't come across the wrong way, but I just don't think that subtle wavyness looks good on the trees.

braiam · Post by **braiam** » Sat Dec 07, 2019 10:29 pm

Reading the part about Optimizations, I remembered that Dwarf Fortress suffer from the same limitations. Given that both are system simulators (in the conceptual sense) I think that it would be cool if you talked with Toady about your specific scenarios and how you deal with them.

coppercoil · Post by **coppercoil** » Sat Dec 07, 2019 10:43 pm

I like animated water and trees. They don’t need to be very realistic. I will not watch it most of the time anyway. I play zoomed out, so movement will be almost unnoticeable, and that’s good because I don’t want too much distraction. I think time to time I will zoom in and say “wow, they look cool”, and past five seconds I’ll zoom out without deep investigation into movement physics. I agree with comments above that movement could be better, but it’s not a real problem. Trees just become alive and that is most important thing

Omnifarious · Post by **Omnifarious** » Sun Dec 08, 2019 10:02 am

Allaizn wrote: Fri Dec 06, 2019 12:54 pm Yes, there is a toggle in the graphics settings. Though it pains my heart a little to see people jumping to turn it off - I hope you'll give it a chance at least

Don't worry. I really like the tree animation and I won't be turning it off. I think some people just have very specific sensory issues with certain things.

Don't take it personally.

Nomnaut · Post by **Nomnaut** » Sun Dec 08, 2019 10:31 am

I have a question about the statement that factorio is memory latency bound.
The article says the heat flow entity had its update time reduced down to 0.17 ms/tick, which made it three times faster and is a nice example of how Factorio is memory latency bound, i.e. running three tasks in parallel would not improve things because they would be competing for memory access.

I’m confused by a number of things here. First, I’m going to assume that memory access time means RAM and not L1/2/3 cache, since cache access is much, much faster. Second, if we are talking about RAM, I assume the latency is with regards to the average number of clock cycles required to retrieve memory rather than the actual time taken for one retrieval (100 or so clock cycles vs 10 nanoseconds).

When talking about parallelization, if the tasks are indeed independent, why would they be competing for memory access? Three tasks accessing memory won’t clog any data bus. Even if the two fastest tasks have to wait for the third task (even if it’s an explicit wake condition), it’s strictly speaking faster than having to run the three tasks in series (unless you’re talking about the over head required to parallelize, and/or if all three are still run on the same thread, in which case, that’s not really parallelization, is it?). I ask, because immediately preceding the conclusion, you mention that more cores won’t help, but that’s exactly what they do regarding functional parallelism.

coppercoil · Post by **coppercoil** » Sun Dec 08, 2019 11:23 am

Nomnaut wrote: Sun Dec 08, 2019 10:31 am When talking about parallelization, if the tasks are indeed independent, why would they be competing for memory access?

Because memory access is the bottleneck. You can improve anything before and after bottleneck but the final result will be the same. For example, if some skillet can cook four pancakes per minute, nothing will change if you take three bakers. They can compete, they can work together, they can work one after another, but hardware limitations are the same: four pancakes.
If you want to improve final result, you need to reduce number of pancakes that are going through the bottleneck. That means data structures should be more compact and they should be cooked only once. I think here’s the main problem: the same data are loading into cache several times. This is very difficult to solve if an application has tens of millions of interacting entities.

I have an idea. Entities could have duplicated data: one regular structure for all entity data, and reduced copy containing only few compressed attributes (eg. ID, X, Y, State and no bit more) that are used for read only. This second copy can be collected into separated RAM chunk and optimized for multiple cache loads. There can be several compact copies, regrouped differently for different uses.

XFFmaxXFFrus · Post by **XFFmaxXFFrus** » Sun Dec 08, 2019 12:08 pm

Dear developers, I want to offer an idea to introduce weather phenomena into the game: Rains, storms, thunderstorms. The idea is to have a cloud that moves across the map. Also the lightning will turn off the power supply or something like that.
I sometimes lacked atmosphere a natural phenomena that is impossible to resist, and there is still enemies nearby)))
Thanks.

Lubricus · Post by **Lubricus** » Sun Dec 08, 2019 2:51 pm

Nomnaut wrote: Sun Dec 08, 2019 10:31 am I have a question about the statement that factorio is memory latency bound.
The article says the heat flow entity had its update time reduced down to 0.17 ms/tick, which made it three times faster and is a nice example of how Factorio is memory latency bound, i.e. running three tasks in parallel would not improve things because they would be competing for memory access.

I’m confused by a number of things here. First, I’m going to assume that memory access time means RAM and not L1/2/3 cache, since cache access is much, much faster. Second, if we are talking about RAM, I assume the latency is with regards to the average number of clock cycles required to retrieve memory rather than the actual time taken for one retrieval (100 or so clock cycles vs 10 nanoseconds).

When talking about parallelization, if the tasks are indeed independent, why would they be competing for memory access? Three tasks accessing memory won’t clog any data bus. Even if the two fastest tasks have to wait for the third task (even if it’s an explicit wake condition), it’s strictly speaking faster than having to run the three tasks in series (unless you’re talking about the over head required to parallelize, and/or if all three are still run on the same thread, in which case, that’s not really parallelization, is it?). I ask, because immediately preceding the conclusion, you mention that more cores won’t help, but that’s exactly what they do regarding functional parallelism.

Multithreading often help avoid slowdowns because of memory latency, on the other hand it can't increase the memory throughput. The idea is that when one thread is waiting on memory the others can work. If I understand that correct it is the main advantage to hyper-threading. Much of the problem with memory access in Factorio seems to be cache misses. Maybe several threads touching different parts of the memory is messing up the cache?

Ext3h · Post by **Ext3h** » Sun Dec 08, 2019 4:08 pm

coppercoil wrote: Sun Dec 08, 2019 11:23 am
Nomnaut wrote: Sun Dec 08, 2019 10:31 am When talking about parallelization, if the tasks are indeed independent, why would they be competing for memory access?
Because memory access is the bottleneck.
[...]
I have an idea. Entities could have duplicated data: one regular structure for all entity data, and reduced copy containing only few compressed attributes (eg. ID, X, Y, State and no bit more) that are used for read only. This second copy can be collected into separated RAM chunk and optimized for multiple cache loads. There can be several compact copies, regrouped differently for different uses.

They are independent. The blog post just said that if you have tasks A, B and C, then by just executing them in parallel, you don't get faster than the slowest of them.

About that "idea", that's what Factorio, and so many other games out there are already doing. Entity-component system, whereby for each behavior of an entity (e.g. liquid behavior of a pipe-entity), the entity itself only holds a pointer onto the "component" simulating that specific behavior. Then you have a "manager" for that component type, which deals with updating that specific system, which only looks at the components it's responsible for. All components of the same type are stored in linear memory, and are updated by the manager in just that order. Simple example, the "liquid" component of the pipe knows that it's adjacent to up to 4 other liquid-components (doesn't matter what they are, might be underground pipes, storage, whatever). So when it's the turn of that component instance to be updated, it doesn't even need to go via the spatial database for the world to find its neighbours via spatial database, it already is linked to exactly the other related components it needs to do its job.

The catch here is that the other "liquid" components it needs to query for the update are not necessarily in the "ideal" memory locations either. If you look at component instance N, and it's linked to A, B, C and D, then even your all 5 of them are only liquid, A, B, C and D are likely not in the same area of memory, and thereby also not currently in the cache. Trying to read from them still hits you with the memory latency for every single one of them.

On the other hand, the next-to-be-processed entities N+1 and N+2 are likely already pre-loaded into the cache by the processors own choice, as a trivial optimization for linear memory access patterns.

Lubricus wrote: Sun Dec 08, 2019 2:51 pmMuch of the problem with memory access in Factorio seems to be cache misses. Maybe several threads touching different parts of the memory is messing up the cache?

You don't just get cache-misses when you trash your caches ("touching too many different things"), but also when your working set simply doesn't fit within the cache and you have already optimized down to the point where you are only accessing each memory location once per frame.

Then there is the issue that Factorio, even if it were possible to update entities just as they are laid out in memory (which has no correlation whatsoever to how they are laid out in the world!), you can't constrain read accesses to the same memory range. If you update a pipe, you have to query the state of the surrounding pipes as well. Same for belts, and so much else.

At that point the processor already does assume that you will access the next entity in memory order, and prefetch that memory (it's simply reading ahead in proximity of what you have accessed). But in hardware, it is not able to predict that for the just prefechted entity N, you will also need a prefetch of the entities A, B, C and D which are linked from N by pointer, and will be queried in a few dozen CPU cycles.

At that point, you have to actively issue a prefetch yourself. When you process entity N, then you already look at the pointers A+2, B+2, C+2, D+2 belonging to Entity N+2, and trigger prefetch for each of these. By the time you are done computing N and N+1, the prefetch for A+2, B+2, C+2 and D+2 has hopefully completed and these entities are now ready to be accessed straight from L1 cache.

Factorio is not bound by memory throughput. Neither in terms of memory transactions, nor in terms of raw throughput. Not even remotely.
Just pure memory lantency, due to pretty obviously still missing active prefetch logic.
Hyperthreading helps here (as due to memory latency, there are less than 2-3 instructions being execute per clock cycle, meaning another thread can easily fit in there). But with just one additional thread per core, it doesn't mask that much either. Plus, multi-threaded updates of the same components are complicated.

Necessary software optimizations aside, the answer to the often asked "what RAM should I buy" is actually: Buy a low-core-count Threadripper. No RAM out there is going to provide access times so low that you could achieve a notable performance boost. A monstrous 128MB L3 cache on the other hand is a pretty simple workaround for hitting RAM every again. That is the amount of memory which is then sufficient to actually fit the whole working set in cache again.

coppercoil · Post by **coppercoil** » Sun Dec 08, 2019 4:57 pm

Ext3h wrote: Sun Dec 08, 2019 4:08 pm About that "idea", that's what Factorio, and so many other games out there are already doing. Entity-component system, whereby for each behavior of an entity (e.g. liquid behavior of a pipe-entity), the entity itself only holds a pointer onto the "component" simulating that specific behavior. Then you have a "manager" for that component type, which deals with updating that specific system, which only looks at the components it's responsible for.[..]
I agree to

My idea tries to solve slightly different question. I presume (I may be wrong) that entities are accessed several times in different places: e.g. to get network signals, distribute fluids (main task), emit results into network, calculate damage from explosion nearby, provide data for pathfinding calculations etc. Every time you need to access some entity properties. So maybe you need to load entity data into cache several times. The main idea is, you don’t need to load all entity data every time. You can load only small part that is required for particular task. There will be some data redundancy in general, but cache will be able to hold more entities. Moreover, these data can be arranged in different manner, for example, for fluid processing and damage entities are grouped by geographic distance, for signal processing the same data are grouped according network id. On other words, one entity with three hypostases. I speak about linear memory, not about pointers. I don’t know if it's a real issue for Factorio, just considering.

Ext3h · Post by **Ext3h** » Sun Dec 08, 2019 5:16 pm

coppercoil wrote: Sun Dec 08, 2019 4:57 pmI speak about linear memory, not about pointers.

So did I. Except you don't replicate data back to the main entity. Once it's forked off to a component, data is stored only in that component and no longer directly embedded in the owning entity. Said isolated components exist in linear memory with others of their type only. But you still don't get any "smart" memory layout for them, usually just a heap per type. No such grouping or clustering as you suggested. Only the mentioned benefit of compact memory layout, so the only expected gains are from linear memory access when updating components in raw memory order.

coppercoil · Post by **coppercoil** » Sun Dec 08, 2019 5:44 pm

Ext3h wrote: Sun Dec 08, 2019 5:16 pm
coppercoil wrote: Sun Dec 08, 2019 4:57 pmI speak about linear memory, not about pointers.
But you still don't get any "smart" memory layout for them, usually just a heap per type. No such grouping or clustering as you suggested.

Why not? I don't understand. You can have any number of forks grouped according different access rules.

Ext3h · Post by **Ext3h** » Sun Dec 08, 2019 6:55 pm

coppercoil wrote: Sun Dec 08, 2019 5:44 pm
Ext3h wrote: Sun Dec 08, 2019 5:16 pm
coppercoil wrote: Sun Dec 08, 2019 4:57 pmI speak about linear memory, not about pointers.
But you still don't get any "smart" memory layout for them, usually just a heap per type. No such grouping or clustering as you suggested.
Why not? I don't understand. You can have any number of forks grouped according different access rules.

Unless you happen to fit the relevant neighbors exactly within the 64 byte cache line window size, you don't profit from locality. Chance that you manage to fit more than 2 components into 64 byte is small. As soon as you need a pointer back to the owning entity (and you do if you need to be able to set e.g. events), you have a lower cap of 8 bytes just for the pointer, so 16 bytes minimum with alignment. And unless you can do with 24 bytes payload, you don't even get below the 32 byte threshold for even fitting more than one component into a single cache line.

So much for why "groups" don't benefit you even remotely as much as you would naively assume. Not being within the same 64 byte means no guaranteed gains, as it's no longer related as far as the cache system is concerned. You already need explicit prefetch.

Only possible chance was if the structure you need to iterate was acyclic, with a spanning tree containing lots of nice straights, then you could end up in a scenario where you could align components such that both the necessary queries and the iteration order both happen to be approximately linear memory walks. That's a lot of "if"s, and involves re-ordering your component-heap quite often, which is far more expensive than you would imagine.

cwalter5 · Post by **cwalter5** » Sun Dec 08, 2019 7:26 pm

Please, please, please start over on the tree animation. It doesn’t look natural at all and comes off as a strange animated impressionist painting. And since weird impressionist painting isn’t factorio’s art style it doesn’t fit.

The wavy bubbles looks more like a desert heat haze than actual moving leaves. I know you’ve probably invested a lot of effort in these changes, but please don’t put them in. They don’t work and look nothing like actual leaves blowing in the actual wind.

Honestly, I hate to be that guy because I’ve loved everything you guys have done. But please swallow your pride and come up with something better.

In case it isn’t clear I really really hate the blowing leaves effect.

coppercoil · Post by **coppercoil** » Sun Dec 08, 2019 8:07 pm

Ext3h wrote: Sun Dec 08, 2019 6:55 pm Unless you happen to fit the relevant neighbors exactly within the 64 byte cache line window size, you don't profit from locality. Chance that you manage to fit more than 2 components into 64 byte is small. As soon as you need a pointer back to the owning entity (and you do if you need to be able to set e.g. events), you have a lower cap of 8 bytes just for the pointer, so 16 bytes minimum with alignment. And unless you can do with 24 bytes payload, you don't even get below the 32 byte threshold for even fitting more than one component into a single cache line.

So much for why "groups" don't benefit you even remotely as much as you would naively assume. Not being within the same 64 byte means no guaranteed gains, as it's no longer related as far as the cache system is concerned. You already need explicit prefetch.

Only possible chance was if the structure you need to iterate was acyclic, with a spanning tree containing lots of nice straights, then you could end up in a scenario where you could align components such that both the necessary queries and the iteration order both happen to be approximately linear memory walks. That's a lot of "if"s, and involves re-ordering your component-heap quite often, which is far more expensive than you would imagine.

I don’t agree. Let’s imagine there are 100 entities to process in the loop. Assume, each entity takes 64 bytes of RAM, so you need to load 100 cache lines in total.
Now take 32 bits for ID (not 64B for a pointer), 18 bits for X and Y, and 4 bits for state, so 9 bytes for entity. That means 7 entities per cache line, 15 lines in total. Profit.
I emphasize that bits are compressed, memory alignment is disabled. It’s because CPU is cheap and RAM loading is expensive.
Of course, there will be cases where it will not work. There may be cases where it will work. A separate study is required for every case.

Component reordering should be forced to be rare enough: linear processing can be split into the fast part and slow part (waiting for reordering). Yes, this would be complex. There are no low hanging fruits left.

Data prefetch would be useful for every technique, no exception here.

Ext3h · Post by **Ext3h** » Sun Dec 08, 2019 8:41 pm

coppercoil wrote: Sun Dec 08, 2019 8:07 pmNow take 32 bits for ID (not 64B for a pointer), 18 bits for X and Y, and 4 bits for state, so 9 bytes for entity. That means 7 entities per cache line, 15 lines in total. Profit.

That math is far off, unfortunately. Unpacking bit-packed structures like that is not free. Especially not if placed on arbitrary offsets, that's 6-10 extra instructions just for unpacking, including offset-calculation, bit-arithmetic etc., with the nasty side effect of massive register pressure negatively impacting assembly of the surrounding logic. Add to that overhead from loosing the ability to use any form of vector instruction, due to unsuitable alignment. Minimal sensible alignment for the structure is 16 byte, anything other than a multiple of that comes with several severe catches. Also 18 bits for X and Y combined? How far do you think you can get with a range of 0-512 in either direction? Update logic does not stop at the edge of a chunk.

Furthermore, there still isn't a sensible memory layout in most cases. Serializing by depth first traversal order of spanning tree is already as good as it gets, and even that only performs well in very few scenarios like main-bus etc. And whenever some entity updates, you have to rebuild the tree, reorder the components, inform the owning entities that their components memory locations just shifted, and generally accept the limitation that you can now never again hand out a pointer / index into the component storage without living in constant fear about index invalidation issues. Meaning, now you can't even establish a proper linked data structure within a component type, without bouncing back via the owning entity!

Sander_Bouwhuis · Post by **Sander_Bouwhuis** » Sun Dec 08, 2019 9:40 pm

I'm glad to hear the fuzzy tree waving animations can be turned off. It really looks undefined/unsharp.

Factorio Forums

Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations

Re: Friday Facts #324 - Sound design, Animated trees, Optimizations