Friday Facts #271 - Fluid optimisations & GUI Style inspector

Regular reports on Factorio development.
Post Reply
User avatar
bobingabout
Smart Inserter
Smart Inserter
Posts: 7352
Joined: Fri May 09, 2014 1:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by bobingabout »

Dominik wrote: ↑
Thu Dec 06, 2018 11:44 am
I have that running right now. But no promises that I will keep it as it is yet another added computation.
Dominik wrote: ↑
Thu Dec 06, 2018 10:57 am
I don't want to spoil stuff, it will eventually all be in FFF. But yes.
Well, I'm not really fussed about the system remaining as it is, but some sort of viscosity control in data would be nice.
I look forward to hearing more about the new fluid system.
Creator of Bob's mods. Expanding your gameplay since version 0.9.8.
I also have a Patreon.

weaknespase
Long Handed Inserter
Long Handed Inserter
Posts: 59
Joined: Sat Mar 24, 2018 8:19 am
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by weaknespase »

Most of the time simulated physics follow simplified models that doesn't make too much sense in some parts. For example, in some other game with gases inside pipes, they follow ideal gas model which propagates and diffuses inside pipes with infinite speed. In that game pumps are barriers between pipe networks. So, cooling loop lowers temperature of gas inside entire network and, thus, any pump splitting the system instantly lowers efficiency of heat exchange, contrary to what happens in real life in most cases.

Factorio has different set of quirks, and most jarring in my opinion is that it doesn't simulate pressure and/or flow rate in fluid system, leading to every fluid behaving like a shear-thickening fluid.

christian
Long Handed Inserter
Long Handed Inserter
Posts: 64
Joined: Fri Jun 08, 2018 12:44 am
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by christian »

omg does this mean Factorio will be able to use TR2 32 cores? At least when it comes to fluids? Do the belts work this way as well?

Jap2.0
Smart Inserter
Smart Inserter
Posts: 2339
Joined: Tue Jun 20, 2017 12:02 am
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Jap2.0 »

christian wrote: ↑
Mon Dec 10, 2018 3:55 am
omg does this mean Factorio will be able to use TR2 32 cores? At least when it comes to fluids? Do the belts work this way as well?
Theoretically yes, although at that point at least 14 of those cores would be mostly idle because fluids don't take up that much time comparatively to the entire rest of the game (especially when divided several times), fluids can't constantly be running (there are times they have to be done/can't start yet due to determinism and stuff), another 16 (I believe that's the max) could be used for render preparation, although that's not a major use of CPU, one core would be used for the main game processes, and one for OS stuff (which wouldn't be anywhere near as intensive as Factorio - and perhaps this could spill over into the 14 remaining cores too some). Belts are not threaded and most likely will not ever be.
Note: this is all just to the best of my knowledge, I may be wrong in some areas.
There are 10 types of people: those who get this joke and those who don't.

pleegwat
Filter Inserter
Filter Inserter
Posts: 258
Joined: Fri May 19, 2017 7:31 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by pleegwat »

Jap2.0 wrote: ↑
Mon Dec 10, 2018 9:16 pm
christian wrote: ↑
Mon Dec 10, 2018 3:55 am
omg does this mean Factorio will be able to use TR2 32 cores? At least when it comes to fluids? Do the belts work this way as well?
Theoretically yes, although at that point at least 14 of those cores would be mostly idle because fluids don't take up that much time comparatively to the entire rest of the game (especially when divided several times), fluids can't constantly be running (there are times they have to be done/can't start yet due to determinism and stuff), another 16 (I believe that's the max) could be used for render preparation, although that's not a major use of CPU, one core would be used for the main game processes, and one for OS stuff (which wouldn't be anywhere near as intensive as Factorio - and perhaps this could spill over into the 14 remaining cores too some). Belts are not threaded and most likely will not ever be.
Note: this is all just to the best of my knowledge, I may be wrong in some areas.
They could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.

The main considerations I'm thinking of:
[*] Can two inserters insert into the same assembler on the same tick from different threads?
[*] Can two threads manipulate different locations in the same belt run at the same time (ring buffer)?

I suspect the first to be false if those two assemblers are holding the same item, and hence multiple belts feeding a single assembler would be locked to the same system.
If the second is false as well, then main bus factories probably cannot multithread at all. Only outposts and train-based factories could benefit. Then again, most megabases seem to be train-based.
I don't think bots would have any impact, since I assume a bot will only interact with a single chest in a tick, so that transfer can be handled in that chest's system.
There could be additional restrictions I'm not thinking of.

Nightinggale
Fast Inserter
Fast Inserter
Posts: 120
Joined: Sun May 14, 2017 12:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Nightinggale »

pleegwat wrote: ↑
Mon Dec 10, 2018 9:56 pm
They could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.

The main considerations I'm thinking of:
[*] Can two inserters insert into the same assembler on the same tick from different threads?
[*] Can two threads manipulate different locations in the same belt run at the same time (ring buffer)?
I disagree with that statement. You are mixing inserters and belts. Imagine making pure belts and treat them like pipes. You make one "belt group". Think of it this way. Each belt starts in a unique group. If a belt moves items on to another belt, join that group instead. Follow this idea around the entire belt grid. The result will be if two belts aren't in the same group, then we know whatever is on one belt can't affect the other belt. Belt groups expand through belts, splitters, underground, loaders etc. Basically anything, which can act as a belt.

With belt groups like that, execute all groups in parallel. This execution will move items and only that. This should work just fine multithreaded.

Once done, only one thread is active and then other entities can join in on the action, like loaders will take items to/from storage, inserters can pickup or place items etc.


If we want to be really aggressive towards multithreading, allow splitters to be part of two groups. Belt movements will then be a two step process. First is the movement of items, which is done in parallel together with belt item movements. Next is moving items left/right. This step can also be done in parallel as moving items in one splitter will not affect movements in another splitter.

This approach will make belts move items using a near unlimited amount of cores. Think about your bus layout. Will all belts be in the same group? You will have plenty of groups, particularly if you use splitters to branch out items as the branch could be made to be a different group.

We can move on and do the same with inserters. An inserter has a pickup and drop tile. If it's an entity (like assemblers), use the entity as tile in this context. Any other inserter picking up or dropping to any of those tiles will be in the same group and the groups spread like that. This means two inserters picking up from the same tile of belt will be running on a single core, hence no conflict. The same for figuring out which inserter should pick up to/from assembers and so on. Loaders should act like inserters for this context. They pick up from themselves and drop in storage or vice versa. You will however end up with plenty of groups, which can work independently from each other, hence running groups on multiple cores.
Issue: what about inserters and vehicles, like unloading trains?

We could consider entity groups, like inserters can group themselves with the entities they work with. If we do this, then the modding API should allow assigning modded entities to join groups, like a crafting combinator joins the group of whatever assembler it is controlling. This way assemblers can be multithreaded too.

We are just getting started. Assign groups to train tracks and allow each track system to run multithreaded. Maybe you have one big network, but having multiple isn't far fetched. There is a mod, which adds trains on water, meaning it will become at least 2 networks (land and water). Belt movements can run in parallel with trains. Powerplants can run in parallel with both trains and belts (remember loaders and inserters will not be active here). Powerplants will however not be able to run in parallel with pipes as they require steam to be fixed.

There are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.

Jap2.0
Smart Inserter
Smart Inserter
Posts: 2339
Joined: Tue Jun 20, 2017 12:02 am
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Jap2.0 »

Nightinggale wrote: ↑
Mon Dec 10, 2018 11:35 pm
pleegwat wrote: ↑
Mon Dec 10, 2018 9:56 pm
They could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.

The main considerations I'm thinking of:
[*] Can two inserters insert into the same assembler on the same tick from different threads?
[*] Can two threads manipulate different locations in the same belt run at the same time (ring buffer)?
I disagree with that statement. You are mixing inserters and belts. Imagine making pure belts and treat them like pipes. You make one "belt group". Think of it this way. Each belt starts in a unique group. If a belt moves items on to another belt, join that group instead. Follow this idea around the entire belt grid. The result will be if two belts aren't in the same group, then we know whatever is on one belt can't affect the other belt. Belt groups expand through belts, splitters, underground, loaders etc. Basically anything, which can act as a belt.

With belt groups like that, execute all groups in parallel. This execution will move items and only that. This should work just fine multithreaded.

Once done, only one thread is active and then other entities can join in on the action, like loaders will take items to/from storage, inserters can pickup or place items etc.


If we want to be really aggressive towards multithreading, allow splitters to be part of two groups. Belt movements will then be a two step process. First is the movement of items, which is done in parallel together with belt item movements. Next is moving items left/right. This step can also be done in parallel as moving items in one splitter will not affect movements in another splitter.

This approach will make belts move items using a near unlimited amount of cores. Think about your bus layout. Will all belts be in the same group? You will have plenty of groups, particularly if you use splitters to branch out items as the branch could be made to be a different group.

We can move on and do the same with inserters. An inserter has a pickup and drop tile. If it's an entity (like assemblers), use the entity as tile in this context. Any other inserter picking up or dropping to any of those tiles will be in the same group and the groups spread like that. This means two inserters picking up from the same tile of belt will be running on a single core, hence no conflict. The same for figuring out which inserter should pick up to/from assembers and so on. Loaders should act like inserters for this context. They pick up from themselves and drop in storage or vice versa. You will however end up with plenty of groups, which can work independently from each other, hence running groups on multiple cores.
Issue: what about inserters and vehicles, like unloading trains?

We could consider entity groups, like inserters can group themselves with the entities they work with. If we do this, then the modding API should allow assigning modded entities to join groups, like a crafting combinator joins the group of whatever assembler it is controlling. This way assemblers can be multithreaded too.

We are just getting started. Assign groups to train tracks and allow each track system to run multithreaded. Maybe you have one big network, but having multiple isn't far fetched. There is a mod, which adds trains on water, meaning it will become at least 2 networks (land and water). Belt movements can run in parallel with trains. Powerplants can run in parallel with both trains and belts (remember loaders and inserters will not be active here). Powerplants will however not be able to run in parallel with pipes as they require steam to be fixed.

There are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
Sure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
There are 10 types of people: those who get this joke and those who don't.

Cadde
Fast Inserter
Fast Inserter
Posts: 149
Joined: Tue Oct 02, 2018 5:44 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Cadde »

It all boils down to complexity of each operation.

Imagine you try to multithread every item everywhere. This means assigning tasks to run each item in it's own thread.
Doing so means you DOUBLE the amount of execution per item and you still have to merge the results of those operations somewhere in the end. You lose performance.

Multithreading only works if you need to do more operations in series than you'd have to do if you split it to separate cores. As Nightinggale points out, you would have to group belts that are unrelated to each other to get enough serial execution to gain any benefits.
You'd have to find a balance where the time added to set up is gained back by the time saved running the belts parallel.

It's not the parallelization that's difficult. It's finding MEANINGFUL parallelization that is.
And that's where metrics come in, you measure the amount of time spent in each stage already and you identify the bottlenecks and attack the bottlenecks.
Perhaps it's not the belts simulation that's the problem, perhaps it's all the updating of visuals that is?

...

My solution to this isn't about multithreading, but rather about merging items into blocks. Sort of like how the wide chests mod works.
Instead of having x instances of items on the belt, as long as they are all moving unaffected (no bends, no adds/subtracts and no separation from belt speed changes) then merge the into a single block of 14 (because 7 can fit in one straight belt lane and there's 2 lanes, right?) items and move those blocks as one sprite.

As soon as a block enters a changing environment, for example it encounters a bend, then split the block up and only re-form it when the items are on a non-affected belt again.

None of the multithreading issues arise from doing so, the downside is added number of sprites being loaded into VRAM. Or maybe if they aren't using sprites in the traditional sense but rather geometry then make a block of items into a single bigger object using the different materials per quad.

Nightinggale
Fast Inserter
Fast Inserter
Posts: 120
Joined: Sun May 14, 2017 12:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Nightinggale »

Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
Sure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
This is a really weird situation and a post, which wasn't easy to start writing. I wrote a post with (or based on) correct facts. You object to what I wrote. The weird part is that you also based your post on correct facts (or at least mostly correct). This sounds impossible yet it's the case and the explanation is in the tiny details of how hardware operates and this situation actually highlights why using more than one CPU core can be problematic.

I will try to explain in more detail the thinking behind my post, though it will mainly be about the CPU<->memory interface. Factorio is a game, which attracts people with an inner engineer and it certainly attracted the attention of this electrical engineer. Let's just say I have a fairly decent understanding of what goes on in a computer at a hardware level.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
you have to consider how to transfer everything between those threads
I did and the answer is simple: it works without communication. As I mentioned earlier, communication between threads is slow and can cause multi threaded software to be slower than single core. Because of this, the whole goal is to find code, which doesn't need to communicate between threads.

I good candidate is belt1 and belt2 if those two aren't connected. The belts are able to write to items (position) and the belt they are on (what is on me). At least I assume so (I obviously don't have the source code). Since no belt or item on belt1 will come in contact with belt2 and vice versa, they can read/write to memory without ever writing to a shared variable, meaning no communication is needed.

This means there is no communication between threads and no need to "transfer everything".
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
memory latency/bandwidth as bottlenecks
Single core Factorio is bottlenecked by memory latency, but not memory bandwidth. I know this because a single CPU core is simply not fast enough to come near the bandwidth limitations of memory. Bandwidth could become an issue on a 64 core system, but let's be real: nobody plays Factorio on such a system. Bandwidth shouldn't even be an issue with 8 real cores.

(long explanation on how memory latency became an issue and how multiple cores plays a role in the solution)
Memory latency is a real issue and it is indeed slowing down Factorio. Again I can say that with 100% certainty because the only thing not slowed down by memory latency is those specially designed CPU benchmarking tools, which pushes the CPUs as hard as possible.

To understand memory latency, let's start with the beginning. Back in the 80s, memory was instant. The CPU would read a memory cell and it would have the result during the same cycle. In fact the famous Amiga 500 had memory twice as fast as the CPU, meaning the CPU would only use the memory half the time, leaving the memory free for other hardware for the remaining half of the time and that is without slowing down the CPU due to a queue.

CPUs increased in speed and so did memory, but CPUs increased faster. This introduced memory latency. CPU caches were introduced to combat this. Still CPUs kept increasing faster than the memory and the latency got worse and worse when counted in CPU cycles. That crazy race to get the most MHz CPU certainly didn't help here. It also didn't result in much faster CPUs because while they provided more cycles per second, they also did less for each cycle.

Around 2005 the world of CPUs changed. The race for MHz was over and instead we had a single CPU chip with 2 cores. Why this change? The answer is memory latency. The projected path for the MHz craze meant CPUs were heading towards killing their performance due to always waiting for the memory. Lower clock speed and more cores made CPUs more efficient. This btw is also the time we saw a move from most performance possible to most performance per watt because CPUs were overheating with the crazy clock speeds. 2006 was the time when Apple switched from IBM to Intel CPUs. They went from 2 GHz G5 (single core) to 2 GHz dual core, yet the core temperature dropped significantly. The main reason for the switch was that PowerPCs were using too much power and had temperature issues. Low power usage and temperature control was the future this this point.

So is dual core 2 GHz meant to deliver the same as a single core 4 GHz? No. The truth is that adding cores is a way to hide memory latency. The core stalls when waiting for the memory and the answer was to have more than one core, meaning it wouldn't stop everything when a core stalled. Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.

In other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.

How does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.

It's essentially the same with the CPU requesting data from memory. The CPU transmits a request and it can transmit multiple requests before the memory replies. This means as long as the number of cores is low enough to not encounter issues with bandwidth, the cores can request data independently from each other without issues. Yes they have to share the level 3 cache, which in some cases could be a problem, but in most cases added cores, hence more memory requests at the same time is more beneficial than one core for the entire level 3 cache.

Even without using multiple cores, the fluid boxes mentioned in FFF will use this. Why? Because hardware in the CPU other than the core will detect a pattern to memory reading, assume it's a list where all is needed and start to generate requests by itself. If it guesses right, then by the time the CPU needs more memory, it's already in the level 3 cache, or at least underway. This is another way to combat memory latency.

We can tell modern CPUs are designed with memory latency in mind. Take for instance Core i7 8700B. It has a clock speed of 3.2 GHz, but can boost to 4.6. This means it can handle 3.2 when not affected (heavily) by memory latency. However say it's losing 50% of the time to memory latency (a real number!), it will not do 3.2 billion calculations/second, but instead 4.6/2=2.3 billion. When the CPU is waiting for the memory, it will do nop (no operation), which has a very low power consumption, hence heat. This means despite running faster, it will have almost 30% less power hungry cycles every second. This allows the extra speed without overheating and the extra speed means it will work faster while working, meaning it will reach a new memory request faster.
In short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs.
This depends on the design. What I'm proposing should avoid this issue. The key is that the result should be completely independent of which order the cores decide to work on the tasks. This is not only key to keeping multiplayer in sync, it also allows the programmer to add some on/off switch for splitting into multiple threads.

This allows some automated tests where single threaded and multi threaded should provide the same results. This by itself should catch issues where multithreading by itself causes issues.

Since the gameplay result is the same for single threaded and multithreaded, debugging can be done in just a single core, which takes care of the issues related to debugging.

As for a design, which can hide bugs. That is true for bad designs for both single and multi threaded designs. Multi threaded designs can make this more difficult if they fail the part about allowing switching multithreading off.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
Did I say it would be easy? No because I can't say that because it highly depend on the current implementation. I'm replying to a post saying multi threading belts is an issue. My reply is that the assumed difficulty is based on the false assumption that belts can't be run in parallel without running inserters in parallel. I then go on to write about how game items can be split from a gameplay logical point of view, as in identifying parts without shared memory. I can't tell how hard it is to implement, but I can tell they are candidates, which can be worth looking at because they might be good candidates.
Nightinggale wrote: ↑
Mon Dec 10, 2018 11:35 pm
Issue: what about inserters and vehicles, like unloading trains?
This line doesn't say it's easy. In fact it states the fact that I know it has at least one unsolved issue, which could potentially kill off that proposal on how to multi thread the code.
Nightinggale wrote: ↑
Mon Dec 10, 2018 11:35 pm
There are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
Another hint that I'm saying that implementing this might not be trivial.

This quote also states something else, which is it might not be worth it in a world where players have 2 or 4 cores. However in a world where players have 8 or 16 cores is a completely different case and going from 1 to 2/4 is nothing compared to going from 1 to 16 cores.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
It all boils down to complexity of each operation.

Imagine you try to multithread every item everywhere. This means assigning tasks to run each item in it's own thread.
Doing so means you DOUBLE the amount of execution per item and you still have to merge the results of those operations somewhere in the end. You lose performance.
Previously I mentioned using TBB. It will only make one thread for each core and then add a queue of tasks in each thread. This will reduce the overhead for going multithreaded, but it won't remove it entirely.

While important in general, it is particularly important if we have say 500 entities where we benefit from running them in parallel, but 50% of them will not really do anything. Since each queue is in a thread, adding tasks to discard like that will not have a massive overhead like it would have if they had a thread each.

Also the windows (or mac/linux) will try to assign equal CPU time to each 100% CPU time thread. This means opening 500 threads because you have 500 entities means windows will swap which thread to work on rather frequently. There is a massive overhead to swapping active thread in a CPU. By only having one 100% CPU time thread for each core, windows will not feel the need to swap active threads as much, hence removing yet another way to cause overhead.

There is great potential in using all CPU cores. However there is also a lot of pitfalls, which can kill performance.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
It's not the parallelization that's difficult. It's finding MEANINGFUL parallelization that is.
And that's where metrics come in, you measure the amount of time spent in each stage already and you identify the bottlenecks and attack the bottlenecks.
Perhaps it's not the belts simulation that's the problem, perhaps it's all the updating of visuals that is?
The only way to truly know if a specific location is slow enough to consider converting it into multiple threads is to profile. Again since we do not have the source code, we can only make guesses. Sure they can be based on a lot of insights, but the final verdict is a profiling measurement.

What I did was essentially trying to remember what people have mentioned as slowdowns (sort of a profiling test) and then consider how to split those tasks to make them not share memory, hence allow running in parallel. I can't tell in advance if it really will work because as you mention it might be graphical updates, which is the culprit. However what I wrote serves two purposes: one is to locate candidates for further investigation, the other is a reply to a statement about difficulties regarding multithreading.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
My solution to this isn't about multithreading, but rather about merging items into blocks. Sort of like how the wide chests mod works.
Instead of having x instances of items on the belt, as long as they are all moving unaffected (no bends, no adds/subtracts and no separation from belt speed changes) then merge the into a single block of 14 (because 7 can fit in one straight belt lane and there's 2 lanes, right?) items and move those blocks as one sprite.
Check this against FFF #176 Belts optimization for 0.15.

User avatar
Oktokolo
Filter Inserter
Filter Inserter
Posts: 883
Joined: Wed Jul 12, 2017 5:45 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Oktokolo »

Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
Sure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
This is a really weird situation and a post, which wasn't easy to start writing. I wrote a post with (or based on) correct facts. You object to what I wrote. The weird part is that you also based your post on correct facts (or at least mostly correct). This sounds impossible yet it's the case and the explanation is in the tiny details of how hardware operates and this situation actually highlights why using more than one CPU core can be problematic.

I will try to explain in more detail the thinking behind my post, though it will mainly be about the CPU<->memory interface. Factorio is a game, which attracts people with an inner engineer and it certainly attracted the attention of this electrical engineer. Let's just say I have a fairly decent understanding of what goes on in a computer at a hardware level.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
you have to consider how to transfer everything between those threads
I did and the answer is simple: it works without communication. As I mentioned earlier, communication between threads is slow and can cause multi threaded software to be slower than single core. Because of this, the whole goal is to find code, which doesn't need to communicate between threads.

I good candidate is belt1 and belt2 if those two aren't connected. The belts are able to write to items (position) and the belt they are on (what is on me). At least I assume so (I obviously don't have the source code). Since no belt or item on belt1 will come in contact with belt2 and vice versa, they can read/write to memory without ever writing to a shared variable, meaning no communication is needed.

This means there is no communication between threads and no need to "transfer everything".
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
memory latency/bandwidth as bottlenecks
Single core Factorio is bottlenecked by memory latency, but not memory bandwidth. I know this because a single CPU core is simply not fast enough to come near the bandwidth limitations of memory. Bandwidth could become an issue on a 64 core system, but let's be real: nobody plays Factorio on such a system. Bandwidth shouldn't even be an issue with 8 real cores.

(long explanation on how memory latency became an issue and how multiple cores plays a role in the solution)
Memory latency is a real issue and it is indeed slowing down Factorio. Again I can say that with 100% certainty because the only thing not slowed down by memory latency is those specially designed CPU benchmarking tools, which pushes the CPUs as hard as possible.

To understand memory latency, let's start with the beginning. Back in the 80s, memory was instant. The CPU would read a memory cell and it would have the result during the same cycle. In fact the famous Amiga 500 had memory twice as fast as the CPU, meaning the CPU would only use the memory half the time, leaving the memory free for other hardware for the remaining half of the time and that is without slowing down the CPU due to a queue.

CPUs increased in speed and so did memory, but CPUs increased faster. This introduced memory latency. CPU caches were introduced to combat this. Still CPUs kept increasing faster than the memory and the latency got worse and worse when counted in CPU cycles. That crazy race to get the most MHz CPU certainly didn't help here. It also didn't result in much faster CPUs because while they provided more cycles per second, they also did less for each cycle.

Around 2005 the world of CPUs changed. The race for MHz was over and instead we had a single CPU chip with 2 cores. Why this change? The answer is memory latency. The projected path for the MHz craze meant CPUs were heading towards killing their performance due to always waiting for the memory. Lower clock speed and more cores made CPUs more efficient. This btw is also the time we saw a move from most performance possible to most performance per watt because CPUs were overheating with the crazy clock speeds. 2006 was the time when Apple switched from IBM to Intel CPUs. They went from 2 GHz G5 (single core) to 2 GHz dual core, yet the core temperature dropped significantly. The main reason for the switch was that PowerPCs were using too much power and had temperature issues. Low power usage and temperature control was the future this this point.

So is dual core 2 GHz meant to deliver the same as a single core 4 GHz? No. The truth is that adding cores is a way to hide memory latency. The core stalls when waiting for the memory and the answer was to have more than one core, meaning it wouldn't stop everything when a core stalled. Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.

In other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.

How does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.

It's essentially the same with the CPU requesting data from memory. The CPU transmits a request and it can transmit multiple requests before the memory replies. This means as long as the number of cores is low enough to not encounter issues with bandwidth, the cores can request data independently from each other without issues. Yes they have to share the level 3 cache, which in some cases could be a problem, but in most cases added cores, hence more memory requests at the same time is more beneficial than one core for the entire level 3 cache.

Even without using multiple cores, the fluid boxes mentioned in FFF will use this. Why? Because hardware in the CPU other than the core will detect a pattern to memory reading, assume it's a list where all is needed and start to generate requests by itself. If it guesses right, then by the time the CPU needs more memory, it's already in the level 3 cache, or at least underway. This is another way to combat memory latency.

We can tell modern CPUs are designed with memory latency in mind. Take for instance Core i7 8700B. It has a clock speed of 3.2 GHz, but can boost to 4.6. This means it can handle 3.2 when not affected (heavily) by memory latency. However say it's losing 50% of the time to memory latency (a real number!), it will not do 3.2 billion calculations/second, but instead 4.6/2=2.3 billion. When the CPU is waiting for the memory, it will do nop (no operation), which has a very low power consumption, hence heat. This means despite running faster, it will have almost 30% less power hungry cycles every second. This allows the extra speed without overheating and the extra speed means it will work faster while working, meaning it will reach a new memory request faster.
In short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs.
This depends on the design. What I'm proposing should avoid this issue. The key is that the result should be completely independent of which order the cores decide to work on the tasks. This is not only key to keeping multiplayer in sync, it also allows the programmer to add some on/off switch for splitting into multiple threads.

This allows some automated tests where single threaded and multi threaded should provide the same results. This by itself should catch issues where multithreading by itself causes issues.

Since the gameplay result is the same for single threaded and multithreaded, debugging can be done in just a single core, which takes care of the issues related to debugging.

As for a design, which can hide bugs. That is true for bad designs for both single and multi threaded designs. Multi threaded designs can make this more difficult if they fail the part about allowing switching multithreading off.
Jap2.0 wrote: ↑
Tue Dec 11, 2018 1:05 am
I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
Did I say it would be easy? No because I can't say that because it highly depend on the current implementation. I'm replying to a post saying multi threading belts is an issue. My reply is that the assumed difficulty is based on the false assumption that belts can't be run in parallel without running inserters in parallel. I then go on to write about how game items can be split from a gameplay logical point of view, as in identifying parts without shared memory. I can't tell how hard it is to implement, but I can tell they are candidates, which can be worth looking at because they might be good candidates.
Nightinggale wrote: ↑
Mon Dec 10, 2018 11:35 pm
Issue: what about inserters and vehicles, like unloading trains?
This line doesn't say it's easy. In fact it states the fact that I know it has at least one unsolved issue, which could potentially kill off that proposal on how to multi thread the code.
Nightinggale wrote: ↑
Mon Dec 10, 2018 11:35 pm
There are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
Another hint that I'm saying that implementing this might not be trivial.

This quote also states something else, which is it might not be worth it in a world where players have 2 or 4 cores. However in a world where players have 8 or 16 cores is a completely different case and going from 1 to 2/4 is nothing compared to going from 1 to 16 cores.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
It all boils down to complexity of each operation.

Imagine you try to multithread every item everywhere. This means assigning tasks to run each item in it's own thread.
Doing so means you DOUBLE the amount of execution per item and you still have to merge the results of those operations somewhere in the end. You lose performance.
Previously I mentioned using TBB. It will only make one thread for each core and then add a queue of tasks in each thread. This will reduce the overhead for going multithreaded, but it won't remove it entirely.

While important in general, it is particularly important if we have say 500 entities where we benefit from running them in parallel, but 50% of them will not really do anything. Since each queue is in a thread, adding tasks to discard like that will not have a massive overhead like it would have if they had a thread each.

Also the windows (or mac/linux) will try to assign equal CPU time to each 100% CPU time thread. This means opening 500 threads because you have 500 entities means windows will swap which thread to work on rather frequently. There is a massive overhead to swapping active thread in a CPU. By only having one 100% CPU time thread for each core, windows will not feel the need to swap active threads as much, hence removing yet another way to cause overhead.

There is great potential in using all CPU cores. However there is also a lot of pitfalls, which can kill performance.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
It's not the parallelization that's difficult. It's finding MEANINGFUL parallelization that is.
And that's where metrics come in, you measure the amount of time spent in each stage already and you identify the bottlenecks and attack the bottlenecks.
Perhaps it's not the belts simulation that's the problem, perhaps it's all the updating of visuals that is?
The only way to truly know if a specific location is slow enough to consider converting it into multiple threads is to profile. Again since we do not have the source code, we can only make guesses. Sure they can be based on a lot of insights, but the final verdict is a profiling measurement.

What I did was essentially trying to remember what people have mentioned as slowdowns (sort of a profiling test) and then consider how to split those tasks to make them not share memory, hence allow running in parallel. I can't tell in advance if it really will work because as you mention it might be graphical updates, which is the culprit. However what I wrote serves two purposes: one is to locate candidates for further investigation, the other is a reply to a statement about difficulties regarding multithreading.
Cadde wrote: ↑
Tue Dec 11, 2018 2:46 am
My solution to this isn't about multithreading, but rather about merging items into blocks. Sort of like how the wide chests mod works.
Instead of having x instances of items on the belt, as long as they are all moving unaffected (no bends, no adds/subtracts and no separation from belt speed changes) then merge the into a single block of 14 (because 7 can fit in one straight belt lane and there's 2 lanes, right?) items and move those blocks as one sprite.
Check this against FFF #176 Belts optimization for 0.15.
The real question is, whether parts of the logic could be moved to the GPU to profit from its typically lower memory latency (for some obscure reason, GPUs tend to always be one step ahead when it comes to RAM).

Nightinggale
Fast Inserter
Fast Inserter
Posts: 120
Joined: Sun May 14, 2017 12:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Nightinggale »

Oktokolo wrote: ↑
Tue Dec 11, 2018 5:34 am
The real question is, whether parts of the logic could be moved to the GPU to profit from its typically lower memory latency (for some obscure reason, GPUs tend to always be one step ahead when it comes to RAM).
Very unlikely. The CPU is moving small amount of data in an unpredictable pattern (well mostly unpredictable). GPUs draw big images, like 4K, which makes them move giant pieces of memory in a much more predictable pattern.

As a result, each of those will have memory optimized for the type of usage they have. CPUs have memory optimized for low latency while dedicated GPU memory is optimized for bandwidth. The CPU will not benefit from using GPU memory. In fact if it sacrificed latency for bandwidth, the CPU would slow down by using GPU memory.

Another reason for not using the GPU for game logic is that the GPU is optimized for extreme speed for graphics while the CPU has sacrificed speed for being able to do everything. As a result, some tasks are only possible on the CPU, which is why some FPS games are throttled by the CPU rather than the GPU. In essence the GPU needs to send tasks to the CPU because the GPU doesn't have the hardware to handle them itself.

So what kind of tasks is the GPU unable to handle? Essentially anything unpredictable. Most striking is conditional code. The simple task of if some condition, then do something is impossible. "If belt not blocked, move item on belt at the rate of the belt speed" is a CPU task because the GPU can't handle it. This makes the GPU useless for game logic. The type of task it can do is like "draw this object" as it then reads the size and location of the object and then it places it on the screen. It's a step by step task with no branching.

posila
Factorio Staff
Factorio Staff
Posts: 5201
Joined: Thu Jun 11, 2015 1:35 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by posila »

Oktokolo wrote: ↑
Tue Dec 11, 2018 5:34 am
The real question is, whether parts of the logic could be moved to the GPU to profit from its typically lower memory latency (for some obscure reason, GPUs tend to always be one step ahead when it comes to RAM).
GPUs have much worse memory latency than CPUs, in addition to lower clock speeds and much worse instruction latency. But it has huge memory bandwith and is massivelly parallel, so it can crunch massivelly parallelizable tasks in insane rate.

keyboardhack
Filter Inserter
Filter Inserter
Posts: 478
Joined: Sat Aug 23, 2014 11:43 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by keyboardhack »

Nightinggale wrote: ↑
Tue Dec 11, 2018 6:15 am
Oktokolo wrote: ↑
Tue Dec 11, 2018 5:34 am
The real question is, whether parts of the logic could be moved to the GPU to profit from its typically lower memory latency (for some obscure reason, GPUs tend to always be one step ahead when it comes to RAM).
Very unlikely. The CPU is moving small amount of data in an unpredictable pattern (well mostly unpredictable). GPUs draw big images, like 4K, which makes them move giant pieces of memory in a much more predictable pattern.

As a result, each of those will have memory optimized for the type of usage they have. CPUs have memory optimized for low latency while dedicated GPU memory is optimized for bandwidth. The CPU will not benefit from using GPU memory. In fact if it sacrificed latency for bandwidth, the CPU would slow down by using GPU memory.

Another reason for not using the GPU for game logic is that the GPU is optimized for extreme speed for graphics while the CPU has sacrificed speed for being able to do everything. As a result, some tasks are only possible on the CPU, which is why some FPS games are throttled by the CPU rather than the GPU. In essence the GPU needs to send tasks to the CPU because the GPU doesn't have the hardware to handle them itself.

So what kind of tasks is the GPU unable to handle? Essentially anything unpredictable. Most striking is conditional code. The simple task of if some condition, then do something is impossible. "If belt not blocked, move item on belt at the rate of the belt speed" is a CPU task because the GPU can't handle it. This makes the GPU useless for game logic. The type of task it can do is like "draw this object" as it then reads the size and location of the object and then it places it on the screen. It's a step by step task with no branching.
Seems like your information about GPUs is 10 years old.
Modern graphics cards are so called GPGPU which can absolutely execute branching code. Todays GPUs can essentially do whatever a CPU can.
Waste of bytes : P

Nightinggale
Fast Inserter
Fast Inserter
Posts: 120
Joined: Sun May 14, 2017 12:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Nightinggale »

keyboardhack wrote: ↑
Tue Dec 11, 2018 9:28 am
Seems like your information about GPUs is 10 years old.
Modern graphics cards are so called GPGPU which can absolutely execute branching code. Todays GPUs can essentially do whatever a CPU can.
Have you read the article you link to?
The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU
In short GPGPU is the ability for the GPU to send tasks to the CPU. This means the programmer can write GPU code, which contains a full set of CPU instructions. Not having to write both CPU and GPU code and communication between those will obviously be much easier to work with.

However my statement is still true. The GPU hardware itself can't branch and has to rely on the CPU to do that. Coded manually for each device or together with GPGPU software doesn't matter. It's still a cooperation between two chips on a hardware level. It's highly likely that GPGPU will produce more efficient code than coding the interface manually each time, but it wont' change the fact that the CPU hardware has the ability to do certain tasks, which GPU cores can't do.

If we turn this into a car analogy, it would be something like this. Early cars had the driver control when to fire the spark plugs. This had to constantly be adjusted because timing depends on the engine RPM. Today the driver can ignore this issue. This doesn't mean modern cars can run without spark plugs. Instead there is a computer, which measures data from the engine and fires the spark plugs with the best possible timing, meaning fuel efficiency becomes better than any manually controlled spark plug control. This means the driver is freed up to concentrate on driving instead of engine management.

posila
Factorio Staff
Factorio Staff
Posts: 5201
Joined: Thu Jun 11, 2015 1:35 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by posila »

Nightinggale wrote: ↑
Tue Dec 11, 2018 1:07 pm
The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU
In short GPGPU is the ability for the GPU to send tasks to the CPU. This means the programmer can write GPU code, which contains a full set of CPU instructions. Not having to write both CPU and GPU code and communication between those will obviously be much easier to work with.
That's not at all what that means. It means CPU is able to read back results of computation done on GPU. GPUs can't cooperate with CPU the way you described.

GPU can branch, but branching on GPU has more performance considerations than on CPU. I'll write about it in FFF someday.

pleegwat
Filter Inserter
Filter Inserter
Posts: 258
Joined: Fri May 19, 2017 7:31 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by pleegwat »

The reason I relate inserters and belts is that I assumed they would be in the same pass. If moving items forward on a belt is in a different pass, the two don't influence each other as strongly. However based on earlier friday facts on belt optimisation, I assumed moving items forward on belts is mostly free anyway. The question remains whether any two inserters grabbing off the same section of belt (same tile, 1 tile apart, etc) need to modify the same belt memory.

Another consideration is that inserters (and also splitters) might act as latches - their input side would be processed in a different task than their output side. However this would require a separate pass to move items from the input to the output side.

I definitely agree that picking correct task sizes is a problem on its own. If tasks are too small, the overhead of doling them out is too large. If the tasks are too large, due to the unpredictability of task cost, one thread may end up blocking the whole update.

For performance, threads should work on game state directly, not on doled-out tasks. This means, for example, all actions affecting a single chest's inventory need to be in the same task, so no further synchronization between threads is ever necessary when accessing it.

And because the process of assigning entities to systems is also slow, systems need to be relatively stable. You don't want to re-split the base in systems every tick.

Cadde
Fast Inserter
Fast Inserter
Posts: 149
Joined: Tue Oct 02, 2018 5:44 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Cadde »

In regards to memory latency, this is how modern CPUs do their best to avoid it.
https://www.youtube.com/watch?v=_qvOlL8nhN4
Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
...

Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.
From my understanding that's not how hyperthreading works.
Hyperthreading operates regardless of what the instruction is, be it using the load/store unit (memory) or one of the ALUs or DIV/MUL units or floating point units etc.
What matters is which ports are available at the moment. A port can have an ALU, a DIV unit and a Vector unit in it. You can't use both the ALU and the div unit under the same port at the same time. But you have more ports that may also have an ALU, a LOAD and a STORE unit etc etc. It all depends on how the CPU is laid out.

So if not all ports are being used in the current execution then the OS (or CPU) can schedule another thread/program to use the free ports for their executions.
It's not that the CPU is "blocked" waiting for a memory response, it's that the CPU isn't using all it's execution units at the same time for the same thread.
And that's where you get 130% extra performance IF you are running more than one thread to do operations.
Of course there's also the limitations that two threads shouldn't write (STORE) to the same memory location at the same time or you get memory collisions that may or may not be handled gracefully by your operating system and/or your executable.

Essentially, a core has more than one ALU and other execution units and not all of them are used simultaneously so you can allow other threads to use them in parallel. And not all execution units are available despite them not being used at that moment in time. This because there's only a limited number of channels to speak to the different execution units.
Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
In other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.
It's not about hiding latency, that's what the first link is all about.
Here's a related video on hyperthreading: https://www.youtube.com/watch?v=k6PzjGwyKuY
And you can ignore all the stuff about "Port Smash", it's basically to probe the core for which units are being used by other threads so a hacker can discern exactly what code paths are being used. Say to determine what the private key is in some encryption.

Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
How does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.
I would describe it slightly differently. You are correct in that the time to travel is unchanged, but the waiting time depends entirely on whether the papers are in storage room 1, 2, 3, 15, 166, 1337 ... or in the office already and just needs to land on the requestee's desk.
And it also depends on whether the papers needed are all in one box or in several boxes.

RAM (Random Access Memory) have one major defining latency, CAS (Column Access Strobe) latency. That is the time it takes the memory to point to a different address, or storage room as it may be.
Imagine your paper fetching person (paperboy) needing to move a bridge to reach a particular storage room. There's only one bridge and only one person can cross it and return at a time.
Good thing is that the storage rooms are divided into sectors so you can move one bridge to point to storage rooms 1 - 10 and another bridge takes you to storage rooms 11 - 20 and so on. But only ONE paperboy can go to storage rooms 1 through 10 at a time by moving the bridge.
If the bridge is already aligned with the correct storage room, the CAS latency is only a matter of verifying that the bridge indeed is already pointing at the correct room and off the paperboy goes to fetch the next papers.
But it's rarely that simple as there's always other offices that will need papers from storage rooms 1, 2 and 3 and so the bridge has to be moved and the offices rarely (but could) cooperate to say "let all the stuff from room 1 be collected first and then room 2 and finally room 3".

Another thing i would change in your example is the existence of the office local storage. Or the IN/OUT box if you will.
Once a paper has been fetched (copied) it will not just land on the desk (CPU register) but also in the IN/OUT box (CPU cache memory) so should that same paper be requested, it will be fetched directly from the IN/OUT box and so the paperboy will only have to walk a few feet rather than down the corridors, align the bridge and then back again.

And, then i would like to point out that the paperboy isn't stupid. In each storage room there's not just random papers lying around willy nilly. They are neatly stored in boxes. Each storage room has a copy machine (because that's what you do when you read from memory, you make a copy, you don't move the data) that is capable of copying an entire box in an instant. So instead of just getting one paper (byte, short, long, longlong), the paperboy takes the entire box with him. There's a big chance the paper requested which is in that box is next to the other papers that office needs. The paperboy doesn't care if he's grabbing one paper or one box of papers. Weight (bandwidth) is not an issue.
So the entire box lands in the office (cache memory) and so any sequential request are done in cache memory, or directly from the IN/OUT box in the office.

And that's the gist of the memory related fluid optimizations so far. They are storing everything relating to one line of pipes in sequential memory (same box) so when the request is fulfilled, it's all there right away.
Finally, once the data is no longer needed in cache, the paperboy takes the data (box) and brings it to the storage room for archiving. And that's where issues with multithreading comes in. Whenever a box is copied from memory, the paperboy will hang a "LOCKED" sign on the original box, only to be unlocked when the altered copy is returned into its original location.
Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
In short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
Not really, you can reach bandwidth limitations on memory even with a single core if you really wanted. Or by happy accident did so. But we are talking about 40+ GB/s here. I find it very unlikely that you would reach those speeds in factorio as, each tick, you would be moving 666 Mb of data between memory modules and the CPU. Devilish at it may sound, i really don't think there's that much data in a single tick to be processed.
But possible it is because the memory controller fetches BOXES, not single papers. As you said yourself in another way.
I said bandwidth isn't an issue... That is until you start doing SETI (Search for ExtraTerrestrial Intelligence) or bitcoin mining or software rendering or any other memory intensive thing with works sequentially with memory layouts in mind.
Factorio isn't that kinds of complicated, so bandwidth for Factorio isn't an issue.
Nightinggale wrote: ↑
Tue Dec 11, 2018 4:56 am
Check this against FFF #176 Belts optimization for 0.15.
Almost. It's a good start for sure, i wasn't aware. But each item on the belt is still fed to the renderer, right?
Either way, i am sure the Factorio devs are on top if this way more than we are at this point in time.
They didn't use to but this "close" to a 1.0 release, it's all about optimizations. In such a logic heavy game, even a single operation in the right place can make all the difference.

-------------------------------------------------------

As for the discussion on GPUs, GPUs are AWESOME for what they do. Which isn't so much logic...
GPUs are MASSIVE floating point math units. They can do logic but the CPU is actually faster than the GPU in this regard. The GPU is WAAAAAY faster doing floating point operations, matrix transforms and filling pixels on screen.
All of this HIGHLY parallelized. Because graphics is rarely serial in nature.
You feed the GPU massive amounts of data (textures and geometry) and those (ideally) stay on there forever. Then, every frame, you move, scale and rotate those graphics and nothing more. The GPU takes care of filling said geometry with texture/surface data and doing all the necessary calculations for what goes on screen when and what is obscured by what else.

GPU = MATH
CPU = LOGIC

Just because you can offload some computations to the GPU doesn't mean it's always faster to do so. And there's another major bottleneck here that really depends on the system. The PCI-E bus. If you expect the GPU to "remember" between ticks then you are kind of out of luck. Besides, the GPU is already quite busy holding all the sprite graphics.

GPUs are also excellent at SETI (see above) and bitcoin mining (though there's special hardware for JUST that task too because GPUs are limited in this regard) as well as guessing trillions of password combinations etc.
But that's again because such tasks are heavy on MATH and light on LOGIC.
GPUs do physics calculations too BTW, again because it's a MATH problem, not a LOGIC one.

"But what about GoL (Game of Life) on GPUs?" (https://nullprogram.com/blog/2014/06/10/)
Again, it's down the the GPU's superior MATH capabilities. And it's superior texture capabilities (fill rate). And it's highly parallelizable nature.

And don't forget that sending data to and from the GPU is slow compared to RAM. And you still eventually have to do that to sync everything up each tick. You'd be better off (even if the GPU could do some tasks faster) running it on the CPU as the time to transfer all the data back and forth would be the new bottleneck.
As i said, you send all your geometry and textures ONCE to the GPU and then send simple draw calls from there on out. Between GPU and CPU, the bandwidth is the real problem.

Nightinggale
Fast Inserter
Fast Inserter
Posts: 120
Joined: Sun May 14, 2017 12:01 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Nightinggale »

I give up investigating how the GPU hardware handles conditional branching. There are statements like "OpenCL allows workloads to be shared by CPU and GPU", but can both do all or how to they split? It's really not clear and since I'm not actually going to use it personally, at least not in the near future, I won't spend more time on this. It seems like it could take a while.
Cadde wrote: ↑
Tue Dec 11, 2018 5:35 pm
"But what about GoL (Game of Life) on GPUs?" (https://nullprogram.com/blog/2014/06/10/)
Looks like it is running an if-else structure on shaders. Does that mean shaders can handle conditional branching? If so then the shaders might end up in different places in the code. Does that mean shaders are not SIMD? Has the GPU system left the SIMD system? That doesn't make sense because both CUDA and OpenCL claims to be SIMD.

Somehow I feel like this talk about GPUs and particularly the research following it has left me with more questions than answers.
posila wrote: ↑
Tue Dec 11, 2018 3:28 pm
GPU can branch, but branching on GPU has more performance considerations than on CPU. I'll write about it in FFF someday.
Please do. That would be interesting to read.
Cadde wrote: ↑
Tue Dec 11, 2018 5:35 pm
It's not about hiding latency, that's what the first link is all about.
Here's a related video on hyperthreading: https://www.youtube.com/watch?v=k6PzjGwyKuY
Rather oddly I encountered that video today before you posted the link.

As for the hiding latency part, I got that from a university professor. Now a random forum user says something else. This forced me to stop and think it through and weirdly enough the random guy from the internet makes the most sense. I have a really odd feeling right now.... It makes me think back to a first year lecture where the professor said something like always evaluate all information given to you and never trust any source 100%, not even him.

It makes perfect sense to merge it together with out of order execution (oooe). However talking about how oooe works is actually a bit problematic if you want to go into details. The thing is that the CPU is sold based on an instructionset and how it actually handles the instructions internally is up to the vendor. This means you can buy two different Intel CPUs and they can easily have different oooe. In fact Intel Atom doesn't have oooe at all (aimed at low power rather than performance).

As for oooe and memory latency. Yeah, reading something from memory into a register shouldn't stall the core unless the next part of the code causes a stall. The problem is it can. Say we have if content of pointer > 0, then some function call. It reads the address in memory and then it encounters the if statement. It doesn't know the answer before the variable arrives from memory, meaning it will use a dirty trick here: take a guess. Say it assumes it to be true. It then executes whatever code is in the true statement with the catch that it has to be able to revert this in case it guessed wrong. If it starts out with a function call, then it stalls. The issue is that oooe will not work on function calls and once a call is encountered, it will wait until all the previous instructions are executed. It will also make sure that it will not make a function call based on an incorrect guess at a conditional branch point.

This means even with oooe it is possible to stall a core completely due to memory latency. Even worse, it's really likely to have plenty of cases where you have pointer to something and then if getWhatever() {call something}. In other words "normal" code will likely cause CPU cores to stall completely due to memory latency.

Considering oooe doesn't follow function calls, I can only assume simple get functions should be inlined. If not, then they are function calls, but if properly inlined, the compiled code will not make a function call. Instead it will be the class pointer + const and then read that address from memory. That's something, which should work with out of order execution. I wonder what's worst with lack of inline: the function call overhead or the fact that it can stop the oooe from looking into the rest of the calling function.


Somehow I feel like this thread gets less to do with Factorio for each new post.... :oops:

Cadde
Fast Inserter
Fast Inserter
Posts: 149
Joined: Tue Oct 02, 2018 5:44 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by Cadde »

Nightinggale wrote: ↑
Wed Dec 12, 2018 1:39 am
I give up investigating how the GPU hardware handles conditional branching.
I am of the firm belief that unless Factorio goes into solving the answer of life, the universe and everything it should stay out of the GPU for any of its simulation and only rely on the GPU for what is drawn on screen. I don't recall if you read about my issues with FPS, not actual UPS.
If you can offload a lot of render preparation to the GPU then that's all good and dandy. The GPU can easily handle animations and belts can be animated for a certain timeframe if you can find repeating patterns over long stretches.

So instead of sending each item on a belt's position you send "This block of items will be moving this far without disruption" so here, do that for the next X iterations" kind of job. Until interrupted by user action or biters coming along destroying a belt.
The simulation stays in the background completely unchanged in this regard, it's only what you see that's being offloaded to GPU entirely.
Nightinggale wrote: ↑
Wed Dec 12, 2018 1:39 am
Somehow I feel like this talk about GPUs and particularly the research following it has left me with more questions than answers.
As they always do. People easily mistake GPUs for another multicore CPU. It's not suited for what the CPU does, it's suited for massively complicated math problems.

Saying everything would run better on GPU is like saying a quantum computer can run Crysis. Technically it can, but you will get all the answers you needed all at the same time and in a single frame. You both win and lose the game in one frame and all you need to do is unravel how it got there unless you are interested in whether the cat is in the box at all and whether it is dead or alive.
Nightinggale wrote: ↑
Wed Dec 12, 2018 1:39 am
It makes perfect sense to merge it together with out of order execution (oooe). However talking about how oooe works is actually a bit problematic if you want to go into details.

...
I don't recall saying OOOP was the end all solution to memory latency, just that it does address the issue of in what order things are executed, thus allowing to "multithread" the actual code even when it's serial in nature.

Whereas Hyper Threading wasn't at all built for that purpose. It was so that other programs could use the available resources on each core simultaneously.
OOOP helps ALL programs run faster.
Hyper Threading helps only programs that run simultaneously to run faster. And single threaded programs take no advantage of Hyper Threading.

Users of 4 core systems always wonder why their CPU usage is pegged at 25% and not 100%. Are their CPU's broken?
No, there's 4 cores and only one is used. 4/1 = 25%.
But each core has a "virtual" one... Does this mean the virtual core is also used 100% (12.5% of total) ?
No, it means it counts any use on the physical core as both the physical and virtual being used. If you wanna measure hyperthreaded usage you need software which does this even for virtual cores.

Image

As you can see in that picture, hyperthreading isn't used at all in that moment in time as the save i am playing is very lightweight. It instead spreads across the 4 physical cores i have.
hyperthreading can solve latency issues by a weird side effect where there's still execution units available as the load/store ones are waiting for memory. But it cannot solve problems with memory being busy to all threads nor can it solve issues within a single thread. Hyperthreading only works on separate threads.
And separate threads don't tend to work with the same memory.

You do bring up a good point though, speculative execution is indeed helpful. Except you said the CPU reverses it's results. It doesn't reverse them, it simply doesn't use the wrong results. This is at the heart of what heartbleed is. It exploits speculative execution to create side channels into memory it shouldn't have access to.

Since the blog post is about optimizations it's still got plenty to do with Factorio. Anyone else reading these post may realize that it indeed isn't always as simple as just using all cores to 100%. Sometimes you simply can't do that because you can't find a good balance. If you manage to split everything into neat chunks that all execute in N time then cool, you can use 100% as you know just how long each chunk will take and that they will finish simultaneously.
But in reality it's never that simple. Some parts will always take longer than the rest and that's the weak link in the chain, that's the one thing that slows everyone else down. And it's all down to how one built their base in factorio... Let's not go into the math for how many possible combinations there are in building a base.. But one thing that can help a users UPS is staying within chunks. Since each chunk is bound to have all its data in the same box. You could gain 10% (number pulled out of a smelly body hole) performance this way.

posila
Factorio Staff
Factorio Staff
Posts: 5201
Joined: Thu Jun 11, 2015 1:35 pm
Contact:

Re: Friday Facts #271 - Fluid optimisations & GUI Style inspector

Post by posila »

Nightinggale wrote: ↑
Wed Dec 12, 2018 1:39 am
Looks like it is running an if-else structure on shaders. Does that mean shaders can handle conditional branching? If so then the shaders might end up in different places in the code. Does that mean shaders are not SIMD? Has the GPU system left the SIMD system? That doesn't make sense because both CUDA and OpenCL claims to be SIMD.
More accurate name for GPU architecture is SIMT (Single instruction, multiple threads). The article describes control flow handling following way
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an IF-ELSE block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.

Post Reply

Return to β€œNews”