Well, I'm not really fussed about the system remaining as it is, but some sort of viscosity control in data would be nice.
I look forward to hearing more about the new fluid system.
Well, I'm not really fussed about the system remaining as it is, but some sort of viscosity control in data would be nice.
Theoretically yes, although at that point at least 14 of those cores would be mostly idle because fluids don't take up that much time comparatively to the entire rest of the game (especially when divided several times), fluids can't constantly be running (there are times they have to be done/can't start yet due to determinism and stuff), another 16 (I believe that's the max) could be used for render preparation, although that's not a major use of CPU, one core would be used for the main game processes, and one for OS stuff (which wouldn't be anywhere near as intensive as Factorio - and perhaps this could spill over into the 14 remaining cores too some). Belts are not threaded and most likely will not ever be.
They could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.Jap2.0 wrote: βMon Dec 10, 2018 9:16 pmTheoretically yes, although at that point at least 14 of those cores would be mostly idle because fluids don't take up that much time comparatively to the entire rest of the game (especially when divided several times), fluids can't constantly be running (there are times they have to be done/can't start yet due to determinism and stuff), another 16 (I believe that's the max) could be used for render preparation, although that's not a major use of CPU, one core would be used for the main game processes, and one for OS stuff (which wouldn't be anywhere near as intensive as Factorio - and perhaps this could spill over into the 14 remaining cores too some). Belts are not threaded and most likely will not ever be.
Note: this is all just to the best of my knowledge, I may be wrong in some areas.
I disagree with that statement. You are mixing inserters and belts. Imagine making pure belts and treat them like pipes. You make one "belt group". Think of it this way. Each belt starts in a unique group. If a belt moves items on to another belt, join that group instead. Follow this idea around the entire belt grid. The result will be if two belts aren't in the same group, then we know whatever is on one belt can't affect the other belt. Belt groups expand through belts, splitters, underground, loaders etc. Basically anything, which can act as a belt.pleegwat wrote: βMon Dec 10, 2018 9:56 pmThey could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.
The main considerations I'm thinking of:
[*] Can two inserters insert into the same assembler on the same tick from different threads?
[*] Can two threads manipulate different locations in the same belt run at the same time (ring buffer)?
Sure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.Nightinggale wrote: βMon Dec 10, 2018 11:35 pmI disagree with that statement. You are mixing inserters and belts. Imagine making pure belts and treat them like pipes. You make one "belt group". Think of it this way. Each belt starts in a unique group. If a belt moves items on to another belt, join that group instead. Follow this idea around the entire belt grid. The result will be if two belts aren't in the same group, then we know whatever is on one belt can't affect the other belt. Belt groups expand through belts, splitters, underground, loaders etc. Basically anything, which can act as a belt.pleegwat wrote: βMon Dec 10, 2018 9:56 pmThey could try a similar 'belt systems' trick, but determining whether entities need to be linked may be trickier, and the systems are probably much larger.
The main considerations I'm thinking of:
[*] Can two inserters insert into the same assembler on the same tick from different threads?
[*] Can two threads manipulate different locations in the same belt run at the same time (ring buffer)?
With belt groups like that, execute all groups in parallel. This execution will move items and only that. This should work just fine multithreaded.
Once done, only one thread is active and then other entities can join in on the action, like loaders will take items to/from storage, inserters can pickup or place items etc.
If we want to be really aggressive towards multithreading, allow splitters to be part of two groups. Belt movements will then be a two step process. First is the movement of items, which is done in parallel together with belt item movements. Next is moving items left/right. This step can also be done in parallel as moving items in one splitter will not affect movements in another splitter.
This approach will make belts move items using a near unlimited amount of cores. Think about your bus layout. Will all belts be in the same group? You will have plenty of groups, particularly if you use splitters to branch out items as the branch could be made to be a different group.
We can move on and do the same with inserters. An inserter has a pickup and drop tile. If it's an entity (like assemblers), use the entity as tile in this context. Any other inserter picking up or dropping to any of those tiles will be in the same group and the groups spread like that. This means two inserters picking up from the same tile of belt will be running on a single core, hence no conflict. The same for figuring out which inserter should pick up to/from assembers and so on. Loaders should act like inserters for this context. They pick up from themselves and drop in storage or vice versa. You will however end up with plenty of groups, which can work independently from each other, hence running groups on multiple cores.
Issue: what about inserters and vehicles, like unloading trains?
We could consider entity groups, like inserters can group themselves with the entities they work with. If we do this, then the modding API should allow assigning modded entities to join groups, like a crafting combinator joins the group of whatever assembler it is controlling. This way assemblers can be multithreaded too.
We are just getting started. Assign groups to train tracks and allow each track system to run multithreaded. Maybe you have one big network, but having multiple isn't far fetched. There is a mod, which adds trains on water, meaning it will become at least 2 networks (land and water). Belt movements can run in parallel with trains. Powerplants can run in parallel with both trains and belts (remember loaders and inserters will not be active here). Powerplants will however not be able to run in parallel with pipes as they require steam to be fixed.
There are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
This is a really weird situation and a post, which wasn't easy to start writing. I wrote a post with (or based on) correct facts. You object to what I wrote. The weird part is that you also based your post on correct facts (or at least mostly correct). This sounds impossible yet it's the case and the explanation is in the tiny details of how hardware operates and this situation actually highlights why using more than one CPU core can be problematic.Jap2.0 wrote: βTue Dec 11, 2018 1:05 amSure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
I did and the answer is simple: it works without communication. As I mentioned earlier, communication between threads is slow and can cause multi threaded software to be slower than single core. Because of this, the whole goal is to find code, which doesn't need to communicate between threads.
Single core Factorio is bottlenecked by memory latency, but not memory bandwidth. I know this because a single CPU core is simply not fast enough to come near the bandwidth limitations of memory. Bandwidth could become an issue on a 64 core system, but let's be real: nobody plays Factorio on such a system. Bandwidth shouldn't even be an issue with 8 real cores.
This depends on the design. What I'm proposing should avoid this issue. The key is that the result should be completely independent of which order the cores decide to work on the tasks. This is not only key to keeping multiplayer in sync, it also allows the programmer to add some on/off switch for splitting into multiple threads.
Did I say it would be easy? No because I can't say that because it highly depend on the current implementation. I'm replying to a post saying multi threading belts is an issue. My reply is that the assumed difficulty is based on the false assumption that belts can't be run in parallel without running inserters in parallel. I then go on to write about how game items can be split from a gameplay logical point of view, as in identifying parts without shared memory. I can't tell how hard it is to implement, but I can tell they are candidates, which can be worth looking at because they might be good candidates.
This line doesn't say it's easy. In fact it states the fact that I know it has at least one unsolved issue, which could potentially kill off that proposal on how to multi thread the code.Nightinggale wrote: βMon Dec 10, 2018 11:35 pmIssue: what about inserters and vehicles, like unloading trains?
Another hint that I'm saying that implementing this might not be trivial.Nightinggale wrote: βMon Dec 10, 2018 11:35 pmThere are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
Previously I mentioned using TBB. It will only make one thread for each core and then add a queue of tasks in each thread. This will reduce the overhead for going multithreaded, but it won't remove it entirely.Cadde wrote: βTue Dec 11, 2018 2:46 amIt all boils down to complexity of each operation.
Imagine you try to multithread every item everywhere. This means assigning tasks to run each item in it's own thread.
Doing so means you DOUBLE the amount of execution per item and you still have to merge the results of those operations somewhere in the end. You lose performance.
The only way to truly know if a specific location is slow enough to consider converting it into multiple threads is to profile. Again since we do not have the source code, we can only make guesses. Sure they can be based on a lot of insights, but the final verdict is a profiling measurement.Cadde wrote: βTue Dec 11, 2018 2:46 amIt's not the parallelization that's difficult. It's finding MEANINGFUL parallelization that is.
And that's where metrics come in, you measure the amount of time spent in each stage already and you identify the bottlenecks and attack the bottlenecks.
Perhaps it's not the belts simulation that's the problem, perhaps it's all the updating of visuals that is?
Check this against FFF #176 Belts optimization for 0.15.Cadde wrote: βTue Dec 11, 2018 2:46 amMy solution to this isn't about multithreading, but rather about merging items into blocks. Sort of like how the wide chests mod works.
Instead of having x instances of items on the belt, as long as they are all moving unaffected (no bends, no adds/subtracts and no separation from belt speed changes) then merge the into a single block of 14 (because 7 can fit in one straight belt lane and there's 2 lanes, right?) items and move those blocks as one sprite.
The real question is, whether parts of the logic could be moved to the GPU to profit from its typically lower memory latency (for some obscure reason, GPUs tend to always be one step ahead when it comes to RAM).Nightinggale wrote: βTue Dec 11, 2018 4:56 amThis is a really weird situation and a post, which wasn't easy to start writing. I wrote a post with (or based on) correct facts. You object to what I wrote. The weird part is that you also based your post on correct facts (or at least mostly correct). This sounds impossible yet it's the case and the explanation is in the tiny details of how hardware operates and this situation actually highlights why using more than one CPU core can be problematic.Jap2.0 wrote: βTue Dec 11, 2018 1:05 amSure, it might be possible, but you have to consider how to transfer everything between those threads, and memory latency/bandwidth as bottlenecks. Another large factor I've heard mentioned is that it will make the code much more difficult to maintain and likely cause a decent number of bugs. I think you're underestimating how difficult it is to implement while keeping it as stable, deterministic, and maintainable as it currently is.
I will try to explain in more detail the thinking behind my post, though it will mainly be about the CPU<->memory interface. Factorio is a game, which attracts people with an inner engineer and it certainly attracted the attention of this electrical engineer. Let's just say I have a fairly decent understanding of what goes on in a computer at a hardware level.
I did and the answer is simple: it works without communication. As I mentioned earlier, communication between threads is slow and can cause multi threaded software to be slower than single core. Because of this, the whole goal is to find code, which doesn't need to communicate between threads.
I good candidate is belt1 and belt2 if those two aren't connected. The belts are able to write to items (position) and the belt they are on (what is on me). At least I assume so (I obviously don't have the source code). Since no belt or item on belt1 will come in contact with belt2 and vice versa, they can read/write to memory without ever writing to a shared variable, meaning no communication is needed.
This means there is no communication between threads and no need to "transfer everything".
Single core Factorio is bottlenecked by memory latency, but not memory bandwidth. I know this because a single CPU core is simply not fast enough to come near the bandwidth limitations of memory. Bandwidth could become an issue on a 64 core system, but let's be real: nobody plays Factorio on such a system. Bandwidth shouldn't even be an issue with 8 real cores.
(long explanation on how memory latency became an issue and how multiple cores plays a role in the solution)Memory latency is a real issue and it is indeed slowing down Factorio. Again I can say that with 100% certainty because the only thing not slowed down by memory latency is those specially designed CPU benchmarking tools, which pushes the CPUs as hard as possible.In short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
To understand memory latency, let's start with the beginning. Back in the 80s, memory was instant. The CPU would read a memory cell and it would have the result during the same cycle. In fact the famous Amiga 500 had memory twice as fast as the CPU, meaning the CPU would only use the memory half the time, leaving the memory free for other hardware for the remaining half of the time and that is without slowing down the CPU due to a queue.
CPUs increased in speed and so did memory, but CPUs increased faster. This introduced memory latency. CPU caches were introduced to combat this. Still CPUs kept increasing faster than the memory and the latency got worse and worse when counted in CPU cycles. That crazy race to get the most MHz CPU certainly didn't help here. It also didn't result in much faster CPUs because while they provided more cycles per second, they also did less for each cycle.
Around 2005 the world of CPUs changed. The race for MHz was over and instead we had a single CPU chip with 2 cores. Why this change? The answer is memory latency. The projected path for the MHz craze meant CPUs were heading towards killing their performance due to always waiting for the memory. Lower clock speed and more cores made CPUs more efficient. This btw is also the time we saw a move from most performance possible to most performance per watt because CPUs were overheating with the crazy clock speeds. 2006 was the time when Apple switched from IBM to Intel CPUs. They went from 2 GHz G5 (single core) to 2 GHz dual core, yet the core temperature dropped significantly. The main reason for the switch was that PowerPCs were using too much power and had temperature issues. Low power usage and temperature control was the future this this point.
So is dual core 2 GHz meant to deliver the same as a single core 4 GHz? No. The truth is that adding cores is a way to hide memory latency. The core stalls when waiting for the memory and the answer was to have more than one core, meaning it wouldn't stop everything when a core stalled. Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.
In other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.
How does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.
It's essentially the same with the CPU requesting data from memory. The CPU transmits a request and it can transmit multiple requests before the memory replies. This means as long as the number of cores is low enough to not encounter issues with bandwidth, the cores can request data independently from each other without issues. Yes they have to share the level 3 cache, which in some cases could be a problem, but in most cases added cores, hence more memory requests at the same time is more beneficial than one core for the entire level 3 cache.
Even without using multiple cores, the fluid boxes mentioned in FFF will use this. Why? Because hardware in the CPU other than the core will detect a pattern to memory reading, assume it's a list where all is needed and start to generate requests by itself. If it guesses right, then by the time the CPU needs more memory, it's already in the level 3 cache, or at least underway. This is another way to combat memory latency.
We can tell modern CPUs are designed with memory latency in mind. Take for instance Core i7 8700B. It has a clock speed of 3.2 GHz, but can boost to 4.6. This means it can handle 3.2 when not affected (heavily) by memory latency. However say it's losing 50% of the time to memory latency (a real number!), it will not do 3.2 billion calculations/second, but instead 4.6/2=2.3 billion. When the CPU is waiting for the memory, it will do nop (no operation), which has a very low power consumption, hence heat. This means despite running faster, it will have almost 30% less power hungry cycles every second. This allows the extra speed without overheating and the extra speed means it will work faster while working, meaning it will reach a new memory request faster.
This depends on the design. What I'm proposing should avoid this issue. The key is that the result should be completely independent of which order the cores decide to work on the tasks. This is not only key to keeping multiplayer in sync, it also allows the programmer to add some on/off switch for splitting into multiple threads.
This allows some automated tests where single threaded and multi threaded should provide the same results. This by itself should catch issues where multithreading by itself causes issues.
Since the gameplay result is the same for single threaded and multithreaded, debugging can be done in just a single core, which takes care of the issues related to debugging.
As for a design, which can hide bugs. That is true for bad designs for both single and multi threaded designs. Multi threaded designs can make this more difficult if they fail the part about allowing switching multithreading off.
Did I say it would be easy? No because I can't say that because it highly depend on the current implementation. I'm replying to a post saying multi threading belts is an issue. My reply is that the assumed difficulty is based on the false assumption that belts can't be run in parallel without running inserters in parallel. I then go on to write about how game items can be split from a gameplay logical point of view, as in identifying parts without shared memory. I can't tell how hard it is to implement, but I can tell they are candidates, which can be worth looking at because they might be good candidates.
This line doesn't say it's easy. In fact it states the fact that I know it has at least one unsolved issue, which could potentially kill off that proposal on how to multi thread the code.Nightinggale wrote: βMon Dec 10, 2018 11:35 pmIssue: what about inserters and vehicles, like unloading trains?
Another hint that I'm saying that implementing this might not be trivial.Nightinggale wrote: βMon Dec 10, 2018 11:35 pmThere are plenty of options for multithreading. It's just about getting the ideas on how to split the work into completely independent tasks. Two years ago I would question the workload vs benefit from this, but the new CPUs changes everything. Now I'm fairly certain that my next CPU upgrade (whenever that might be) will have at least 8 real cores.
This quote also states something else, which is it might not be worth it in a world where players have 2 or 4 cores. However in a world where players have 8 or 16 cores is a completely different case and going from 1 to 2/4 is nothing compared to going from 1 to 16 cores.
Previously I mentioned using TBB. It will only make one thread for each core and then add a queue of tasks in each thread. This will reduce the overhead for going multithreaded, but it won't remove it entirely.Cadde wrote: βTue Dec 11, 2018 2:46 amIt all boils down to complexity of each operation.
Imagine you try to multithread every item everywhere. This means assigning tasks to run each item in it's own thread.
Doing so means you DOUBLE the amount of execution per item and you still have to merge the results of those operations somewhere in the end. You lose performance.
While important in general, it is particularly important if we have say 500 entities where we benefit from running them in parallel, but 50% of them will not really do anything. Since each queue is in a thread, adding tasks to discard like that will not have a massive overhead like it would have if they had a thread each.
Also the windows (or mac/linux) will try to assign equal CPU time to each 100% CPU time thread. This means opening 500 threads because you have 500 entities means windows will swap which thread to work on rather frequently. There is a massive overhead to swapping active thread in a CPU. By only having one 100% CPU time thread for each core, windows will not feel the need to swap active threads as much, hence removing yet another way to cause overhead.
There is great potential in using all CPU cores. However there is also a lot of pitfalls, which can kill performance.
The only way to truly know if a specific location is slow enough to consider converting it into multiple threads is to profile. Again since we do not have the source code, we can only make guesses. Sure they can be based on a lot of insights, but the final verdict is a profiling measurement.Cadde wrote: βTue Dec 11, 2018 2:46 amIt's not the parallelization that's difficult. It's finding MEANINGFUL parallelization that is.
And that's where metrics come in, you measure the amount of time spent in each stage already and you identify the bottlenecks and attack the bottlenecks.
Perhaps it's not the belts simulation that's the problem, perhaps it's all the updating of visuals that is?
What I did was essentially trying to remember what people have mentioned as slowdowns (sort of a profiling test) and then consider how to split those tasks to make them not share memory, hence allow running in parallel. I can't tell in advance if it really will work because as you mention it might be graphical updates, which is the culprit. However what I wrote serves two purposes: one is to locate candidates for further investigation, the other is a reply to a statement about difficulties regarding multithreading.
Check this against FFF #176 Belts optimization for 0.15.Cadde wrote: βTue Dec 11, 2018 2:46 amMy solution to this isn't about multithreading, but rather about merging items into blocks. Sort of like how the wide chests mod works.
Instead of having x instances of items on the belt, as long as they are all moving unaffected (no bends, no adds/subtracts and no separation from belt speed changes) then merge the into a single block of 14 (because 7 can fit in one straight belt lane and there's 2 lanes, right?) items and move those blocks as one sprite.
Very unlikely. The CPU is moving small amount of data in an unpredictable pattern (well mostly unpredictable). GPUs draw big images, like 4K, which makes them move giant pieces of memory in a much more predictable pattern.
GPUs have much worse memory latency than CPUs, in addition to lower clock speeds and much worse instruction latency. But it has huge memory bandwith and is massivelly parallel, so it can crunch massivelly parallelizable tasks in insane rate.
Seems like your information about GPUs is 10 years old.Nightinggale wrote: βTue Dec 11, 2018 6:15 amVery unlikely. The CPU is moving small amount of data in an unpredictable pattern (well mostly unpredictable). GPUs draw big images, like 4K, which makes them move giant pieces of memory in a much more predictable pattern.
As a result, each of those will have memory optimized for the type of usage they have. CPUs have memory optimized for low latency while dedicated GPU memory is optimized for bandwidth. The CPU will not benefit from using GPU memory. In fact if it sacrificed latency for bandwidth, the CPU would slow down by using GPU memory.
Another reason for not using the GPU for game logic is that the GPU is optimized for extreme speed for graphics while the CPU has sacrificed speed for being able to do everything. As a result, some tasks are only possible on the CPU, which is why some FPS games are throttled by the CPU rather than the GPU. In essence the GPU needs to send tasks to the CPU because the GPU doesn't have the hardware to handle them itself.
So what kind of tasks is the GPU unable to handle? Essentially anything unpredictable. Most striking is conditional code. The simple task of if some condition, then do something is impossible. "If belt not blocked, move item on belt at the rate of the belt speed" is a CPU task because the GPU can't handle it. This makes the GPU useless for game logic. The type of task it can do is like "draw this object" as it then reads the size and location of the object and then it places it on the screen. It's a step by step task with no branching.
Have you read the article you link to?keyboardhack wrote: βTue Dec 11, 2018 9:28 amSeems like your information about GPUs is 10 years old.
Modern graphics cards are so called GPGPU which can absolutely execute branching code. Todays GPUs can essentially do whatever a CPU can.
In short GPGPU is the ability for the GPU to send tasks to the CPU. This means the programmer can write GPU code, which contains a full set of CPU instructions. Not having to write both CPU and GPU code and communication between those will obviously be much easier to work with.The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU
That's not at all what that means. It means CPU is able to read back results of computation done on GPU. GPUs can't cooperate with CPU the way you described.Nightinggale wrote: βTue Dec 11, 2018 1:07 pmIn short GPGPU is the ability for the GPU to send tasks to the CPU. This means the programmer can write GPU code, which contains a full set of CPU instructions. Not having to write both CPU and GPU code and communication between those will obviously be much easier to work with.The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU
From my understanding that's not how hyperthreading works.Nightinggale wrote: βTue Dec 11, 2018 4:56 am...
Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.
It's not about hiding latency, that's what the first link is all about.Nightinggale wrote: βTue Dec 11, 2018 4:56 amIn other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.
I would describe it slightly differently. You are correct in that the time to travel is unchanged, but the waiting time depends entirely on whether the papers are in storage room 1, 2, 3, 15, 166, 1337 ... or in the office already and just needs to land on the requestee's desk.Nightinggale wrote: βTue Dec 11, 2018 4:56 amHow does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.
Not really, you can reach bandwidth limitations on memory even with a single core if you really wanted. Or by happy accident did so. But we are talking about 40+ GB/s here. I find it very unlikely that you would reach those speeds in factorio as, each tick, you would be moving 666 Mb of data between memory modules and the CPU. Devilish at it may sound, i really don't think there's that much data in a single tick to be processed.Nightinggale wrote: βTue Dec 11, 2018 4:56 amIn short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
Almost. It's a good start for sure, i wasn't aware. But each item on the belt is still fed to the renderer, right?Nightinggale wrote: βTue Dec 11, 2018 4:56 amCheck this against FFF #176 Belts optimization for 0.15.
Looks like it is running an if-else structure on shaders. Does that mean shaders can handle conditional branching? If so then the shaders might end up in different places in the code. Does that mean shaders are not SIMD? Has the GPU system left the SIMD system? That doesn't make sense because both CUDA and OpenCL claims to be SIMD.Cadde wrote: βTue Dec 11, 2018 5:35 pm"But what about GoL (Game of Life) on GPUs?" (https://nullprogram.com/blog/2014/06/10/)
Please do. That would be interesting to read.
Rather oddly I encountered that video today before you posted the link.Cadde wrote: βTue Dec 11, 2018 5:35 pmIt's not about hiding latency, that's what the first link is all about.
Here's a related video on hyperthreading: https://www.youtube.com/watch?v=k6PzjGwyKuY
I am of the firm belief that unless Factorio goes into solving the answer of life, the universe and everything it should stay out of the GPU for any of its simulation and only rely on the GPU for what is drawn on screen. I don't recall if you read about my issues with FPS, not actual UPS.Nightinggale wrote: βWed Dec 12, 2018 1:39 amI give up investigating how the GPU hardware handles conditional branching.
As they always do. People easily mistake GPUs for another multicore CPU. It's not suited for what the CPU does, it's suited for massively complicated math problems.Nightinggale wrote: βWed Dec 12, 2018 1:39 amSomehow I feel like this talk about GPUs and particularly the research following it has left me with more questions than answers.
I don't recall saying OOOP was the end all solution to memory latency, just that it does address the issue of in what order things are executed, thus allowing to "multithread" the actual code even when it's serial in nature.Nightinggale wrote: βWed Dec 12, 2018 1:39 amIt makes perfect sense to merge it together with out of order execution (oooe). However talking about how oooe works is actually a bit problematic if you want to go into details.
...
More accurate name for GPU architecture is SIMT (Single instruction, multiple threads). The article describes control flow handling following wayNightinggale wrote: βWed Dec 12, 2018 1:39 amLooks like it is running an if-else structure on shaders. Does that mean shaders can handle conditional branching? If so then the shaders might end up in different places in the code. Does that mean shaders are not SIMD? Has the GPU system left the SIMD system? That doesn't make sense because both CUDA and OpenCL claims to be SIMD.
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an IF-ELSE block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.