In regards to memory latency, this is how modern CPUs do their best to avoid it.
https://www.youtube.com/watch?v=_qvOlL8nhN4
Nightinggale wrote: Tue Dec 11, 2018 4:56 am
...
Hyperthreading is a clear indication of what is going here. It adds a fake core to a hardware core. Let's call the two cores A and B. When A stalls on memory latency, B will use the real core and vice versa. This means if a single core CPU delivers 100%, dualcore 200% (because multithreading is perfect), one core with hyperthreading can deliver 130% in common use cases. It's just one core, but it waste less time on memory latency.
From my understanding that's not how hyperthreading works.
Hyperthreading operates regardless of what the instruction is, be it using the load/store unit (memory) or one of the ALUs or DIV/MUL units or floating point units etc.
What matters is which ports are available at the moment. A port can have an ALU, a DIV unit and a Vector unit in it. You can't use both the ALU and the div unit under the same port at the same time. But you have more ports that may also have an ALU, a LOAD and a STORE unit etc etc. It all depends on how the CPU is laid out.
So if not all ports are being used in the current execution then the OS (or CPU) can schedule another thread/program to use the free ports for their executions.
It's not that the CPU is "blocked" waiting for a memory response, it's that the CPU isn't using all it's execution units at the same time for the same thread.
And that's where you get 130% extra performance IF you are running more than one thread to do operations.
Of course there's also the limitations that two threads shouldn't write (STORE) to the same memory location at the same time or you get memory collisions that may or may not be handled gracefully by your operating system and/or your executable.
Essentially, a core has more than one ALU and other execution units and not all of them are used simultaneously so you can allow other threads to use them in parallel. And not all execution units are available despite them not being used at that moment in time. This because there's only a limited number of channels to speak to the different execution units.
Nightinggale wrote: Tue Dec 11, 2018 4:56 am
In other words the fact that we have a bunch of cores is mainly to hide the effect of memory latency.
It's not about hiding latency, that's what the first link is all about.
Here's a related video on hyperthreading:
https://www.youtube.com/watch?v=k6PzjGwyKuY
And you can ignore all the stuff about "Port Smash", it's basically to probe the core for which units are being used by other threads so a hacker can discern exactly what code paths are being used. Say to determine what the private key is in some encryption.
Nightinggale wrote: Tue Dec 11, 2018 4:56 am
How does the memory work regarding bandwidth and latency?
Think about an office building. There are 2 offices and they are connected by a corridor. One office has the workers, the other is a storage room. There is one person fetching what the workers need. He is informed of the number on a paper to fetch. He walks to the storage, picks it up and walk back and then he starts over with another number and so on. We add another person to pick up papers. Since the corridor is wide enough for them to pass, we get double the amount of papers per hour, but the waiting time for each paper is unchanged. Next we have 4 people and the throughput is 4 times that of one person, yet the waiting time for a paper is unchanged. Next we have 100 people and they queue up and jam everything.
I would describe it slightly differently. You are correct in that the time to travel is unchanged, but the waiting time depends entirely on whether the papers are in storage room 1, 2, 3, 15, 166, 1337 ... or in the office already and just needs to land on the requestee's desk.
And it also depends on whether the papers needed are all in one box or in several boxes.
RAM (Random Access Memory) have one major defining latency, CAS (Column Access Strobe) latency. That is the time it takes the memory to point to a different address, or storage room as it may be.
Imagine your paper fetching person (paperboy) needing to move a bridge to reach a particular storage room. There's only one bridge and only one person can cross it and return at a time.
Good thing is that the storage rooms are divided into sectors so you can move one bridge to point to storage rooms 1 - 10 and another bridge takes you to storage rooms 11 - 20 and so on. But only ONE paperboy can go to storage rooms 1 through 10 at a time by moving the bridge.
If the bridge is already aligned with the correct storage room, the CAS latency is only a matter of verifying that the bridge indeed is already pointing at the correct room and off the paperboy goes to fetch the next papers.
But it's rarely that simple as there's always other offices that will need papers from storage rooms 1, 2 and 3 and so the bridge has to be moved and the offices rarely (but could) cooperate to say "let all the stuff from room 1 be collected first and then room 2 and finally room 3".
Another thing i would change in your example is the existence of the office local storage. Or the IN/OUT box if you will.
Once a paper has been fetched (copied) it will not just land on the desk (CPU register) but also in the IN/OUT box (CPU cache memory) so should that same paper be requested, it will be fetched directly from the IN/OUT box and so the paperboy will only have to walk a few feet rather than down the corridors, align the bridge and then back again.
And, then i would like to point out that the paperboy isn't stupid. In each storage room there's not just random papers lying around willy nilly. They are neatly stored in boxes. Each storage room has a copy machine (because that's what you do when you read from memory, you make a copy, you don't move the data) that is capable of copying an entire box in an instant. So instead of just getting one paper (byte, short, long, longlong), the paperboy takes the entire box with him. There's a big chance the paper requested which is in that box is next to the other papers that office needs. The paperboy doesn't care if he's grabbing one paper or one box of papers. Weight (bandwidth) is not an issue.
So the entire box lands in the office (cache memory) and so any sequential request are done in cache memory, or directly from the IN/OUT box in the office.
And that's the gist of the memory related fluid optimizations so far. They are storing everything relating to one line of pipes in sequential memory (same box) so when the request is fulfilled, it's all there right away.
Finally, once the data is no longer needed in cache, the paperboy takes the data (box) and brings it to the storage room for archiving. And that's where issues with multithreading comes in. Whenever a box is copied from memory, the paperboy will hang a "LOCKED" sign on the original box, only to be unlocked when the altered copy is returned into its original location.
Nightinggale wrote: Tue Dec 11, 2018 4:56 amIn short: yes memory is a bottleneck for Factorio, but it's only latency related, not bandwidth. Adding more cores will not affect the latency issue for each core and there is bandwidth to spare for extra cores.
Not really, you can reach bandwidth limitations on memory even with a single core if you really wanted. Or by happy accident did so. But we are talking about 40+ GB/s here. I find it very unlikely that you would reach those speeds in factorio as, each tick, you would be moving 666 Mb of data between memory modules and the CPU. Devilish at it may sound, i really don't think there's that much data in a single tick to be processed.
But possible it is because the memory controller fetches BOXES, not single papers. As you said yourself in another way.
I said bandwidth isn't an issue... That is until you start doing SETI (Search for ExtraTerrestrial Intelligence) or bitcoin mining or software rendering or any other memory intensive thing with works sequentially with memory layouts in mind.
Factorio isn't that kinds of complicated, so bandwidth for Factorio isn't an issue.
Almost. It's a good start for sure, i wasn't aware. But each item on the belt is still fed to the renderer, right?
Either way, i am sure the Factorio devs are on top if this way more than we are at this point in time.
They didn't use to but this "close" to a 1.0 release, it's all about optimizations. In such a logic heavy game, even a single operation in the right place can make all the difference.
-------------------------------------------------------
As for the discussion on GPUs, GPUs are AWESOME for what they do. Which isn't so much logic...
GPUs are MASSIVE floating point math units. They can do logic but the CPU is actually faster than the GPU in this regard. The GPU is WAAAAAY faster doing floating point operations, matrix transforms and filling pixels on screen.
All of this HIGHLY parallelized. Because graphics is rarely serial in nature.
You feed the GPU massive amounts of data (textures and geometry) and those (ideally) stay on there forever. Then, every frame, you move, scale and rotate those graphics and nothing more. The GPU takes care of filling said geometry with texture/surface data and doing all the necessary calculations for what goes on screen when and what is obscured by what else.
GPU = MATH
CPU = LOGIC
Just because you can offload some computations to the GPU doesn't mean it's always faster to do so. And there's another major bottleneck here that really depends on the system. The PCI-E bus. If you expect the GPU to "remember" between ticks then you are kind of out of luck. Besides, the GPU is already quite busy holding all the sprite graphics.
GPUs are also excellent at SETI (see above) and bitcoin mining (though there's special hardware for JUST that task too because GPUs are limited in this regard) as well as guessing trillions of password combinations etc.
But that's again because such tasks are heavy on MATH and light on LOGIC.
GPUs do physics calculations too BTW, again because it's a MATH problem, not a LOGIC one.
"But what about GoL (Game of Life) on GPUs?" (
https://nullprogram.com/blog/2014/06/10/)
Again, it's down the the GPU's superior MATH capabilities. And it's superior texture capabilities (fill rate). And it's highly parallelizable nature.
And don't forget that sending data to and from the GPU is slow compared to RAM. And you still eventually have to do that to sync everything up each tick. You'd be better off (even if the GPU could do some tasks faster) running it on the CPU as the time to transfer all the data back and forth would be the new bottleneck.
As i said, you send all your geometry and textures ONCE to the GPU and then send simple draw calls from there on out. Between GPU and CPU, the bandwidth is the real problem.