Parrallel processing in games & applications

MeduSalem · Post by **MeduSalem** » Sat Jan 21, 2017 8:45 am

NotABiter wrote:Anyways, if you want to see a much better direction that processors might go, check out some of the Mill CPU talks. That thing is insanely better than x86/ARM/MIPS/etc. Much better performance, much better ILP, with less "overhead" hardware (it gets better ILP than an x86 using dynamic instruction scheduling, and does so using static instruction scheduling, so it needs no dynamic instruction scheduling hardware - how they do that is explained in the videos), and they claim to need far less energy/operation. (All claims except the last are "obviously true" just from the information in their jaw-dropping videos. The energy efficiency claim will be proven one way or the other once they're running on real silicon. These dudes have some serious brains and are making Intel/AMD/ARM all look kind of stupid by comparison.)
Do you know what the best part is? With their top end being able to dispatch 32 "instructions" (which they call "operations", but they are similar in work-amount to instructions on "normal" CPUs) every single cycle per core, they will make sticking with single-threaded applications viable even longer into the future.

Funny that the Mill shows up here.

While "belt machines" (which is how it would be categorized correctly) have some interesting ideas... the Mill suffers from several fatal design flaws in my opinion:

Basically VLIW instruction set architecture with too much dependence on Compilers being smart

That sucks really, really hard... and has been among one of the reasons why IA64 (Itanium etc) failed (apart from the lack of proper x86 support and pragmatic software developers not feeling like they really wanted to port their software there).

While looking really good on paper (so did IA64 too) they just don't scale very well in practice because they are so hard-designed around certain constraints. Like the amount of instructions that are encoded within them. If you widen the VLIW pipeline, older software often cannot really profit from that and everything needs to be re-compiled or otherwise the pipeline becomes ill-utilized on newer micro architectures. Now go and tell software developers that they have to re-compile every time you decide to release a new processor line with a wider pipeline or otherwise they face a truckload of consumers complaining that their legacy software doesn't perform any better on the new processors. Most companies just won't do it because it is cost uneconomical. Also think about what happens when a company goes out of business... then there is nobody to update/recompile the software. Imagine a game from 10 years ago... it wouldn't perform any better than it did 10 years ago even if the micro architectures advanced further. Fugly.

The only way to overcome that VLIW limitation is to double, triple, quadruple, etc the pipeline from the original width so that the fetcher can pick 2 or 3 of your original VLIWs per cycle. But that again means you have also to do dependency check on-the-fly again with everything against everything to avoid issuing two instructions that need to be done in a particular order... and that's when we are back to how classic fixed-/variable-length ISAs are doing it already. So what's the profit of having VLIW then? It's a point where VLIW is just fundamentally flawed and gets its ass handed to itself by fixed/variable length ISAs.

Another problem that comes from VLIW is the tremendous amount of code bloat due to having "headers" and having to place stupid NOPs everywhere whenever the Compiler thinks that there is nothing in particular that could be done in parallel. (And as research has shown even the best VLIW compilers are really bad at finding Instruction Level Parallelism in general purpose programs for some reason if they are not obviously parallel or helped by software developers knowing their way around assembler) On the other hand the instruction fetcher within the CPU only has a limited window they can look forward so if there is a huge NOP slide encoded then the next "real" instruction might be out of scope of the window and there is nothing you can do about it. The bottom line is... you don't want NOPs in the pipeline... they are inherently a bad design choice and even to think about having it should have been illegal because it turns the entire concept of VLIW upside down.

In my opinion the space they waste with encoding NOPs could have been used for the next instructions following afterwards (even if you can't schedule it yet for execution) so that when the current instruction completes the pipeline already has them ready... with a much smaller prefetching window and increased code density which in return also profits the Cache structure and overall bandwidth.

After watching all of Ivan Godard's videos and reading into most of his papers I just don't get why he insists on that VLIW stuff. It's basically proven in praxis that it doesn't work out as well as it seems in theory and that it is the major reason why almost everyone, including GPU companies like Nvidia/AMD who also used VLIW in older graphics chips eventually moved away from VLIW to RISC for their newer chips. The design constraints eventually outweighed the gains (especially since they are also targeting GPGPU stuff).

I know in theory Compilers for the Mill would be able encode up to 33 instructions into one VLIW, which is much more than anyone else ever did before, so there probably wouldn't be the need for widening the pipeline for quite a long time... But in reality compilers would probably have a hard time filling all the 33 slots up most of the time (especially in short loops and branches which is what probably the most general purpose code consists of)... so it ends up with a lot of overhead introduced through the fixed size "header" of the VLIW which always has to be there and all the NOPs that are in between the actual instructions (even if the NOPs are all combined into one single NOP they are still there and reduce throughput).

On the bottom line I think a VLIW architecture also can't really do anything much better than a classic fixed-length RISC architecture can do... all that is done with a VLIW ISA can also be achieved with a fixed-length RISC ISA, just that the RISC ISA doesn't suffer from above described problems like having to re-compile and infamous NOP sleds (at least modern RISC ISAs try to avoid NOP Sleds, older one suffer from the same problem due to the strictness of the classic RISC pipeline).

And if one really wants to go massively parallel then one can't avoid having some sort of SIMD extension anyways... but leave the old fashioned General Purpose instructions in peace for the sake of simplicity and flexibility.
Static, unchanging cycle-times in the Execution Units for instructions

From what I have also read in the papers the Mill basically assumes that a specific instruction always takes exactly the same amount of time, independently of the actual micro architecture it is running on. I don't remember exactly why it required that but it did for some reason.

That makes changes to the execution units between micro architectures pretty much impossible. But sometimes there need to be changes... some instructions might become obsolete/less useful over time so you might increase the cycles it takes for them to complete on purpose if it allows you to make another instruction perform faster instead. It really depends from micro architecture to micro architecture and how everything is interconnected and how important several instructions are.

So basically the Mill hard codes the execution units and that's awful because one of the better aspects of an instruction set architecture should be the possibility to implement it with different micro architectures instead of one that is totally fixed to work in a particular way.
The belt-like mechanic itself isn't inherently better than Out-of-Order with Register Renaming

I get that the basic concept of the Mill is treating the Registers like a belt where the oldest item on the belt eventually drops off and vanishes and if you want to preserve it you have to put it back to the beginning of the belt... and that Ivan reasons this because 80% of the values inserted into the registers are only used once, while only 15% get used more than once.

But what makes that actually better than a classic Register machine where the Compiler decides when it is legit to overwrite a particular ISA register with a new value?

He reasons that you need less Multiplexers between the Belt Registers and the Execution units but I think that this can't really be avoided no matter what. For the inputs of the Execution units you still need to wire every Belt position to every execution unit or otherwise you can't access the data from every belt position once you need it. So on the EU input side you again have a huge multiplexer tree... because how otherwise would you do it? Wait until the proper item passes by the Execution unit in the belt-like fashion? Urgh... the latency until a single instruction gets finished... and how about branch mispredictions... pipeline flushing and whatnot... would take forever to recover from that.

The only savings you might be able to have is on the data output of an execution unit because you always drop it at the beginning of the belt. But if you have multiple outputs per cycle you have to re-introduce multiplexers or limit the amount of items that can be dropped onto the belt in one go. Also you have to introduce a logic that puts the items in the right order on the belt so that the outputs of various execution units don't end up in the wrong order if they finnish at the same cycle.

And that's also where the complexity doesn't really differ that much anymore from having out-of-order execution with register renaming and a re-order buffer.

The really bad part about the belt-like structure is that you can't even think about implementing an out-of-order execution of stuff because the instructions are encoded in a way that they reference a relative belt-position that changes every time another item gets input on the beginning of the belt. That in turn makes it almost impossible to ever implement Out-of-Order execution on the Mill because the hardware required to keep track where on the belt a specific item currently is would be monstrous compared to current RISC/CISC register renaming.

OoO becomes more and more profitable if not necessary once there are Cache misses or other long latency/multi-cycle instructions... and I see no real way that the Mill can avoid that problem... on the contrary the way it works with belt-relative addressing of registers it makes it a whole lot harder to implement micro architecture improvements.

Also what about extending the belt-length? Probably the same problem as having only the Instruction Set Architecture Registers... if you don't have enough of them you end up with a lot of Loads/Stores and adding more Registers is impossible without a new ISA (which then requires recompiling everything to take advantage). And that's why Out-of-Order eventually became an important thing.

There are some other things that I think may not turn out all that good on the Mill too... but the entire thing is almost worth its own topic.

More realistically seems that VISC stuff from Soft Machines since they at least have something to show in Silicon... but sadly they got bought by Intel recently so it probably won't see a commercial implementation for a long, long time... and only if Intel is really cornered and needs to pull a rabbit out of their hat. But the basic concept would be that a single threaded application could take advantage of unused resources from multiple cores, which is achieved by having a global fronted end that fetches instructions from various threads at once and then schedules and distributes them among the cores that are free.

Also when it comes to Out-of-Order stuff then there are also different new concepts that haven't been implemented in any commercial processors yet... like for example out-of-order retirement of instructions when it becomes obvious that they don't depend on anything else anymore. That in turn would allow for a much smaller Reorder buffer... or to increase the out of order window in situations where a lot of instructions are parallel and as long as they don't require atomicity.

If you ask me then I think that the best approach was Berkely RISC II and follow-ups like SPARC which had register windows to speed up loops and subroutine calls, thereby minimizing the need to access Caches/RAM... and that's where they should have gone with the development. Generalizing the register windows so that the software can decide which registers are "punched through" from window to window and instructions to move the contents of a window directly onto a stack buffer on window switch... stuff like that.

NotABiter · Post by **NotABiter** » Sat Jan 21, 2017 11:06 am

hoho wrote:one has to do a couple of hundred of computations per byte to max out CPU usage *and* memory bandwidth.

It depends on your memory access pattern. If you are streaming through memory serially, using every byte, then the number of CPU cycles available per byte of memory is actually very low. If you are hitting random memory addresses and only looking at one byte at each address, then yeah, then you're looking at 100s of CPU cycles per byte. Good code tries very hard to not do that though, because it's really slow. (See, for example, Data-oriented design and AOS and SOA.) But depending on the problem, random accesses can be various degrees of difficult to avoid.

While this sounds awesome, I don't think it's really possible in regular apps to have *that* much possible parallelism in code.

Except it is. For example, simply comparing two large strings will max out ILP on a Mill. (He explains how that works in one of the videos.)

it sounds vaguely similar to what Intel promised with Itanium - compiler-side optimizations instead of hardware-side.

It's nothing like Itanium (or VLIW in general). With VLIW you need some mythical "sufficiently smart compiler" to figure out how to turn source programs into good binaries, and that's a problem that still hasn't really been solved. Ivan Godard is an experienced compiler writer and he's specifically stated that the Mill is designed so the compiler is simple to write - nothing fancy required. It's been some time since I watched their videos, but IIRC a big part of that was their ability to do per-operation speculative execution (which also allows some very cool loop optimizations that x86 et al can't do).

So, yeah. Hypothetical 32 instructions in parallel, practical "up to 6".

No, practical to 32... on the right work load (i.e. not highly branchy code). And not "up to 6" but "ILP of six or more on common programs". (I'm not sure why you felt compelled to take ">= 6" and turn it into "<= 6".) They've made a point of being able to take existing C code and just recompile it and run it, and just doing that with common programs gets them to 6+, which is 2-3 times better than what you tend to see on x86.

In reality that means effectively double the power usage of executing the code that "normal" CPU would not run speculatively. Sure, it will be better performance but at the cost of significantly higher power usage.

Like I said, their claim is lower power, not higher. (In fact, the specific claim he makes many times is 1/10th the power of a conventional CPU, but usually in the context of DSP-like workloads. My own guess is they are looking at 75%+/-25% of the power of a normal CPU on common branchy stuff.) I know they're running this thing in simulation, doing things like running a full Linux kernel on it and then running applications on Linux. But I don't know how detailed their simulation is - e.g. if it's purely logic + timing or if it's logic + timing + power. So while I reserve some skepticism on their power claims, like I said, all of their other claims are "obviously true" (at least to someone like myself who has a background in computer architecture and VLSI design) which highly suggests that their power claims are more likely "mostly correct" rather than "mostly wrong".

Also, from what I remember, those parts of the circuit dealing with HW-level optimizations and scheduling in modern CPUs are relatively power efficient. Vast majority of power is still spent in execution units and just moving data to and from the CPU.

Your memory is incorrect. Decode and dispatch for out-of-order superscaler architectures (like x86) are major power (and silicon) hogs, and they do not scale well at all. (Power and silicon costs go up something like quadratically with the number of instructions you try to handle at once.)

Have they actually created any CPUs using that architecture or is it just simulated

Last I saw (maybe a year ago) it was still simulation.

Way too many times have CPU manufacturers promised magical performance increases that almost never play out in reality.

But the thing is, the architectural improvements they describe (and they describe most of them well enough that I could actually implement them myself -- they're not just hand-waving) essentially guarantee the performance increases. It's like hearing for the first time an explanation of how to make a carry-lookahead adder -- it's just immediately clear to anyone "practiced in the art" that the speed of that circuit is going to be way better than a ripple-carry adder.

MeduSalem wrote:1. Basically VLIW instruction set architecture with too much dependence on Compilers being smart

This just shows that you understood nothing. The Mill compiler is dumb. It has no need to be smart. (Yes, VLIW compilers have to be smart - and at present they're still not smart enough. Mill is not VLIW.)

2. Static, unchanging cycle-times in the Execution Units for instructions

Meh. IIRC aren't most of them single cycle? I think the multiplier has an extra cycle of delay or two beyond that?

some instructions might become obsolete/less useful over time so you might increase the cycles it takes for them to complete on purpose if it allows you to make another instruction perform faster instead.

I won't even argue this point, because it's completely irrelevant to how their 1.0 release would perform.

3. The belt-like mechanic itself isn't inherently better than Out-of-Order with Register Renaming

Except it is. If you are looking at only instructions/cycle, then no, it's not better - it's just about the same. But as soon as you look at power and silicon costs it's way better because they don't need to have all of that out-of-order superscaler decode and dispatch hardware. (There's a picture of a Nehalem core here. Note that about 1/6th of the entire core is spent on instruction scheduling alone. That's an expense that Mill doesn't have to pay. In power terms it's even worse than it looks because, unlike cache and execution units which mostly sit idle (in terms of "active area"), the scheduler is running full tilt every (core-not-fully-stalled) cycle trying to find ILP. Nehalem spends another 1/6th of its core just on instruction decode.

He reasons that you need less Multiplexers between the Belt Registers and the Execution units but I think that this can't really be avoided no matter what.

Dude, he's not just "reason"ing that you need less -- they've got this thing working in simulation. They've got it designed to at least an HDL level of detail. They know exactly how many muxes their current designs require.

The really bad part about the belt-like structure is that you can't even think about implementing an out-of-order execution

You're already getting all of the benefits of out-of-order execution without paying the power and silicon costs of out-of-order. Why the hell would you then want to even consider having out-of-order? You want to waste more power and silicon?

the hardware required to keep track where on the belt a specific item currently is would be monstrous compared to current RISC/CISC register renaming

Now you're just demonstrating that you know nothing of computer architecture. Tracking belt position (if they needed to do so, and they don't) would be way cheaper than register renaming (because it requires almost zero communication - only the belt-stall line is needed to know when to not bump a local incrementer). Besides which, there is no belt. It's a logical abstraction. They're not actually shifting data each cycle down some FIFO. (And you would know that if you watched/understood the videos.)

Also what about extending the belt-length?

What about it? They already support various belt lengths on different members of the Mill family. (Higher end Mill CPUs have larger belts.) You load the same code on all of them and it just works. He explained that in the videos.

More realistically seems that VISC stuff from Soft Machines...

I might read about that later.

out-of-order retirement

Doesn't seem very interesting. The main cost is on the front end of scheduling, not the back.

I think that the best approach was Berkely RISC II and follow-ups like SPARC which had register windows to speed up loops and subroutine calls

Mill function calls (which work just like mill instructions as far as the caller is concerned) kick both their asses, both in coolness factor and in performance. The belt means no register window is needed. (EDIT: I think you must have missed the part in the videos where it explains how each function call effectively gets its own logical belt. And IIRC the Mill also handles loops similar to its function calls, though I don't remember the details of that.) And those architectures lack all of the other features that make Mill better than x86. Really the only thing that has kept SPARC alive at all is that they have a crap ton of hardware threads (which lets them hide tons of memory latency which your typical poorly written server code suffers rather badly from). And even there they often play 2nd fiddle to x86 (because their caches are just too small).

MeduSalem · Post by **MeduSalem** » Sat Jan 21, 2017 2:01 pm

NotABiter wrote:This just shows that you understood nothing. The Mill compiler is dumb. It has no need to be smart. (Yes, VLIW compilers have to be smart - and at present they're still not smart enough. Mill is not VLIW.)

It is VLIW or at least very VLIW-alike. He said so himself multiple times, even in the Video you linked yourself, but it is also stated in a lot of other places. Maybe other sources are wrong, maybe his own videos are not enirely correct or not specific enough about of what kind of ISA category it exactly falls into, who knows.

If the compiler is dumb I cannot say though, but it better wouldn't be or otherwise being able to encode up to 33 instructions in one word makes not that much sense in the first place. Because the main advantage of VLIW stuff (and similar ISAs) is offloading the dependency check to the compiler so that you don't have to do all the complex dynamic dependency checks on the fly anymore... which are mostly required for Super Scalar/Out-of-Order Systems to work and which the Mill obviously doesn't use and which it claims to be one of it's main advantages since it doesn't depend on all the elaborate circuitry for that.

If it had that kind of complex circuitry then it might as well make use of out-of-order too because that's what the logical next step would be: Register renaming... because that comes at almost no cost anymore once you already have dynamic dependency checking. Yeah okay, it takes more silicon to have the additional physical registers, RAT and ROB, etc, but I'm speaking of additional complexity rather than Die space... Die space isn't really a concern anyways anymore nowadays or otherwise they wouldn't pack 24mb of L3 cache on a chip or produce 500mm² GPU chips.

Which is also addressing some other things of your reply, because if the compiler doesn't exploit the Instruction Level Parallelism... the Mill probably can't exploit it... since the processor itself has to be told what is parallel and what isn't. And as you said yourself... Most VLIW compilers from the past sucked at what they should have been good at... and if you say the Mill's compiler is dumber than that it will face a rough time.

That is also why I think that without Out of Order Execution nothing is really able to maintain high performance nowadays, not even RISC CPUs with their leight weight design, even most of them have it if they want to be cutting-edge and able to compete. You might alleviate a lot of things like latency etc with SMT like you mentioned with SPARC, but in the end if you suck at exploiting ILP then the IPC will go down and that's it...

NotABiter wrote:Meh. IIRC aren't most of them single cycle? I think the multiplier has an extra cycle of delay or two beyond that?

Some simple stuff like comparisons or easy Integer adders/bitshift stuff and other arithmetic or control flow stuff maybe... but other stuff like SIMD/FPU/Encryption/Address Generation/etc often take several cycles, even on the most recent x86 CPUs... but it keeps on changing from Micro architecture to micro architecture. There are even PDFs documenting each processor line of Intel/AMD and how many cycles each instruction takes ... and for some it is only written as average because some of them aren't fully deterministic or depend on the exact input that they don't know in advance.

So I guess that if the Mill requires instructions to be that specific and deterministic they might face some problems... like where the execution can only be as fast as the slowest instruction to prevent race conditions since there's no complex dependency check/out-of-order logic that prevents certain things from happening... If I had to take a wild guess that's also the main reason why the instructions need to have such deterministic cycle times or otherwise the entire thing goes haywire and might end up in the wrong state just because a single instruction took longer than expected.

My bet is that this will be the reason for why the Mill won't perform as well for general purpose stuff as they expect. If there's heavy FPU/SIMD computation or lots of random Load/Stores with cache misses the entire thing will always slow down to that speed because it can't progress until these problems are out of the way since it can't do another instruction out of order while resolving the long latency instruction... it probably depends on the Compiler to expect that there might be a cache miss or something and to re-arrange instructions in a way to hide all the latency... and I don't know if they will really succeed with that. But let's see how it eventually turns out if they ever make it to Silicon and into real applications rather than simulations.

NotABiter wrote:I won't even argue this point, because it's completely irrelevant to how their 1.0 release would perform.

True, might not be all that important since they haven't even realized anything in Silicon yet... but eventually they should consider the possibility. Other companies made the mistake not to consider such things and now their ISAs and Micro architectures are stuck with a lot of crap that can't be done in any other way than it already is for compatibility reasons or because it would require a fundamental rework of absolutely everything, pretty much like dumping everything they have and starting over from scratch.

NotABiter wrote:Dude, he's not just "reason"ing that you need less -- they've got this thing working in simulation. They've got it designed to at least an HDL level of detail. They know exactly how many muxes their current designs require.

Well, I can't think of it requiring less muxes than a compareable in-order ARM or x86 CPU (like Intel Atom for example) does if it tries to be comparable or competitive with the amount of Registers and EUs. The only reason the Mill realistically doesn't require that many muxes as a highend x86-64 CPU or whatever is because it doesn't have out-of-order execution, hence less physical registers that need to be wired up to the EUs, reducing the multiplexer tree.

But that also means it has less physical registers to work with in general, meaning it has to access RAM more often since it can't sidestep to another (renamed) register. Which is probably going to be a downside in certain CPU intensive scenarios.

Also if the thing doesn't really do any complex runtime dependency checks (and the register renaming which comes at almost no additional price) then that possible instruction level parallelism needs to be offloaded to the compiler... and if the compiler doesn't do it well... then you probably end up wasting possible throughput.

NotABiter wrote:Now you're just demonstrating that you know nothing of computer architecture. Tracking belt position (if they needed to do so, and they don't) would be way cheaper than register renaming (because it requires almost zero communication - only the belt-stall line is needed to know when to not bump a local incrementer). Besides which, there is no belt. It's a logical abstraction. They're not actually shifting data each cycle down some FIFO. (And you would know that if you watched/understood the videos.)

I know that it is an abstraction and I never wrote that within the micro architecture the Registers really move down in a FIFO like style. I know that the data is physically staying in place because that's requiring less circuitry and probably also less power than moving the stuff around constantly... and instead it's just the connections to the registers which rotate around so that the illusion arises that the "oldest belt position" eventually becomes the beginning again... so that when a new item gets placed on the "belt" it basically takes (overwrites) the place the "oldest" item had on the "belt" before.

That said the assembler code does still make relative references due to the way an item on the "belt" constantly changes it's relative position and with it its address whenever a new item gets pushed onto the "belt". So an address to a belt position is only temporarily valid. That makes it necessary to keep track of which position you are actually on the "belt" or otherwise an instruction might reference the wrong belt position that is no longer valid. The compiler probably has no problem keeping track of it, but I wouldn't want to be the one who has to write an OS with Assembler parts for that machine.

NotABiter wrote:What about it? They already support various belt lengths on different members of the Mill family. (Higher end Mill CPUs have larger belts.) You load the same code on all of them and it just works. He explained that in the videos.

I really wonder how that is exactly going to work out with the "belt"-relative addressing of Registers, I mean really, I want to see technical papers on it, not just some powerpoint abstractions. Because if the compiler assumes that the "belt" is only like for example 32 registers long ... (and it has to know a length because otherwise it wouldn't really know when the processor exactly starts overwriting the "oldest" register with the "newest" item) ... then it can only ever address 32 registers at one given point, which is basically fixed in the ISA like any other Architectural Registers from other ISAs.

If you would want the code to profit from a longer "belt" in a newer processor the code would have to be recompiled for that machine because otherwise the relative addresses wouldn't go any further because the compiler wrote it that way. Either that or the compiler would have to write code in a way that allows the code to scale onto longer "belts" by default. And that's seems to me like it is something rather problematic with relative addresses since it would have to insert conditional stores/loads for the case that a "belt" is not as long as on another processor.

NotABiter wrote:Mill function calls (which work just like mill instructions as far as the caller is concerned) kick both their asses, both in coolness factor and in performance. The belt means no register window is needed. (EDIT: I think you must have missed the part in the videos where it explains how each function call effectively gets its own logical belt. And IIRC the Mill also handles loops similar to its function calls, though I don't remember the details of that.)

Nope, I haven't missed that part... but how do you think that the mill function call works? How do you think that a new "belt instance" is created in particular? ... As far as I know he never really exactly explained it in any of his videos/papers how exactly it is working in the background... only in an abstract way that says "a new belt instance is created" (I remember that powerpoint sheet exactly) ... but we can agree on that he can't pull a new "belt" out of thin air. So there are two possible ways I can think of how he is doing it in the background:

Creating a new belt instance is like creating another Register Window... similar to RISC II or SPARC... and you can only have a limited amount of belt instances... but eventually you will have to move something out to the Stack once you reach the bottom of the barrel.
Automatically dumping the contents of the "Caller belt" temporarily onto the Stack in the background and then automatically putting it back on return from the Callee, the way it is done in pretty much all other instruction sets like ARM, x86 and so on, with the only difference being that ARM, x86 etc require the software to decide itself what to put on the stack and nothing happens automated, hence they require calling conventions to decide if the caller or the callee is responsible for saving the state.

If he does the second then it might only be slightly faster than ARM/x86 calling conventions... if at all, which also require everything to be dumped on the stack if the callee needs to work with the registers and most definitely a lot slower than Register Windows where the stuff never really leaves the CPU Core but rather gets rotated out of the way to make room.

Also don't tell me constantly that I know nothing. I'm not Jon Snow.

I may not be entirely up to date with some things, or may have misinterpreted something... but that I know nothing would mean I couldn't even differentiate between Bits and Bytes.

Harkonnen · Post by **Harkonnen** » Sat Jan 21, 2017 4:02 pm

TL;DR: Factorio has multithreaded update since around October 2016. The code of Factorio is so clean that I managed to inject threading in roughly 2 months with seeing quite big codebase for the first time

That implementation scales almost linearly from 1 to 2 cores. Gives an extra of 15% when jumping from 2 to 4 cores. And some extra 2% when jumping to 8 threads on 4-core CPU with HT.

It is not applied to all entities in the game though, so it does not mean doubling UPS. Some rare savegames showed up 70% overall UPS increase, but usual gains are around 10-30% - the real hogs are not that easily threadable. The reasons it was not included in 0.14 and why including it in 0.15 is still a question are described below. Main of them is that I started to see any considerable gains only when I "cheated" by adding belts to multithreading, but they are going to be sped up like 10-20 times using another single-threaded tech which is not quite multithreading-friendly.

In modern times memory bandwidth is the main limiter which is around 20Gb/s in hardware where factorio targets itself. This means that you have roughly 300 Mb per frame if you want to stay at 60 ups. The funny thing - when I first time came to office with my i7-2670QM (mobile processor) with 1333Mhz RAM and ran inserter-heavy map which I later used for testing multithreading - I got around 25 ups. Kovarex with his 12-core Xeon and I believe 1600Mhz at least showed around 30 ups. It was nearly the same! while if we would run something ALU-heavy like Terragen rendering, I would be beaten like 3 times I guess, even with same number of threads. That means Factorio is memory-bandwidth limited, not by ALU power which multicore solutions actually give. You want to go threading when you need more ALU. So threading it is same as threading network code when you are at 64kbit connection.

In my opinion multithreading would really start taking over this world only when hardware becomes NUMA - that is personal memory bank for each core, so on 4-core you will get 80 gb/sec - like that of GPU. Afaik AMD had such things in some of their processors, and I believe Intel has something like this also, but unless you have portable API from OS onto which memory bank you allocate, it doesn't add much. Also OS tends to throw your threads all round cores to make CPU heating and throttling equally. Overally - unless you are doing some offline 3D rendering or some other ALU-heavy thing, it's better to write good packed cache-friendly single-threaded code that will utilize full potential of 20gb/sec from one single core than to complicate codebase with multithreaded approaches. Keep those other cores for quick Alt-Tab to browser or farm some bitcoins if you don't like those extra ALUs sitting idle.

Still, I mentioned there were some noticeable gains when I added threading. That happened because many times CPU gets stalled waiting for some data from RAM upon cache miss, and hardware out-of-order execution is not that deep to do non-dependent ALU things ahead, especially when there's branch or CALL right after a cache miss, so multithreading fill such gaps by allowing other cores to do their ALU stuff while memory controller is busy filling cache line of some other core, sorta out-of-order execution, sorta what hyperthreading is after. You also get more L1/L2 cache when you utilize several cores. Many things in Factorio are heap-allocated without caring a lot about their locality (at least between entities), so I believe organizing such things in better form using custom allocators will lead to fewer cache misses, and gains from multi-threading will be not that noticeable if at all.

Another thing - game already runs perfectly at 60 ups at "normal" factories. You start getting to 30-40 ups when you build a Super-Gigafactory with like 4Gb gamestate (I speak of RAM usage by task manager, not savegame size). The thing here is that commodity hardware usually has 4-6Gb RAM if we speak of laptops, so such factories are already not something everybody can load. And that was one of reasons we abandoned 32bit version. So going optimizing and optimizing it all forward can be for like forever. It's like with multiplayer when we made 200 players, then 400, then thought about 1000. There should be some point to stop at and look at other aspects of the game. So for now my vision is to optimize things which can be optimized like x2-10 times in single-threaded form. Belts being one of such things - that will come soon in friday facts.

The way that multithreading worked was briefly explained in FFF mentioned. It was done not just lock-free, it was done atomic-free

well, almost, except chunks scheduler. Many entities have their side effects spanning at most several tiles apart, so every chunk (32x32 tiles part of the map) may have read/write effects only on itself and 8 chunks around it. With this in mind if two chunks are at least 2 tiles apart on both X and Y axis, these chunks (their multithreading-friendly entities) can be updated in arbitrary order without causing read/write or write/write collisions. Another things to notice - if some chunk gets affected by updating entities in nearby chunks - it must get side-effects of this processing in some deterministic order. For example, if some chest in corner of the chunk gets affected by two inserters in two nearby chunks - order of those inserters affecting it must be deterministic.

So all chunks were put over 3x3 grid, sorta updateGeneration = (X%3)*3 + (Y%3). After that all these generations updated in a sweep by worker threads. There is a thing that when G0 finishes updating, some chunks from G1 already may start updating without waiting for the whole G0 wave to finish. But every chunk depends on up to 24 chunks around it, so that would put a big strain on scheduler because many chunks contain just some 3-4 pipes in them so that sync would cost more than updating entities themselves. I had idea to shrink this dependency graph because some edges imply other edges as well - that has led to at most 4 direct dependencies per chunk, but brought another problem - necessity to put artificial chunks which are not active actually to make these implied edges work as intended.

Since all that multithreading starts mattering only on huge factories with a lot of active chunks, I abandoned the idea of smart scheduling - in current implementation all threads wait for all G0 to be processed, then entire G1 is processed and so on. These are sorta sub-clock signals BenSeidel was talking about. Since there are quite a lot of chunks for every generation, granularity is big enough for that stalled waiting time to be witin acceptable limits. Another thing I will probably add is distributing chunks between threads in some balanced form, so every thread has about same number of entities to update, so they all finish updating at about same time.

The entities that were affected are inserters, assembly-machines, chem-plants, refineries, pipes, but even in logistic-free map it was not enough to see considerable performance gains. Then I added belts and it gave something. Main hogs are currently electric networks (already much better, rseding91 made a ton of updates in there), biters and trains. So in our push for final performance will probably be after biters and trains. Trains span all around the map, but their gamestate is probably double-bufferable, so that can make it threadable. Biters are the nastiest thing here because they run in packs steering from each other within a single chunk - a lot of collisions for threaded code. I sometimes even think to accelerate them on GPU with a compute shader

Overally it works like that:
- Update electric networks, trains, circuit-networks etc... (all single-threaded)
- Update single-threaded entities (biters for example)
- Run 9 sweeps to update multi-threaded entities
- Consolidate what was postponed during multi-threaded sweep

The postponed stuff is when for example inserters touches a logistic chest. Logistic chests have a lot of stuff bound to them, like a list of all chests with iron so bots can search for iron quickly. Heck, we even have voroni triangulation in some places of logistic network. This is definately not the stuff you want to do lock-free or whatever-free. Another thing - a train. You drop to cargo wagon, it increases overall intentory for entire train, and train can be longer than 2 chunks. So all changes that need to be done there are dropped into list of actions to postpone (one such list per chunk), and then performed single-threadedly after all sweeps are done. Fortunately this list is not that big per tick, so I never saw it in profiler. And adding to these postponing lists is the only place where critical sections were used.

Still it makes code quite complicated in some places. Factorio codebase is very OOP-heavy, we use a lot of things from std, boost, multiple inheritance, etc... So you frequently have some method like LogisticChest::transfer which could be called from literally everywhere - an inserter, belt loader, logistic bot, player dropping stuff, etc... and you never know from where actually it will happen the next time. You update belt, belt moves a player, a player has personal roboport - woops, entire logistic network is now affected. Now, if threading is added, whenever somebody in Wube works on such method of some entity has to care if it can be called in multithreaded context or not, and not all things are easy to postpone. So if threading is kept in codebase, at the end of the day everybody will have Einstein hair, 5 cups of coffee and smokers like me will get two extra empty packs of cigarattes while they were temporarily boosting L1 cache size of their brain to get grip on all those possible dependencies when they wanted to add just one single line of code.

L1 cache of the brain is very important thing. It allows to care less and still be on the safe side. That's why OOP was invented in the first place. That's why people like garbage collectors in performance-relaxed environments. That's how everything works. That's what makes development team scale from 2 coders to 10 for example. All factorio developers get special adamantium gear-shaped tattoo on their neck inscribed by Albert during intiation process. That tattoo extends brain L1 cache to about x1.5 of healthy people already - that's the secret behind this godlike update loop that made game famous, but still they have their families and thus need good dreams at night, so things which makes life complicated every day are better be avoided. Threading cannot be encapsulated all that transparently, you have to keep remembering of it every day you modify something. Also that tattoo gives trouble at airport metal detectors, so we want to remove it at some day, for that we are currently trying to simplify things, not make them harder, so the project stays manageable, crash-free and desync-free.

Another thing - floating bugs especially desyncs will be much harder to reproduce when stuff gets multithreaded. It is much harder to find black cat in a dark room when you are not 100% sure it is really there - i.e. if desync was caused by game logics or by heap layout on all those 'new' operators, or now add threading here - just one more monster under the sofa

Taking in mind that without my belts cheat threading would add only 10-15%, and only on uber-factories with 3-4Gb gamestate which not everybody can afford, we don't know if it gets merged in master branch or not. At least in that form. Heck, I even had a desync due to threaded pollution was accumulated in form of 0.5+0.5+100000 instead of 100000+0.5+0.5 (actual numbers were bigger).

Now on what BenSeidel proposes. I guess misunderstanding in this thread comes from two false assumptions some of which were noted along pages. 1st one is that entities are quite decoupled. This is not true - situation with chest and two inserters is the best example, so every time you drop something into "WRITE" state, you have to actually add to a buffer of changes. We understand that collisions will not happen all that frequently, but you never know if collision will happen or not, so you much put a lock every time you drop something. All those XCHG might kill performance very quickly, so per-entity (or more precisely per-activity) locks are not an option. Getting some common inserter-chest thing will quickly amass your entire factory in one such connected monster because inserter depends on chest and on assembly machine nearby which depends on another inserter which depend on belt which depend on belt which depend on belt which depend on belt which depend on belt which depend on belt which depend on train which depends on mining drill on the other end of the factory. Well, not quite like that of course but you get the idea.

As for clock - you are probably making analogies with fluid processing and stuff like that. These are situations which probably may be applied to biters in some form, but all those automatic crafting cases which factorio is about mostly require sequential processing during one single tick. The best thing that could be done with them is decopling thir processing at chunk bassis, which already was done long ago. Even here - with belts the cheating was not only in that belts are gonna be rewritten into O(1) algo not dependent on number of items on them, the cheating was also in that I had to limit belt recursion depth to like 30 tiles instead of 1000, so they do not span a chunk away. This causes sporadic belt decompression at chunk boundaries when forward piece has moved, after previous one has already been updated.

If you compare Factorio to VHDL - think of it as of getting a sum of two 16384-bit integers. It will take quite a lot of time I suppose for all latches to stabilize, so you'll want to do that in ticks. Or better example - take a scheme for division of such long numbers.

2nd assumption - about old/new state doublebuffering - that scheme could work if we would update every entity every tick, but we don't

Remember what I said in the beginning - we roll 4Gb of gamestate at 30-60 ups. The only way to achieve that is to keep track of active entities, wakeup lists and such stuff - so only part of gamestate gets actually updated. Remember that usual case - assembly machine sopped working - why? oops, out of red circuits. Wtf here? out of plastics. Wtf here? (200 belts away) no oil. Wtf here? (200 pipes away) refinery got ful. Wtf here? Train does not come. Wtf here? it's out of fuel ...... some belt was broken and not rebuilt. Interesting thing about such production dependencies is that they resemble the way factorio tracks what should be updated and what shouldn't be. So all of this starts rolling and ticking only when you put that belt piece, it gets coal, coal gets to inserter, locomotive gets coal, moves, etc... finally that inserter with plastics starts dropping stuff into assembly machine, and it gets its first update tick to produce red circuit. And it estimates that it will finish producing in like 1000 ticks, so it puts itself into another timed-way sleep-list, and for another 999 ticks it's not updated, and bars you see and animation are applied only for rendering. This codebase is like Google Chrome or JVM or Linux Kernel - you name it. It's not "gamedev" in its sense

It's another marvel of entire software industry that puts core i7 to a good use while some shitty messenger takes two seconds just to show me cached picture.

With that activity tracking in mind getting any form of double-buffering would require almost every variable to be double-buffered individually. While it makes sense at some places (for example in circuit networks it works exactly that way), we have many places with dynamic lists for example, especially in logistic network. Now imagine you have list A-B-C. Now you modify it so that B is removed and D is added, so now you get A-C-D. That would require every prev/next pointer to be double-buffered, and explicitly copied from old to new state before tick starts. Or tracking those variables that changed, and making copy just over them, which eliminates performance of fast-flip and requires some interlocked operations to fill these lists from different threads.

As for copy-on-write - it's quite hard to implement when you have tons of stuff referring each other with random pointers. We're not dealing here with particle dynamics, we're dealing with RB trees, lists, voroni triangulation and things like that. We are thinking sometimes about using copy-on-write for latency hinding to have two gamestates at once with difference of a few megabytes and updating only a little around the player, but I'm afraid OS won't be able to handle that at 60 ups. Regarding threading instead of copy-on-write I was sometimes thinking about consolidate-on-read, but that would require frequent locking and even longer Einstein hair on programmer heads. It's not something that could be done once, it's something that gives hard to maintain contract on everyday coding.

So, to conclude - we'll get back to threading when it's attempted for biters and trains, and when we will get assured that we cannot suck it all from RAM at 20gb/sec from single core by reordering data with smart prefetches. So far it saturates at 1.5-2.5 cores which is not so far from bringing it to single core. Think of incomplete task manager green bar as of because we don't do stupid things on ALU

as a bonus you get free alt-tab and non-lagging browser. For those who will still want full-green taskmanager with 100% cpu usage we may add "connect SETI@home" checkbox.

Adil · Post by **Adil** » Sat Jan 21, 2017 5:31 pm

Harkonnen wrote:That means Factorio is memory-bandwidth limited, not by ALU power which multicore solutions actually give.

Guess it's time to add reaction balance calculations to nuclear reactors.

BenSeidel · Post by **BenSeidel** » Sat Jan 21, 2017 7:13 pm

Well "NotABiter", you bring up some interesting points, and in general, I agree with them. (Or are you being sarcastic?)

NotABiter wrote:6 years of researching and yet...

yet what? we are currently implementing the last few features and gearing up for the commercialisation of the software. Is that what you are asking? please clarify.

NotABiter wrote:You might want to try to learn how to use google.

Java has a class dedicated to "clocks" (cyclical barriers):
https://docs.oracle.com/javase/7/docs/a ... rrier.html

Make that two classes:
https://docs.oracle.com/javase/7/docs/a ... haser.html
(paper with benchmarks here: http://www.cs.rice.edu/%7Evs3/PDF/SPSS08-phasers.pdf)

I am vaguely aware of them, yes. IIRC they fill specific roles in the synchronisation of the timing of the threads, not the data, much in the same way as the fork-join framework operates. If this is the case then neither class offers a way to allowing the reading and writing of shared state without an additional locking mechanism. The technique I have described is not about the synchronisation of the threads, but the access of the values using the double buffer. Think of it as inter-process vs intra-process. Also, I have not claimed that this is my technique, I simply borrowed it from the hardware designers as this is what they have been doing for decades now. I have only claimed that I have not seen anything like this in my travels.

NotABiter wrote:I'm not sure I can wrap my head around your level of delusion.

Thanks!
But essentially it is the same process that GPUs go through. The Vertex shader functionality comes out, some games support it, some don't. Only when games start requiring it do you see that card's aren't released without it. (And yes, this is the way it goes, I had this issue with a brand new card and took it back because of this error). Same with CPUs. Once games REQUIRE 16 cores, Intel will STOP making 8 cores. There has always been a lockstep between software requirements and hardware standard features. If there was not then you would never see a feature die out because it was not adopted by the industry. You seem to have read my post something along the lines of "If you make a game require 16+ cores then CPU manufactures will start making 16 core CPUs".

NotABiter wrote:Intel didn't start making multi-core systems because "software developers lead", they did it despite the fact that software developers were (by and large) not even ready for it

Is that a joke? Both Unix and Linux have been multicore since the dawn of time, not to mention that WindowsNT was out back in 1993. Programmers have been multithreading for a very long time. Intel made dual-core CPUs standard because that pesky OS kept using a thread for itself... darn thing. So if they could keep the OS's threads from running on the same core that the program is running on then you will see an improvement in performance. So now having two cores as standard gives your game 1 core at 100%. But now you have a second core sitting idle, so you pull the renderer and some AI tasks off to it's own thread, now you are using 2 cores, with that pesky OS still taking one from you every now and then. So Intel jumped to 4 cores as standard...

CPU manufacturers have pretty much have been reacting to the software available on the market because speeding that software up will sell chips. If they see that the CPU is doing a large amount of floating point matrix multiplications then they will produce instructions that cater directly to that task. I'm not saying that they have not done CPUs with a large number of cores, I'm saying that their consumer grade gear is driven by consumer grade software. So if we start producing games that run on 8 or 16 cores then Intel won't be producing any more dual-core chips. Or does supply and demand not play a part in the market place?

Now look at it from the other side and game developers started using two cores because some (as opposed to all) hardware had two cores, just the same way you are seeing games that now use 4 core because some hardware has 4 cores.

NotABiter wrote:With today's tools

So we agree that writing multithreaded code is not hard, but writing it with the currently available tools & techniques is?

NotABiter wrote:Optimized code of this nature really requires language/tool support to be feasible at all for most "mainstream" programmers

Yep, it's a tooling issue.

NotABiter wrote:Right. It couldn't possibly be that we make rational decisions based on actual knowledge (including first-hand experience)

Once bitten, twice shy?

NotABiter wrote:Programmers are all just irrational lazy cowards

Yes, all humans are. I thought this was common knowledge.

NotABiter wrote:modern CPUs provide instruction-level parallelism not because programmers are too lazy but because there is no other way in existing CPUs for programmers to efficiently express/implement such micro-parallelism

I agree this is an issue. I once attended a talk by a guest lecturer at my uni who was advocating the introduction of a micro-parallelism instruction that essentially would cause a light-weight fork-join in your application that was entirely based in hardware. IIRC it did it by allowing one core to start up another core and set it to begin execution at the specified instruction. I don't know where it got to or how far the research went, nor do I even remember the name of the guest lecturer, it was well over a decade ago now. It's quite interesting to think how such an instruction would synergise with hyper-threading or if it could be implemented using an instruction-interleaving methodology similar to what the out-of-order processor tries to do (in so much as the OoO processor tries to determine what instructions can be reordered to allow the interleaving).

And my point was that if the software community wrote software that used 100's or 1000's of threads then you would see less effort on the current types of CPU optimisations, instead more effort on things like increasing memory bandwidth or cross-core communication algorithms. Are you aware of the research done into multicore optimisations that are specifically targeted at algorithms used to determine if it's faster to move the required data from another core to the requesting core vs moving the current threads state & instructions over to the core that has the data? It's these types of algorithms that are not currently considered viable simply because of the way we write software.

NotABiter wrote: (Are you offering to rework all of the code that benefits from that hardware so it's no longer needed? Didn't think so.)

It's on my list of things to do, well in a roundabout way. I believe that a static transformation exists so that you can get currently single threaded code and run it across multiple threads. Once I have some time (and more importantly money) I'll be giving it a crack as it's something I am really interested in. (I realise it may not be possible. If anyone has a proof that it's impossible then let me know as it will save me alot of time).

The Mill concept is an interesting one, thanks for the link. When I have had time to digest it I may make a comment. MeduSalem seems to have some valid points, so it will be interesting to see how it all turns out.

Harkonnen, thanks for the massive dump of info! The things that you guys have done with the game are amazing!
But, I'm only asking these questions because you begin the paragraph with:

Harkonnen wrote:Now on what BenSeidel proposes.

So I am assuming that you are speaking directly about the clock-controlled double buffer. If not and you are talking about the system you use, then disregard my comments.
In that paragraph you say:
[/quote="Harkonnen"]We understand that collisions will not happen all that frequently, but you never know if collision will happen or not, so you much put a lock every time you drop something.[/quote]
The entire idea of this method is that you don't need variable-level locks because the buffer pointers & offsets are constant for the duration of the clock-cycle. The only time you should need a variable level lock is when you want to access mutable data across the memory bounds. This should only occur because including the data in your "execution context" would cause too many to merge into one. An example of this would be the "logistics chest list" where you store a list of all the chests that have iron in them. If you included that list in the "execution context" of one chest then you would end up having one "execution context" containing all the chests, causing you to processes them sequentially.

Harkonnen wrote:All those XCHG might kill performance very quickly

as previously said, no locking synchronisation, so no XCHG instruction. No any need for any atomic instructions such as any of the compare-and-somethings. The offset and pointers to the read and write buffers are constant over the duration of the clock controlled process so there is no need to synchronise the values over the cores.

And one question about this:

Harkonnen wrote:Getting some common inserter-chest thing will quickly amass your entire factory in one such connected monster because inserter depends on chest and on assembly machine nearby which depends[...]

I was under the impression that the action of an inserter within a frame only depends upon the entity (or ground) that the inserter is trying to pick up/drop off on. There are a series of frames where the inserter has to swing from one to another, so it cannot affect or be affected by both the pickup and drop off entity at the same time?

Ok, you have the case where an inserter won't pick up until the assembly machine has an opening for the item, but that is based on the current levels of goods in the machine, not the future amount of goods in the machine right? I don't really want to get too bogged down on exact details, I just think you have misinterpreted where the context bound would be or that you didn't see the part about the inserter being able to move from one "execution context" to another between update cycles. "Execution contexts" don't have to be static, they can be created and deleted as needed and various tasks can move between contexts when the need dictates.

Harkonnen wrote:so every time you drop something into "WRITE" state, you have to actually add to a buffer of changes

The buffer should always be there, hence the idea that a COW style may be a performance enhancement for some scenarios.

It's 3am here and this post has kept me busy for way too long. If there are any issues with it I'll fix it tomorrow (or more accuratly, today)
p.s. I love it when a new perspective comes into a conversation

hansinator · Post by **hansinator** » Sat Jan 21, 2017 9:08 pm

Harkonnen wrote:In modern times memory bandwidth is the main limiter which is around 20Gb/s in hardware where factorio targets itself.

I've heard that before from a dev that memory bandwidth is the limiting factor. I believe you are just guessing, because I have numbers that show a different situation. I ran some benchmarks and used Intel's PCM tools to measure memory bandwidth untilization. PCM tools directly query performance counters of the memory controller, so it is very precise.

This is what SiSoft Sandra measures what is possible on my system on average:

Along with the PCM tools output (peak performance I've seen):

Code: Select all

|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  23934.27                  --|
|--                  System Write Throughput(MB/s):  26305.86                  --|
|--                 System Memory Throughput(MB/s):  50240.13                  --|
|---------------------------------------||---------------------------------------|

That's 50GB/s memory bandwidth.

Now here's what Factorio looks like on the same system:

It doesn't even come close. Factorio uses roughly 6% (3GB/s) of my available memory bandwidth and roughly 8% CPU. So it can not be the memory bandwidth & not the CPU integer/float performance. The game itself just seems to be unable to make efficient use of the available resources.

Just for comparison, this is what the 7-zip benchmark, a real-world workload, achieves:

Code: Select all

|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):   9802.64                  --|
|--                  System Write Throughput(MB/s):   3953.46                  --|
|--                 System Memory Throughput(MB/s):  13756.10                  --|
|---------------------------------------||---------------------------------------|

Yoyobuae · Post by **Yoyobuae** » Sat Jan 21, 2017 9:32 pm

hansinator wrote:It doesn't even come close. Factorio uses roughly 6% (3GB/s) of my available memory bandwidth and roughly 8% CPU.

Not all bandwidths are made equal.

Sequential read/writes are always going to be much higher than random read/writes. Random memory access patterns can in fact make memory 10x slower.

But you know how the world works. Sky high sequential read/write benchmarks look great in marketing material.

As for CPU utilization, Factorio is fully utilizing one core. Exactly as expected from a (mostly) single threaded program.

Harkonnen · Post by **Harkonnen** » Sat Jan 21, 2017 9:35 pm

That tool appears to be summing read throughput with write throughput. So it's around 25Gb/sec for read or for write, which is what I meant. As for Win10 task manager - it shows amount of memory used, not throughput

Which means you have there around 2GB gamestate (1-1.5Gb is taken by DirectX). CPU usage - it shows % of all logical cores. So if you have 4-core with hyper-threading, or 8-core without one, 12.5% would mean 100% usage of one single core. Not just ALU of course, waiting on cache misses is also included here.

As for 3gb/sec while factorio is running - I wonder if it accounts for actual reads or for whole 64-byte cache lines that were loaded. If that's rate of actual data used - I would believe that number.

hansinator · Post by **hansinator** » Sat Jan 21, 2017 10:12 pm

Harkonnen wrote:That tool appears to be summing read throughput with write throughput. So it's around 25Gb/sec for read or for write, which is what I meant.

You are interpreting that wrong. Yes it sums up reads & writes, because the sum reflects the total memory bus bandwidth that is currently used by the system. The tool (and the output that I've copied) actually lists the read and write throughput separately.

Here's the peak write performance measured with Aida64:

Code: Select all

CPU	CPU Clock	Motherboard	Chipset	Memory	CL-RCD-RP-RAS	Write Speed
6x Core i7-6850K HT	3800 MHz	[ TRIAL VERSION ]	X99	Quad DDR4-2133	15-15-15-35 CR2	60138 MB/s

60GB/s

And the peak read performance as measured by Aida64

Code: Select all

CPU	CPU Clock	Motherboard	Chipset	Memory	CL-RCD-RP-RAS	Read Speed
6x Core i7-6850K HT	3800 MHz	[ TRIAL VERSION ]	X99	Quad DDR4-2133	15-15-15-35 CR2	48279 MB/s

48GB/s

As you can see the bandwidth is around the same for read only, write only or read/write (copy). So the Intel tool is doing nothing wrong when it sums up read/write. The theoretical maximum is around 67GB/s.

Harkonnen wrote:As for Win10 task manager - it shows amount of memory used, not throughput Which means you have there around 2GB gamestate (1-1.5Gb is taken by DirectX).

And that is why I included it, to show how much memory Factorio currently uses. Edit: for reference of course, because the numbers wouldn't say anything if Factorio just used a couple 100 megs of RAM..

Harkonnen wrote:CPU usage - it shows % of all logical cores. So if you have 4-core with hyper-threading, or 8-core without one, 12.5% would mean 100% usage of one single core.

And that is why I included the plots of all 6 cores (12 with HT). You can see that one plot reaches almost 100% - for the time that Factorio has been running. However, you can also see that the CPU frequency is not at it's maximum. It's around 2,3GHz but it can reach 4GHz with turbo-boost, so the game could make even more use of that one core. Edit: the CPU freq is being auto-scaled there, it's not that I manually under-clock it..

Harkonnen wrote:As for 3gb/sec while factorio is running - I wonder if it accounts for actual reads or for whole 64-byte cache lines that were loaded. If that's rate of actual data used - I would believe that number.

It accounts for the actual bandwidth that is used by the memory controller. That includes all memory accesses but excludes data that is fetched from the caches.

It still seems that the game does not make efficient use of the available resources.

Edit: this page explains the tool https://software.intel.com/en-us/articl ... er-monitor
Edit2: The Intel tool is no benchmark, it precisely measures what the current system memory bandwidth utilization is!
Edit3: Just to clarify it more.. the sum of read/write is the sum of reads and writes taking place at the same time. They were not measured separately.

Post by **posila** » Sat Jan 21, 2017 10:36 pm

That to me just confirms how badly we need to optimize memory access patterns. Your CPU is chilling on 2.3GHz because it has nothing better to do than to wait on memory fetch. If we managed to make memory accesses more sequential, you'd see higher memory throughput and possibly higher CPU clock speed.

EDIT: Intel PCM looks like really cool tool, we should check it out. Thanks for mentioning it.

Harkonnen · Post by **Harkonnen** » Sat Jan 21, 2017 10:52 pm

As for 60 Gb/sec - well, XEON is not common consumer product

it's server die. They may even have semi-NUMA-alike tricks I mentioned above. 20 Gb/sec is kind of low-end for now, probably I should update my expectations to 30 Gb/sec. But so I need to do about average Ghz of ALU which has shifted from 2Ghz to 3Ghz in past years. Another anomaly I see - Windows does not try to spread the load between different cores which I frequently saw on many laptops and on regular desktop processors. Important digit here is 3Gb/sec which makes me thinking that threading I mentioned earlier will scale much better on XEONs, scaling linearly for up to 4 cores perhaps.

hansinator · Post by **hansinator** » Sat Jan 21, 2017 10:56 pm

posila wrote:That to me just confirms how badly we need to optimize memory access patterns. Your CPU is chilling on 2.3GHz because it has nothing better to do than to wait on memory fetch. If we managed to make memory accesses more sequential, you'd see higher memory throughput and possibly higher CPU clock speed.

EDIT: Intel PCM looks like really cool tool, we should check it out. Thanks for mentioning it.

I'm glad that someone sees that memory bandwidth can't be the culprit in all cases. But optimizing such stuff is very hard. It's not that I would expect Factorio to make optimal use of the available resources. For most people it seems to be okay as it is. I'm just someone who likes to measure stuff and see if claims hold up. For me it is okay that way and I'm already thrilled by all the optimization work that Rseding91 does. Though a little extra performance is always a good thing, so I won't say anything against optimizing memory access patterns

Harkonnen wrote:Important digit here is 3Gb/sec which makes me thinking that threading I mentioned earlier will scale much better on XEONs, scaling linearly for up to 4 cores perhaps.

There we can only speculate until it would be done but I assume the same.

BenSeidel · Post by **BenSeidel** » Sun Jan 22, 2017 12:41 am

hansinator wrote:Important digit here is 3Gb/sec which makes me thinking that threading I mentioned earlier will scale much better on XEONs, scaling linearly for up to 4 cores perhaps.
There we can only speculate until it would be done but I assume the same.

I can confirm that Xeons offer a better scaling with memory throughput. While there is not a 1:1 ratio between cores and memory controllers, there is a x:nx ratio. (I believe that 4 cores share one memory controller, with about a 90% chance I'm wrong). Anyway, page 5 of https://sp.ts.fujitsu.com/dmsp/Publicat ... -ww-en.pdf has a good diagram of the memory arciteture of higher-core CPU's while the next page has a lower core count memory arciteture diagram.

Post by **Rseding91** » Sun Jan 22, 2017 3:15 am

hansinator wrote:
Harkonnen wrote:In modern times memory bandwidth is the main limiter which is around 20Gb/s in hardware where factorio targets itself.
I've heard that before from a dev that memory bandwidth is the limiting factor. I believe you are just guessing, because I have numbers that show a different situation. I ran some benchmarks and used Intel's PCM tools to measure memory bandwidth untilization. PCM tools directly query performance counters of the memory controller, so it is very precise.

No guessing involved at all: https://www.reddit.com/r/factorio/comme ... ed_fpsups/ http://imgur.com/a/QJiHE

justarandomgeek · Post by **justarandomgeek** » Sun Jan 22, 2017 4:59 am

Harkonnen wrote:Another anomaly I see - Windows does not try to spread the load between different cores which I frequently saw on many laptops and on regular desktop processors.

The Intel/Asus utilites that came with my current board have options to do this when OC'd because they clock the individual cores differently when it's primarily serving low-threaded loads - it tries to push the heavy threads to the higher clocked cores, and lighter threads to slower cores. Possible there's something like that at play here.

NotABiter · Post by **NotABiter** » Sun Jan 22, 2017 5:20 am

MeduSalem wrote:It is VLIW or at least very VLIW-alike.

Mill is not a VLIW architecture. I don't think Godard can state it any more clearly or emphatically:

Ivan Godard wrote:ops from one instruction are not issued all together the way they would be in a VLIW. That VLIW, if it said 'const add store', all of those are happening all at once in the same cycle. That's what a VLIW does, it's what it is. Not true in a Mill.

Yes, Mill has rather big instructions, so even wikipedia says "Mill uses a very long instruction word (VLIW)-style encoding" (though even there I would take exception because even Mill's encoding has significant differences than any VLIW I know of), but instruction encoding is not architecture. If you were to change a modern x86's instruction encoding without changing how it issues and executes instructions (but updating its decoding logic for the new encoding of course), it would still be an out-of-order superscaler architecture, not a VLIW, not a stack machine, not a DSP, etc.

MeduSalem wrote:Die space isn't really a concern anyways anymore nowadays or otherwise they wouldn't pack 24mb of L3 cache on a chip or produce 500mm² GPU chips.

Die space is an issue because die space costs money (both by using more silicon and by decreasing yield) and because taking up space on the critical path (instruction issue is on the critical path) potentially adds latency (costing clock speed and/or cycles). That the impact of L3 cache on performance (and sales) is worth the die space does not alter these facts. Furthermore, L3 cache does not incur the same costs -- it's not on the critical path so doesn't affect clock speed (and whatever its cycle time is, it's still faster than main memory). And L3 cache likely has little/no impact on yield because L3 cache can be easily made redundant due to its regularity (i.e. with defective subblocks being disabled rather than having to toss out the entire core/die).

Most VLIW compilers from the past sucked at what they should have been good at... and if you say the Mill's compiler is dumber than that it will face a rough time.

Maybe try (again) to watch/understand the videos? (Maybe you need coffee? Smokes?) Mill isn't VLIW. Subsequently, the Mill compiler isn't faced with the horrible non-options that VLIW compilers have to deal with. One example of the differences is covered in the "phasing" section (8:54..40:21) of the "Instruction" video. He explains in detail, with side-by-side VLIW vs Mill examples, how a VLIW architecture, anytime you're near a branch, basically says "Screw you compiler! Screw you performance!" whereas the Mill architecture does not. Good scheduling for VLIW becomes impossibly hard in that case - you have to (if it's even possible) do things to programs that compilers shouldn't ever have to do to programs. It's like some dummy wrote a bubble sort and now the only way the compiler can fix it is to recognize the crap code for what it is and completely rewrite it as a quicksort. In the case at hand it's the VLIW architect that's the dummy - they're the ones that made hardware that's impossibly hard to compile to. Like I said, the Mill designers are crazy smart, not dummies like those that sunk billions into "The Itanic".

And "it will face a rough time"? How so? VLIW compilers only get p0wned by hand-written code because making the code efficient for VLIW is just too complex. That's not true for Mill. The Mill compiler (which already exists/works) seems to be doing just fine:

Ivan Godard wrote:What I have done is I have compared this to optimal hand schedules, and you really got to work to find something better than what this will produce.

MeduSalem wrote:That is also why I think that without Out of Order Execution nothing is really able to maintain high performance nowadays

Which you say even though the Mill, in simulation, running actual code (not stuff they made up), compiled with their compiler, completely blows away x86 in terms of operations per cycle.

My bet is that this will be the reason for why the Mill won't perform as well for general purpose stuff as they expect.

They're already running general purpose code, compiled with their compiler, in simulation. They know exactly what the performance is for "general purpose stuff". (And they've told us that it's 6+ ops/cycle, which is really quite good.)

But that also means it has less physical registers to work with in general, meaning it has to access RAM more often since it can't sidestep to another (renamed) register.

If you actually watched/understood/remembered the videos... (There's more memories in the Mill than just "the belt" and caches, and they are doing clever things with them.)

If you would want the code to profit from a longer "belt" in a newer processor the code would have to be recompiled for that machine

Well, golly gee whiz! You don't say!

Maybe if you had watched/understood/remembered the videos you would know that they do exactly that, already. Compilation is two phase. The first phase takes C code and generates a generic Mill binary for distribution. That binary can be installed on any Mill machine. When you actually install it on a given machine, that generic Mill binary is compiled (they call it "specialization") to the specific Mill chip in the machine. (Each Mill machine not only has differences like belt size, they have completely different instruction encodings - from day 1. They do that to get maximal efficiency, not wasting encoding space on instructions/resources a given Mill cpu doesn't even have. That also means when they add new instructions in the future they will have zero issues with "making room" in the instruction encoding space or ending up with monstrous encodings like x86. BTW, their instruction encodings are machine created/optimized, not hand-made. Their compiler automatically adjusts to whatever instruction encoding a given machine has.)

So there are two possible ways I can think of how he is doing it in the background:
1. ...
2. ...

Haven't you learned yet that, with the Mill, there is always a third (better) way?
He covers this in one of the videos. Just going from my (slightly fuzzy) memory: They move such memory around in the background with a sort of "always running hardware task". Such management operations are not near top-of-"stack" (they just let that memory sit there - with any luck they never have to move that) and normally have no direct linkage to call/return. Instead memory at the "interface" between "what's on chip (dedicated memory, not cache)" and "what's in memory (cache)" is moved around in a fashion to try to avoid ever having to stall, and making use of idle memory-system cycles whenever possible.

EDIT: I forgot to mention that I read that article on VISC. Their results have (literally) been adjusted nine ways from Sunday which even anandtech says "means that the data should be thrown out entirely", so I'll just ignore those. I'll also ignore the instruction rewriting (simply because it's not interesting and is neither good nor bad). The main thing they've done is make a larger superscaler by ganging a bunch of smaller superscalers together and adding a front-end (if I read that write). They're kind of light on the details, but it looks like they had to add a lot of pipeline stages to make that happen. (Doing it in one stage would of course be too expensive - like I mentioned before that imposes quadratic costs.) So in less-branchy code I would expect them to be able to engage more cores (perform better) and in more branchy code I would expect them to suffer some P4-like mispredict costs (perform worse). It looks like their goal is low power? That's certainly achievable with more effective parallelism. (It's common knowledge that more effective parallelism allows same performance at lower clocks allows lower voltage resulting in reduced 0.5*f*v^2 = power.) Can't say I'm too excited about it though (mostly because it's old tech and not very exciting, plus doesn't seem aimed at "higher end consumers" aka me).

NotABiter · Post by **NotABiter** » Sun Jan 22, 2017 5:41 am

Rseding91 wrote:
hansinator wrote:
Harkonnen wrote:In modern times memory bandwidth is the main limiter which is around 20Gb/s in hardware where factorio targets itself.
I've heard that before from a dev that memory bandwidth is the limiting factor. I believe you are just guessing, because I have numbers that show a different situation. I ran some benchmarks and used Intel's PCM tools to measure memory bandwidth untilization. PCM tools directly query performance counters of the memory controller, so it is very precise.
No guessing involved at all: https://www.reddit.com/r/factorio/comme ... ed_fpsups/ http://imgur.com/a/QJiHE

But what you're linking sure looks like a guess to me. I mean, yes, you're seeing a performance difference when you change memory speed, but you're still guessing about why. When you change memory speed you are changing both your memory latency and your memory bandwidth, so you don't know which (or what mix) of them is responsible for the change in performance.

The tool hansinator is using clearly (no guesswork) shows that you're only using a tiny fraction of total available memory bandwidth. Now it's still possible that for some small fraction of the frame time you are memory bandwidth bound, but for at least the vast majority of the frame time you're clearly not.

That is to say, it looks like your performance goes up with faster memory entirely or almost entirely due to decreased memory latency, and not due to increased memory bandwidth. That (combined with hansinator's CPU core not being maxed) indicates you are doing a lot of non-sequential memory accesses, and the resulting memory stalls are eating a lot of your time.

NotABiter · Post by **NotABiter** » Sun Jan 22, 2017 7:39 am

BenSeidel wrote:Or are you being sarcastic?

Mostly serious. (Paragraph starting "Right." was sarcasm, hopefully obviously so.)

NotABiter wrote:6 years of researching and yet...
yet what?

1. Note that "..." means "continues".
2. Go to my post.
3. Take a look at what comes immediately after the "...".
I.e., I was pointing out that you claimed 6 years of research in this area, yet still made the claim:

But anyway, Clocks (and clock boundaries) are an extremely powerful technique that I have never seen applied in software.

which, well, seems somewhat odd in the face of Java having built-in library classes whose sole purpose is to implement clocks (aka barriers aka phases).

I am vaguely aware of them, yes. IIRC they fill specific roles in the synchronisation of the timing of the threads, not the data, much in the same way as the fork-join framework operates.

I was just pointing out that while claiming you haven't in 6 years of research seen others using clocks, and while implementing "from scratch" clocks/barriers in your example code (using .wait() and .notifyAll() on lock objects), there already existed, in the very language you were using, built-in classes for doing barriers. I.e. barriers seem to be a lot more widely used than your "I have never seen applied in software" statement would seem to suggest. (I was not saying that such classes deal with the isolation part of the solution, just the barrier part.)

And yes, this is the way it goes, I had this issue with a brand new card and took it back because of this error

So, you buy low-end crap. You freely admit you're part of the problem!

Once games REQUIRE 16 cores, Intel will STOP making 8 cores.

So you used the word "lead" when you were really talking about "the trailing edge of technology". (That's why it was unclear to me.)

Is that a joke? Both Unix and Linux have been multicore since the dawn of time, not to mention that WindowsNT was out back in 1993.

No joke. Systems developers being able to do things like multithreading (and write in assembly, and write to bare metal, etc.) does not mean you can generalize to non-OS software developers. And you can't both argue that software devs aren't ready to do parallelism because they don't understand it or are afraid of it or whatever AND simultaneously argue that they are ready for it. (If you wish to do that, please feel free to argue directly with yourself and leave me out of it.)

CPU manufacturers have pretty much ... now use 4 core because some hardware has 4 cores.

OK, but now you're no longer talking about "leading the rear" but "leading the front", which is fine, but is confusing when you just switch subjects mid-post. This kind of leading is what was talking about with the RPG example.

So we agree that writing multithreaded code is not hard, but writing it with the currently available tools & techniques is?

Yes, if you're talking about non-trivial and efficient cases. (Trivial and inefficient is of course pretty easy right now without any additional tools, but usually not particularly useful.)
Even with the compiler I'm working on, I'm really only shooting for "good enough" parallelization. I'll still be leaving a lot on the table. (I.e. even with my compiler, creating really efficient parallelized code will still be in the "difficult" category, at least until such time that its capabilities are extended well beyond what I currently have in mind.)

And my point was that if the software community wrote software that used 100's or 1000's of threads then you would see less effort on the current types of CPU optimisations

(I'm assuming you don't mean applications just go using threads haphazardly (i.e. lots and lots) because that will kill performance just as bad or worse than not using enough threads.)
What you're talking about depends on what the "mix" is. I.e., even if there are many programs that are properly parallelized, if there are still enough important programs that are not then the concern about single-threaded performance will continue to some extent.
And what you're asking is kind of really difficult, even with what many would call "proper tools". That's because even proper tools can (at best) only generate "known good code" for hardware that actually exists. Just because something runs really well with 8 or 16 threads on an existing 8 or 16 core system, that's no guarantee that when it fires-up on a 1,000 core box and launches its 1,000 threads that it's going to work well at all (performance wise).

Are you aware of the research done into multicore optimisations that are specifically targeted at algorithms used to determine if it's faster to move the required data from another core to the requesting core vs moving the current threads state & instructions over to the core that has the data?

I've done some work in that area in my previous job (except there it was not from one core to another but from one machine in a cluster to another).
For my current endeavors, that's beyond what I'm trying to accomplish - I'm doing physical-type modeling which means I've got good locality and it always makes sense for any (commonly touched) data belonging to objects moving in the model-space to also move in physical memory (when a boundary is crossed).

I believe that a static transformation exists so that you can get currently single threaded code and run it across multiple threads.

"a static transformation"? Or "many different static transformations, depending on the code"?
And certainly not all code (some just isn't parallelizable), right?

Anyways, very superficially it sounds similar to what I'm doing. But my goals are, um, more sane, and I'm already working on it. (Basically it's about letting the developer indicate where isolation should exist and what kind of isolation should exist, and then the compiler uses that knowledge to safely parallelize the code, with isolation checked/proven as required to ensure the parallelized code has the same semantics as the original non-parallelized code. In some very easy cases the compiler may be able to do it without any help, with the number of such cases likely expanding over time.)

Post by **Rseding91** » Sun Jan 22, 2017 8:00 am

NotABiter wrote:But what you're linking sure looks like a guess to me. I mean, yes, you're seeing a performance difference when you change memory speed, but you're still guessing about why. When you change memory speed you are changing both your memory latency and your memory bandwidth, so you don't know which (or what mix) of them is responsible for the change in performance.

The tool hansinator is using clearly (no guesswork) shows that you're only using a tiny fraction of total available memory bandwidth. Now it's still possible that for some small fraction of the frame time you are memory bandwidth bound, but for at least the vast majority of the frame time you're clearly not.

That is to say, it looks like your performance goes up with faster memory entirely or almost entirely due to decreased memory latency, and not due to increased memory bandwidth. That (combined with hansinator's CPU core not being maxed) indicates you are doing a lot of non-sequential memory accesses, and the resulting memory stalls are eating a lot of your time.

... yes. That's what I've been saying. I don't know where you got the idea that when I say "We're limited by RAM speeds" is memory throughput. It's both: tons of random access of little bits (latency) and a very small amount of throughput. Additionally that throughput listing for a given processor is the *maximum* throughput of the processor - not the throughput of a single thread doing random access - which is what the vast majority of Factorios update cycle is doing.

As Harkonnen said virtually all of the update cycle is linked to the rest in some small way which is why we can't just drop threads onto some portion of it without having to fundamentally change the update cycle with a bunch of small hacks to prevent concurrent modifications.

For instance: inserters.

If we group all objects that can be changed by running update() on a single inserter you essentially get the entire list of all updatable entities because:

Adding/removing an item from any entity can trigger the alarm() method if the entity is wakeable (virtually everything an inserter interacts with is)
Triggering alarm() on an entity will trigger alarm() on any other inserters waiting on that entity for some reason
Triggering alarm() re-activates the entity which puts it back in the updatable entities list
Moving items in or out of an entity that's part of the logistic network alters the logistic network item counts which all entities part of that network use
Moving items in or out of an entity that's part of a circuit network alters the circuit network item counts which all entities part of that network use

And the final "nail" for doing inserters is the "slow" parts are those interactions with other entities because they're *always* on some other cache line (if they're even in the cache) and the CPU has to wait while the required piece of RAM is loaded. The rotation between pickup and dropoff simply rotates a vector towards a pre-calculated position (as of 0.15) and takes trivial amounts of time to run compared to something like "drop the item on the ground" which has to check if it collides with anything at that location and allocate memory for the new item entity + place it on the ground and link it in with the rest of the entities on the chunk.

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications