Funny that the Mill shows up here.NotABiter wrote:Anyways, if you want to see a much better direction that processors might go, check out some of the Mill CPU talks. That thing is insanely better than x86/ARM/MIPS/etc. Much better performance, much better ILP, with less "overhead" hardware (it gets better ILP than an x86 using dynamic instruction scheduling, and does so using static instruction scheduling, so it needs no dynamic instruction scheduling hardware - how they do that is explained in the videos), and they claim to need far less energy/operation. (All claims except the last are "obviously true" just from the information in their jaw-dropping videos. The energy efficiency claim will be proven one way or the other once they're running on real silicon. These dudes have some serious brains and are making Intel/AMD/ARM all look kind of stupid by comparison.)
Do you know what the best part is? With their top end being able to dispatch 32 "instructions" (which they call "operations", but they are similar in work-amount to instructions on "normal" CPUs) every single cycle per core, they will make sticking with single-threaded applications viable even longer into the future.
While "belt machines" (which is how it would be categorized correctly) have some interesting ideas... the Mill suffers from several fatal design flaws in my opinion:
- Basically VLIW instruction set architecture with too much dependence on Compilers being smart
That sucks really, really hard... and has been among one of the reasons why IA64 (Itanium etc) failed (apart from the lack of proper x86 support and pragmatic software developers not feeling like they really wanted to port their software there).
While looking really good on paper (so did IA64 too) they just don't scale very well in practice because they are so hard-designed around certain constraints. Like the amount of instructions that are encoded within them. If you widen the VLIW pipeline, older software often cannot really profit from that and everything needs to be re-compiled or otherwise the pipeline becomes ill-utilized on newer micro architectures. Now go and tell software developers that they have to re-compile every time you decide to release a new processor line with a wider pipeline or otherwise they face a truckload of consumers complaining that their legacy software doesn't perform any better on the new processors. Most companies just won't do it because it is cost uneconomical. Also think about what happens when a company goes out of business... then there is nobody to update/recompile the software. Imagine a game from 10 years ago... it wouldn't perform any better than it did 10 years ago even if the micro architectures advanced further. Fugly.
The only way to overcome that VLIW limitation is to double, triple, quadruple, etc the pipeline from the original width so that the fetcher can pick 2 or 3 of your original VLIWs per cycle. But that again means you have also to do dependency check on-the-fly again with everything against everything to avoid issuing two instructions that need to be done in a particular order... and that's when we are back to how classic fixed-/variable-length ISAs are doing it already. So what's the profit of having VLIW then? It's a point where VLIW is just fundamentally flawed and gets its ass handed to itself by fixed/variable length ISAs.
Another problem that comes from VLIW is the tremendous amount of code bloat due to having "headers" and having to place stupid NOPs everywhere whenever the Compiler thinks that there is nothing in particular that could be done in parallel. (And as research has shown even the best VLIW compilers are really bad at finding Instruction Level Parallelism in general purpose programs for some reason if they are not obviously parallel or helped by software developers knowing their way around assembler) On the other hand the instruction fetcher within the CPU only has a limited window they can look forward so if there is a huge NOP slide encoded then the next "real" instruction might be out of scope of the window and there is nothing you can do about it. The bottom line is... you don't want NOPs in the pipeline... they are inherently a bad design choice and even to think about having it should have been illegal because it turns the entire concept of VLIW upside down.
In my opinion the space they waste with encoding NOPs could have been used for the next instructions following afterwards (even if you can't schedule it yet for execution) so that when the current instruction completes the pipeline already has them ready... with a much smaller prefetching window and increased code density which in return also profits the Cache structure and overall bandwidth.
After watching all of Ivan Godard's videos and reading into most of his papers I just don't get why he insists on that VLIW stuff. It's basically proven in praxis that it doesn't work out as well as it seems in theory and that it is the major reason why almost everyone, including GPU companies like Nvidia/AMD who also used VLIW in older graphics chips eventually moved away from VLIW to RISC for their newer chips. The design constraints eventually outweighed the gains (especially since they are also targeting GPGPU stuff).
I know in theory Compilers for the Mill would be able encode up to 33 instructions into one VLIW, which is much more than anyone else ever did before, so there probably wouldn't be the need for widening the pipeline for quite a long time... But in reality compilers would probably have a hard time filling all the 33 slots up most of the time (especially in short loops and branches which is what probably the most general purpose code consists of)... so it ends up with a lot of overhead introduced through the fixed size "header" of the VLIW which always has to be there and all the NOPs that are in between the actual instructions (even if the NOPs are all combined into one single NOP they are still there and reduce throughput).
On the bottom line I think a VLIW architecture also can't really do anything much better than a classic fixed-length RISC architecture can do... all that is done with a VLIW ISA can also be achieved with a fixed-length RISC ISA, just that the RISC ISA doesn't suffer from above described problems like having to re-compile and infamous NOP sleds (at least modern RISC ISAs try to avoid NOP Sleds, older one suffer from the same problem due to the strictness of the classic RISC pipeline).
And if one really wants to go massively parallel then one can't avoid having some sort of SIMD extension anyways... but leave the old fashioned General Purpose instructions in peace for the sake of simplicity and flexibility. - Static, unchanging cycle-times in the Execution Units for instructions
From what I have also read in the papers the Mill basically assumes that a specific instruction always takes exactly the same amount of time, independently of the actual micro architecture it is running on. I don't remember exactly why it required that but it did for some reason.
That makes changes to the execution units between micro architectures pretty much impossible. But sometimes there need to be changes... some instructions might become obsolete/less useful over time so you might increase the cycles it takes for them to complete on purpose if it allows you to make another instruction perform faster instead. It really depends from micro architecture to micro architecture and how everything is interconnected and how important several instructions are.
So basically the Mill hard codes the execution units and that's awful because one of the better aspects of an instruction set architecture should be the possibility to implement it with different micro architectures instead of one that is totally fixed to work in a particular way. - The belt-like mechanic itself isn't inherently better than Out-of-Order with Register Renaming
I get that the basic concept of the Mill is treating the Registers like a belt where the oldest item on the belt eventually drops off and vanishes and if you want to preserve it you have to put it back to the beginning of the belt... and that Ivan reasons this because 80% of the values inserted into the registers are only used once, while only 15% get used more than once.
But what makes that actually better than a classic Register machine where the Compiler decides when it is legit to overwrite a particular ISA register with a new value?
He reasons that you need less Multiplexers between the Belt Registers and the Execution units but I think that this can't really be avoided no matter what. For the inputs of the Execution units you still need to wire every Belt position to every execution unit or otherwise you can't access the data from every belt position once you need it. So on the EU input side you again have a huge multiplexer tree... because how otherwise would you do it? Wait until the proper item passes by the Execution unit in the belt-like fashion? Urgh... the latency until a single instruction gets finished... and how about branch mispredictions... pipeline flushing and whatnot... would take forever to recover from that.
The only savings you might be able to have is on the data output of an execution unit because you always drop it at the beginning of the belt. But if you have multiple outputs per cycle you have to re-introduce multiplexers or limit the amount of items that can be dropped onto the belt in one go. Also you have to introduce a logic that puts the items in the right order on the belt so that the outputs of various execution units don't end up in the wrong order if they finnish at the same cycle.
And that's also where the complexity doesn't really differ that much anymore from having out-of-order execution with register renaming and a re-order buffer.
The really bad part about the belt-like structure is that you can't even think about implementing an out-of-order execution of stuff because the instructions are encoded in a way that they reference a relative belt-position that changes every time another item gets input on the beginning of the belt. That in turn makes it almost impossible to ever implement Out-of-Order execution on the Mill because the hardware required to keep track where on the belt a specific item currently is would be monstrous compared to current RISC/CISC register renaming.
OoO becomes more and more profitable if not necessary once there are Cache misses or other long latency/multi-cycle instructions... and I see no real way that the Mill can avoid that problem... on the contrary the way it works with belt-relative addressing of registers it makes it a whole lot harder to implement micro architecture improvements.
Also what about extending the belt-length? Probably the same problem as having only the Instruction Set Architecture Registers... if you don't have enough of them you end up with a lot of Loads/Stores and adding more Registers is impossible without a new ISA (which then requires recompiling everything to take advantage). And that's why Out-of-Order eventually became an important thing.
There are some other things that I think may not turn out all that good on the Mill too... but the entire thing is almost worth its own topic.
More realistically seems that VISC stuff from Soft Machines since they at least have something to show in Silicon... but sadly they got bought by Intel recently so it probably won't see a commercial implementation for a long, long time... and only if Intel is really cornered and needs to pull a rabbit out of their hat. But the basic concept would be that a single threaded application could take advantage of unused resources from multiple cores, which is achieved by having a global fronted end that fetches instructions from various threads at once and then schedules and distributes them among the cores that are free.
Also when it comes to Out-of-Order stuff then there are also different new concepts that haven't been implemented in any commercial processors yet... like for example out-of-order retirement of instructions when it becomes obvious that they don't depend on anything else anymore. That in turn would allow for a much smaller Reorder buffer... or to increase the out of order window in situations where a lot of instructions are parallel and as long as they don't require atomicity.
If you ask me then I think that the best approach was Berkely RISC II and follow-ups like SPARC which had register windows to speed up loops and subroutine calls, thereby minimizing the need to access Caches/RAM... and that's where they should have gone with the development. Generalizing the register windows so that the software can decide which registers are "punched through" from window to window and instructions to move the contents of a window directly onto a stack buffer on window switch... stuff like that.