hoho wrote:Yes, he has stated that. Has it been shown to be like that in reality?
Yes. They've written their own "specializer" for the 2nd compilation phase (which is covered extensively in video #10). For parsing and "in the middle" optimization they're just using the LLVM toolset. They can compile and run real programs and get good results doing so. They're still working on getting some remaining bits of the C standard libraries ported.
x86 is extremely efficient on not highly branchy code as well.
That's like scoffing at some 7 foot tall person's standing height and saying "a midget is tall when he stands up as well". Intel just bumped their absolute maximum (best case) dispatch rate to 8 instructions per cycle (per core) as of Haswell. (It was 6 before that.) That's not exactly close to the Mill's 33. As mentioned earlier, widening that further on an x86-type architecture isn't really practical unless you're willing to eat an even deeper pipeline, which then hurts your performance on branchy code. Obviously Intel has decided that for their processor that would not be a worthwhile trade-off to make. And there are major architectural differences between x86 and Mill that mean the price of making an x86 core have a given issue-width is
always going to be significantly higher than it is for the Mill. This isn't a problem Intel can solve with clever chip design - they'd need to abandon the x86 ISA to solve it (and then they'd still have to deal with Mill's patents).
Throw in 256/512bit SIMD and you'll get some insane throughput. "on the right workload"

You do realize that using SIMD on x86 doesn't improve their instruction issue rate, right? And that Mill also has vector processing? In fact, the Mill's vector processing is leagues ahead of x86 and is lacking all of x86's ugliness. (On the Mill data size and vector size move down the belt right along with the data so they don't have to be encoded in the instruction, and scalars and vectors can coexist side-by-side in the same "register space"/belt, so you can use the exact same 'add' machine-instruction whether you are adding two 8 bit numbers or two 64 bit numbers or two vectors of 32 bit numbers, etc. In fact, the Mill lets you use that very same 'add' instruction to add numbers that aren't even the same size as each other. That's a massive improvement over Intel's bazillion "seemingly random and endless sea of" SSE and SSE2 and AVX and MMX and VEX and REX/classical integers and stack-based FPU instructions.)
Again, have they actually built any of their CPUs or is it all just hypothetical?
That's a false dichotomy. "hypothetical" means no evidence. They have a working simulation, a working compiler, and various applications running on top of their simulation, with (very good) measured (granted, simulated) results of that. They plan on switching from simulation to an FPGA test platform this year. Their ultimate goal is to be a fabless core provider (like ARM).
I can't see how that is possible when ALUs are the most power hungry parts in modern CPUs.
CITATION REQUIRED
And even if that were true (and it's not true for common code, i.e. branchy and scalar), Mill would
still consume less power than x86. I.e. if you ignore all other power savings, e.g. ignore that Mill doesn't have to pay the huge "out-of-order scheduling power tax", and just look at ALU power only, Mill still wins. The reason comes down to the basic physics of CMOS. Say you're running your typical branchy code on both x86 and Mill, and the x86 is achieving 3 operations/cycle and the mill is achieving 6 operations/cycle. Hold performance constant and look at power. What's the result? The higher operations/cycle on the Mill means you can clock the Mill at 1/2 the frequency that you clock the x86 and still get the same performance. With a lower clock, you can also then run the Mill at a lower voltage than the x86. (For any given process and design, higher clocks require higher voltages - that's just how CMOS works.) The dynamic power consumed by CMOS is:
capacitance * frequency * voltage * voltage Now for the Mill the frequency is cut in half, but there is twice the capacitance in play (twice as many active ALUs), so capacitance * frequency is a wash (2 * 0.5 = 1). But since Mill only needs to operate at half the frequency, you can reduce voltage by say (just making up a reasonable-ish number here) 20%. That then means the dynamic power consumption of the Mill ALUs in this case is now 0.8 * 0.8 or 64% of the dynamic power consumption of the x86 ALUs.
Note that the point here isn't to provide any sort of definitive answer about exactly how much power Mill will consume compared to x86, but to simply show "how that is possible": As long as Mill can provide the same performance at a lower frequency, then it can also run at a lower voltage and subsequently use less power.
Sure, in specific workloads it could be since their CPU effectively seems to want to be a DSP.
It's not that it "wants to be a DSP" - it's that they've taken the architectural properties that allow DSPs to be more power efficient than general purpose CPUs and figured out how to extend them so they can also execute general purpose code. (Or at least that's their story. I don't really care how/why they came up with such great architecture ideas. I'm just happy they did. Well, mostly happy -- why didn't I come up with those ideas!?!?)
Just because Intel is retardedly putting an entire "low-end graphics card" on their chips doesn't change the fact that x86 instruction decode and scheduling consumes around 1/3rd of every single CPU core they make. Next you'll be arguing that the entire CPU core isn't that important for CPUs because, after all, even with all of the cores added together they still make up a minority of the die area on such Intel chips.
Plus, that graphics crap doesn't count towards power when I (or any other self-respecting gamer) is using that chip (if stuck using such a chip) because we're using real graphics cards and the whole integrated graphics section of the chip will be powered down (or at least I would hope so).
Mill simply uses that transistor budget to implement things that "normal" CPUs don't have a need for.
CITATION REQUIRED
Btw, is there any information on how Mill would handle thread/context switches? Do they require any additional magic or will it "just work" and would be indistinguishable from running a single thread?
Yes, there's information on it:
https://www.youtube.com/watch?v=5osiYZV8n3U#t=21m25s
But you'd probably have to watch that video from the start, and maybe some of the earlier videos, to really understand it.
That video is mostly about security, and Mill has some really good security features. For example, a common attack on x86 is overflowing an array on the stack to overwrite a return address (and thereby cause the processor to jump where ever you want). Such attacks are strictly impossible on the Mill. New stack data is also automatically zeroed (for free - they've got tricks for doing that - i.e. they don't actually write a bunch of zeroes to memory) which not only increases performance (because zeroing at least some local data is pretty common in programs) but also prevents data (e.g. passwords) from being seen on the stack by code that has no business seeing it. It also supports (non-OS) services having their own stacks (so a user of a service can't just fill their stack up and then invoke the service and cause it to fail due to stack-overflow). There's just lots and lots of stuff there - watch the videos. If you care about computer architecture then you owe it to yourself to watch them, in full, in order, because Godard isn't just listing/describing features of the Mill architecture (though he does plenty of that), he's describing fundamental problems in computer architecture, how/why previous approaches fail, and how they can be overcome. One after another. Every single video (except #10 the compiler one) has one or more of these "breakthrough bombs" in it. (The reason the compiler one doesn't is because they don't have to do anything revolutionary in their compiler.)
Also, how long is the "pipeline" for the Mill CPU?
I don't think that's been covered in the videos. Pipeline depth is likely to vary from one Mill chip to another.
Anyways, in a sense it doesn't matter because, regardless of what their pipeline size is, they've already measured 6+ operations/cycle on common branchy code. I.e. their pipeline size is already "baked in" to that result. (Of course, obviously pipeline size matters - i.e. if they could magically have a shorter pipeline without hurting anything else they could then get their operations/cycle up to 7+ or more. My point is just that 6+ is already very good, and a lot better than what x86 achieves.)
E.g if they have perfect code and manage to schedule 32 instructions per cycle and there is a branch, they have to halve that to 16 to run both of the branches in parallel.
I think there's been some misunderstanding. Mill does support speculative execution (and the details of that are covered in the videos), but as far as I know it doesn't do "both branch directions at the same time".
How does Mill handle SIMD? Are all it's registers general purpose or are some SIMD?
I already answered this earlier in this post. (Its vector support is excellent, well integrated, highly regular and powerful, and is covered in video #5.)
What about int vs float?
Like, to the death? My money's on int because he's a more solid wall of bits.
Not sure what you're asking. Like int, float is supported for both scalar and vector data. The architecture (though not necessarily every single Mill chip) supports 16/32/64/128 bit IEEE binary, 32/64/128 bit IEEE decimal, and 8/16/32/64/128 ISO C fractions (whatever that is). Supported int sizes are 8/16/32/64/128. This is covered in video #5.