Page 5 of 22

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 9:03 am
by NotABiter
I was thinking about how I would speed up code like some of Factorio's that is of a serial nature, without having to parallelize it. It didn't take long to think of using a "prefetch thread". The idea being that while one thread is running the actual code, the other is prefetching data to be used by the first thread. Searching on that revealed that these are commonly known, referred to in the literature as "helper thread prefetchers", and typically used with SMT aka hyper-threading (which makes sense because otherwise they can't pull data all the way into the L1 cache).

In the process I came across another interesting idea:
This paper describes and evaluates helper threads that run on separate cores of a multicore and/or multi-socket computer system. Multiple threads running on separate cores can significantly improve overall performance by aggressively prefetching data into the cache of one core while the main thread executes on another core. We call this technique inter-core prefetching. When the prefetcher has prefetched a cache's worth of data, it moves on to another core and continues prefetching. Meanwhile, when the main thread arrives at the point where it will access the prefetched data, it migrates to the core that the prefetcher recently vacated. It arrives and finds most of the data it will need is waiting for it, and memory accesses that would have been misses to main memory become cache hits.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 9:57 am
by hansinator
Rseding91 wrote:
hansinator wrote:
Harkonnen wrote:In modern times memory bandwidth is the main limiter which is around 20Gb/s in hardware where factorio targets itself.
I've heard that before from a dev that memory bandwidth is the limiting factor. I believe you are just guessing, because I have numbers that show a different situation. I ran some benchmarks and used Intel's PCM tools to measure memory bandwidth untilization. PCM tools directly query performance counters of the memory controller, so it is very precise.
No guessing involved at all: https://www.reddit.com/r/factorio/comme ... ed_fpsups/ http://imgur.com/a/QJiHE
Halving the RAM speed does way more than just halving the available bandwidth.. According to the graphs for a 50% reduction in memory clock speed there's a reduction in FPS/UPS of
  • ~32% @ 4.6GHz CPU / 8 threads
  • ~30% @ 3.2GHz - 4.2GHz CPU / 8 threads
So decreasing memory access speed & bandwidth by half reduces the FPS by roughly 1/3. This makes me wonder, cause I would expect a larger drop when the memory speed was at it's limit most of the time. It looks like it is not at it's limit.

Now for bandwidth. Whether I use quad-channel memory with a theoretical peak bandwidth of 67GB/s or dual-channel (resulting in half of the bandwidth of course) makes no noticeable difference in FPS/UPS.

This, to me, proves, that it is not the bandwidth, but more the memory access time. Which in turn hints at non-optimal memory access patterns and many cache misses.

Unfortunately the graph does not show how the CPU cache speeds vary, if at all, when he varies the multiplier. That would be nice to know how it affects FPS/UPS.

But I can conclude that you probably can not just plug game-state size, FPS/UPS & memory bandwidth into a simple calculation and say more is not possible. That being said I have to repeat that I don't cry for more. Edit: and I conclude further that the memory overall speed is the limit given the way how you use it. There's plenty of headroom to get more out of it.

Edit:
The problem probably arises from heavy usage of linked lists with lots of small disjoint memory areas being allocated. Traversing them causes lots of cache misses. Linked lists are cache-line unfriendly. My experience shows that arrays holding small data structures or pointers are way faster on modern hardware, even if you often copy around dozens or few 100 megs to resize them. In c++ it is faster to embed the linked list pointers into a class instead of having a list that holds references, because it improves locality of memory accesses.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 10:52 am
by hoho
NotABiter wrote:It depends on your memory access pattern. If you are streaming through memory serially, using every byte, then the number of CPU cycles available per byte of memory is actually very low. If you are hitting random memory addresses and only looking at one byte at each address, then yeah, then you're looking at 100s of CPU cycles per byte.
I have e5 6600 with theoretical memory bandwidth of 34.1GB/s. Cumulatively the four cores perform ~ 3.4x4 = ~ 13.6 billion cycles. In each cycle, the CPU can pefrorm several operations in each core, up to some couple dozen with SIMD stuff. So, yeah, I misremembered my numbers, it's much closer to byte per cycle than 100 bytes.

Though with CPUs with more parallel cores/threads, things get worse. It also doesn't help that each thread generally has their own datasets and thus put higher load on caches.
NotABiter wrote:Good code tries very hard to not do that though, because it's really slow. (See, for example, Data-oriented design and AOS and SOA.) But depending on the problem, random accesses can be various degrees of difficult to avoid.
I'm well aware of that. Improving memory locality and predictability of access patterns is one of the best ways to improve computing efficiency after using proper algorithms.
NotABiter wrote:Ivan Godard is an experienced compiler writer and he's specifically stated that the Mill is designed so the compiler is simple to write - nothing fancy required.
Yes, he has stated that. Has it been shown to be like that in reality?
NotABiter wrote:No, practical to 32... on the right work load (i.e. not highly branchy code).
x86 is extremely efficient on not highly branchy code as well. Throw in 256/512bit SIMD and you'll get some insane throughput. "on the right workload" :)

NotABiter wrote:They've made a point of being able to take existing C code and just recompile it and run it, and just doing that with common programs gets them to 6+, which is 2-3 times better than what you tend to see on x86.
Again, have they actually built any of their CPUs or is it all just hypothetical?
NotABiter wrote:Like I said, their claim is lower power, not higher
I can't see how that is possible when ALUs are the most power hungry parts in modern CPUs.
NotABiter wrote:(In fact, the specific claim he makes many times is 1/10th the power of a conventional CPU, but usually in the context of DSP-like workloads.
Sure, in specific workloads it could be since their CPU effectively seems to want to be a DSP.
NotABiter wrote:Your memory is incorrect. Decode and dispatch for out-of-order superscaler architectures (like x86) are major power (and silicon) hogs, and they do not scale well at all. (Power and silicon costs go up something like quadratically with the number of instructions you try to handle at once.)
In modern x86 CPUs, that part of the CPUs is tiny compared to the entire die size
http://wccftech.com/haswell-die-configu ... -revealed/
Way too many times have CPU manufacturers promised magical performance increases that almost never play out in reality.
But the thing is, the architectural improvements they describe (and they describe most of them well enough that I could actually implement them myself -- they're not just hand-waving) essentially guarantee the performance increases. It's like hearing for the first time an explanation of how to make a carry-lookahead adder -- it's just immediately clear to anyone "practiced in the art" that the speed of that circuit is going to be way better than a ripple-carry adder.
(There's a picture of a Nehalem core here. Note that about 1/6th of the entire core is spent on instruction scheduling alone. That's an expense that Mill doesn't have to pay.
Mill simply uses that transistor budget to implement things that "normal" CPUs don't have a need for. It's a tradeoff.


Btw, is there any information on how Mill would handle thread/context switches? Do they require any additional magic or will it "just work" and would be indistinguishable from running a single thread?


Also, how long is the "pipeline" for the Mill CPU? Modern CPUs have it around 12-18 cycles IIRC leading to stalling at most that many cycles in case of branch misprediction. In Mill, it'd require doubling the computations the CPU does for that period, possibly more if you have branch within branch in fewer cycles than the lenght of the "pipeline". Another thing to note is that at the time they're performing that speculative execution, they are using up ALUs that could have been spent on doing computations that wouldn't be discareded. E.g if they have perfect code and manage to schedule 32 instructions per cycle and there is a branch, they have to halve that to 16 to run both of the branches in parallel.


How does Mill handle SIMD? Are all it's registers general purpose or are some SIMD? What about int vs float?

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 11:22 am
by hoho
Harkonnen wrote:TL;DR: Factorio has multithreaded update since around October 2016.

[a wall text made of pure awesomeness]
You should put that stuff you wrote there in a separate article, possibly even non-FFF one. I'm sure a LOT of people would be interested in reading it but only a few ever see it due to it being hidden in a thread full of mental mastrubation :)

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 11:33 am
by Harkonnen
A few notes:
1. Number of threads in Factorio options affects only number of prepare-render threads. It has nothing to do with update cycle. So zoom in maximally, better in window, better in like 720p to diminish effect of rendering on ups completely.
2. Ah, I see now that 60 gb/sec comes from quad-channel memory :) My expectations were based on dual-channel sticks.
3. So, we got somewhere now. Thanks to BenSeidel for stating the thread :) even thought it's on different topic, it had a very good side-effect. For main discussion of this topic I'll tune in a bit later.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 11:37 am
by Harkonnen
hoho wrote:You should put that stuff you wrote there in a separate article, possibly even non-FFF one. I'm sure a LOT of people would be interested in reading it but only a few ever see it due to it being hidden in a thread full of mental mastrubation :)
Yeah, we're thinking of that. For these questions were in the air for a very long time.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 11:39 am
by hansinator
Harkonnen wrote:
hoho wrote:You should put that stuff you wrote there in a separate article, possibly even non-FFF one. I'm sure a LOT of people would be interested in reading it but only a few ever see it due to it being hidden in a thread full of mental mastrubation :)
Yeah, we're thinking of that. For these questions were in the air for a very long time.
OT: It must be quite a burden to make a game that is enjoyed by big crowd of computer specialists.. everybody knows it better :roll: ;)

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 11:41 am
by hoho
Harkonnen wrote:
hoho wrote:You should put that stuff you wrote there in a separate article, possibly even non-FFF one. I'm sure a LOT of people would be interested in reading it but only a few ever see it due to it being hidden in a thread full of mental mastrubation :)
Yeah, we're thinking of that. For these questions were in the air for a very long time.
I've not been to gamasutra for a couple of years but I used to enjoy gamedev articles there. Publishing it there would also be a awesome if their site still is dealing with the stuff.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 4:58 pm
by Yoyobuae
Rseding91 wrote:... yes. That's what I've been saying. I don't know where you got the idea that when I say "We're limited by RAM speeds" is memory throughput. It's both: tons of random access of little bits (latency) and a very small amount of throughput. Additionally that throughput listing for a given processor is the *maximum* throughput of the processor - not the throughput of a single thread doing random access - which is what the vast majority of Factorios update cycle is doing.

As Harkonnen said virtually all of the update cycle is linked to the rest in some small way which is why we can't just drop threads onto some portion of it without having to fundamentally change the update cycle with a bunch of small hacks to prevent concurrent modifications.

For instance: inserters.

If we group all objects that can be changed by running update() on a single inserter you essentially get the entire list of all updatable entities because:
  • Adding/removing an item from any entity can trigger the alarm() method if the entity is wakeable (virtually everything an inserter interacts with is)
  • Triggering alarm() on an entity will trigger alarm() on any other inserters waiting on that entity for some reason
  • Triggering alarm() re-activates the entity which puts it back in the updatable entities list
  • Moving items in or out of an entity that's part of the logistic network alters the logistic network item counts which all entities part of that network use
  • Moving items in or out of an entity that's part of a circuit network alters the circuit network item counts which all entities part of that network use
And the final "nail" for doing inserters is the "slow" parts are those interactions with other entities because they're *always* on some other cache line (if they're even in the cache) and the CPU has to wait while the required piece of RAM is loaded. The rotation between pickup and dropoff simply rotates a vector towards a pre-calculated position (as of 0.15) and takes trivial amounts of time to run compared to something like "drop the item on the ground" which has to check if it collides with anything at that location and allocate memory for the new item entity + place it on the ground and link it in with the rest of the entities on the chunk.
Assume there's two surfaces with zero interaction between them, could their update cycle happen parallel?

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 5:02 pm
by Rseding91
Yoyobuae wrote:Assume there's two surfaces with zero interaction between them, could their update cycle happen parallel?
That's far more plausible but things like script events would ruin it since they have access to everything in any event.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 5:11 pm
by justarandomgeek
Yoyobuae wrote:Assume there's two surfaces with zero interaction between them, could their update cycle happen parallel?
Related question: if you make a surface that is used for spaced out isolated blocks of factory (like Factorissimo, but all on one surface with building interiors on a grid. Let's assume each block fits entirely within one chunk for the sake of argument, with empty chunks between them - the optimal case for parallel chunks, basically), would the spacing of these blocks affect how well they can be parallelized by the current scheme (the 3x3 grid)? Would they be better to fall in the same position on that 3x3 grid, or be roughly evenly spread among positions?

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 5:39 pm
by Harkonnen
Well, it's not "current" yet :) but better spread them evenly. As for per-surface threading - yes, we would go that way if Factorio would be natural to run multiple surfaces, and all those surfaces having a decent factory. Currently surfaces are only in mods, and main surface is still dominating in computation power required. Also current implementation applies that 3x3 algo for every surface individually, so if you have 10 surfaces, you will have 90 sweeps, with internal sweeps almost useless because Factorissimo internals are quite small. If that threading goes to production, we'll have to rethink chunk processing so that 3x3 grid is applied to all chunks of all surfaces at once, but not on surface-after-surface bassis. Lua stuff will stay single-threaded forever I guess. As for factorissimo again - we plan for C++ side virtual connection between pipes/transport-lines, so items are not rethrown with scripts on per-item bassis, instead virtual connection between transport lines is setup once.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 5:59 pm
by Yoyobuae
Harkonnen wrote:Well, it's not "current" yet :) but better spread them evenly. As for per-surface threading - yes, we would go that way if Factorio would be natural to run multiple surfaces, and all those surfaces having a decent factory.
I was just thinking that players already change their factory to optimize UPS. Splitting factory across surfaces would just be another option for megafactory builders (like using bots instead belts and such).

Like when game is started, one could choose to run 2 parallel surfaces and maybe there are some buildings to teleport back an forth, with some limited set of virtual connections between the surfaces (for items, pipes, electric network). Those connections would need to be designed with parallel processing in mind, though (otherwise it would be no better than single surface case).

Similar with scripts, each surface would need it's own LUA environment, with maybe interactions via remote interface.

Re: Parrallel processing in games & applications

Posted: Sun Jan 22, 2017 6:11 pm
by justarandomgeek
Harkonnen wrote:Well, it's not "current" yet :)
Sorry, yes, I used 'current' to mean what you had discussed earlier :) I should have been more precise and said something to the effect of the implemented-but-not-deployed scheme!
Harkonnen wrote: but better spread them evenly.
Thats' sort of what I was thinking. If i were to do such a thing then, I'd probably tile on every 4th or 5th chunk or so such that they'd tend (mathematically) to be roughly evenly spread over your grid.
Harkonnen wrote: Also current implementation applies that 3x3 algo for every surface individually, so if you have 10 surfaces, you will have 90 sweeps, with internal sweeps almost useless because Factorissimo internals are quite small.
Well, I was thinking specifically of a case with the one main surface and one surface with a big grid of small factory bits in it, so you'd only get 18 sweeps at least in that case, but I see your point! Not sure I understand the "almost useless" part though, unless that's just because the set of things that can be updated in a grid that way is so relatively small.
Harkonnen wrote: Lua stuff will stay single-threaded forever I guess.
I pretty much expected that, for the sake of everyone's sanity, but the way I see it, every gain to the main update time also frees up more time for lua at the end of the cycle, so everyone still wins! :)
Harkonnen wrote: As for factorissimo again - we plan for C++ side virtual connection between pipes/transport-lines, so items are not rethrown with scripts on per-item bassis, instead virtual connection between transport lines is setup once.
Looking forward to native item/fluid-movers, that'll be great! (and I humbly request a native signal-mover too, to get circuit signals between surfaces! The script way is pretty slow :( )

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 3:22 am
by NotABiter
hoho wrote:Yes, he has stated that. Has it been shown to be like that in reality?
Yes. They've written their own "specializer" for the 2nd compilation phase (which is covered extensively in video #10). For parsing and "in the middle" optimization they're just using the LLVM toolset. They can compile and run real programs and get good results doing so. They're still working on getting some remaining bits of the C standard libraries ported.
x86 is extremely efficient on not highly branchy code as well.
That's like scoffing at some 7 foot tall person's standing height and saying "a midget is tall when he stands up as well". Intel just bumped their absolute maximum (best case) dispatch rate to 8 instructions per cycle (per core) as of Haswell. (It was 6 before that.) That's not exactly close to the Mill's 33. As mentioned earlier, widening that further on an x86-type architecture isn't really practical unless you're willing to eat an even deeper pipeline, which then hurts your performance on branchy code. Obviously Intel has decided that for their processor that would not be a worthwhile trade-off to make. And there are major architectural differences between x86 and Mill that mean the price of making an x86 core have a given issue-width is always going to be significantly higher than it is for the Mill. This isn't a problem Intel can solve with clever chip design - they'd need to abandon the x86 ISA to solve it (and then they'd still have to deal with Mill's patents).
Throw in 256/512bit SIMD and you'll get some insane throughput. "on the right workload" :)
You do realize that using SIMD on x86 doesn't improve their instruction issue rate, right? And that Mill also has vector processing? In fact, the Mill's vector processing is leagues ahead of x86 and is lacking all of x86's ugliness. (On the Mill data size and vector size move down the belt right along with the data so they don't have to be encoded in the instruction, and scalars and vectors can coexist side-by-side in the same "register space"/belt, so you can use the exact same 'add' machine-instruction whether you are adding two 8 bit numbers or two 64 bit numbers or two vectors of 32 bit numbers, etc. In fact, the Mill lets you use that very same 'add' instruction to add numbers that aren't even the same size as each other. That's a massive improvement over Intel's bazillion "seemingly random and endless sea of" SSE and SSE2 and AVX and MMX and VEX and REX/classical integers and stack-based FPU instructions.)
Again, have they actually built any of their CPUs or is it all just hypothetical?
That's a false dichotomy. "hypothetical" means no evidence. They have a working simulation, a working compiler, and various applications running on top of their simulation, with (very good) measured (granted, simulated) results of that. They plan on switching from simulation to an FPGA test platform this year. Their ultimate goal is to be a fabless core provider (like ARM).
I can't see how that is possible when ALUs are the most power hungry parts in modern CPUs.
CITATION REQUIRED

And even if that were true (and it's not true for common code, i.e. branchy and scalar), Mill would still consume less power than x86. I.e. if you ignore all other power savings, e.g. ignore that Mill doesn't have to pay the huge "out-of-order scheduling power tax", and just look at ALU power only, Mill still wins. The reason comes down to the basic physics of CMOS. Say you're running your typical branchy code on both x86 and Mill, and the x86 is achieving 3 operations/cycle and the mill is achieving 6 operations/cycle. Hold performance constant and look at power. What's the result? The higher operations/cycle on the Mill means you can clock the Mill at 1/2 the frequency that you clock the x86 and still get the same performance. With a lower clock, you can also then run the Mill at a lower voltage than the x86. (For any given process and design, higher clocks require higher voltages - that's just how CMOS works.) The dynamic power consumed by CMOS is: capacitance * frequency * voltage * voltage Now for the Mill the frequency is cut in half, but there is twice the capacitance in play (twice as many active ALUs), so capacitance * frequency is a wash (2 * 0.5 = 1). But since Mill only needs to operate at half the frequency, you can reduce voltage by say (just making up a reasonable-ish number here) 20%. That then means the dynamic power consumption of the Mill ALUs in this case is now 0.8 * 0.8 or 64% of the dynamic power consumption of the x86 ALUs.

Note that the point here isn't to provide any sort of definitive answer about exactly how much power Mill will consume compared to x86, but to simply show "how that is possible": As long as Mill can provide the same performance at a lower frequency, then it can also run at a lower voltage and subsequently use less power.
Sure, in specific workloads it could be since their CPU effectively seems to want to be a DSP.
It's not that it "wants to be a DSP" - it's that they've taken the architectural properties that allow DSPs to be more power efficient than general purpose CPUs and figured out how to extend them so they can also execute general purpose code. (Or at least that's their story. I don't really care how/why they came up with such great architecture ideas. I'm just happy they did. Well, mostly happy -- why didn't I come up with those ideas!?!?)
In modern x86 CPUs, that part of the CPUs is tiny compared to the entire die size
http://wccftech.com/haswell-die-configu ... -revealed/
Just because Intel is retardedly putting an entire "low-end graphics card" on their chips doesn't change the fact that x86 instruction decode and scheduling consumes around 1/3rd of every single CPU core they make. Next you'll be arguing that the entire CPU core isn't that important for CPUs because, after all, even with all of the cores added together they still make up a minority of the die area on such Intel chips.

Plus, that graphics crap doesn't count towards power when I (or any other self-respecting gamer) is using that chip (if stuck using such a chip) because we're using real graphics cards and the whole integrated graphics section of the chip will be powered down (or at least I would hope so).
Mill simply uses that transistor budget to implement things that "normal" CPUs don't have a need for.
CITATION REQUIRED
Btw, is there any information on how Mill would handle thread/context switches? Do they require any additional magic or will it "just work" and would be indistinguishable from running a single thread?
Yes, there's information on it: https://www.youtube.com/watch?v=5osiYZV8n3U#t=21m25s
But you'd probably have to watch that video from the start, and maybe some of the earlier videos, to really understand it.
That video is mostly about security, and Mill has some really good security features. For example, a common attack on x86 is overflowing an array on the stack to overwrite a return address (and thereby cause the processor to jump where ever you want). Such attacks are strictly impossible on the Mill. New stack data is also automatically zeroed (for free - they've got tricks for doing that - i.e. they don't actually write a bunch of zeroes to memory) which not only increases performance (because zeroing at least some local data is pretty common in programs) but also prevents data (e.g. passwords) from being seen on the stack by code that has no business seeing it. It also supports (non-OS) services having their own stacks (so a user of a service can't just fill their stack up and then invoke the service and cause it to fail due to stack-overflow). There's just lots and lots of stuff there - watch the videos. If you care about computer architecture then you owe it to yourself to watch them, in full, in order, because Godard isn't just listing/describing features of the Mill architecture (though he does plenty of that), he's describing fundamental problems in computer architecture, how/why previous approaches fail, and how they can be overcome. One after another. Every single video (except #10 the compiler one) has one or more of these "breakthrough bombs" in it. (The reason the compiler one doesn't is because they don't have to do anything revolutionary in their compiler.)
Also, how long is the "pipeline" for the Mill CPU?
I don't think that's been covered in the videos. Pipeline depth is likely to vary from one Mill chip to another.
Anyways, in a sense it doesn't matter because, regardless of what their pipeline size is, they've already measured 6+ operations/cycle on common branchy code. I.e. their pipeline size is already "baked in" to that result. (Of course, obviously pipeline size matters - i.e. if they could magically have a shorter pipeline without hurting anything else they could then get their operations/cycle up to 7+ or more. My point is just that 6+ is already very good, and a lot better than what x86 achieves.)
E.g if they have perfect code and manage to schedule 32 instructions per cycle and there is a branch, they have to halve that to 16 to run both of the branches in parallel.
I think there's been some misunderstanding. Mill does support speculative execution (and the details of that are covered in the videos), but as far as I know it doesn't do "both branch directions at the same time".
How does Mill handle SIMD? Are all it's registers general purpose or are some SIMD?
I already answered this earlier in this post. (Its vector support is excellent, well integrated, highly regular and powerful, and is covered in video #5.)
What about int vs float?
Like, to the death? My money's on int because he's a more solid wall of bits. :-)

Not sure what you're asking. Like int, float is supported for both scalar and vector data. The architecture (though not necessarily every single Mill chip) supports 16/32/64/128 bit IEEE binary, 32/64/128 bit IEEE decimal, and 8/16/32/64/128 ISO C fractions (whatever that is). Supported int sizes are 8/16/32/64/128. This is covered in video #5.

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 9:30 am
by ratchetfreak
I found a video on a GDC talk about multithreading Destiny (GDC has been uploading a lot of talks previously behind their paywall up on youtube lately)

granted it's not the same engine arch as factorio but their way of preventing data races is interesting. They use jobs and annotate what they need in each job and also have a set of dependencies between each job. Then with static analysis you can prove that there aren't data races.

It requires a good job scheduler though and dev buy-in to properly annotate the jobs.

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 9:42 am
by hoho
NotABiter wrote:They can compile and run real programs and get good results doing so
How good?
NotABiter wrote:Intel just bumped their absolute maximum (best case) dispatch rate to 8 instructions per cycle (per core) as of Haswell. (It was 6 before that.) That's not exactly close to the Mill's 33.
There is little point in increasing the dispatch rate when the times you can actually execute that many instructions in parallel are rare. Getting 33 instructions to be executed in parallel for extended period of time in a regular application is effectively once in a lifetime event.
NotABiter wrote:As mentioned earlier, widening that further on an x86-type architecture isn't really practical unless you're willing to eat an even deeper pipeline, which then hurts your performance on branchy code.
How deep is the effective pipeline on Mill?
NotABiter wrote:You do realize that using SIMD on x86 doesn't improve their instruction issue rate, right?
Of course not but it does significantly increase computing power.
NotABiter wrote:And that Mill also has vector processing?
I assumed as much but I'm still waiting to know how they have solved their register layout. From what you said here it seems like it's registers are all effectively general purpose, just simd instructions take a bunch of those registers as one chunk and execute SIMD instructions on them.
NotABiter wrote:"hypothetical" means no evidence. They have a working simulation, a working compiler, and various applications running on top of their simulation, with (very good) measured (granted, simulated) results of that
Is there a place where someone could look at their actual results instead of just vague "it has good performance" assertions?
NotABiter wrote:
I can't see how that is possible when ALUs are the most power hungry parts in modern CPUs.
CITATION REQUIRED
http://www.cs.virginia.edu/~skadron/Pap ... ot_mej.pdf
Table 3 there gives some hints. If you have any better data, I'd love to see it.
NotABiter wrote:And even if that were true (and it's not true for common code, i.e. branchy and scalar), Mill would still consume less power than x86.
Based on what data can you say that Mill consumes less power than x86 when it's doing its speculative execution effectively doing twice the amount of calculations compared to x86 on that branchy scalar code?
NotABiter wrote:I.e. if you ignore all other power savings, e.g. ignore that Mill doesn't have to pay the huge "out-of-order scheduling power tax", and just look at ALU power only, Mill still wins
Again, where is your data? How much power is the OoO scheduling part of CPU taking compared to other parts?

Transistor budgets in CPUs have become large enough that modern CPUs are adding a whole bunch of "dark" silicon nowadays - parts of specialized hardware accelerated functionality that is rarely used but when required, they incerase performance significantly (e.g random number generators, encryption, video acceleration). Just cramming in extra cores isn't cost effective when you don't have enough caches and memory bandwidth to properly feed them.
NotABiter wrote:The reason comes down to the basic physics of CMOS. Say you're running your typical branchy code on both x86 and Mill, and the x86 is achieving 3 operations/cycle and the mill is achieving 6 operations/cycle.
So you assert. I've not seen any evidence besides assertions from people working on it. I saw similar assertions from Intel talking how IA64 would be the messiah that saves computing from the evil x86.

There is also the issue of speculative execution leading to requiring pulling in more data to caches for running the code path that eventually gets discarded. Memory latency and cache misses are one of the biggest reasons why CPUs aren't able to effectively fill their pipelines.
NotABiter wrote:Hold performance constant and look at power. What's the result?
I'd love to know it but I've not seen any data on power usage of Mill. If you have any, I'd love to see it.
NotABiter wrote:As long as Mill can provide the same performance at a lower frequency, then it can also run at a lower voltage and subsequently use less power.
Yes, I'd agree with that. Only problem I have is that I've not seen any evidence that Mill can provide same performance at lower frequency in a regular code. Speculations aren't evidence.
NotABiter wrote:they've taken the architectural properties that allow DSPs to be more power efficient than general purpose CPUs and figured out how to extend them so they can also execute general purpose code
Just because they're capable of running general code doesn't mean it's efficient at doing that. Itanium was also "capable" of running x86 code and modern GPUs are turing-complete. There aren't many things that benefit from being run on GPUs over CPUs.
NotABiter wrote:Just because Intel is retardedly putting an entire "low-end graphics card" on their chips doesn't change the fact that x86 instruction decode and scheduling consumes around 1/3rd of every single CPU core they make
Look up what I said above about transistor budgets.

Vast majority of the non-GPU part of CPU die is spent on improving memory access and latency. It's useless to cram a couple of extra cores on it when you won't have enough memory bandwith to feed them and get constant stalls waiting for the data.
NotABiter wrote:Plus, that graphics crap doesn't count towards power when I (or any other self-respecting gamer) is using that chip (if stuck using such a chip) because we're using real graphics cards and the whole integrated graphics section of the chip will be powered down (or at least I would hope so).
You can use OpenCL to use all that massive parallel computing power in that IGP in your general purpose code if you wish drastically improving the theoretical computing powers of your CPU.
NotABiter wrote:
Mill simply uses that transistor budget to implement things that "normal" CPUs don't have a need for.
CITATION REQUIRED
For starters, normal CPUs don't contain several dozens of parallel ALUs while running general purpose code as it's rare to ever be able to effectively use them all in parallel.

Also, I'd love to see the breakdown of transistor budget for Mill. Something tells me no one has a clue about it.
NotABiter wrote:Anyways, in a sense it doesn't matter because, regardless of what their pipeline size is, they've already measured 6+ operations/cycle on common branchy code
Do you have any idea what that "common branchy code was"? Some sort of Spec benchmark?

Also, pipeline lenght most definitely matters, no matter the CPU architecture. The deeper it is, the more power and memory performance gets wasted in branches, context switches and cache misses.
NotABiter wrote:I think there's been some misunderstanding. Mill does support speculative execution (and the details of that are covered in the videos), but as far as I know it doesn't do "both branch directions at the same time".
Interesting. For me, speculative execution is effectively the definition of running both branches in parallel and discarding the wrong one. If they're not doing that to improve their performance in branchy code, the length of their pipeline is crucial to figure out how effectively it can execute "normal" workloads.
NotABiter wrote:
What about int vs float?
Like, to the death? My money's on int because he's a more solid wall of bits. :-)
I meant if they have separate registers for them as it is in x86.


Trying to find information on their pipeline, I found this article:
http://www.linleygroup.com/newsletters/ ... p?num=5038

Apparently, their numbers are 3x perf/watt in hypothetical peak performance of 33 instuctions per cycle compared to quadcore Haswell.

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 10:23 am
by mrvn
hoho wrote:
mrvn wrote:Try a different solution. Instead of each of the 108 inserters checking the warehouse if they can take some items go the other way and add more clock states.

Clock state 0: all inserters figure out how many items they can potentially grab (evaluate signals, set filters)
Clock state 1: the warehouse goes through it's list of inserters in a deterministic order offering each items if they can take it.
Clock state 2: all inserters update their state to accept the offered items
Clock state 3: data goes live, the GUI can draw the new state
repeat
So, in other words, you iterate over the inserters three times? That'll be a lot of memory/cache bandwidh used with awful computations-per-byte ratio.
You go over them 3 times but with 8 cores.

There are also tricks you can do. For example for clock states 2 and 3 you can use a ring buffer to have multiple sets of data. And then you only need one global variable saying which of the states is the active one. So instead of copying all computed-state to gui-state on a clock tick you just increment one int and EVERY item in the game will have updated it's state.

Now you only go over the inserters twice, doing different things each time. But now you are doing it on 8 cores. Could be 4-8 times as fast.
Note: 8 cores also have 8 times as much L1 cache and 2-4 times as much L2 cache as a single core.

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 10:42 am
by hoho
mrvn wrote:Note: 8 cores also have 8 times as much L1 cache and 2-4 times as much L2 cache as a single core.
Running same data over several cores at different times in different threads also means more memory accesses to get the same data. Devs already stated how that's currently one of the biggest problems with making factorio scale well.




I stumbled upon this in the Mill page:
http://millcomputing.com/topic/introduc ... g-model-2/

Apparently, their 33 execution units aren't really as impressive as I originally thought
The pipelines in the Mill Gold CPU are:

8 pipes that do single output “reader” operations
4 pipes that can do either integer (including multiply) or binary floating point (also including multiply) operations
4 pipes that can only do integer (not including multiply) operations
4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations
4 pipes that can do either immediate constant or load/store operations
4 pipes that can do pick operations
5 pipes that can do “writer” operations
That's not much more than what we see in modern x86 CPUs. The only thing they can do is to issue instructions in all of them in parallel in a hypothetical situation where there are that many instructions that you can run in parallel.

For example, when using SIMD, Haswell can run 32 single-precision FMA FLOPs per cycle in addition to all the stuff it can do with other registers and memory access. The information on Mill is somewhat cryptic but I don't think they have *that* much more vector throughput than that per-core. Of course, code that is able to actually feed the SIMD units with that amount of computations and data is a rare sight in regular applications.

https://www.extremetech.com/computing/1 ... nvidia-amd

Re: Parrallel processing in games & applications

Posted: Mon Jan 23, 2017 10:56 am
by mrvn
You aren't running over the same data at the same time. That's the big NO-NO in multithreading that would require expensive locking all over the place. You are running over different data at the same time.

What does happen is that you run over the same data multiple times at different times. That can't be avoided. And since the work per tick is split into lots of little chunks the same data that would be accessed at the same time in a single thread might now be accessed far from each other. This can be a huge killer of cache locality. But usually the gain from more cores outweigh the loss. There is no magic solution in computer science. You have to try and see and design things to work best.