Parrallel processing in games & applications

BenSeidel · Post by **BenSeidel** » Thu Sep 21, 2017 10:45 am

shadetheartist wrote:You say that this "clock based multi-threading" design pattern is undiscovered by the the subscribers of popular programming culture. Is that true? If so, is this post a revealing of cataclysmic changes to the meta of software as we know it? Or is there a backwater wiki page somewhere you're referencing.

The post isn't supposed to be about some cataclysmic change to the current programming meta, unless you would class the introduction of structured programming and object-orientated programming as cataclysmic changes. AFAIK there is no wiki anywhere with information on it: it's naturally occurred (along with 2 other techniques for creating fine-grained parallelism) within a code base I have been working on. It's also not new: the technique is used anywhere that asynchronous code is forced upon the programmer (usually because of "physics"). But saying that, I have never seen any references to any form of structured multi-threading with the exception of "See, we functional guys have it right: DON'T MUTATE YOUR DATA" within the user-level coding community.

shadetheartist wrote:Your argument comparing 90's game programming where "the code's fine, your computer is crap" to a prediction of the future of development has sparked my curiosity. It seems like if one wants to get a foot in the door to development jobs in the future one might want to consider your view on parallelism.

I believe this is an inevitability. The thing to remember is that Moore's law in the form of "the number of transistors that can be put into a device at the lowest cost per transistor increases exponentially with time" is still holding and will continue to hold for the foreseeable future, not because the size of the transistors will shrink but because other processes will get better. We don't have much wiggle room on transistor size any more (I think they are about 70 atoms wide IIRC) but we have a massive amount of room to move on power consumption. If we enable our applications to decentralise their processing requirements by running concurrently on 20,000+ cores then we may open up new avenues for chip designers to create more usable FLOPS for us.

hoho · Post by **hoho** » Fri Sep 22, 2017 6:12 pm

BenSeidel wrote:If we enable our applications to decentralise their processing requirements by running concurrently on 20,000+ cores then we may open up new avenues for chip designers to create more usable FLOPS for us.

Problem with that would be the rate and latency of data transfer between all those cores.

Computing isn't really limited by raw computing power. It's limited by the ability to get required data to proper registers in time. With 20k cores, you'd have to be able to move around terabytes of data every second. As you should be well aware, memory bandwidth and latency has been increasing at WAY lower rates than Moore law and there we are pretty close to the physical limits already.

Btw, by "fine grained", how fine granulation do you have in mind? How many cycles of work can each of those thousands of cores have before it has to synchronize something with others or get data from others?

BenSeidel · Post by **BenSeidel** » Sat Sep 23, 2017 10:46 am

hoho wrote:Btw, by "fine grained", how fine granulation do you have in mind? How many cycles of work can each of those thousands of cores have before it has to synchronize something with others or get data from others?

It's an interesting question and one that I can't answer except to say that it will depend on the application being run, so I will instead talk about the results I have been able to get during our experimentation.

So, how granular are we at the moment? I'm not sure how you would describe the granularity exactly as it varies. It's more dependent on what is easiest to express in code, or more specifically, what's easiest for us to read in code. For example, the execution of an SQL query on a connection would be "one instruction" because the interfaces to the connection libraries aren't written by us and supply single-threaded iterators, while there are places where an add/increment instruction, eg "X+= 1" is "one instruction" simply because it makes the code look "nicer".

hoho wrote:Problem with that would be the rate and latency of data transfer between all those cores.

So far we have not had (read "noticed") any issues caused by the transfer rate between CPUs or memory even on larger core count machines. "Instructions" (single-threaded workloads) are generally queued on the core/NUMA node that created them, therefore have a high data locality and are only executed on a non-local core if that core is completely idle. You will always have data transfer between regions of the motherboard, be it ram-> cpu or cpu->cpu, but all in all, it appears to be a non-issue due to that locality. I believe that it's a boon as we are able to use more of the CPU's cache more often, but have no measurements to prove it.

As it stands, during the bootup of the software we generally fluctuate between being able to use 7,500 - 27,000 cores, depending on a bunch of factors. We can get it much, much higher but haven't bothered because there is no hardware (reasonably available) with more than about 60 cores/threads. Just so that you are aware, the application is a desktop application, not some data-center number crunching system. At the last timing measurements, the bootup process (time to show the main window, fully functional) takes about 2700ms when single threaded and about 250ms when on a 60 core machine. There are a bunch of factors that affect that time, but they were the averages.

That is what we have seen during our development, but in theory there is no reason you couldn't take it all the way to the instruction level, where each x86/x64 instruction can be run on any core. There is no reason for us to do that yet as the underlying hardware is optimised to run single threaded code. Sometimes I wish I could have a CPU packed full of 486 cores instead of the i7's & Xeons we have at the moment just to see how it would go.

Tigga · Post by **Tigga** » Sun Sep 24, 2017 12:15 am

BenSeidel wrote: The thing to remember is that Moore's law in the form of "the number of transistors that can be put into a device at the lowest cost per transistor increases exponentially with time" is still holding and will continue to hold for the foreseeable future

We've been behind Moore's law for a few years now.

BenSeidel wrote: As it stands, during the bootup of the software we generally fluctuate between being able to use 7,500 - 27,000 cores, depending on a bunch of factors. We can get it much, much higher but haven't bothered because there is no hardware (reasonably available) with more than about 60 cores/threads.

I've not been following this thread very closely. What about GPUs? 7.5k might be a bit low thread count, but if you can get "much, much higher" then GPUs seem like a fairly solid bet.

hoho · Post by **hoho** » Sun Sep 24, 2017 5:11 pm

BenSeidel wrote:It's an interesting question and one that I can't answer except to say that it will depend on the application being run, so I will instead talk about the results I have been able to get during our experimentation.

Feel free to go as deep as you need, pretty sure I can handle it

BenSeidel wrote:For example, the execution of an SQL query on a connection would be "one instruction" because the interfaces to the connection libraries aren't written by us and supply single-threaded iterators, while there are places where an add/increment instruction, eg "X+= 1" is "one instruction" simply because it makes the code look "nicer".

SQL queries take ages compared to stuff in Factorio, not to mention the code using the queries is generally built to handle delays lasting for if not seconds then dozens of milliseconds. Databases also generally aren't exactly great at constantly modifying same fields and are more for reading to one place while writing to another.

In other words, that's not exactly a good comparison to bring with a game like factorio where you have to perform tens of millions of read-modify-write operations per second while still having to have time left over to be able to run scripting language-based mods within same tick.

BenSeidel wrote:So far we have not had (read "noticed") any issues caused by the transfer rate between CPUs or memory even on larger core count machines

That's because your workload is almost certainly massively different to games.

BenSeidel wrote:You will always have data transfer between regions of the motherboard, be it ram-> cpu or cpu->cpu, but all in all, it appears to be a non-issue due to that locality. I believe that it's a boon as we are able to use more of the CPU's cache more often, but have no measurements to prove it.

Per-tick working set for Factorio is a LOT bigger than CPU caches and that dataset has to be put through the CPU at minimum 60x per second. In reality, it's much more than that due to data from different parts of that working set having to be accessed depending on what gets calculated.

Before you say "just split it into chunks and work on each separately!", amdahl's law will get you in form of having to complete the chunks individually, syncing up and then working on the chunk edges. Remember, stuff is calculated per-tick in several iterations with different iterations affecting each other. E.g belt moving, power for pickup with inserter, room left for inserter target, enough inserted to start manufacturing. Each of those is in some form depending on other parts. Sure, you could possibly split those interdependent checks into duplicated datasets and doublebuffer them but that would balloon the working set size even bigger.

BenSeidel wrote:the bootup process (time to show the main window, fully functional) takes about 2700ms when single threaded and about 250ms when on a 60 core machine

So, adding 60x the computing power and god knows how much memory bandwidth, you only got the stuff about 10x faster?

How much of a speedup do you think factorio-like thingy would get on an average gaming PC that won't get any extra memory bandwidth or latency with multithreading and has to crunch through the entire dataset 60x per second?

For the record, going by steam HW survey, that's either 2 or 4 cores:

: cpu_cores.PNG (29.44 KiB) Viewed 10630 times

BenSeidel wrote:Sometimes I wish I could have a CPU packed full of 486 cores instead of the i7's & Xeons we have at the moment just to see how it would go.

https://www.intel.com/content/www/us/en ... rack=1&amp
You're welcome

hoho · Post by **hoho** » Sun Sep 24, 2017 5:34 pm

Tigga wrote:I've not been following this thread very closely. What about GPUs? 7.5k might be a bit low thread count, but if you can get "much, much higher" then GPUs seem like a fairly solid bet.

GPUs are absolutely horrible for running general purpose code. They have both HUGE latencies and have horrible efficiency when not running exact same code on all the cores in parallel. Hell, just having a branch in one of the cores go other way than in others is a major hickup (e.g on code like if X == 0 then y else z, on some cores x is zero and others not).

Tigga · Post by **Tigga** » Mon Sep 25, 2017 1:07 am

hoho wrote:
Tigga wrote:I've not been following this thread very closely. What about GPUs? 7.5k might be a bit low thread count, but if you can get "much, much higher" then GPUs seem like a fairly solid bet.
GPUs are absolutely horrible for running general purpose code. They have both HUGE latencies and have horrible efficiency when not running exact same code on all the cores in parallel. Hell, just having a branch in one of the cores go other way than in others is a major hickup (e.g on code like if X == 0 then y else z, on some cores x is zero and others not).

His upper end was 27k cores. Assuming that means 27k pieces of independent work, that's just about enough for most GPUs to hide the latencies. "Much much higher" might be 100k-1million+, which is plenty enough.

As for running the same code on all cores and branch costs: I'm mostly familiar with NVIDIA GPUs and CUDA. On those you need groups of 32 threads executing the same code (and taking the same branches) for best performance. A bit of branching is usually ok. A lot can be problematic. I don't think you could offload game logic (eg. for Factorio) to get any meaningful speedup, but that doesn't mean that the app the other guy is developing wouldn't.

hoho · Post by **hoho** » Mon Sep 25, 2017 5:16 am

Tigga wrote:His upper end was 27k cores. Assuming that means 27k pieces of independent work, that's just about enough for most GPUs to hide the latencies.

Sure but his code is nothing like what games run. I'm willing to bet those thousands of cores aren't syncing up their work 60x per second.

BenSeidel · Post by **BenSeidel** » Mon Sep 25, 2017 9:16 am

@Tigga
GPU's aren't CPU's. If I was doing 3d drawing or even touched floating point math then maybe, but I'm not forcing a specific problem set into another problem set, I'm showing a general-purpose solution.

@Hoho,
Most of what you have said is counter-productive. Pulling out single lines from a post only causes confusion. If you have a question then ask.

I will however talk about this part:

hoho wrote:amdahl's law will get you

Amdahl's law, while mathematically correct, is irrelevant in modern computing. Bringing up Amdahl's law for why "multi threading is hard" is like saying that Einstein's general relativity is the reason why car's really haven't increased in speed over the last 20 years. While we know undoubtedly what c (speed of light) is, we have no idea, even theoretically, on what the upper values of P are. For all we know Factorio may have a P value of 0.9997^{# of active chunks}.

If anyone knows of any research on what the upper bounds of P could be then I would be very interested.

hoho wrote:
BenSeidel wrote:Sometimes I wish I could have a CPU packed full of 486 cores instead of the i7's & Xeons we have at the moment just to see how it would go.
https://www.intel.com/content/www/us/en ... rack=1&amp
You're welcome

ROFL... Do you have any understanding on what Moore's law has done over the last 28 years since the 486 was released?
This would have been a better link:
https://www.parallella.org/2016/10/05/e ... processor/

Tigga · Post by **Tigga** » Mon Sep 25, 2017 9:27 am

BenSeidel wrote: GPU's aren't CPU's. If I was doing 3d drawing or even touched floating point math then maybe, but I'm not forcing a specific problem set into another problem set, I'm showing a general-purpose solution.

Very true, but I figure most applications that can make use of 100k+ parallel cores can probably benefit from a throughput machine if performance is a concern. Not all of them of course...

BenSeidel wrote: I will however talk about this part:
hoho wrote:amdahl's law will get you
Amdahl's law, while mathematically correct, is irrelevant in modern computing.

Wow. I guess the field of modern computing I work in is very different from the field of modern computing you work in. Reducing sequential bottlenecks is critical for performance on many workloads.

hoho · Post by **hoho** » Mon Sep 25, 2017 6:22 pm

BenSeidel wrote:Most of what you have said is counter-productive. Pulling out single lines from a post only causes confusion. If you have a question then ask.

I was countering specific assertions. I can't really see how it could have been confusing.

Amdahl's law was shown perfectly in your own example where 60x more computing power barely gave you 10x speed increase.

I know enough about 486 that you won't be able to run it anywhere near the clock speed of that 70+ core machine. Not to mention lack of FPU, let alone SIMD. Hell, you'd be lucky to get a cache on it. I hope you are aware how big part of the die is made up of various caches in modern CPUs.

That epiphany thingy is effectively glorified GPU. It's great if you have MASSIVE amount of computations per byte moved and sucks horribly when you have data load similar to games. Take a guess how much time it takes to send data from one end of the CPU to the other and how many such packets can be in flight at a time. Its clock speed is around 7x lower than modern CPUs and memory bandwidth isn't really much better. More than half the die is spent on scratch memory, there is NO cache. FLOPS wise, it's only around 3x more per mm^2 of die compared to Broadwell.

BenSeidel · Post by **BenSeidel** » Tue Sep 26, 2017 3:38 pm

Tigga wrote:Very true, but I figure most applications that can make use of 100k+ parallel cores can probably benefit from a throughput machine if performance is a concern. Not all of them of course...

I think we are agreeing then, there are some programs that just won't work on a GPU. Those are the programs that I am trying to target.

Tigga wrote:Wow. I guess the field of modern computing I work in is very different from the field of modern computing you work in. Reducing sequential bottlenecks is critical for performance on many workloads.

I'm not saying that at all. Sequential bottlenecks are an extremely large issue and clearing them out is extremely important. I'm simply saying that your bottlenecks aren't caused by Amdahl's law in the same way a blocked injector or a flat tire isn't caused by the theory of relativity. Amdahl's law simply states that only the parts of a program that can benefit from additional resources actually benefits from those additional resources. There is nowhere in his law that stats what that upper bound for that figure is, only that one exists. I am also stating that I have never seen study of the application of his law on an code in anything but a trivial cases such as "how parallel can you make a binary sort"... well it's a merge sort isn't it? So we have no idea if we are right up against the parallel wall or if Factorio could be run on 1 billion CPUs. For these reasons I find any Amdahl based argument a pointless endeavor.

If you feel that my assessment of the law is incorrect then please let me know.

hoho wrote:I was countering specific assertions. I can't really see how it could have been confusing.

You were not. That entire post was about the execution characteristics of my application and the performance characteristics of running a single threaded application over many cores. It has absolutely nothing to do with Factorio and I fail to understand why you proceeded to try to compare the incomparable. As the entire post was a summary of my observations, about a desktop application I have been writing, how is what you did constructive?

hoho wrote:Amdahl's law was shown perfectly in your own example where 60x more computing power barely gave you 10x speed increase.

Firstly, See above for the explanation of why Amdahl's law isn't the underlying cause.
Secondly, What you are talking about here is the processing characteristics of the CPU architecture I am testing the software on, one that is purpose built for running a single thread as fast as possible. On a CPU architecture that was designed to run as many threads as quickly as possible we would see a very different story.

hoho wrote:I know enough about 486 that you won't be able to run it anywhere near the clock speed of that 70+ core machine.

I am indeed curious as to why that would be the case. Could you please expand on this as I can't see any reason why a 486 fab'd using todays technology can't run at the frequency of the other CPU's produced today. I and am really interested in knowing if this is indeed true and if so, what the underlying cause is.

hoho wrote:Not to mention lack of FPU, let alone SIMD. Hell, you'd be lucky to get a cache on it. I hope you are aware how big part of the die is made up of various caches in modern CPUs.

Well, not having FPU or SIMD instructions wouldn't bother me at all as I personally HATE floating points and believe that they should die a slow and horrible death. As for the cache, I am painfully aware what the lack of Dennard's scaling has been doing to the CPU over the last decade.

hoho wrote:That epiphany thingy is effectively glorified GPU. It's great if you have MASSIVE amount of computations per byte moved and sucks horribly when you have data load similar to games. Take a guess how much time it takes to send data from one end of the CPU to the other and how many such packets can be in flight at a time. Its clock speed is around 7x lower than modern CPUs and memory bandwidth isn't really much better. More than half the die is spent on scratch memory, there is NO cache. FLOPS wise, it's only around 3x more per mm^2 of die compared to Broadwell.

That is true, but it's still closer to what I originally asked for: A die with as many 486's on it as possible. Massively cored CPU's are currently a research thing, or a bespoke system build. They are not good for general purpose computing. This is only the case because there is not enough research into how to build them, mainly because there is not enough general purpose massively multi-threaded code being written, and this is only the case because there is not enough research into how to write general purpose massively multi-threaded code.

hoho · Post by **hoho** » Tue Sep 26, 2017 7:56 pm

BenSeidel wrote:I am also stating that I have never seen study of the application of his law on an code in anything but a trivial cases such as "how parallel can you make a binary sort"... well it's a merge sort isn't it? So we have no idea if we are right up against the parallel wall or if Factorio could be run on 1 billion CPUs. For these reasons I find any Amdahl based argument a pointless endeavor.

You got the law down more or less correctly. What you don't seem to realize is that games have far more interdependent computations than sorting and thus they have lower bound for how parallel they can be made.

BenSeidel wrote:
hoho wrote:I was countering specific assertions. I can't really see how it could have been confusing.
You were not. That entire post was about the execution characteristics of my application and the performance characteristics of running a single threaded application over many cores.

What I tried to achieve was to explain how your examples are irrelevant to games due to being radically different workload with radically different latency requirements.

BenSeidel wrote:
hoho wrote:I know enough about 486 that you won't be able to run it anywhere near the clock speed of that 70+ core machine.
I am indeed curious as to why that would be the case. Could you please expand on this as I can't see any reason why a 486 fab'd using todays technology can't run at the frequency of the other CPU's produced today. I and am really interested in knowing if this is indeed true and if so, what the underlying cause is.

Long story short, pipeline length. At minimum you'd have to increase it quite a bit to not hit hard limits of transistor switching speed to get a 486 running on 3GHz+.
For example, do you know how many transistors have to switch *in sequence*, not in parallel, to complete a single multiplication of two 32bit integers?

Second issue that isn't exactly a limitation but just something to keep in mind is IPS. 486 had around 0.3 instructions per cycle vs 10-core i7 at over 100 per cycle, i7 is around 300x faster than 486 clock-to-clock. Transistor-wise, 486 has about a million of them vs ~3.4 billion for that i7. Take out the "fluff" from i7 like IO and cache and you'll cut the transistor count by around 5x. Now, add in the fact that same instructions run at faster speed on i7 than on 486 and look at the lack of cache 486 on and one begins to wonder what would be the point of having a ton of near-useless cores. You'll also get the "tiny" problem that you'd have to completely redesign the 486 to work with modern 13+ layer die designs. In the end, you wouldn't really have something that even remotely resembles a 486.

I couldn't find an annotated die shot of a 486 but this should give a rough idea how much of the die area is spent on the "extras":

: Intel-Broadwell-E-Die-Chipshot.png (1.4 MiB) Viewed 10485 times

486 didn't have vast majority of the stuff there.

Though note that we couldn't just rip all that stuff out and replace it all with ALUs. Big parts of the chip aren't really producing all that much heat (e.g cache) while ALUs would generate a ton of it. Just have a look at how GPUs have comparable amount of transistors with significantly lower clock speed and they are still consuming 3-4x more power.

Long story short, you'd not get comparable IPC for the 486-cluster if you take all the limitations into account.

BenSeidel wrote:
hoho wrote:Not to mention lack of FPU, let alone SIMD. Hell, you'd be lucky to get a cache on it. I hope you are aware how big part of the die is made up of various caches in modern CPUs.
Well, not having FPU or SIMD instructions wouldn't bother me at all as I personally HATE floating points and believe that they should die a slow and horrible death.

You may not like them but you will not want to run game code on fixed point math.

BenSeidel wrote:As for the cache, I am painfully aware what the lack of Dennard's scaling has been doing to the CPU over the last decade.

Also have a look at how little increase has there been in how much data can be moved from RAM to caches/registers per cycle. You do not want to make things worse by hammering the memory bus with a crapton of cores.

Another absolutely *massive* thing you don't seem to acknowledge is that writing parallel code is always more error-prone and complex than sequential. Sure, if people did more of it it'd be easier but it can not ever be anywhere near as simple as sequential. It makes no business sense in jumping through hoops to get a game to use 2-4 cores when that makes every future code alteration that much harder and bugs that much harder to track down.

BenSeidel · Post by **BenSeidel** » Tue Sep 26, 2017 10:32 pm

@hoho.
Once again you have not read what I have posted, instead trying to justify what you have said as being relevant.
I was going to respond, but I feel that anyone reading the thread should have an understanding of what I am saying.

If anyone has any questions please ask them.

hoho · Post by **hoho** » Wed Sep 27, 2017 5:13 am

I'm not justifying, I'm explaining what I thought was plain obvious to everyone.

I remember the last time we had similar discussion you also bowed out in a similar way when I started putting up hard data that showed your claims to not apply to real life.

BenSeidel · Post by **BenSeidel** » Wed Sep 27, 2017 10:09 am

hoho wrote:I remember the last time we had similar discussion you also bowed out in a similar way when I started putting up hard data that showed your claims to not apply to real life.

No, I bow out because you are impossible to discuss things with as you don't seem to understand that different words have different meanings.

Here are some examples (I have highlighted the words you seem to have failed to understand their meaning of, where appropriate).

I say:
I can't see any reason why a 486 fab'd using todays technology can't run at the frequency of the other CPU's produced today.
you reply starts with:
Long story short, pipeline length.
and ends with:
Long story short, you'd not get comparable IPC for the 486-cluster if you take all the limitations into account.
Issue:
There is a difference between the term frequency: used to refer to the clock-rate (the exact term YOU used in your original claim) of a CPU, and it's pipline length or it's instructions per cycle.

I say:
It's an interesting question and one that I can't answer except to say that it will depend on the application being run, so I will instead talk about the results I have been able to get during our experimentation.
you say:
SQL queries take ages compared to stuff in Factorio,[...]
Issue:
I was talking about experimental results in one problem domain. A domain that, just like games, has not been made to run using parallel code. You site the differences between two applications, in two different domains, in two stages of development as a reason why my general purpose solution cannot be applied in any other domain. You also offer no explanation as to why the interdependencies in a game far outweigh the interdependencies in my desktop application. Considering that you need to give 27,000 reasons, you have a lot of work to do.

This issue really permeates that entire post: you are comparing results from an experimental desktop application written by 2 people over 2 years to an almost feature-complete game that has had up to 7?? full time programmers working on it over a 5 year period.

I say:
Well, not having FPU or SIMD instructions wouldn't bother me at all as I personally HATE floating points and believe that they should die a slow and horrible death.
you say:
You may not like them but you will not want to run game code on fixed point math.
Issue:
Do you wish to school me on catholicism while you are at it?
Just in case you didn't get it: I am telling you that you are unable to tell me what I want and don't want.

I am bowing out because I cannot continue to respond to your posts if every time you respond you choose to ignore key words or previous sentences that give what I have said context.

Edit:
OMG... Just realised that I have said this before... you suckered me in again....

ratchetfreak · Post by **ratchetfreak** » Wed Sep 27, 2017 11:06 am

BenSeidel wrote:
hoho wrote:I remember the last time we had similar discussion you also bowed out in a similar way when I started putting up hard data that showed your claims to not apply to real life.
No, I bow out because you are impossible to discuss things with as you don't seem to understand that different words have different meanings.

Here are some examples (I have highlighted the words you seem to have failed to understand their meaning of, where appropriate).

I say:
I can't see any reason why a 486 fab'd using todays technology can't run at the frequency of the other CPU's produced today.
you reply starts with:
Long story short, pipeline length.
and ends with:
Long story short, you'd not get comparable IPC for the 486-cluster if you take all the limitations into account.
Issue:
There is a difference between the term frequency: used to refer to the clock-rate (the exact term YOU used in your original claim) of a CPU, and it's pipline length or it's instructions per cycle.

But his point was that the pipeline length means that equivalent clocks won't achieve equivalent performance. In fact there is easily 10x difference single core even without accounting for how to interract with the rest of the mother board.

Also a single chip cluster of individual cpus exists. It's called the cell processor and can be found in the PS3.

BenSeidel wrote:
I say:
Well, not having FPU or SIMD instructions wouldn't bother me at all as I personally HATE floating points and believe that they should die a slow and horrible death.
you say:
You may not like them but you will not want to run game code on fixed point math.
Issue:
Do you wish to school me on catholicism while you are at it?
Just in case you didn't get it: I am telling you that you are unable to tell me what I want and don't want.

doing fixed point math means you need to account for where the point is at every single operation and then you possibly need to do a shift. And you only get 8 bits of extra precision accuracy out of it for all your trouble in single precision. If you get it wrong you suddenly get completely wrong results.

hoho · Post by **hoho** » Wed Sep 27, 2017 2:54 pm

BenSeidel wrote:I can't see any reason why a 486 fab'd using todays technology can't run at the frequency of the other CPU's produced today.
you reply starts with:
Long story short, pipeline length.
and ends with:
Long story short, you'd not get comparable IPC for the 486-cluster if you take all the limitations into account.
Issue:
There is a difference between the term frequency: used to refer to the clock-rate (the exact term YOU used in your original claim) of a CPU, and it's pipline length or it's instructions per cycle.

As i said, pipeline lenght is why 486 can't be clocked anywhere near the frequency of modern CPUs. Lousy IPC was added to explain why it wouldn't make much sense to run them at that frequency even if you could get their frequency up.

BenSeidel wrote:I say:
It's an interesting question and one that I can't answer except to say that it will depend on the application being run, so I will instead talk about the results I have been able to get during our experimentation.
you say:
SQL queries take ages compared to stuff in Factorio,[...]
Issue:
I was talking about experimental results in one problem domain. A domain that, just like games, has not been made to run using parallel code. You site the differences between two applications, in two different domains, in two stages of development as a reason why my general purpose solution cannot be applied in any other domain. You also offer no explanation as to why the interdependencies in a game far outweigh the interdependencies in my desktop application. Considering that you need to give 27,000 reasons, you have a lot of work to do.

I already explained main differences - your app doesn't require to sync everything up 60x per second with latencies no longer than 1/60th of a second.
Also, SQL servers have been designed from ground up to allow running stuff in parallel and pretty much presume long latenciy between querying and getting a result.

In other words, sure, you can get stuff to run much faster in specific domains but those domains have little to nothing to do with games and results you are seeing aren't applicable to workloads games see.

BenSeidel wrote:This issue really permeates that entire post: you are comparing results from an experimental desktop application written by 2 people over 2 years to an almost feature-complete game that has had up to 7?? full time programmers working on it over a 5 year period.

So, we agree that your example is meaningless since it's simply way too different to games?

BenSeidel wrote:I say:
Well, not having FPU or SIMD instructions wouldn't bother me at all as I personally HATE floating points and believe that they should die a slow and horrible death.
you say:
You may not like them but you will not want to run game code on fixed point math.
Issue:
Do you wish to school me on catholicism while you are at it?
Just in case you didn't get it: I am telling you that you are unable to tell me what I want and don't want.

The FPU "problem" came up in when 486's had horrible support for them therefore they'd suck horribly for running games. It was continuation of the "1000x 486 per chip" discussion.

hoho · Post by **hoho** » Wed Sep 27, 2017 3:02 pm

ratchetfreak wrote:But his point was that the pipeline length means that equivalent clocks won't achieve equivalent performance

Not quite.
Longer pipeline allows for higher clock frequencies but at same clock speed, shorter pipeline length is almost always faster/more effective than longer pipeline. Mostly it's due to branch mispredictions being much more expensive since you will have to flush the entire pipeline to be able to continue with calculations. In the newest P4's, pipeline was some 30 cycles long meaning any time you had branch misprediction, you effectively lost 30+ clock cycles of work + you had to wait for 30 cycles to start getting results from the newly filled pipeline.

Modern i7 and AMD stuff should have pipeline lengths a bit under 20 cycles.

ratchetfreak wrote:Also a single chip cluster of individual cpus exists. It's called the cell processor and can be found in the PS3.

Cell was an interesting beast. It also lacked cache and compensated that with fully software controlled local memory that had the access speed of cache. External memory access was relatively high-latency and low bandwidth, sadly, and you could only ever do calculations with the data that was within the 256kiB sized local memory. Same memory had to also fit the piece of code you were running.

Though if you managed to get your workload chopped to small enough pieces to run on those cores, you could get some rather impressive results. At the end of PS3, bigger developers were offloading parts of rendering to the Cell as it had effectively comparable computing power as the GPU in the machine.

In other words, it had a TON of limitations but if you had the resources to make stuff work for it, you could get some pretty good results out of it.

Funny thing is, the generation after PS3/XB360 had CPUs with *less* computing power than Cell did.

BenSeidel · Post by **BenSeidel** » Thu Sep 28, 2017 1:30 am

ratchetfreak wrote:But his point was that the pipeline length means that equivalent clocks won't achieve equivalent performance. In fact there is easily 10x difference single core even without accounting for how to interract with the rest of the mother board.

Yeah, I know what his point was and everything he said in that area is true. The issue is that he continually ignores key terms in my statements and uses incorrect terms in his statements. When I ask him about it he start on one of those posts that doesn't address his original statement.

My response would have been extremely different if he had not used the term "clock rate" but instead used either of his later terms "pipeline length" or, as his point was really about effective calculations, "IPC". In that case I could have talked about how it's not the effective instructions per second I could get out of my "wish machine" as I realise that today's processors are far more effective per cycle at executing instructions. It's the idea that I could have 20,000+ x86 cores. The epiphany is a very different system that you just can't get an existing program and run it and expect it to work as well as if you wrote your application specifically for it's architecture (same as the PS3 CPU). What I want out of all those cores is to see how an application that can run on 20,000+ cores behaves on 20,000+ cores without having to rewrite the application specifically for an experimental CPU design. Of course I am assuming that Intel (seeing as though I said 486) was able to create such a massive processor that had "uniform cores". I say "uniform cores" because the correct term escapes me. Essentially I mean any core has the same characteristics as any other core, not something that the Epiphany or similar architecture has. As uniformity is a prerequisite to general-purpose, I was sort of assuming that it was a given and that I could just wish it into existence.

ratchetfreak wrote:doing fixed point math means you need to account for where the point is at every single operation and then you possibly need to do a shift. And you only get 8 bits of extra precision accuracy out of it for all your trouble in single precision. If you get it wrong you suddenly get completely wrong results.

It's computer programming: if you get anything wrong you get completely wrong results, so I'm not sure what your point is.

Accounting for the binary (or decimal) point is a far more interesting discussion. BTW I would not recommend using a decimal representation. The issue isn't with the representation, it's an issue with the tools we use. The thing with a fixed point representation is that you must know at all times what your precision is. This should be the case because you should know what your program is doing. It's essentially type checking and should be handled the same way: the checks should be offloaded to your compiler but the information should be there and alarms sounded when you do something inexact. When you say x = 0.1 your compiler should tell you that it can't store 0.1 exactly and that you have to perform a rounding operation. While this seems arduous to begin with, in this case it can be inferred by the precision in x.

As for shifting: Yes you need to do a shift, but you need to do a shift in floating point anyway, it's just handled by the hardware so there is an issue with speed on the current architecture. Additionally you need to do less shifting as you know what precision you are working in and only have to shift if you need to convert to another precision. With floating point that shift is done every time you perform any operation.

Look, like any system there are intricacies. Personally I feel that it's much "better" than floating points in the same way that type checking is "better" than dynamic languages (now that is a rage inducing flame war right there).

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications