Parrallel processing in games & applications

justarandomgeek · Post by **justarandomgeek** » Thu Jan 19, 2017 6:59 am

BenSeidel wrote:
justarandomgeek wrote:This is Factorio, so really you ought to go a little further - make it a Warehouse (from the mod) with all 24 possible inserters and logbots going both ways (storage). And you've got a several thousand of them.

kinnom wrote:Or better, Add bobinserters and have 108 inserters
I think I covered this, in any case I am not writing psudocode/describing how you would handle 108 inserters when you should understand that a while-loop can do any number you wish.

Of course. But the way you described this earlier, you now have 108 inserters each all calculating what all 107 of the others would have to do, to be sure they can all do what they want to do. So you've taken those 108 inserters and done 11664 inserters worth of work for them. I don't care what chip you've got, this won't be faster than simple doing them in order on a single thread, as they are done now.

And that's just for one warehouse.

BenSeidel · Post by **BenSeidel** » Thu Jan 19, 2017 7:54 am

greep wrote:Oh sure, it's just as simple as that to parallelize. That must be why you're writing whole essays to described some inserters interacting with a chest Anyways, we seem to be drifting into nerd rage territory, so I'm off to blow up aliens.

I don't think this is an essay. The core idea is actually a very small part of the posts. Most of my responses are to specific examples of how you could apply the technique to some scenario. As they are all made up examples I have put forward a made up solution. We could really go on forever about one inserter going into one chest, we have causality requirements, memory layouts, and list structures to go over to name a few aspects before we reach anything close to a complete discussion.

Also, from your argument that complex explanations only accompany complex ideas means that you should not have any understanding about gravity unless you have read "Philosophiæ Naturalis Principia Mathematica".

If, after blowing up some aliens (might I suggest orbital ion cannon), you have any issues with the idea let me know. Only once an idea is understood can you comment on its complexity.

justarandomgeek wrote:
BenSeidel wrote:
justarandomgeek wrote:This is Factorio, so really you ought to go a little further - make it a Warehouse (from the mod) with all 24 possible inserters and logbots going both ways (storage). And you've got a several thousand of them.

kinnom wrote:Or better, Add bobinserters and have 108 inserters
I think I covered this, in any case I am not writing psudocode/describing how you would handle 108 inserters when you should understand that a while-loop can do any number you wish.

Of course. But the way you described this earlier, you now have 108 inserters each all calculating what all 107 of the others would have to do, to be sure they can all do what they want to do. So you've taken those 108 inserters and done 11664 inserters worth of work for them. I don't care what chip you've got, this won't be faster than simple doing them in order on a single thread, as they are done now.

And that's just for one warehouse.

I thought I had discussed the issue with each inserter having their own "execution context" here:

BenSeidel wrote:Sure, because I chose a poor memory bound. You need to pick "execution contexts" that have tightly bound data. Processing two closely related things at two separate locations is stupid, but these mistakes are essentially the same ones that can be made in single-threaded code. It's a case of not understanding what parts of your code are accessing the same memory, hence not knowing that they should be processed at the same time to prevent memory transfers.

You can tell that you have incorrect memory bounds in your execution contexts because you get those artifacts. It's a similar concept to an incorrectly designed Object, instead there you end up making all of your variables public. In both cases there is a lack of understanding in the memory access cohesion of your application.

As all the inserters are closely bound and share a large amount of data, merge them into one execution context. I have already talked about how this could be implemented.

If you are having an issue with the idea that you would group your 108 inserters into one "execution context" then let me know what exactly it is that you don't understand. Or is it the mechanism of merging the contexts? Or is it more that you don't see what the context is there for? I only chose the one-entity-per-context because it highlights the extent at which you can multi thread, not because it's an appropriate level of multithreading.

ratchetfreak · Post by **ratchetfreak** » Thu Jan 19, 2017 11:38 am

BenSeidel wrote:
ratchetfreak wrote:This forces mods to be a fully serial part of the game loop with no recourse to even attempt to multi-thread unless you restrict what the lua code can do. Or in other word only let it read the game state and send events to be processed in game_update() along with all other changing events. After you do that you can run multiple mods at the same time because they only need read-access and an outgoing message queue.

However it would be much better if the mod authors get access to do multithreading themselves so the event handling code for each mod doesn't have to be serial.
I have been thinking about this, and you should be able to get the LUA code multithreaded using this technique, you just need each callback to it's own execution context. The only requirement for it to be doable is on how information is moved through the global variable to other executing sections of the code. Any modification to the game state can be done through the tick-tock memory buffer, possibly the same with the global table. BUT AGAIN I HAVE TO SAY: I have no idea of the causality constraints in the API in factorio. So without seeing any code or having a detailed explanation of these constraints from the developers this conversation is a he-said she-said waste of time. The only thing that can be said for certain is that if process A does not depend on process B, and process B does not depend on Process A then they can executed at the same time. I offer a technique to ensure that these process don't every rely on each other.

you can check the mod forums or some mod code to divine the current mod architecture. Baseline is that mods can access any entity in a read/write way at any time during their execution. This puts a hard dependency between any 2 mods.

BenSeidel wrote:
ratchetfreak wrote:A FIFO would not be deterministic because the OS scheduler will affect the order the access come in.
a FIFO system has nothing to do with the OS scheduler. I cannot respond to this because I have absolutely no idea what you are thinking.

FIFO depends on who gets in first, if 2 threads are vying for the first spot they depend on the OS scheduler

BenSeidel wrote:
ratchetfreak wrote:This is a tiny execution context and you don't need complex grouping algorithms to get the most out of the performance
You will always need relatively large execution contexts because if you don't you can't effectively reduce your cache-misses. The only way to ensure that some section of memory is all processed at once is to have it processed in the same thread (aka in the same execution context). Extremely small execution contexts ARE A REALLY BAD IDEA on the current CPU architecture. I repeat: A REALLY BAD IDEA. You need to have high memory cohesion in any application you write, multithreaded or not.

But you can group these tiny execution contexts much more easily than you can with grouping inserters that may access the same inventories. The only significant cache misses you'll get is when you are pushing a message to another entity's queue. Everything else can be grouped in memory per context to avoid false sharing.

BenSeidel · Post by **BenSeidel** » Thu Jan 19, 2017 1:55 pm

ratchetfreak wrote:you can check the mod forums or some mod code to divine the current mod architecture. Baseline is that mods can access any entity in a read/write way at any time during their execution. This puts a hard dependency between any 2 mods.

Why would I check that, I am not developing the code.

ratchetfreak wrote:you can check the mod forums or some mod code to divine the current mod architecture. Baseline is that mods can access any entity in a read/write way at any time during their execution. This puts a hard dependency between any 2 mods.

Again, FIFO has nothing to do with wall-clock time, only order they are inserted into the queue. There is no reason that two items cannot be inserted into a queue at the same time. I believe that your are assuming an absolute linearity to the list when it is not required, only that the condition that no elements are out-of-order based upon their insertion time.

ratchetfreak wrote:But you can group these tiny execution contexts much more easily than you can with grouping inserters that may access the same inventories. The only significant cache misses you'll get is when you are pushing a message to another entity's queue. Everything else can be grouped in memory per context to avoid false sharing.

What do you mean by "group execution contexts"? I don't understand how you can group them as putting any structure to the execution order of the contexts goes against everything the technique is designed to achieve. I am not pushing any messages between execution contexts as that requires instruction-level synchronisation, so I really am at a loss as to what you are trying to say. I you could clarify I could respond.

zx64 · Post by **zx64** » Thu Jan 19, 2017 3:00 pm

Any chance you could link some code samples showing the concepts you're talking about?

I'm curious where it fits into the concepts of data parallelism and task parallelism.

For example this:

A clock is simply some point at which you say "I am done, here is my result", except that everything running must say it at the same time. Clocks allow separate, asynchronous pieces of code the ability to synchronise. It's the time where one thread publishes its results of its previous task and takes the input it needs to do its next task. Clocks are the key to high-level asynchronous programming.

sounds a bit like .NET's Dataflow library, but also a bit like futures and promises.

You might also be interested in this 2009 blog post from the Arma developers about their initial work improving parallelism as part of making Arma 2.
Nothing ground breaking but it's a good summary of the trade-offs they faced when multithreading a simulation with significant data dependencies between stages.

ratchetfreak · Post by **ratchetfreak** » Thu Jan 19, 2017 3:55 pm

BenSeidel wrote:
ratchetfreak wrote:you can check the mod forums or some mod code to divine the current mod architecture. Baseline is that mods can access any entity in a read/write way at any time during their execution. This puts a hard dependency between any 2 mods.
Why would I check that, I am not developing the code.

if you are talking about how mods would fit into you parallel process then you need to know what mods can do.

BenSeidel wrote:
ratchetfreak wrote:But you can group these tiny execution contexts much more easily than you can with grouping inserters that may access the same inventories. The only significant cache misses you'll get is when you are pushing a message to another entity's queue. Everything else can be grouped in memory per context to avoid false sharing.
What do you mean by "group execution contexts"? I don't understand how you can group them as putting any structure to the execution order of the contexts goes against everything the technique is designed to achieve. I am not pushing any messages between execution contexts as that requires instruction-level synchronisation, so I really am at a loss as to what you are trying to say. I you could clarify I could respond.

what I mean is making little a bunch of std::array<Inserter,32> and each job then processes one of those at a time. Same with the other entity types. How they are grouped doesn't matter. They could all be stack inserters on the same cargo wagon or they could be scattered across the map and not access the same stuff and there would be little difference in performance.

On the other hand if you go with the pulling style then it does matter how the inserters are grouped. The stack inserters on the same cargo wagon should all be in the same group. As they'll be looking at each other often. Finding the optimal grouping in this case is far from trivial.

BenSeidel · Post by **BenSeidel** » Fri Jan 20, 2017 3:11 am

zx64 wrote:Any chance you could link some code samples showing the concepts you're talking about?

I have attached a quick & dirty example of both the library and use case. Took about an hour to write, so if I have missed any synchronisation parts in the library I am sorry. There are many things that could be improved. I tried to keep it as bare-bone as possible so that the mechanism is easy to see. If you have an idea that could be implemented in a short time let me know and I'll give it a crack. The library should be functional & reliable, if not fast or pretty. It also only allows for a single clock in a system when there is no reason why you cant have clocks under clocks or clocks next to clocks etc. The implementation of the library also should be tailored to the application it's being used in. Different implementations are pretty much as different as the difference between a memory-probe dictionary vs a list-collision dictionary.

zx64 wrote:I'm curious where it fits into the concepts of data parallelism and task parallelism.

I'd say it fits into task parallelism. Data parallelism is a fairly easy problem to solve, once you have some mathematical constraints on the operations you are performing (ie the Commutative, Associative & Distributive Laws). In my opinion, task parallelism is the far harder one to solve (from a synchronisation point of view). Saying that, the two categories are not mutually exclusive. You can view one as the other and still get the outcome you need, so this could possibly be applied to the sequential sections of the data parallelism algorithms, but I think it would be more of a square peg vs round hole scenario.

zx64 wrote:sounds a bit like .NET's Dataflow library, but also a bit like futures and promises.

Nope. Nothing to do with those.

Futures and promises are nasty (I have never had any good experiences with them). The whole idea of futures and promises is basically an extension to the observer model, except without the boilerplate. So it's like the difference between a Class in C++ and a function pointer in C. Yes you can do OOP in C, but it's just not as nice as doing it in C++. Anyway, my point is that the whole future/promises thing still suffers from all the issue you have in the observer model. The issues basically extend from the uncontrolled ordering of events and the mutability of the languages they are written in. Or to put it another way, it has about as much to do with synchronisation as the Class construct does.

The Dataflow part is based on an actor model, and it literally says that it passes messages in the article. I am not passing any messages. If I have misinterpreted what the data flow library is, then sorry, I have had no experience with it except the article you linked.

zx64 wrote:You might also be interested in this 2009 blog post from the Arma developers about their initial work improving parallelism as part of making Arma 2.
Nothing ground breaking but it's a good summary of the trade-offs they faced when multithreading a simulation with significant data dependencies between stages.

Thanks, I'll give it a read when I am able. While I am interested in that stuff, I'm of the opinion that the study of the tradeoffs is still in the academia phase. As there is basically no use of multithreading in the real world, any use will be an improvement. Not until fine-grained multithreading is the norm does the idea of a tradeoff become a real issue. (I know this is a glaring oversimplification, so there is no reason to point out a counter example)

ratchetfreak wrote:if you are talking about how mods would fit into you parallel process then you need to know what mods can do.

Nope, not at all. I can talk about my THEORETICAL mod system in my THEORETICAL implementation of a multithreaded mod system for Factorio. I believe you think that I am talking about ACTUAL implementations. Unless you are a developer with ACTUAL code, could we please keep this in the realm of the THEORETICAL? Otherwise you could start spouting off various issues with the synchronisation on some specific CPU that has some specific, obtuse bug in it that needs some specific work around that is too expensive to code. I am only here to try to show that there is a relatively simple (compared to current techniques) method of synchronising "micro-multithreaded" systems. (I still don't have a term to specify the difference between running some known number of tasks in parallel and some unknown number of tasks in parallel).

ratchetfreak wrote:what I mean is making little a bunch of std::array<Inserter,32> and each job then processes one of those at a time. Same with the other entity types. How they are grouped doesn't matter. They could all be stack inserters on the same cargo wagon or they could be scattered across the map and not access the same stuff and there would be little difference in performance.

It makes a massive difference what inserters are included in what execution contexts. If you process closely related entities at the same time you keep your cache hot. As I have said, you need to understand the way your memory is accessed and access it all together, preferably in a sequential manner. Your idea of just lumping them together in an array, especially when they are across the entire map, makes poor use of locality.

ratchetfreak wrote:On the other hand if you go with the pulling style then it does matter how the inserters are grouped

There is no "pulling style" as there is not "pushing style". There is only memory access and the clock to control its access. If you are still trying to talk about a message system then that would explain why I am not understand what you are saying. As I have said, I'm not going to talk about another method or if another method is better than this one as that is a subjective conversation. I wish to keep this thread about the presented idea so that when someone sees an opportunity to use it (because it fits their current issue, not because I am telling you that it is the only method) then they can. I am sure that there are sections of the Factorio code that a message system is better, but that is not the topic of the thread.

I apologise if I am coming across as hostile, your responses just don't seem constructive or about the idea I am trying to present.

Edit: Moved the main method to a reasonable location.

BenSeidel · Post by **BenSeidel** » Fri Jan 20, 2017 4:29 am

Here is an even better example of how "easy" it is to make things multithreaded.
I found the source for a Conway's game of life implemention at http://math.hws.edu/javanotes/source/chapter7
I have absolutely no idea if it's under copywrite.

Anyway, attached is a copy of the game source file, both single and multithreaded. It's using the same library as the one that just prints out pointless integers. The two versions are not exactly the same because I had to change the memory structures to better "fit" my OOP style. But I hope that the synchronisation mechanics become more obvious in this example. The change took about 20 minutes, but now you can run the game of life on as many cores as you have!

It's also worth noting that the BoardLine variables are used outside of the clock-controlled area. This is because this section is single-threaded, so there is no need for synchronisation.

zx64, Thanks for the push to write a demo, The game of life thing is really cool

ratchetfreak · Post by **ratchetfreak** » Fri Jan 20, 2017 11:14 am

BenSeidel wrote:
ratchetfreak wrote:if you are talking about how mods would fit into you parallel process then you need to know what mods can do.
Nope, not at all. I can talk about my THEORETICAL mod system in my THEORETICAL implementation of a multithreaded mod system for Factorio. I believe you think that I am talking about ACTUAL implementations. Unless you are a developer with ACTUAL code, could we please keep this in the realm of the THEORETICAL? Otherwise you could start spouting off various issues with the synchronisation on some specific CPU that has some specific, obtuse bug in it that needs some specific work around that is too expensive to code. I am only here to try to show that there is a relatively simple (compared to current techniques) method of synchronising "micro-multithreaded" systems. (I still don't have a term to specify the difference between running some known number of tasks in parallel and some unknown number of tasks in parallel).

regardless fitting mods into your system will require changing the current mod api into something that isn't arbitrary read/write of game state.

BenSeidel wrote:
ratchetfreak wrote:what I mean is making little a bunch of std::array<Inserter,32> and each job then processes one of those at a time. Same with the other entity types. How they are grouped doesn't matter. They could all be stack inserters on the same cargo wagon or they could be scattered across the map and not access the same stuff and there would be little difference in performance.
It makes a massive difference what inserters are included in what execution contexts. If you process closely related entities at the same time you keep your cache hot. As I have said, you need to understand the way your memory is accessed and access it all together, preferably in a sequential manner. Your idea of just lumping them together in an array, especially when they are across the entire map, makes poor use of locality.

Locality of data depends on how you organize your data. You seem to assume a per chunk list of entities so map locality results in data locality but I'm assuming a single unsorted (chunked) list. But because each inserter doesn't need data from the neighboring entities (in my communication model) the map locality doesn't matter for the data required to run each entity.

BenSeidel wrote:
ratchetfreak wrote:On the other hand if you go with the pulling style then it does matter how the inserters are grouped
There is no "pulling style" as there is not "pushing style". There is only memory access and the clock to control its access. If you are still trying to talk about a message system then that would explain why I am not understand what you are saying. As I have said, I'm not going to talk about another method or if another method is better than this one as that is a subjective conversation. I wish to keep this thread about the presented idea so that when someone sees an opportunity to use it (because it fits their current issue, not because I am telling you that it is the only method) then they can. I am sure that there are sections of the Factorio code that a message system is better, but that is not the topic of the thread.

I apologise if I am coming across as hostile, your responses just don't seem constructive or about the idea I am trying to present.

There are 2 styles of possible communication between entities in a highly parallel environment, one is the one you proposed which is each entity gets the previous tick's data from neighboring entities and runs the code that that entity also runs for itself to get the current tick's data, which may require data from its own neighbours, which may require running code that those run, which may require data from those neighbours, etc. This style requires careful limitation of what data actions can depend on, because otherwise each entity may end up needing to simulate the entire map.

While my message style has each entity only look at its own data and its own message queue and if the entity needs data or it want to perform and action that requires cooperation from another entity it sends a message to the other entity and waits on the result next tick. By default this results in much better data locality because the entities don't need to get data from their neighbours in the common case. Pushing a message to another entity's queue is a derefence to get the location of the queue and the queue counter of the receiving entity, an atomic increment and a copy.

mrvn · Post by **mrvn** » Fri Jan 20, 2017 11:58 am

justarandomgeek wrote:
BenSeidel wrote:
justarandomgeek wrote:This is Factorio, so really you ought to go a little further - make it a Warehouse (from the mod) with all 24 possible inserters and logbots going both ways (storage). And you've got a several thousand of them.

kinnom wrote:Or better, Add bobinserters and have 108 inserters
I think I covered this, in any case I am not writing psudocode/describing how you would handle 108 inserters when you should understand that a while-loop can do any number you wish.

Of course. But the way you described this earlier, you now have 108 inserters each all calculating what all 107 of the others would have to do, to be sure they can all do what they want to do. So you've taken those 108 inserters and done 11664 inserters worth of work for them. I don't care what chip you've got, this won't be faster than simple doing them in order on a single thread, as they are done now.

And that's just for one warehouse.

Try a different solution. Instead of each of the 108 inserters checking the warehouse if they can take some items go the other way and add more clock states.

Clock state 0: all inserters figure out how many items they can potentially grab (evaluate signals, set filters)
Clock state 1: the warehouse goes through it's list of inserters in a deterministic order offering each items if they can take it.
Clock state 2: all inserters update their state to accept the offered items
Clock state 3: data goes live, the GUI can draw the new state
repeat

Similar for inserters droping of items. In complex code you can easily end up with 100 clock states without any problems, as long as each state has lots and lots of little things to do on all cores (or is very quick anyway). What you don't want is a clock state where you have 10000 jobs taking no time and one jobs takiung 10s. Then all cores would have to wait 10s for that one job. You want to break down operations into micro steps that each takes no noticable time.

mrvn · Post by **mrvn** » Fri Jan 20, 2017 12:12 pm

Daid wrote:For the TL;DR version for people:
Making factorio multi threaded capable is easy, it just has to be rewritten in a way that 90% of the software engineers in the world are not trained in. But this happens to be my specialized field, so it is easy.

I've met you. Well, not you, but people like you. And I can only say one simple think to you. Get out into the world. You have no real software experience. You never worked with the 80% of the software engineers that are out there. You've only worked with the 10% in your specific field, which are really smart people. The rest of the people that actually build the software in this world are not as smart as you are. Feel good about that part, but remember that the rest of the world also needs software. You cannot build everything for the whole world with those 10% people.

Who am I? I'm just a simple software engineer with ~15 years of experience. Who has done everything from maintenance on horrible projects, to starting new products from scratch. I've worked with 1 man teams, to 10 man teams. People that work with me generally say that I'm smart. I'm not. I'm just practical. I know how to get from A to B in a decent way that most people will understand. I'm pretty decent a practical implementations of technology.
(My biggest visible achievement is that I did a revolution on 3D printing software as a 1 man team a few years ago. Which was a big contribution to our company growing from 15 people to 250 in 3 years)

And that is the problem we face today: We can't do that. We didn't do it the last 15 years despite having known for 10 that we should have done it.

You don't have to be particularly smart and multithreading is not that hard. What it is is different. You have to rethink concepts that are so ingrained in the game industry that it is near impossible to go against. Luckily the time of faster and faster cores is over and having multi cores is standard. The core count keeps climbing. Some years back I had to buy an expensive dual socket board to get SMP. Nowadays my laptop has 8 cores. And leaving 7 cores idle in a game where (at least some of) the competition is using them becomes a game changer.

Here is another example of things the game industry is still doing wrong: Player input is checked ones per frame, the reaction sets in the next frame and lasts for a multiple of frames.

Now with a fast FPS you can step halt a tile forward by hitting a key lightly. With a slow FPS you have to keep the key pressed and then the player jumps forward two tiles before you can lift your finger. Or not at all if you don't press long enough. That problem was solved over a decade ago by using timed events. Check the input very often (or have the kernel do that) and record the time of each change. Then in between frames (if you can't do it multithreaded) you can check the events queue and handle the action. You will know the forward was pressed for 0.1s so the player should move this many pixels foward. The actual FPS does not matter. The game will be as reactive in 5fps as in 60fps. You might not see it but you will notice that you can still take small steps.

hoho · Post by **hoho** » Fri Jan 20, 2017 3:32 pm

mrvn wrote:Try a different solution. Instead of each of the 108 inserters checking the warehouse if they can take some items go the other way and add more clock states.

Clock state 0: all inserters figure out how many items they can potentially grab (evaluate signals, set filters)
Clock state 1: the warehouse goes through it's list of inserters in a deterministic order offering each items if they can take it.
Clock state 2: all inserters update their state to accept the offered items
Clock state 3: data goes live, the GUI can draw the new state
repeat

So, in other words, you iterate over the inserters three times? That'll be a lot of memory/cache bandwidh used with awful computations-per-byte ratio.

hoho · Post by **hoho** » Fri Jan 20, 2017 3:45 pm

mrvn wrote:The core count keeps climbing. Some years back I had to buy an expensive dual socket board to get SMP. Nowadays my laptop has 8 cores.

I got my first "dualcore" in around 2005 in form of hyperthreaded Pentium4. Next was quadcore q6600 about 2 years later. The next CPU I got was less than a year ago in form of i5 6600, also with 4 cores. Sure, I could have spent about 30% more on my CPU to get the "hyperthreading" one with 8 logical cores but it woudln't have been anywhere near twice as fast as the one without HT. Even in ideal situations the difference would have topped out at around 10-15%.

VAST majority of mainstream CPUs that are sold are still at 4 cores or less. Of the CPUs that are used in gaming machines, less than 2% have more than four cores. http://store.steampowered.com/hwsurvey/cpus/

In servers, sure, you can get beasts with over 20 "cores" per socket if you want at the cost of drastically lower clock speed. Thanks to Amdahl's law, that's not exactly practical thing to do for vast majority of people that aren't running "embarrasingly parallel" stuff.

It's a kind of a chicken vs egg problem. People don't get multicore because software doesn't use it. People don't make multicore software because it's not cost-efficient for most things. Pragmatically, most of the stuff runs in single-tread "good enough" and as it's vastly easier to think how to code like that, it's much cheaper to find programmers to code that stuff.

ratchetfreak · Post by **ratchetfreak** » Fri Jan 20, 2017 5:02 pm

mrvn wrote:Try a different solution. Instead of each of the 108 inserters checking the warehouse if they can take some items go the other way and add more clock states.

Clock state 0: all inserters figure out how many items they can potentially grab (evaluate signals, set filters)
Clock state 1: the warehouse goes through it's list of inserters in a deterministic order offering each items if they can take it.
Clock state 2: all inserters update their state to accept the offered items
Clock state 3: data goes live, the GUI can draw the new state
repeat

Similar for inserters droping of items. In complex code you can easily end up with 100 clock states without any problems, as long as each state has lots and lots of little things to do on all cores (or is very quick anyway). What you don't want is a clock state where you have 10000 jobs taking no time and one jobs takiung 10s. Then all cores would have to wait 10s for that one job. You want to break down operations into micro steps that each takes no noticable time.

100 clock states requires 100 syncs, synchronization isn't free. Each time you sync you need to wait on the slowest thread. There is also a non-zero time between the last thread entering the end of tick and all threads being awake at the start of the next tick.

dee- · Post by **dee-** » Fri Jan 20, 2017 5:07 pm

hoho wrote:
mrvn wrote:The core count keeps climbing. Some years back I had to buy an expensive dual socket board to get SMP. Nowadays my laptop has 8 cores.
I got my first "dualcore" in around 2005 in form of hyperthreaded Pentium4. Next was quadcore q6600 about 2 years later. The next CPU I got was less than a year ago in form of i5 6600, also with 4 cores. Sure, I could have spent about 30% more on my CPU to get the "hyperthreading" one with 8 logical cores but it woudln't have been anywhere near twice as fast as the one without HT. Even in ideal situations the difference would have topped out at around 10-15%.

VAST majority of mainstream CPUs that are sold are still at 4 cores or less. Of the CPUs that are used in gaming machines, less than 2% have more than four cores. http://store.steampowered.com/hwsurvey/cpus/

In servers, sure, you can get beasts with over 20 "cores" per socket if you want at the cost of drastically lower clock speed. Thanks to Amdahl's law, that's not exactly practical thing to do for vast majority of people that aren't running "embarrasingly parallel" stuff.

It's a kind of a chicken vs egg problem. People don't get multicore because software doesn't use it. People don't make multicore software because it's not cost-efficient for most things. Pragmatically, most of the stuff runs in single-tread "good enough" and as it's vastly easier to think how to code like that, it's much cheaper to find programmers to code that stuff.

OT: I've got my true 6-core Phenom II X6 1090T around 2012, that's 5 years ago, for 145,00 EUR including 19% tax. If people buy the wrong brand (crastated CPUs) for the wrong price, they've got only themselves to blame.

Back on-topic: the dual-buffering concept sounds really nice and the Game Of Life Example was a pretty good fit because both GoL and Factorio have to be deterministic. Factorio OTOH has a more "wide" area, e.g. power network calculation, but that can be another parallelized step after the "physical" calculation of items, belts and inserters, just like logistics and combinators.

torne · Post by **torne** » Fri Jan 20, 2017 5:54 pm

Game of Life is a really poor example to compare to Factorio, though, because Game of Life is literally the optimal case for double buffered state: each cell of the new state depends on only data from the old state, and so it's trivially parallelisable using this method to any number of threads with no synchronisation at all required other than waiting for all the threads to be done at the end of the frame. Various people have already pointed out in this thread that this is not at all the case for Factorio, and that in many cases the actions of a particular entity in a particular tick *do* depend on the actions of other entities in the same tick (the most trivial one being two inserters grabbing from the same source), and not just the state of the system from the previous tick.

BenSeidel · Post by **BenSeidel** » Sat Jan 21, 2017 12:54 am

hoho wrote: Thanks to Amdahl's law, that's not exactly practical thing to do for vast majority of people that aren't running "embarrasingly parallel" stuff.

Amdhal's law is like the theory of relativity, yes it says there is an upper bound to the speed a process may be run, one that no amount of computing power or coding can increase, but it does not yet have any use except the theoretical. People don't realise that the P could be close to .99995, if not higher. So bringing up Amdahl's law is like bringing up the theory of relativity at a drag race and saying it's the reason why the speeds of the dragsters haven't been increasing.

dee- wrote:Factorio OTOH has a more "wide" area, e.g. power network calculation, but that can be another parallelized step after the "physical" calculation of items, belts and inserters, just like logistics and combinators.

Exactly. While the entire game update may not be doable within one parallel step, many of the sequential steps can be made parallel.

ratchetfreak wrote:100 clock states requires 100 syncs, synchronization isn't free. Each time you sync you need to wait on the slowest thread. There is also a non-zero time between the last thread entering the end of tick and all threads being awake at the start of the next tick.

Yes 100 clock states require 100 synchronisations. This is still better than each shared variable having a synchronisation. Even if you compare it to having a lock at the entity level you can easily see that even 1000 clock states is absolutely nothing in terms of the synchronisation. Additionally, pausing and then re-enabling a thread is not expensive. While it does have a non-zero time (and seriously, saying that something has a non-zero time is saying that it exists. WTF good does it do?) it's not like putting it to sleep for 10ms. It's practically instantaneous. Unless there is another thread waiting to be run on the core there is no reason for you to pay an OS context swap as the OS can literally turn off the core. If someone knows that OS's can do this then please let me know, but (again in THEORY) it could be done.

torne wrote:Game of Life is a really poor example to compare to Factorio, though, because Game of Life is literally the optimal case for double buffered state

Do you understand what an example is for?

torne wrote:Various people have already pointed out in this thread that this is not at all the case for Factorio

The only ones that have pointed it out are the ones either
1) Saying that the developers have said it's hard
2) The ones that believe there is a serial aspect to the entity update that can't be worked around.
3) The ones that haven't even looked at the idea because they believe that another method is always superior.

Even Klonan said that "Multithreading isn't impossible".

torne wrote:, and that in many cases the actions of a particular entity in a particular tick *do* depend on the actions of other entities in the same tick (the most trivial one being two inserters grabbing from the same source), and not just the state of the system from the previous tick.

The only thing the action of an inserter is based upon is the state of the game at the end of the previous tick. If you don't believe this is the case then save the game and re-load it over and over and over and see if you get different outcomes from your inserters, because loading a game literally SETS THE STATE FOR THE PREVIOUS FRAME. This notion of "can't beat the serial" is the distilled essence as to why games and applications are still single-threaded. Programmers have had no experience programming things in parallel and believe that it can't be done.

BenSeidel · Post by **BenSeidel** » Sat Jan 21, 2017 1:35 am

So I think I have said just about all I could say on this technique for managing the synchronisation of data between parallel processes. I keep getting bogged down with arguments about how hard multithreading is and how multithreadable Factorio is or is not. As the goal was to put forward an idea that makes multithreading simple, and I feel I have done that, I can't see any reason to continue the discussion. Saying that though, if anyone has an question about the actual technique or how they should be thinking to make the most of the ever increasing cores then don't hesitate to ask.

Thanks.

NotABiter · Post by **NotABiter** » Sat Jan 21, 2017 4:07 am

BenSeidel wrote:I have spent the last 6 years researching and writing a dynamic application whose structure is not known at compile time.

6 years of researching and yet...

But anyway, Clocks (and clock boundaries) are an extremely powerful technique that I have never seen applied in software.

You might want to try to learn how to use google.

Java has a class dedicated to "clocks" (cyclical barriers):
https://docs.oracle.com/javase/7/docs/a ... rrier.html

Make that two classes:
https://docs.oracle.com/javase/7/docs/a ... haser.html
(paper with benchmarks here: http://www.cs.rice.edu/%7Evs3/PDF/SPSS08-phasers.pdf)

And later in this thread you provide a Java program demonstrating "your" technique... but don't even bother to use either of the above two built-in Java classes designed specifically for this purpose? Oversights like that suggest that you really don't have any idea what many other programmers are or are not using in their software.

Not until mainstream games require at least 16+ cores will you see Intel making them standard.

I'm not sure I can wrap my head around your level of delusion. Please reread what you wrote and see if you now understand why it makes absolutely zero sense.
No software that requires N cores can possibly be "mainstream" until after N cores are "standard" (i.e. most mainstream gamers have >= N cores in their box). You're essentially asking for causality to be violated. If you want to push the envelope by requiring a certain number of cores, it has to be with niche software, not mainstream software.
Luckily, requiring N cores is not, um, required. You can push the envelope simply by having games (and other popular applications) that don't require N cores to run, but offer significant advantages to those players who do have N cores. For example, a role playing game might run on standard hardware and support X NPCs over Z square miles of land. But if the player has four times as much computing power and memory, maybe that same game can support 4*X NPCs over 4*Z square miles of land (supporting more kingdoms, races, quests,etc.), giving the player a much bigger world to play in. (In theory a "properly parallelized" Factorio could do the same, but frankly I don't find making Factorio mega-factories for the sake of making mega-factories to be compelling. IMO the Factorio devs should stop wasting any of their limited time making stuff go faster and instead fix their completely boring/broken gameplay, but I have no false hope of that actually happening. I.e., vanilla Factorio is a decent sandbox, and even a pretty good sandbox with mods, but it's a really poor game.)

They follow where software developers lead.

More nonsense, and especially obnoxious nonsense in a thread about multi-core systems. Intel didn't start making multi-core systems because "software developers lead", they did it despite the fact that software developers were (by and large) not even ready for it. And in the consumer space they've gone from (mostly) 2 core systems to mostly 4 core systems without software really "leading the way" or even catching up. And when they start pumping out 8 core consumer systems en masse, software will still likely be lagging. (And how does it make sense for you to contradict the whole point of your own thread? Oh, that's right, because your entire argument is self-contradictory.)
Software developers, hardware developers, users - it's all an ecosystem. Anyone in that ecosystem can try to lead (even users by expressing demand loudly enough), but that means investing in the effort, and people don't invest in things unless they believe that investment will pay off. In terms of how many cores CPUs have, Intel is a bit of a "reluctant leader" - they didn't really want to lead this way, but they needed to make performance numbers keep growing and single-core was out of gas.

If you think that it can't be done or that software is just "too complex" then I implore you to learn VHDL and see what chip designers have been doing since the the 1980's.

With today's tools, writing good (efficient) parallel code in anything other than trivial (little/no fine-grain interaction) cases does tend to be "too complex". And I rebuff your imploring by noting that I've known VHDL for many years (and used it both on and off the job). Note that the fact that you can make something run in parallel does not mean you have created good (highly optimized) code. And lack of optimization (e.g. just copying all of memory, only handling the minimum "unit of work" size rather than special-casing center vs edge and tiling for best cache use) is contributing significantly to the simplicity of your specific take on this general principle (barriers/phases + per-phase isolation). Optimized code of this nature really requires language/tool support to be feasible at all for most "mainstream" programmers, or to be considered practical for complex projects even for those of us who don't technically need such support (which is one of the reasons I happen to be working on my own language+compiler that supports semi-automatic parallelization of code of this sort). "Unoptimized" (simple) code might be good enough in some cases (maybe even optimal), but unacceptable or not worth it in others (potentially even worse than not parallelizing at all, especially if some users are on low-end hardware).

They are micro-parallelisms, allowing sections of your code to run in parallel where possible, because simply put, software developers are too lazy to do it themselves. Who remembers the days of having to put no-ops after a branch because the CPUs didn't do it.

Programmers have had no experience programming things in parallel and hence fear it.

Right. It couldn't possibly be that we make rational decisions based on actual knowledge (including first-hand experience), based on the difficulty and other properties of the specific problem domain we're working on, based on the opportunity-cost of not doing other things, and based on what customers are willing to accept (pay money for). Programmers are all just irrational lazy cowards. Very convincing argument.
And, of course, nevermind that modern CPUs provide instruction-level parallelism not because programmers are too lazy but because there is no other way in existing CPUs for programmers to efficiently express/implement such micro-parallelism. And nevermind that the need for no-ops after branches was dropped not because programmers were lazy but because there was often nothing useful you could put there (thus resulting in poorer instruction-cache, fetch, and decode performance than an auto-inserted stall) and because of its very limited usefulness (fixed single slot) which meant high performance CPUs had to do dynamic instruction scheduling anyways as branch mispredict costs greatly exceeded the fraction of a cycle a single branch slot could fill. You're just rewriting history in a way that doesn't reflect reality.

To me it's all wasted real estate.

Tragedy of commons. (Are you offering to rework all of the code that benefits from that hardware so it's no longer needed? Didn't think so.)
Anyways, if you want to see a much better direction that processors might go, check out some of the Mill CPU talks. That thing is insanely better than x86/ARM/MIPS/etc. Much better performance, much better ILP, with less "overhead" hardware (it gets better ILP than an x86 using dynamic instruction scheduling, and does so using static instruction scheduling, so it needs no dynamic instruction scheduling hardware - how they do that is explained in the videos), and they claim to need far less energy/operation. (All claims except the last are "obviously true" just from the information in their jaw-dropping videos. The energy efficiency claim will be proven one way or the other once they're running on real silicon. These dudes have some serious brains and are making Intel/AMD/ARM all look kind of stupid by comparison.)
Do you know what the best part is? With their top end being able to dispatch 32 "instructions" (which they call "operations", but they are similar in work-amount to instructions on "normal" CPUs) every single cycle per core, they will make sticking with single-threaded applications viable even longer into the future.

hoho · Post by **hoho** » Sat Jan 21, 2017 7:29 am

NotABiter wrote:Note that the fact that you can make something run in parallel does not mean you have created good (highly optimized) code. And lack of optimization (e.g. just copying all of memory, only handling the minimum "unit of work" size rather than special-casing center vs edge and tiling for best cache use) is contributing significantly to the simplicity of your specific take on this general principle (barriers/phases + per-phase isolation).

Good point.

I'm almost certain that the working set size of a large base in Factorio is greater than cache size of most CPUs. That means you'll have to stream it through the CPU every tick and do as many calculations as possible on each piece of that so that you won't have to access it more in same tick.

CPU computing powers are FAR greater than memory bandwidth. IIRC, one has to do a couple of hundred of computations per byte to max out CPU usage *and* memory bandwidth. If one splits up that working set in a way that would require accessing same parts several times per tick, it'll mean less computations per byte streamed through the CPU. Add in context switches eating up cache/register bandwidth and things will only get worse.

NotABiter wrote:Do you know what the best part is? With their top end being able to dispatch 32 "instructions" (which they call "operations", but they are similar in work-amount to instructions on "normal" CPUs) every single cycle per core, they will make sticking with single-threaded applications viable even longer into the future.

While this sounds awesome, I don't think it's really possible in regular apps to have *that* much possible parallelism in code. Hell, even with our x86 CPUs having 4 instructions dispatched per cycle, they more often than not won't be using as many.

I've not seen the video yet but it sounds vaguely similar to what Intel promised with Itanium - compiler-side optimizations instead of hardware-side. Main problems with Itanium were that it effectively meant you need different executables for different CPU generations and different type of writing your code to be able to use their larger instruction dispatch rate effectively.

[edit]

From their Wiki page:
"Ivan Godard claims that a sufficiently-wide mill can find and use an ILP of six or more on common programs."

So, yeah. Hypothetical 32 instructions in parallel, practical "up to 6". Also, part of how they improve their performance is speculative execution of branches. In other words, they run both (or more in case of switch?) branches in parallel to not get stalls. In reality that means effectively double the power usage of executing the code that "normal" CPU would not run speculatively. Sure, it will be better performance but at the cost of signficantly higher power usage.

Also, from what I remember, those parts of the circuit dealing with HW-level optimizations and scheduling in modern CPUs are relatively power efficient. Vast majority of power is still spent in execution units and just moving data to and from the CPU.

Have they actually created any CPUs using that architecture or is it just simulated and in FPGAs? I'd love to see how it works in real-life. Way too many times have CPU manufacturers promised magical performance increases that almost never play out in reality.

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications