Parrallel processing in games & applications

ratchetfreak · Post by **ratchetfreak** » Mon Jan 23, 2017 11:22 am

hoho wrote:
NotABiter wrote:And that Mill also has vector processing?
I assumed as much but I'm still waiting to know how they have solved their register layout. From what you said here it seems like it's registers are all effectively general purpose, just simd instructions take a bunch of those registers as one chunk and execute SIMD instructions on them.

the belt is really just a front for a bunch of renamed registers, it's very likely that the data doesn't need to get put into a register bank but can instead stay inside the output latches of the execution units and get fed to the input latches of the next execution unit that needs it.

decoding an instruction will involve a looking at previous instructions (or updating a cache as things get decoded to find where belt position n is at the moment and then setting the what the guy calls a "crossbar" (effectively a n to n multiplexer with decent parallel capacity) to move the data around. This may also include putting in moves to the general register bank for long lived values.

hoho · Post by **hoho** » Mon Jan 23, 2017 11:39 am

mrvn wrote:You aren't running over the same data at the same time

I explicitly said same data at *different* times.

mrvn wrote:That can't be avoided.

You can if you don't want to force double-buffering of everything

Devs have already talked about their solution a little bit that avoids doing that by calculating a few chunks in a single go and having special logic at the edges where they may influence stuff in other chunks. That way, you trade better data locality for more complex parallelism.

mrvn wrote:There is no magic solution in computer science. You have to try and see and design things to work best.

That I definitely agree with.

mrvn · Post by **mrvn** » Mon Jan 23, 2017 2:41 pm

hoho wrote:
mrvn wrote:That can't be avoided.
You can if you don't want to force double-buffering of everything
Devs have already talked about their solution a little bit that avoids doing that by calculating a few chunks in a single go and having special logic at the edges where they may influence stuff in other chunks. That way, you trade better data locality for more complex parallelism.

That's the other way to go (or a combination of the two). By splitting the world into blocks and doing a block at a time you improve locality. But that wasn't the method suggested at the start of the thread.

Drawback with splitting the world into blocks to compute in parallel is that the overlap needed grows with the size of the block and reach of units (long inserters reach farther). Luckily for a 2D world like factorio the overlap grows linear with the size while the area (and therefore potentially number of inserters/assemblers/furnaces/...) grows quadratic. So the larger the block the less overlap. On the other hand you want a fair number of blocks so blocks with much work and blocks with little work balance out better.

BenSeidel · Post by **BenSeidel** » Tue Jan 24, 2017 2:30 am

hoho wrote:
mrvn wrote: You aren't running over the same data at the same time
I explicitly said same data at *different* times.
mrvn wrote: That can't be avoided.
You can if you don't want to force double-buffering of everything

The number of times you run over data has nothing to do with the double-buffered value synchronisation, that has to do with the update algorithm you have chosen. It's perfectly feasible for the devs to keep the current chunk aligned update algorithm, but instead of using an explicit lock on each shared variable they can use a buffer backed variable, getting the same result with no instruction level synchronisation cost. The "each inserter gets processed individually" is a BAD ALGORITHM. It has little to do with the synchronisation. You could write the same bad algorithm using semaphores or any other of your favourite locking mechanisms.

I believe that the whole idea has been missed by a lot of people. The method has no bearing on your algorithm. If you have multithreaded code that uses instruction level locks, you can use a double-buffer instead and not pay the cost of the instruction cost but instead pay the cost of the additional cache line(s). Theoretically the cost of the extra cache will be less than the cost of an instruction synchronisation when done correctly and under the correct workload.

The thing to note is that this is a GENERAL PURPOSE method of synchronising variable values across multiple threads. There will be cases where it is slower than some other synchronisation method and there will be cases where it is faster, that is what makes it general purpose. But saying that, it is generally better in systems that have high memory coherence.

Also, locking in this style should be easier than using other forms of locks because it's possible to get the compiler to check if you are writing to a variable outside of it's lock (in multithreaded code). AFAIK this is not possible with the instruction level locks because there is no way to encode that type of adhoc information into the type system (when is it safe to write without the lock that is). It's also easier because there is a mental model associated with it that is extremely similar to the OOP model (The OOP model restricts access points, the double-buffer system restricts update points). In this way debugging is much easier as well because you can unit test sections of your code. You know that from clock A to clock B some action will occur. No ifs, buts or maybes. There is no issue with the ordering of the updates, or the thread that the update occurs on. Yes if you circumvent the model you will get issues, but that is the same in any model.

mrvn wrote:That's the other way to go (or a combination of the two). By splitting the world into blocks and doing a block at a time you improve locality. But that wasn't the method suggested at the start of the thread.

Again people, the update algorithms discussed are purely THEORETICAL and are not the way I would ACTUALLY code Factorio. This thread is about a method to synchronise variables across threads in a safe and predictable way. NOT about the way you would update an inserter arm picking up from a chest.

There are practically an unlimited number of memory barriers you could choose to split Factorio up into small, multi threadable chunks. Discussing the merit of each of them would not be constructive. The examples (including the game of life example) given are for illustration purposes ONLY.

Now for the off topic stuff.
@mrvn,
The Mill architecture, while interesting, isn't going to be as ground-breaking as it should be. Most of the gains in performance I have seen them talk about are optimisations that any chip designer is able to do when they throw out the 32-bit constraints. The caching mechanisms, paging mechanisms, IO etc are nothing new. The belt mechanism is an interesting model, but RatchedFreak said it best with "the belt is really just a front for a bunch of renamed registers". All the real gains in power they get are from the offloading of the out-of-order processing that the current x86 chips do. Admittedly the 33 ops per cycle is impressive, but it's the theoretical maximum for specific workloads. I would love to see a high performance chip that only uses 40-ish watts, but I doubt you will see any widespread use of it without x86 support. I will be watching it to see what comes in the next few years.

mrvn · Post by **mrvn** » Tue Jan 24, 2017 9:32 am

BenSeidel wrote: Also, locking in this style should be easier than using other forms of locks because it's possible to get the compiler to check if you are writing to a variable outside of it's lock (in multithreaded code). AFAIK this is not possible with the instruction level locks because there is no way to encode that type of adhoc information into the type system (when is it safe to write without the lock that is). It's also easier because there is a mental model associated with it that is extremely similar to the OOP model (The OOP model restricts access points, the double-buffer system restricts update points). In this way debugging is much easier as well because you can unit test sections of your code. You know that from clock A to clock B some action will occur. No ifs, buts or maybes. There is no issue with the ordering of the updates, or the thread that the update occurs on. Yes if you circumvent the model you will get issues, but that is the same in any model.

Actually that is not true. Linear type systems https://en.wikipedia.org/wiki/Substructural_type_system can track the state of variables and make those kind of observations during compile time. Have a look at Rust https://www.rust-lang.org/en-US/ if you want this kind of ensurances from your compiler.

BenSeidel · Post by **BenSeidel** » Tue Jan 24, 2017 2:58 pm

mrvn wrote:Actually that is not true. Linear type systems https://en.wikipedia.org/wiki/Substructural_type_system can track the state of variables and make those kind of observations during compile time. Have a look at Rust https://www.rust-lang.org/en-US/ if you want this kind of ensurances from your compiler.

Neither of them stop race conditions, deadlocks or prevent the user from running code they expect to be atomic, but turns out isn't. concurrentlist.remove(concurrentlist.indexof(item)) will not produce the expected outcome on rust or using any type system described in your article.

When you allow the programmer to use instruction level synchronisation you introduce a form of ad-hoc information into the codebase. It's essentially the same as casting. There is no way to encode this in the type system as any part of the code can run the synchronising code in any order, including any instruction interleaves across the threads.

You need to be able to ensure that there is absolutely no overlapping of state update concerns across the running processes because that is the source of all your concurrency issues. It's impossible to encode this into your type system when you use instruction level synchronisation.

In fact I have just realised that they are, by definition, opposites. When you encode locking information into your type system you are removing the locking mechanism from the lines of your code and placing it into your types, therefore they must be mutually exclusive.

The technique I have put forward does not rely on instruction level synchronisation therefore is possible to encode into a type system as well as having the added advantage of being a mental model that people can understand.

ratchetfreak · Post by **ratchetfreak** » Wed Jan 25, 2017 11:28 am

BenSeidel wrote:
mrvn wrote:That's the other way to go (or a combination of the two). By splitting the world into blocks and doing a block at a time you improve locality. But that wasn't the method suggested at the start of the thread.
Again people, the update algorithms discussed are purely THEORETICAL and are not the way I would ACTUALLY code Factorio. This thread is about a method to synchronise variables across threads in a safe and predictable way. NOT about the way you would update an inserter arm picking up from a chest.

There are practically an unlimited number of memory barriers you could choose to split Factorio up into small, multi threadable chunks. Discussing the merit of each of them would not be constructive. The examples (including the game of life example) given are for illustration purposes ONLY.

but those details people are hammering on are exactly the details that will make or break the parallel setup. Too small chunks and caches don't get fully used, Too large chunks and you run the danger of having a slow thread blocking everything else, Bad distribution of chunks has the same danger. Communication overhead (there will always be some) can easily kill performance

From the dev's comments the biggest bottleneck is currently data locality, until that is fixed I very much doubt that any multithreading improvement will make much difference

BenSeidel wrote: Now for the off topic stuff.
@mrvn,
The Mill architecture, while interesting, isn't going to be as ground-breaking as it should be. Most of the gains in performance I have seen them talk about are optimisations that any chip designer is able to do when they throw out the 32-bit constraints. The caching mechanisms, paging mechanisms, IO etc are nothing new. The belt mechanism is an interesting model, but RatchedFreak said it best with "the belt is really just a front for a bunch of renamed registers". All the real gains in power they get are from the offloading of the out-of-order processing that the current x86 chips do. Admittedly the 33 ops per cycle is impressive, but it's the theoretical maximum for specific workloads. I would love to see a high performance chip that only uses 40-ish watts, but I doubt you will see any widespread use of it without x86 support. I will be watching it to see what comes in the next few years.

What the belt really gains is in the dependency analysis. Even without the static scheduling of execution units you only need to look ahead a fixed upperbound of instructions to find out whether the value of an operation (or needs to be saved) or if it will fall of the belt unused.

hoho · Post by **hoho** » Wed Jan 25, 2017 4:06 pm

ratchetfreak wrote:What the belt really gains is in the dependency analysis.

Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?

ratchetfreak · Post by **ratchetfreak** » Wed Jan 25, 2017 4:27 pm

hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?

The instruction decoder needs the dependencies to know when to stall for results and when it can throw away the results as the logical register is overwritten. But because a register can stay unreferenced for a very long time the default action is to save it in the register bank unless the register is overwritten within the instruction cache.

hansinator · Post by **hansinator** » Wed Jan 25, 2017 4:30 pm

hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?

On compiler-level it is the programming language. Dependencies are not clear if you support side-effects (e.g. global variables). A proper dependency analysis (one that covers 100% of the code) can only be done for purely functional languages. But it should be possible to do a dependency graph for entity interactions in Factorio. Is it worth the effort? Probably not.

ratchetfreak · Post by **ratchetfreak** » Wed Jan 25, 2017 5:18 pm

hansinator wrote:
hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?
On compiler-level it is the programming language. Dependencies are not clear if you support side-effects (e.g. global variables). A proper dependency analysis (one that covers 100% of the code) can only be done for purely functional languages. But it should be possible to do a dependency graph for entity interactions in Factorio. Is it worth the effort? Probably not.

I was talking on a different level of dependencies, namely the dependency between machine instruction's result and when it's used next.

In the register machine that analysis is required to adjust for instructions that take more than one cycle and be able to do ILP.

In a belt machine that a similar analysis is still needed to know what operation each belt position refers to.

hansinator · Post by **hansinator** » Wed Jan 25, 2017 6:04 pm

Yes I understand. But that level isn't compiler related and hoho asked about the compiler. Or do I miss something?

BenSeidel · Post by **BenSeidel** » Thu Jan 26, 2017 1:23 am

hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?

There is absolutely nothing preventing this from occurring as this is what the hardware out-of-order unit does. There is no reason why current x86 code could not be run through a static dependency analysis tool to produce Mill instructions. The issues about global states have already been dealt with in the current x86 systems as they do the global state renaming as well (The global state is just a register).

hoho · Post by **hoho** » Thu Jan 26, 2017 9:36 am

BenSeidel wrote:
hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?
There is absolutely nothing preventing this from occurring as this is what the hardware out-of-order unit does.

That's done in hardware in x86, correct?
If so then considering Mill lacks such complex HW OoO functionality, that has to rely on compiler to do the analysis for it. So, again, why can't the same analysis be done on x86 to further improve the efficiency of how well the code runs there?

BenSeidel wrote:The issues about global states have already been dealt with in the current x86 systems as they do the global state renaming as well (The global state is just a register).

Yes, the rename registers in x86 seem to fill pretty much the same role as the belt FIFO registers.

ratchetfreak · Post by **ratchetfreak** » Thu Jan 26, 2017 11:13 am

hoho wrote:
BenSeidel wrote:
hoho wrote:
ratchetfreak wrote:What the belt really gains is in the dependency analysis.
Isn't that analysis done on compiler side? If so, what stops from using it on regular x86?
There is absolutely nothing preventing this from occurring as this is what the hardware out-of-order unit does.
That's done in hardware in x86, correct?
If so then considering Mill lacks such complex HW OoO functionality, that has to rely on compiler to do the analysis for it. So, again, why can't the same analysis be done on x86 to further improve the efficiency of how well the code runs there?

BenSeidel wrote:The issues about global states have already been dealt with in the current x86 systems as they do the global state renaming as well (The global state is just a register).
Yes, the rename registers in x86 seem to fill pretty much the same role as the belt FIFO registers.

Compilers will try to separate dependent instructions to aid OoO, however that's not always possible (the mill instruction set will suffer from this same problem and the compiler will need to insert nop-slides as needed mimicking the stalls of the X86) and it requires the compiler knows the latency of the instructions which will depend on a lot of factors.

It's also perfectly possible to have much of the same OoO functionality in the Mill but the static scheduling in the IAS (results are only available after a fixed number of cycles) is designed to minimize the need for it. The static latency of instructions will very likely be a lower bound of the actual latency of the execution units. For example a load can take anywhere from 3 cycles to 200 depending on the cache. But that doesn't mean the Mill will stall exactly after 3 cycles to wait on the load instead it will go ahead until an instruction will actually use the result and then initiate a pipeline stall while it can still execute unrelated instruction (until the instruction cache is maxed after which it needs to stall anyway).

mrvn · Post by **mrvn** » Thu Jan 26, 2017 12:38 pm

BenSeidel wrote:
mrvn wrote:Actually that is not true. Linear type systems https://en.wikipedia.org/wiki/Substructural_type_system can track the state of variables and make those kind of observations during compile time. Have a look at Rust https://www.rust-lang.org/en-US/ if you want this kind of ensurances from your compiler.
Neither of them stop race conditions, deadlocks or prevent the user from running code they expect to be atomic, but turns out isn't. concurrentlist.remove(concurrentlist.indexof(item)) will not produce the expected outcome on rust or using any type system described in your article.

1) indexof on a list as already a MAJOR fail. You never ever do that or your performace is dead.

2) I believe on Rust this would work just fine. You can only call .remove and .indexof on the concurrentlist if you own it. Which means no other thread can be accessing it. So it does what you want.

You can't share the concurrentlist list between threads without explicitly making provisions for it (which I assume don't exists for the above) and then .indexof would have to work differently and lock the list for the lifetime of the index or something. So even if you make it shared it would still work, just slowly.

BenSeidel wrote:When you allow the programmer to use instruction level synchronisation you introduce a form of ad-hoc information into the codebase. It's essentially the same as casting. There is no way to encode this in the type system as any part of the code can run the synchronising code in any order, including any instruction interleaves across the threads.

You need to be able to ensure that there is absolutely no overlapping of state update concerns across the running processes because that is the source of all your concurrency issues. It's impossible to encode this into your type system when you use instruction level synchronisation.

In fact I have just realised that they are, by definition, opposites. When you encode locking information into your type system you are removing the locking mechanism from the lines of your code and placing it into your types, therefore they must be mutually exclusive.

The technique I have put forward does not rely on instruction level synchronisation therefore is possible to encode into a type system as well as having the added advantage of being a mental model that people can understand.

The point of having it in the type is so the compiler can verify you locked everything correctly or even automatically lock things as needed and only as needed. The type system in Rust allows the compiler to know where locking is needed and where you use unlockable data structures where you should be locking and so on. It also ensures that things get unlocked after use.

Sure it isn't a magic bullet but it does a hell of a lot of checks that you are missing in for example C.

mrvn · Post by **mrvn** » Thu Jan 26, 2017 12:41 pm

Oh, and don't forget there is another multithreading "solution" out there:

Minecrafts "I don't care" solution. If you simply do not care about the odd glitch here and there you can avoid all the locking. But then you get players using odd contraptions to duplicate items because 2 threads both take the same item if you craft things right and you end up with 2 identical items. You can also do the reverse, make one out of two. But somehow that is less useful.

Post by **ssilk** » Sat Jan 28, 2017 11:07 pm

TL;DR
The most interesting part about multi-threading etc. is to make even bigger worlds. But instead of targeting to multi-threading this can be achieved also, by concentrating more into interconnecting game-worlds (servers), cause that is similar - if made clever: The size of a world can be doubled by connecting a second server to the first.

This thread was a great read.

When I read this thread now I think multiprocessing is something, which comes to a limit in this or that way. It will make thinks too complicated and the argument with the RAM-Speed is something, which I also saw some time ago. So for me it is nothing which should be put too much afford into and instead of that, I think we (uhm, the devs

) should concentrate on connecting worlds together. What I mean I wrote a long time ago in this post: viewtopic.php?f=5&t=8670&start=10#p69213
At that time I thought instead of multiprocessing in threads it might be possible to multiprocess in OS-processes. So that the game doesn't need to run 100% in parallel.
Now - after reading this - I think this is also no real solution. Cause of the RAM bottleneck.
And what I also think is this: Nice to have but does that bring more players to spent money for Factorio? I think not so much. Technically it's super interesting. From the game-play-side doubling the speed from 10 ups to 20 ups is great, but really not THAT important.

So how to make that "dream" I sketched in the linked thread make come true? What - I thought - what if we have multiple computers? I mean we already have this servers that run 24/7 Factorio and the community tries to connect those worlds together.
So that is the vision: A grid of "worlds" (World A, World B, C, D etc...) connected together by "exchange points".

What is an exchange-point? Think to a bridge for example:

The player/train etc. enters the bridge (Which has an exchange point in World A) and with that he leaves World A and after some "transfer time" appears at the connected exchange point in world B.

From the sight of World A World B is nothing else than a "player". Some source of events. For World B World A is the same: Just a source of events like "A new player appears, now take over control from this IP".

There is no difference if the event comes from a player or from another World, that is connected. In both cases the events don't need to be determinsitic. As for the player for a world the events don't need to be real-time.
Let's think that a bit further: The player builds a belt in World B to the bridge and over this bridge into World A. And every entity on this belt is also nothing else than an event for World A. The worst case that can happen is, that if the connection breaks, that the stream of items will stop in World A. Nothing important.

So with this "trick" the size of any Factorio World can be doubled without loosing ups. And it offers such a great number of gameplay-variants, that I cannot count them.

[Some of the simplest for example: Three Worlds: A, B and C. A and B are forced to deliver to C some amount of item X. Who finishes first wins.
Two Worlds: Positive and Negative. Positive needs to send items to Negative, Negative to Positive.Both worlds are connected just with a belt. Negative items move backwards on belts. And if a negative item touches a positive it will be destroyed. And now the players in the Positive needs to bring some items to Negative and vice versa. (My picture about this is, that it will look similat to Harry Potter fighting against Lord Voldemort

).]

orzelek · Post by **orzelek** » Sat Jan 28, 2017 11:17 pm

One possible solution to allow for overall bigger Factories would be making them multi-surface. Factorissimo already does that on small scale.
It would allow for each surface to run on separate thread (should also help with memory bottlenecks - each core can have own cache and memory controller can handle more requests looking at stats given in this thread).
Then we would need some nice in game abstraction for this

Something like caves/varying levels of terrain and then entities that would share power and transport items/fluids between those.

Quite plausible would be making terrain semi-3d. And when you go over to higher level you would switch surface. With ability to dig through lower levels and build ramps up/down plus pillars for levels above this would make game both 3d and nicely scalable for multithreading. Seems like win-win

hoho · Post by **hoho** » Sun Jan 29, 2017 9:33 am

Biggest problem with connected worlds is how to ensure they all stay synchronized.

As it stands, each player simulates actions of all other players. If you have several different worlds running in a grid, how would you verify that they're all still running in same overall state? If one server slows down for whatever reason and tick rate drops <60, would others also have to drop lower? What if some items get sent from the faster server to slower one and due to lower tick rate it won't be able to accept them all?

The idea is nice but would require a metric ton of work and considering perhaps at best 0.01% of playerbase would get any use out of it, I'm not sure if it's reasonable to put much effort into it

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications