Parrallel processing in games & applications

Deadly-Bagel · Post by **Deadly-Bagel** » Tue Jan 17, 2017 10:09 am

BenSeidel wrote:It is as simple as I am making it out, at least in terms of the technique. You seem to be getting the effort required for a complete rewrite confused with the effort to learn and use the technique for new code. At no point have I ever said that all the code in factorio should be rewritten, far from it. They are running a business, they need to make money and for that they need to get the game finished. All I have said is that IF Factorio was to be rewritten to be truly multithreaded than it's not as hard as everyone thinks that it is. It's still hard, but not climb Mt Everest hard, more learning to read hard.

I disagree with your analogy, Factorio is a massive game with a lot of interactions, rewriting it probably would be "climb Mt Everest" hard (as much as a physical task can be compared to an intellectual task). Not only would much of the game logic basically have to be written from scratch, but then they need to go through testing and bugfixing all that, then redo multiplayer (again). When I say it can't happen in Factorio, I'm speaking of feasibility. Even if cost weren't an issue this update would take an absurd amount of time. Just the multiplayer update took, what, four months to develop and test? This would be delaying release by at least a year. The player base would go mental.

BenSeidel wrote:The changes to the main loop are minimal.

Remember Factorio doesn't use a standard main loop, though I don't know if we even can know how it would affect this suggestion from the limited information provided on it.

And anyway we don't know exactly what the problem areas are. As has been said you could probably optimise 90% of the code a hundredfold and see zero performance increase because the remaining 10% is where all the processing is getting held up, and we don't know what that 10% is. It may very well be simple for them to multithread inserters or whatever but pointless as it would only end up waiting for other threads.

Maybe it can be done, maybe it can't, I don't know, but I would avoid coming out with statements like "Factorio can be multithreaded easily" because you don't know either. You have educated theories and ideas, but without seeing and understanding the Factorio code itself you can't say for certain. Regardless of your achievements I highly doubt you've written something so complex, intricate and still efficient enough to run 99% of factories smoothly on relatively low-end gear, which I mean only as credit to the Factorio developers.

hoho · Post by **hoho** » Tue Jan 17, 2017 10:40 am

Forgot one thing:

BenSeidel wrote:
hoho wrote:The final merge of the results from different threads also can't be parallelised.
That is based upon the only currently available "stable" style of multithreading. "I have a list, I want to add it all together", or "I have a list, I want to transform it in some way". If you do that then yes, you have to have one thread merging the results. However, this is not what I am suggesting. I am suggesting that the application be broken down into areas, each with a clock-controlled memory bound.

Problem with that is that Factorio multiplayer is built around only sending the server the actions of a player and the server simulates all those on it's own side. That means at each tick, the game has to complete with computing all those "areas" before it can begin with calculations for the next tick. In other words, the game can't begin with calculating the next tick before everything is complete on the current one.

ratchetfreak · Post by **ratchetfreak** » Tue Jan 17, 2017 12:32 pm

The method of having each entity check for the possibly interacting entities and then doing the logic will spiral out of control pretty quickly unless you put in limits to interaction. For example a provider chest would need to check for players, inserters and bots, each of those would also need to check the chest, the inserters, the player and the bots to see if their "pull item out" interaction succeeded.

Bots are especially problematic because a decent factory can have 10k in the air at any one time and only a handful actually wanting to interact with the chest. So unless you have a way of getting the bots destined to the chest in O(1) time you end up scanning 10k bots every time an inserter tries to pick up an item from a logistics chest.

A simpler interaction model is to use messages, each entity can send messages to other entities, at the start of each frame it sorts the incoming messages deterministically and then processes them.

The inserter case would work out to be

Code: Select all

inserter->chest1: request_pick_up(filter)
chest1->inserter: items_picked up(item)
(inserter waits until rotation complete)
insteter->chest2: request_drop_off(item)
chest2->inserter: items_recieved(item) -- may be only a partial stack after which the inserter would send request_drop_off(item) again until the hand is empty

the code for a chest would be

Code: Select all

sort_queue(queue)
while(queue.peek().type == drop_off){
     msg = (drop_off_msg)queue.pop();
     inserted, remaining = inventory.insert(msg.item);
     if(inserted) msg.sender.send(drop_off_ack(inserted))
     else if(msg.block) this.send(msg) -- self message so inserter doesn't have to keep sending the drop_off message while chest is full
                                       -- can possibly be implemented by having an internal drop_off queue that takes priority over the messages from the queue to avoid starvation.
}
while(queue.peek().type == pick_up){
     msg = (pick_up_msg)queue.pop();
     picked = inventory.remove(msg.filter);
     if(picked) msg.sender.send(pick_up_ack(picked))
     else if(msg.block) this.send(msg)
}
if(circuit_connection) circuit_connection.send(circuit_msg(inventory.contents));

This has a few downsides though,

number one is that interactions now have a minimal delay until they are complete (2 ticks minimum for request/ack).
Second, some entities will need to keep a bit more state to record which messages it is expecting and which have been send out.
Third, the interaction aren't as readable as simply "item = chest.extract_item(filter)", though with coroutines you may be able to get something nice out of it.
related to that the mod api would need to be redesigned from scratch. Currently it's a monolithic design where each mod gets invoked one by one and gets full read/write access to the game world during its time chunk. This you cannot multithread. Having a mod api that is more amenable to multithreading would be a very nice first step in general.

What I mean with coroutines is that inserter code could for example be:

Code: Select all

while(true){
    pickup = get_entity_front()
    if(!pickup) pickup = yield_until_entity_front();
    pickup.send(pick_up_msg(filter, true))
    ack = yield_until_msg(pick_up_ack)
    hand = ack.item

    yield_until_rotation_finished()

    drop = get_entity_back()
    if(!drop) drop = yield_until_entity_back();
    while(hand){
        drop.send(drop_off_msg(hand, true))
        drop_ack = yield_until_msg(pick_up_ack)
        hand = drop_ack.remaining
    }
}

BenSeidel · Post by **BenSeidel** » Tue Jan 17, 2017 12:53 pm

@Deadly-Bagel
I have no response to that except what I have already talked about. I can't reply to "You haven't seen the code so how do you know your technique works for them". All I am saying is that unless they process things in a purely random order and they rely on the fact that it's in this uncontrollable random order it should be easy enough to transform ONE aspect of the game to use multiple threads without adversely affecting the others. Unless you have some concrete evidence or a specific "how do you handle this case" question, this "should"/"Could" game is not helping the discussion.

hoho wrote:Problem with that is that Factorio multiplayer is built around only sending the server the actions of a player and the server simulates all those on it's own side. That means at each tick, the game has to complete with computing all those "areas" before it can begin with calculations for the next tick. In other words, the game can't begin with calculating the next tick before everything is complete on the current one.

The multiplayer aspect should not have any part in the update of an inserter. The rewrite is a client-server scenario so a players input is sent to the server and then the server processes it essentially as if it was a local players actions. The processed action is then set back to the clients who update the game state appropriately. In fact, I would not touch the input processing at all. Why would you? The game would spend less than 0.0000000000000001% of its time processing user input.
The suggestion is that all the actions in a frame get processed entirely before processing the next. Its just a question about how many threads are able to update all the inserters. At the moment every inserter is processed on the one thread. I am offering a technique that removes the multithreading headaches that are caused by "inserter A must be processed before inserter B".

hoho wrote:
BenSeidel wrote:
hoho wrote:@OP, you didn't address the need of perfect determinism in your response to criticism of your original post.
Hmm... because I don't see that there is any indeterminism. Your question about the two inserters removing things from the same chest was addressed in my first post, so I'm not sure what you are asking.
It did talk a little bit about it but I was specifically interested in the scenario I described:
fewer items in the chest than inserters try to pick out. To make things more fun, make it a provider chest and have bots try to take things from it as well.

Or, another alternative, instead of chest have a transport belt with one item running on it but several inserters trying to take it off. Also add those inserters and belt to wire network that connects to machines in half the factory.

I'd be interested to hear the logic you'd propose for grouping up such vaguely connected things to be dealt with by a single thread instead of breaking them up over several threads.

So, to address when there are fewer items in the chest as being picked up by two inserters (btw, this is essentially the same as a pipe trying to fill the orders of two connected chemical factories). If we choose the most fine-grained solution as it will give the most amount of detail, there are other levels of "units of work", such as the "merged" context I have talked about.

So, lets take it from a point where two inserter arms are swinging round to the chest. I will say that the two inserter arms will arrive at exactly the same frame, and that they both will try to pick up 5 iron plate when there is only 3 iron plate in the chest.

So, frame 1, both inserters get processed by an individual thread. On each independent thread they:
read the current inserter arm location, calculate the new one and write it to the "output" buffered variable.
Calculate if the next frame would put them over the chest (if you want it instantly) OR it could be a check that if their current position is over the chest.
Either way, if the check succeeds, "attach" the inserter to a list of inserters for the chest that are eligible to pick up an item (in the next frame, not this one). This can be done using many methods, including a lockable list, as collisions should be extremely rare. If a lockable list is chosen, then the current timestamp or some other mechanism must be used to ensure that any subsequent executions will always give the same result. You could double-buffer the list, or the list values, or many other solutions that would work. If this is the part that is unclear let me know as I can go into more detail on the list part. Hell, the this may only be modified during the user input processing phase as that (up till now) is the only way an inserters input and output locations can be swapped. In this way the chest should always know all the inserters that can possibly be picking up from it, so every entity with a link to the chest can do any calculation they require for their next state.

We'll say that the check succeeded and the inserters were added, ready for frame 2.

In yet another independent process/thread the chest checks if any inserters are trying to remove an item. Seeing none, it does nothing. It should see none regardless of if the inserters have yet run their code. If they have and you have a synchronised list (not a double buffered one) then it should be able to identify that they inserters are there for the next frame and not the current one.

The clock ticks and we start processing frame 2. What was previously written to the output buffers is now "flipped" to be the values in entities.

The inserters, ready to pick up an item both check (again in separate threads) to see who gets to pick up an item out of the chest. They can do this because the "requesting list" is clock-safe (discussed above). Each inserter figures out how many items they can take. In this case one is able to collect 3 while the other one sits and does nothing. The decision as to which inserter gets to pick it up can be based on some ID, such as an EntityID or possibly some physical location like "Top left goes first". It could also be a fifo system based on how long an inserter has been waiting (just note its arrival time in the list) or any other of DETERMINISTIC methods. You CANNOT do "I processed it first, so I get the items".

So, as all 3 entities (chest and two inserters) have access to the same clock-controlled information, and as they all run the same code to determine who takes what, they can all update their state based on the changes, independent of the ordering or timing of the other executions.

When it comes round to frame 3..4... etc, the same code is run, at some point the timeout for the first inserter arm is reached and the inserter removes itself from the "requesting list" and starts its swinging code.

If you wish to include robots in there, then the frame before the robot arrives, attach it in the same manner as the inserter arm to the list of requesting entities. If at any point the chest goes to zero, during the update phase of the robot you are able to check the item count in the chest and cancel the order if need be, or do whatever logic you wish to do with the robots.

I hope I have addressed your question. If I have not let me know what I have missed and I will attempt further explanations.

hoho wrote:Or, another alternative, instead of chest have a transport belt with one item running on it but several inserters trying to take it off. Also add those inserters and belt to wire network that connects to machines in half the factory.

The circuit network is especially easy as you can see the behaviour of "the value becomes available in the next frame" with an addition combinator attached to itself. As the output of a combinator has no bearing on its behaviour during its frame, but only on the input values of the calculation done on the next frame, you should be able to see reasonably easy that you could process each network independently of any other processes in the factory. Values read from the belt are the the contents of the belt during that frame, not the next so you can just read the state of the belt and add it to the network signal for the next frame. In fact you can read the state of any entity during that frame without worrying if that entity has been updated. The double buffer ensures it.

The inserters being enabled and disabled is essentially the same, read the current value of the circuit network (the value for that frame) and decide if that inserter is able to pick up an item. As all threads should see exactly the same "current value", they will all come to the same conclusion.

greep wrote:FYI, you really should read that link smarty posted, it's pretty obvious they've considered everything you've said:

I did, both when it came out and when it was linked.
I don't dispute that what you quoted is the feeling they have towards multithreaded code. It's the feeling that the entire industry has. What amazes me is that you missed the part where they say

The main problem when dealing with parallelism is access to shared resources, eg. when two different threads write to the same piece of memory. This is either solved by locks, which is not really an option on the entity level, or by strategy that prevents the threads to write to the same piece of memory.

So here I am trying to put forward a strategy (that as far as I know is not well known) that does exactly what they are looking for: preventing threads from writing to the same memory location. It's also a strategy that should not need significant changes to their update code. I think that they may even find that their compiler will type-check the places that need to be updated to be clock-controlled. And by significant I mean "major change in logic" not "I have to add .get() to all these errors?".

hoho · Post by **hoho** » Tue Jan 17, 2017 1:00 pm

ratchetfreak wrote:For example a provider chest would need to check for players, inserters and bots, each of those would also need to check the chest, the inserters, the player and the bots to see if their "pull item out" interaction succeeded.

Don't forget mods being able to insert stuff randomly as well

ratchetfreak wrote:A simpler interaction model is to use messages, each entity can send messages to other entities, at the start of each frame it sorts the incoming messages deterministically and then processes them.

Considering how there can easily be hundreds of thousands of entities (bots + inserters) in a megabase, I don't think message queues is such a good idea when you have to generate a LOT of them every tick.

hoho · Post by **hoho** » Tue Jan 17, 2017 1:21 pm

BenSeidel wrote:I am offering a technique that removes the multithreading headaches that are caused by "inserter A must be processed before inserter B".

You claim that the ordering wouldn't pose an issue but instead of explaining how exactly, so far you've essentially said some sort of magic is required to determine what order is used

Though, I see in this comment you did go into more detail.

Also, sorry if I come off somewhat incoherent. I commented as I was reading it.

BenSeidel wrote:Calculate if the next frame would put them over the chest (if you want it instantly) OR it could be a check that if their current position is over the chest.

Wait, does that mean every time inserter is processed, a check is made to see if it's over the chest in *next* tick? That doesn't sound reasonable to me as it'd mean if it takes, say, 30 ticks, to perform one movement, you'll have 30 extra checks. Instead, that check should only be done when inserter movement is complete.

BenSeidel wrote:Either way, if the check succeeds, "attach" the inserter to a list of inserters for the chest that are eligible to pick up an item (in the next frame, not this one).

Ah, now I think I understand why you wanted to check if next tick would put the inserter over the chest since with my proposed method you'd have an extra tick of delay between inserter stopping over the chest and it picking things up.

BenSeidel wrote:If a lockable list is chosen, then the current timestamp or some other mechanism must be used to ensure that any subsequent executions will always give the same result.

Using timestamps for ensuring determinism seems horribly fragile, especially in multithreaded environments. Using current tick number isn't much better. I'd say something like entity ID might be more useful, assuming factorio has something like that for each individual entity in game. If not, x/y position might also work.

BenSeidel wrote:In this way the chest should always know all the inserters that can possibly be picking up from it, so every entity with a link to the chest can do any calculation they require for their next state.

Does that mean every time an entity tries to interact with something, they have to check that list to see if other entities have already done their part?

BenSeidel wrote:The decision as to which inserter gets to pick it up can be based on some ID, such as an EntityID or possibly some physical location like "Top left goes first". It could also be a fifo system based on how long an inserter has been waiting (just note its arrival time in the list) or any other of DETERMINISTIC methods. You CANNOT do "I processed it first, so I get the items".

Let's assume there are two inserters i1 and i2 with i1 having the higher priority and both running in their separate threads.

How would i2 know if i1 has already processed the items in the chest or not?

It's the same scenario - 3 plates in chest and both try to take out 5. We're in the second "tick" and both inserters are trying to pull things out.

BenSeidel · Post by **BenSeidel** » Tue Jan 17, 2017 1:51 pm

@Rachetfreak
I only saw your post after I tried to submit mine, as I had a child to put to bed I couldn't modify my answer at that time. Suffice to say that there may or may not be merits to your suggestion. As I don't want to get into a discussion about what method is best, but instead focus on why this method should be used and how it works, I won't be commenting on it here.

hoho wrote:Wait, does that mean every time inserter is processed, a check is made to see if it's over the chest in *next* tick? That doesn't sound reasonable to me as it'd mean if it takes, say, 30 ticks, to perform one movement, you'll have 30 extra checks. Instead, that check should only be done when inserter movement is complete.

Agreed, it's an extra 30 checks. This could be optimized out though. I only suggested it to remove the 1 tick delay between it getting over the chest and it becoming eligible to be handed an item, as you rightly pointed out in the next part of your post (I'm replying as I read).

hoho wrote:Using timestamps for ensuring determinism seems horribly fragile, especially in multithreaded environments. Using current tick number isn't much better. I'd say something like entity ID might be more useful, assuming factorio has something like that for each individual entity in game. If not, x/y position might also work.

The tick# is a timestamp. It should not be any more fragile compared to any other method of keeping order as it should not be altered in the multithreaded area, as that is all about the update within the tick, making it a constant. It is also extremely fast to access, being part of the user space and is synchronised in multiplayer as well.

hoho wrote:Does that mean every time an entity tries to interact with something, they have to check that list to see if other entities have already done their part?

No, It's not about "seeing if it's done its part" its about not caring and having all the information required to know what the other entities would do. Esentially you have to check the list to see what everything else will do that tick.

hoho wrote:Let's assume there are two inserters i1 and i2 with i1 having the higher priority and both running in their separate threads.

How would i2 know if i1 has already processed the items in the chest or not?

It's the same scenario - 3 plates in chest and both try to take out 5. We're in the second "tick" and both inserters are trying to pull things out.

If, due to a butterfly flapping its wings, i2 gets processed before i1, it checks if i1 WILL take out any plates, not if it has. Because it can run the same code that i1 will run, it can know for certain that it will take all 3 pieces of plate. This way it knows that after (in game logic, not processing sequence) i1 takes its plate out there will be no plate left for it to take out. Likewise the chest has to check any of the inserters WILL be taking out plate, and if so, how much plate will they be taking out in total. In the chest's thread it can know how much in total will be taken out by running the same code the inserters will be running, and adjust its figures appropriately, not by writing to its "current" values that are stored in the "read buffer" but by writing these new values to its "write buffer". This way when the buffers are flipped, what the chest & inserters are writing to their output buffers become the "read buffers" for then next tick/update cycle.

So long as you get the general mechanism and how the update cycle would work (ie -that it works) then we can start discussing how appalling this example is in terms of performance and how it can be improved immensely. For example, there is no real reason to calculate the next state of the "attached" inserter arms WITHOUT writing it's state. If you wish to go through how it works in a lock-free scenario let me know as it's quite interesting even in the presence of multithreading and even the weakest of memory order guarantees.

BenSeidel · Post by **BenSeidel** » Tue Jan 17, 2017 2:48 pm

I have just realised that I missed a possible question about how the bot network will interact with such an inserter/chest implementation. I hope that I showed how a bot traveling towards/trying to collect an item from a chest would interact with the system, but how the bot gets allocated has not been discussed.

So, if there is some reason why the allocation logic cannot be written in this style to be executed in parallel with the updates to the inserters/chests, then don't make it parallel. Keep it sequential. So for example, the bots may only be allocated AFTER the inserters have had a chance to pick up an item out of a chest. In that case, run the chest & inserter updates, wait for all threads to complete, flip the buffer and run the current code as per normal. It will be able to iterate over the list of chests and bots, allocating and updating any state associated with its process. If there was a reason not to put the bots in with the inserters & chests, then they could also be run here as they currently are. It really does not depend on how the inserters are being updated, only that they are not currently being process in some interleaved manner.

If the bot network where to also be made into a multithreaded process then they could still occur in a sequential manner with respect to the processing of the inserters. Process the first update section, wait for all threads to finish, flip the buffer associated with the first update process, start the second update process, wait for all threads to finish, flip that buffer, and so on and so on. This technique is not limited to just the one multiprocess phase, but is used to make some sequential phase into a parallel, asynchronous phase. It is even possible to have clock based updates within clock based updates.

hoho · Post by **hoho** » Tue Jan 17, 2017 3:13 pm

BenSeidel wrote:The tick# is a timestamp. It should not be any more fragile compared to any other method of keeping order as it should not be altered in the multithreaded area, as that is all about the update within the tick, making it a constant.

I must be misunderstanding something here. I assumed that the timestamp/tick# here is used to figure out the order in what the inserters should access the chest within a single tick. Apparently, it's not. What is it used for exactly?

BenSeidel wrote:No, It's not about "seeing if it's done its part" its about not caring and having all the information required to know what the other entities would do. Esentially you have to check the list to see what everything else will do that tick. If, due to a butterfly flapping its wings, i2 gets processed before i1, it checks if i1 WILL take out any plates, not if it has. Because it can run the same code that i1 will run, it can know for certain that it will take all 3 pieces of plate.

So, am I understanding this correctly:
i2 looks that i1 wants to take plates out and as i1 picks up at most 5 plates there won't be any left so i1 won't try to pick up anything itself.

Essentially, each inserter that interacts with the chest at given tick will calculate how many items each higher-priority inserter before it wants to take out and only then does their own calculations.

That sounds like something that is bound to get massively complex once you count in filter inserters, inserters with different stack sizes and inserters with configurable stack sizes (coming in 0.15). Sounds like a LOT of random memory access to gather up all that data.

BenSeidel wrote:In the chest's thread it can know how much in total will be taken out by running the same code the inserters will be running, and adjust its figures appropriately, not by writing to its "current" values that are stored in the "read buffer" but by writing these new values to its "write buffer".

So in the chest thread, the inserter removing stuff has to ran again almost the same way as each inserter itself did?

Basically what I think Factorio currently does is that it handles the operation of inserter taking stuff out of chest as a single atomic operation where inserters gets items in same moment as chest item count is reduced. Your proposal seems to require having to do a LOT more calculations per inserter operation + it decouples inserter and chest operations probably making duplication of items more likely to happen (due to coding errors).

BenSeidel wrote:So long as you get the general mechanism and how the update cycle would work (ie -that it works) then we can start discussing how appalling this example is in terms of performance and how it can be improved immensely.

Yeah, assuming I did understand your proposal correctly, overall computation amount would increase quite a bit. Just that it's possibly overall runnign in shorter amount of time due to multithreading. Of course, laptop users won't be particularly happy

BenSeidel wrote:If you wish to go through how it works in a lock-free scenario let me know as it's quite interesting even in the presence of multithreading and even the weakest of memory order guarantees.

If you have time and willingness to describe it, I'd love to read it. I'm a programmer myself and I love parallel code, just that I haven't really had much chance to write much of it myself

ratchetfreak · Post by **ratchetfreak** » Tue Jan 17, 2017 3:39 pm

hoho wrote:
ratchetfreak wrote:For example a provider chest would need to check for players, inserters and bots, each of those would also need to check the chest, the inserters, the player and the bots to see if their "pull item out" interaction succeeded.
Don't forget mods being able to insert stuff randomly as well

I abstracted mods out of the equation there as I assumed that the mod api would be revamped to fit better into this model

hoho wrote:
ratchetfreak wrote:A simpler interaction model is to use messages, each entity can send messages to other entities, at the start of each frame it sorts the incoming messages deterministically and then processes them.
Considering how there can easily be hundreds of thousands of entities (bots + inserters) in a megabase, I don't think message queues is such a good idea when you have to generate a LOT of them every tick.

each entity would only get a handful of messages per frame which limits the potential memory overhead (less than everything else an entity needs) and an atomic increment per message push isn't that expensive. There would need to be a mechanism to deal with overflow but a pool of queuechunk{message msg[32]; queuechunk *next;} will deal with that nicely.

There could be things that could bypass the normal message queue because they always need to be sent/read from like circuit network output which could still use the visible-front&modifiable-back buffer strategy.

Post by **Klonan** » Tue Jan 17, 2017 3:44 pm

Multithreading isn't impossible, it can be implemented in maybe a month for some systems, might take longer for others, but in general its not too difficult

But multithreading isn't some amazing performance multiplier, and it only really has an impact if you have more cores, so its great if you have a shiny modern 8 core i7 or something, but for people on laptop or low end CPUs with maybe only a dual core, the performance improvement might not even be there.

Right now there is a lot of singlethreaded improvements that can be made, and have been made. For 0.15 performance has been roughly doubled just from a slew of single-threaded optimizations, and thats not even counting the massive belt-optimization Harkonnen is working on.

Multi-threading brings with it a big bag of complications, and doesn't necessarily benefit a lot of players, while general single-thread improvements will help all players of the game, while being (relatively) simple.
Until we reach some sort of ceiling with our current optimizations, it isn't strictly better to start work on multi-threading

BenSeidel · Post by **BenSeidel** » Wed Jan 18, 2017 12:48 am

ratchetfreak wrote:I abstracted mods out of the equation there as I assumed that the mod api would be revamped to fit better into this model

No, the API should remain the same. I don't know how or where the modded code is run in the game loop. It could be interleaved - observer style or it could be run after the state updates - a message queue system. From all the indications I have seen the modded code is run after all the game updates to the frame have occurred as I have not seen any indication that the read state of an entity changes between API calls due to ordering of the updates. If that is the case then no changes should be required to the LUA code.

As it would be a pain to multithread the LUA code, especially when better gains could be made through a JIT, just leaving it as it is at the moment should be sufficient. Saying that though, you could get the JIT to create a DAG of dependant code, running each node independently while still maintaining code dependencies. Anyway, I have tried applying this clock based memory bound technique to a compiler. As a compiler does not have a natural clock bound, it does not work - at all. It's like applying object oriented code to a simulation of fluid mechanics, they just don't work well together. Any other method would probably be better to get the LUA code to be multithreaded.

hoho wrote:I must be misunderstanding something here. I assumed that the timestamp/tick# here is used to figure out the order in what the inserters should access the chest within a single tick. Apparently, it's not. What is it used for exactly?

Yes, to determine the priority of the inserter pickups, implement in a FIFO system. The tick is well known, consistent over multiplayer instances and constant over the duration of a game update. I'm not sure how it would introduce any amount of indeterminism as each machine would know without any doubt when each inserter started requesting items.

hoho wrote: So, am I understanding this correctly:
i2 looks that i1 wants to take plates out and as i1 picks up at most 5 plates there won't be any left so i1 won't try to pick up anything itself.

Essentially, each inserter that interacts with the chest at given tick will calculate how many items each higher-priority inserter before it wants to take out and only then does their own calculations.

That sounds like something that is bound to get massively complex once you count in filter inserters, inserters with different stack sizes and inserters with configurable stack sizes (coming in 0.15). Sounds like a LOT of random memory access to gather up all that data.

Sure, because I chose an poor memory bound. You need to pick "execution contexts" that have tightly bound data. Processing two closely related things at two separate locations is stupid, but these mistakes are essentially the same ones that can be made in single-threaded code. It's a case of not understanding what parts of your code are accessing the same memory, hence not knowing that they should be processed at the same time to prevent memory transfers.

You can tell that you have incorrect memory bounds in your execution contexts because you get those artifacts. It's a similar concept to an incorrectly designed Object, instead there you end up making all of your variables public. In both cases there is a lack of understanding in the memory access cohesion of your application.

As all the inserters are closely bound and share a large amount of data, merge them into one execution context. I have already talked about how this could be implemented.

Klonan wrote:Multithreading isn't impossible, it can be implemented in maybe a month for some systems, might take longer for others, but in general its not too difficult

But multithreading isn't some amazing performance multiplier, and it only really has an impact if you have more cores, so its great if you have a shiny modern 8 core i7 or something, but for people on laptop or low end CPUs with maybe only a dual core, the performance improvement might not even be there.

Right now there is a lot of singlethreaded improvements that can be made, and have been made. For 0.15 performance has been roughly doubled just from a slew of single-threaded optimizations, and thats not even counting the massive belt-optimization Harkonnen is working on.

Multi-threading brings with it a big bag of complications, and doesn't necessarily benefit a lot of players, while general single-thread improvements will help all players of the game, while being (relatively) simple.
Until we reach some sort of ceiling with our current optimizations, it isn't strictly better to start work on multi-threading

Hi Klonan,
At least we agree that making something multithreaded isn't difficult.

Sadly, the rest of your post is essentially how the industry is operating at the moment, both from the chip designers point of view and from the software designers point of view. I have had discussions with CPU designers and they literally say the same thing except from the other side of the fence. "There is no demand for more than 4 cores unless you are running a server farm, and then they are network bound so they usually top out at about 32-64".

See what I am saying?
Software developers focus on single threaded throughput, because the chips only offer good single threaded throughput and the chips only offer good single threaded throughput because the software designers focus on it. If the software industry stopped thinking that multithreading is mutually exclusive to single threaded optimisations then this cycle of self-destruction will end.

These posts are about a simple technique that will allow multithreading support in existing single-threaded code with relatively minimal amounts of code change and probably less than a 10-20% overhead. I say probably because every piece of software is different and it depends on the "execution context" bounds that are selected. As Factorio SEEMS to be "everything interacts with everything" it really isn't. Factorio has extremely tight locality bounds, so you may even find that you could push it much much further down.

Also, at some point you have to stop worrying about the laptop with 2 cores. Yes they should be able to launch a rocket, but really we are talking about them running at 30 UPS vs 36-ish UPS on a large map. Any single threaded optimisations will absorb that difference. I realise that you are a business, so "excluding" some your customer base does sound harsh, but it's a short term issue that will extend the life of your game indefinitely. I have a 64 core machine sitting not 3 meters away from me. It's over 3 years old now so is nothing special. I could host RDP sessions to factorio allowing truly massive factories. People would seriously think about getting the 8 core xeon or a duel socket motherboards to run larger and larger factories with more and more complex mods.

Anyway, it's just my opinion about how and why things are the way they are. I am trying to change it and have based my software on the hope that maybe one day the software industry will just accept that they should multithread, even though they don't have to right at this moment.

P.S.
12 core xeon, $200-$400. 16 core $600-$800. Not out of my or any of my friends price brackets.

jcranmer · Post by **jcranmer** » Wed Jan 18, 2017 3:00 am

BenSeidel wrote:To answer all the issues with the memory bandwidth issue and the double buffering: I don't think it would make a difference. Psorek had it correct, the copy on write functionality makes the entire system relatively cost-free. The copying is not done until a write is done. I think the CPU's are even smart enough to integrate the COW functionality into the cache-writing if you are able not to call a main memory synchronising function. I'm not an x86 instruction expert, so if anyone actually knows the real cost of copying a memory page then let me know. I do know that two virtual memory addresses can actually point to the same physical memory address until one thread writes.

Copy-on-write implemented in hardware operates only at page-level granularity. Basically, the page is set to not allow writes, triggering a page fault if you attempt to write it, and the kernel then comes along and makes a separate page that's writable. Page faults are expensive--Linus reports about 1050 cycles on modern hardware. (L1 cache hit is 4 cycles latency). If you do it in software (basically, all writes to a data-structure first check a semaphore count), the cost of the check for the copy is ~300 cycles. Neither of these numbers include the cost of actually copying memory. The page fault scenario might also involve a partial TLB flush (and cache flush), which is easily several µs of lost time.

The greatest costs of parallel code is the overhead of synchronization and communication. Finer-grained locks allow for more parallelism, but at the cost of increased overhead acquiring and releasing those locks (CAS being the limit: it's a lock per cache-line, but every access takes as long as a lock acquisition). The base consumer hardware to assume is still a 4-core machine (although dual-core is still common in bottom-line computers), so arguing for a 32-core software design is not cost-effective for game developers.

Most of the issues facing the industry at the moment are self-imposed. There are no libraries or development going into getting multithreading up-to-scratch. It's been decades now that we have known of the upper limit of CPU speeds and almost a full decade where that limit has been reached but still a multithreaded program is the exception, not the norm.

If you think single-core performance has stalled in the past ~15 years, you're really blind about computer performance. Sure, clock speeds haven't increased, but clock speed isn't everything. There's been advances in branch prediction technology, better pipeline organization, more effective port utilization for superscalar design. Hell, Haswell makes unaligned accesses that cross cache line boundaries free.

greep · Post by **greep** » Wed Jan 18, 2017 5:08 am

BenSeidel wrote: Software developers focus on single threaded throughput, because the chips only offer good single threaded throughput and the chips only offer good single threaded throughput because the software designers focus on it. If the software industry stopped thinking that multithreading is mutually exclusive to single threaded optimisations then this cycle of self-destruction will end.

Well, I think you're overestimating the average ability of a software developer when you say that the reason they don't use multithreading is because the chips only offer good single threaded throughput. Your average game developer has a BS in computer science, has "exposure" to parallel processing, and maybe took a course on it. S/he doesn't have 6 years expertise in the field, and neither would they be able to write Factorio at anywhere near it's efficiency. Not to say you can't pick it up if needed, but it's certainly one of the more academic parts of CS. And I don't say this to insult the average developer, since that would just be insulting myself. You're just not going to write a mutlithreaded game when it'd make your bus factor like one guy xD

ratchetfreak · Post by **ratchetfreak** » Wed Jan 18, 2017 11:34 am

BenSeidel wrote:
ratchetfreak wrote:I abstracted mods out of the equation there as I assumed that the mod api would be revamped to fit better into this model
No, the API should remain the same. I don't know how or where the modded code is run in the game loop. It could be interleaved - observer style or it could be run after the state updates - a message queue system. From all the indications I have seen the modded code is run after all the game updates to the frame have occurred as I have not seen any indication that the read state of an entity changes between API calls due to ordering of the updates. If that is the case then no changes should be required to the LUA code.

As it would be a pain to multithread the LUA code, especially when better gains could be made through a JIT, just leaving it as it is at the moment should be sufficient. Saying that though, you could get the JIT to create a DAG of dependant code, running each node independently while still maintaining code dependencies. Anyway, I have tried applying this clock based memory bound technique to a compiler. As a compiler does not have a natural clock bound, it does not work - at all. It's like applying object oriented code to a simulation of fluid mechanics, they just don't work well together. Any other method would probably be better to get the LUA code to be multithreaded.

The current place where mods are run is basically (or near enough)

Code: Select all

game_update();
foreach(auto& mod: mods){
    mod.run_events();
}

where run_events can read and write to the game state. This forces mods to be a fully serial part of the game loop with no recourse to even attempt to multi-thread unless you restrict what the lua code can do. Or in other word only let it read the game state and send events to be processed in game_update() along with all other changing events. After you do that you can run multiple mods at the same time because they only need read-access and an outgoing message queue.

However it would be much better if the mod authors get access to do multithreading themselves so the event handling code for each mod doesn't have to be serial.

BenSeidel wrote:
hoho wrote:I must be misunderstanding something here. I assumed that the timestamp/tick# here is used to figure out the order in what the inserters should access the chest within a single tick. Apparently, it's not. What is it used for exactly?
Yes, to determine the priority of the inserter pickups, implement in a FIFO system. The tick is well known, consistent over multiplayer instances and constant over the duration of a game update. I'm not sure how it would introduce any amount of indeterminism as each machine would know without any doubt when each inserter started requesting items.

2 inserters wanting to pick up an item in the same tick => the pickups get the same tick number which then requires something else to deterministically order them. A FIFO would not be deterministic because the OS scheduler will affect the order the access come in.

BenSeidel wrote:
hoho wrote: So, am I understanding this correctly:
i2 looks that i1 wants to take plates out and as i1 picks up at most 5 plates there won't be any left so i1 won't try to pick up anything itself.

Essentially, each inserter that interacts with the chest at given tick will calculate how many items each higher-priority inserter before it wants to take out and only then does their own calculations.

That sounds like something that is bound to get massively complex once you count in filter inserters, inserters with different stack sizes and inserters with configurable stack sizes (coming in 0.15). Sounds like a LOT of random memory access to gather up all that data.
Sure, because I chose an poor memory bound. You need to pick "execution contexts" that have tightly bound data. Processing two closely related things at two separate locations is stupid, but these mistakes are essentially the same ones that can be made in single-threaded code. It's a case of not understanding what parts of your code are accessing the same memory, hence not knowing that they should be processed at the same time to prevent memory transfers.

You can tell that you have incorrect memory bounds in your execution contexts because you get those artifacts. It's a similar concept to an incorrectly designed Object, instead there you end up making all of your variables public. In both cases there is a lack of understanding in the memory access cohesion of your application.

As all the inserters are closely bound and share a large amount of data, merge them into one execution context. I have already talked about how this could be implemented.

or you use the message style I proposed which restricts what each entity will operate on to just itself, its own queue and the queues of a few other entities and a limited number of public properties that would otherwise be too expensive to emit events for. This is a tiny execution context and you don't need complex grouping algorithms to get the most out of the performance

Lee_newsum · Post by **Lee_newsum** » Wed Jan 18, 2017 5:44 pm

BenSeidel wrote: P.S.
12 core xeon, $200-$400. 16 core $600-$800. Not out of my or any of my friends price brackets.

I have a i7-5820k 6-core 12-threads @3.3GHz and on game out there can max out my cpu, they can max out a threads or 3.

justarandomgeek · Post by **justarandomgeek** » Wed Jan 18, 2017 8:52 pm

hoho wrote:
BenSeidel wrote:
hoho wrote:@OP, you didn't address the need of perfect determinism in your response to criticism of your original post.
Hmm... because I don't see that there is any indeterminism. Your question about the two inserters removing things from the same chest was addressed in my first post, so I'm not sure what you are asking.
It did talk a little bit about it but I was specifically interested in the scenario I described:
fewer items in the chest than inserters try to pick out. To make things more fun, make it a provider chest and have bots try to take things from it as well.

This is Factorio, so really you ought to go a little further - make it a Warehouse (from the mod) with all 24 possible inserters and logbots going both ways (storage). And you've got a several thousand of them.

kinnom · Post by **kinnom** » Wed Jan 18, 2017 9:03 pm

justarandomgeek wrote: with all 24 possible inserters

Or better, Add bobinserters and have 108 inserters

BenSeidel · Post by **BenSeidel** » Thu Jan 19, 2017 1:06 am

jcranmer wrote:Copy-on-write implemented in hardware operates only at page-level granularity. Basically, the page is set to not allow writes, triggering a page fault if you attempt to write it, and the kernel then comes along and makes a separate page that's writable. Page faults are expensive--Linus reports about 1050 cycles on modern hardware. (L1 cache hit is 4 cycles latency). If you do it in software (basically, all writes to a data-structure first check a semaphore count), the cost of the check for the copy is ~300 cycles. Neither of these numbers include the cost of actually copying memory. The page fault scenario might also involve a partial TLB flush (and cache flush), which is easily several µs of lost time.

1050 cycles is not that much when it can occur at most ONCE per update cycle for each page that is clock-controlled. It means that if you are able to get your concurrent data structures at or below ~1000 bytes then it's cheaper to spend the COW cost as opposed to doing a semaphore check (4 x 300 cycles = 1200 cycles). It becomes even better when there ends up being no writes to the output buffer, so it will not trigger an update. Also, if you do take into account the time to copy the page you should find that the bulk "array-style" copy that the OS page manager is able to do will be significantly faster than the ad-hoc value-by-value, essentially random order value copying that the code would do. Of course, if your program is writing every value anyway, then don't use COW.

But again, This is an optimisation, and as I am no page mechanics expert I just don't know what would be best in what scenarios. I also live by the mantra "don't optimise until you profile".

jcranmer wrote:The greatest costs of parallel code is the overhead of synchronization and communication.

Using this technique there should be no synchronisation costs during the update. As the synchronisation is done at a single point you only have to pay for it once. This is the key point. Because you don't use any synchronisation methods within a clock-cycle you get all the benefits of the single threaded optimisations that modern CPU's have, without paying for the memory bandwidth limitations. The only additional cost to making it multithreaded is the extra memory costs to store the clock-controlled sections. If there is the chance that you can utilise even 1 additional memory channel then you should find an increase in performance. Also, cache-misses with writes aren't as expensive as cache misses with reads (in theory: I can't speak about EVERY CPU architecture) because you don't have to pull the memory down into your cache to get the write written.

jcranmer wrote:If you think single-core performance has stalled in the past ~15 years, you're really blind about computer performance. Sure, clock speeds haven't increased, but clock speed isn't everything. There's been advances in branch prediction technology, better pipeline organization, more effective port utilization for superscalar design. Hell, Haswell makes unaligned accesses that cross cache line boundaries free.

I am fully aware of the "advancements" made in CPU design that have increased the throughput in modern CPUs, having already spoken about them. They are micro-parallelisms, allowing sections of your code to run in parallel where possible, because simply put, software developers are too lazy to do it themselves. Who remembers the days of having to put no-ops after a branch because the CPUs didn't do it. Go have a look at the footprint of a modern CPU, have a look at how large the branch prediction, instruction reordering, vector processing (all the versions) are and compare it to the size of the FPU and ALU. To me it's all wasted real estate. Have a read of http://www-cs.intel.com/pressroom/kits/ ... final2.pdf and tell me that we are going in the correct direction. I really wish I could find the performance-per-watt of CPUs leading up to the 90's as it was decreasing. The cost per operation in the 60's was immense.

greep wrote:Well, I think you're overestimating the average ability of a software developer when you say that the reason they don't use multithreading is because the chips only offer good single threaded throughput. Your average game developer has a BS in computer science, has "exposure" to parallel processing, and maybe took a course on it. S/he doesn't have 6 years expertise in the field, and neither would they be able to write Factorio at anywhere near it's efficiency. Not to say you can't pick it up if needed, but it's certainly one of the more academic parts of CS. And I don't say this to insult the average developer, since that would just be insulting myself.

That really is a load of bull. If you honestly believe that someone can't understand this technique then I am glad I am not working next to you. Sure, if you want to get into the full extent of the multithreaded system I have developed then you may be correct, but simply having two global variables, one call READ_BLOCK and one called WRITE_BLOCK, extending each clock-controlled variable to be an array of two, and controlling their access by using object.variable[WRITE_BLOCK] = object.variable[READ_BLOCK] + 1; to me seem quite an easy change. If you cannot visualise this in your head then how the hell did you learn to program? Unfortunately I WILL NOT BE ABLE TO CONVINCE YOU because you are not trying to learn it, probably because you believe it's too hard to learn. This is essentially the reason why the couple of "web guys" got the boot, they simply did not want to learn it.

ratchetfreak wrote:This forces mods to be a fully serial part of the game loop with no recourse to even attempt to multi-thread unless you restrict what the lua code can do. Or in other word only let it read the game state and send events to be processed in game_update() along with all other changing events. After you do that you can run multiple mods at the same time because they only need read-access and an outgoing message queue.

However it would be much better if the mod authors get access to do multithreading themselves so the event handling code for each mod doesn't have to be serial.

I have been thinking about this, and you should be able to get the LUA code multithreaded using this technique, you just need each callback to it's own execution context. The only requirement for it to be doable is on how information is moved through the global variable to other executing sections of the code. Any modification to the game state can be done through the tick-tock memory buffer, possibly the same with the global table. BUT AGAIN I HAVE TO SAY: I have no idea of the causality constraints in the API in factorio. So without seeing any code or having a detailed explanation of these constraints from the developers this conversation is a he-said she-said waste of time. The only thing that can be said for certain is that if process A does not depend on process B, and process B does not depend on Process A then they can executed at the same time. I offer a technique to ensure that these process don't every rely on each other.

ratchetfreak wrote:A FIFO would not be deterministic because the OS scheduler will affect the order the access come in.

a FIFO system has nothing to do with the OS scheduler. I cannot respond to this because I have absolutely no idea what you are thinking.

ratchetfreak wrote:This is a tiny execution context and you don't need complex grouping algorithms to get the most out of the performance

You will always need relatively large execution contexts because if you don't you can't effectively reduce your cache-misses. The only way to ensure that some section of memory is all processed at once is to have it processed in the same thread (aka in the same execution context). Extremely small execution contexts ARE A REALLY BAD IDEA on the current CPU architecture. I repeat: A REALLY BAD IDEA. You need to have high memory cohesion in any application you write, multithreaded or not.

Lee_newsum wrote:I have a i7-5820k 6-core 12-threads @3.3GHz and on game out there can max out my cpu, they can max out a threads or 3.

I don't know what you are saying. Are you saying that there are only games that max out 3 threads or are you saying your 6 cores can only max out 3 threads. Either way, I have not found this in any of my tests, but I have only been working with Xeons for anything above 4 cores. AFAIK, a CPU vendor gets its butt kicked when they falsely advertise what a chip can do, so I would think it's more to do with what your are running, not what you are running it on. Anyway, please clarify.

justarandomgeek wrote:This is Factorio, so really you ought to go a little further - make it a Warehouse (from the mod) with all 24 possible inserters and logbots going both ways (storage). And you've got a several thousand of them.

kinnom wrote:Or better, Add bobinserters and have 108 inserters

I think I covered this, in any case I am not writing psudocode/describing how you would handle 108 inserters when you should understand that a while-loop can do any number you wish.

*edit: realised I linked a flops-per-watt, not the energy per instruction. Can't find (at the moment) any better papers/sources detailing the increase of the energy cost per instruction.

greep · Post by **greep** » Thu Jan 19, 2017 4:26 am

BenSeidel wrote:
greep wrote:Well, I think you're overestimating the average ability of a software developer when you say that the reason they don't use multithreading is because the chips only offer good single threaded throughput. Your average game developer has a BS in computer science, has "exposure" to parallel processing, and maybe took a course on it. S/he doesn't have 6 years expertise in the field, and neither would they be able to write Factorio at anywhere near it's efficiency. Not to say you can't pick it up if needed, but it's certainly one of the more academic parts of CS. And I don't say this to insult the average developer, since that would just be insulting myself.
That really is a load of bull. If you honestly believe that someone can't understand this technique then I am glad I am not working next to you. Sure, if you want to get into the full extent of the multithreaded system I have developed then you may be correct, but simply having two global variables, one call READ_BLOCK and one called WRITE_BLOCK, extending each clock-controlled variable to be an array of two, and controlling their access by using object.variable[WRITE_BLOCK] = object.variable[READ_BLOCK] + 1; to me seem quite an easy change. If you cannot visualise this in your head then how the hell did you learn to program? Unfortunately I WILL NOT BE ABLE TO CONVINCE YOU because you are not trying to learn it, probably because you believe it's too hard to learn. This is essentially the reason why the couple of "web guys" got the boot, they simply did not want to learn it.

Oh sure, it's just as simple as that to parallelize. That must be why you're writing whole essays to described some inserters interacting with a chest

Anyways, we seem to be drifting into nerd rage territory, so I'm off to blow up aliens.

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications