Linked chests connect surfaces. And can't mods add circuit wires between entities on different surfaces too?Rseding91 wrote: Mon Feb 08, 2021 8:30 pmEvents are global; Lua events even more so. So they do interact with everything regardless of surfaces. Surfaces in fact are nothing special except more chunks. Very little is actually seperate per-surface. The entirety of force-related anything is global, events are global, player stuff is global.elfstone wrote: Mon Feb 08, 2021 6:03 pm ... it might be possible to run different surfaces on different cores, since events on one surface don't interact with things on other surfaces, which should make multi threading a lot easier.
Parrallel processing in games & applications
Re: Parrallel processing in games & applications
Re: Parrallel processing in games & applications
And that is probably the biggest misconception of why most multithreading code shows no improvements.Trific wrote: Tue Feb 09, 2021 8:07 pm Programming for multithreading is a bit like designing a train network in Factorio. In both cases you have to avoid deadlocks. Except in a program, you may have millions or billions of trains each carrying a unique cargo, and you're up against the constraints that you can't have two trains carrying the same cargo, and no amount of throwing extra trains or fancy signaling at the problem is going to get a single cargo somewhere any faster than the max speed of the train.
You don't put 1000 trains on a single rail line and expect them to be faster. That just remains sequential no matter how you multithread it. And the extra signals will slow everything down so your trains will actually take far longer to arrive.
You build 1000 rail lines with one train each and no cross overs. Now all trains run in parallel at 1000 times the throughput.
The way factorio is designed everything depends on everything else and every entity has to be evaluated in the correct (or at least same on all clients) order or you desync. That makes multithreading basically impossible any more than it is. You first have to isolate groups of entities that don't interact at al. Like the pipes are split into fluid systems that do not connect anywhere. Those things that are easily split like that wube has split by now.
But that is a fundamental design problem of how entity updates work in factorio. The action of one entity affects other entities in the same tick so everything depends on everything else and thus you would need locking or clever dependency analysis to multithreading, both of which add more overhead than you gain from more cores. And with updates not being broken down into small chunks that fit in caches the memory is the true limiting factor as the devs have stated frequently.
Going back to the train analogy: Once you put down your tracks all you can do is mitigate the track design with signals to gain some parallelity. But ultimately the way you designed the tracks will limit you. Only way to improve at a certain point (which factorio seem to have reached) is to deconstruct it all and design a better track layout.
Re: Parrallel processing in games & applications
That's not asynchronous, just plain multithreaded. Asynchronous, multithreaded would be one cook putting the eggs on the stove and the toast in the toaster and setting timers and another taking them out when the timers go off. Or Multithreaded, asynchronous would be one cook putting eggs on the stove and setting a timer for all the order and taking them off then each timer goes off. And another cook doing the same with toasts.NotoriousPyro wrote: Sun Feb 27, 2022 9:44 pmhttps://stackoverflow.com/a/34681101/3812928Asynchronous, multithreaded: you hire two more cooks, one to cook eggs and one to cook toast. Now you have the problem of coordinating the cooks so that they do not conflict with each other in the kitchen when sharing resources. And you have to pay them.
If you can do it so easily then why don't you apply for a job and do it? If you want it so bad you would do it for free if it's so easy.
Where the analogy breaks down is having to pay them. The customers already bought the CPUs so they are free to use. It's like every customer brings their own cooks into the restaurant.
-
- Smart Inserter
- Posts: 2768
- Joined: Tue Apr 25, 2017 2:01 pm
- Contact:
Re: Parrallel processing in games & applications
My understanding may be off, but that does still seem like asynchronous, multithreaded to me. In that example you've actually got 3 cooks: the original cook who is the one taking in the main task and then assigning sub-tasks for the order to the other two (eggs & toast). The original cook can then do other needed tasks for that order (prepping the plate or anything else needed for it), but ultimately needs to wait until both the eggs and toast are done before being able to send it out the door.mrvn wrote: Tue Oct 04, 2022 3:52 pmThat's not asynchronous, just plain multithreaded.NotoriousPyro wrote: Sun Feb 27, 2022 9:44 pmhttps://stackoverflow.com/a/34681101/3812928Asynchronous, multithreaded: you hire two more cooks, one to cook eggs and one to cook toast. Now you have the problem of coordinating the cooks so that they do not conflict with each other in the kitchen when sharing resources. And you have to pay them.
If you can do it so easily then why don't you apply for a job and do it? If you want it so bad you would do it for free if it's so easy.
My Mods: Classic Factorio Basic Oil Processing | Sulfur Production from Oils | Wood to Oil Processing | Infinite Resources - Normal Yield | Tree Saplings (Redux) | Alien Biomes Tweaked | Restrictions on Artificial Tiles | New Gear Girl & HR Graphics
Re: Parrallel processing in games & applications
Maybe. Maybe better described as the worker-threads or thread pool model. You have a bunch of cooks and one chef that orders them around and coordinates the kitchen.FuryoftheStars wrote: Tue Oct 04, 2022 4:27 pmMy understanding may be off, but that does still seem like asynchronous, multithreaded to me. In that example you've actually got 3 cooks: the original cook who is the one taking in the main task and then assigning sub-tasks for the order to the other two (eggs & toast). The original cook can then do other needed tasks for that order (prepping the plate or anything else needed for it), but ultimately needs to wait until both the eggs and toast are done before being able to send it out the door.mrvn wrote: Tue Oct 04, 2022 3:52 pmThat's not asynchronous, just plain multithreaded.NotoriousPyro wrote: Sun Feb 27, 2022 9:44 pmhttps://stackoverflow.com/a/34681101/3812928Asynchronous, multithreaded: you hire two more cooks, one to cook eggs and one to cook toast. Now you have the problem of coordinating the cooks so that they do not conflict with each other in the kitchen when sharing resources. And you have to pay them.
If you can do it so easily then why don't you apply for a job and do it? If you want it so bad you would do it for free if it's so easy.
Note: only the chef is working asynchronous. All the cooks work sequentially in the above, that's why I said it's not asynchronous.
-
- Smart Inserter
- Posts: 2768
- Joined: Tue Apr 25, 2017 2:01 pm
- Contact:
Re: Parrallel processing in games & applications
Yes, but I think that was the point of the examples that were given in that post. The "synchronous", "asynchronous, single threaded", etc were supposed to be from the viewpoint of "you" (the original cook), not the additional cooks or all aspects of the given examples.mrvn wrote: Tue Oct 04, 2022 5:17 pm Note: only the chef is working asynchronous. All the cooks work sequentially in the above, that's why I said it's not asynchronous.
My Mods: Classic Factorio Basic Oil Processing | Sulfur Production from Oils | Wood to Oil Processing | Infinite Resources - Normal Yield | Tree Saplings (Redux) | Alien Biomes Tweaked | Restrictions on Artificial Tiles | New Gear Girl & HR Graphics
-
- Filter Inserter
- Posts: 665
- Joined: Wed Sep 16, 2020 12:45 pm
- Contact:
Re: Parrallel processing in games & applications
There's a misconception. Multithreading does exist (belts, one of the biggest mainstays of the game). The choice seemed to be to do it tactically rather than architecturally. My guess is that it was all built with a single core in mind in order to increase stability, which paid off I think for speedrunning (noticeable lack of crazy hacks you find in other speedruns). And then people built these crazy bases, and so they shoe-horned some stuff in...mrvn wrote: Tue Oct 04, 2022 3:45 pm And that is probably the biggest misconception of why most multithreading code shows no improvements.
In the end though, there is nothing magic about 'determinism' and there is nothing all that end-of-the-world with parallelism. It's just a series of trade offs and choices that are made. I'm skeptical it makes sense to fight mod authors on this, mods will always be flakey regardless, but it is what it is.
OptimaUPS Mod, pm for info.
-
- Manual Inserter
- Posts: 2
- Joined: Tue Sep 06, 2022 5:30 pm
- Contact:
Re: Parrallel processing in games & applications
This thread has had me curious about using GPU's for game development.
I've worked on a top-down simulation game that uses OpenCL as the primary language for ingame logic. So far it has lead to some interesting design challenges.
1. All OpenCL game logic code uses fixed-point numbers for determinism. primarily to make the addition of numbers across all threads deterministic. The result of A = B + C is the same as A = C + B.
2. All OpenCL game structures use pointer offsets integers instead of native pointers to keeping gamestate coherent for transfer across network and saving/loading to disk.
3. The Transfers TO/FROM CPU<->GPU are minimized by issuing rendering calls on the CPU and altering OpenGL VBO's on the GPU (OpenGL shared graphics objects) The only need for communication back to the CPU is for sound effects and other metadata.
4. For lockstep networking - commands are sent to the GPU and then processed there.
I have been able to simulate 35k units with simple fixedpoint physics and collision detection at < 32fps all determistically in sync across the network. With A* Pathfinding and such as well.
The Prototype can be downloaded here for anyone wishing to try it out:
https://gameprototypes.itch.io/astroid-miner
https://www.youtube.com/watch?v=8bW2YIXNxsA
Should work on NVIDIA and AMD graphics cards with at least 4gb ram. (tested on Radion RX 460, and NVIDIA RTX 2080)
I've worked on a top-down simulation game that uses OpenCL as the primary language for ingame logic. So far it has lead to some interesting design challenges.
1. All OpenCL game logic code uses fixed-point numbers for determinism. primarily to make the addition of numbers across all threads deterministic. The result of A = B + C is the same as A = C + B.
2. All OpenCL game structures use pointer offsets integers instead of native pointers to keeping gamestate coherent for transfer across network and saving/loading to disk.
3. The Transfers TO/FROM CPU<->GPU are minimized by issuing rendering calls on the CPU and altering OpenGL VBO's on the GPU (OpenGL shared graphics objects) The only need for communication back to the CPU is for sound effects and other metadata.
4. For lockstep networking - commands are sent to the GPU and then processed there.
I have been able to simulate 35k units with simple fixedpoint physics and collision detection at < 32fps all determistically in sync across the network. With A* Pathfinding and such as well.
The Prototype can be downloaded here for anyone wishing to try it out:
https://gameprototypes.itch.io/astroid-miner
https://www.youtube.com/watch?v=8bW2YIXNxsA
Should work on NVIDIA and AMD graphics cards with at least 4gb ram. (tested on Radion RX 460, and NVIDIA RTX 2080)
-
- Smart Inserter
- Posts: 2768
- Joined: Tue Apr 25, 2017 2:01 pm
- Contact:
Re: Parrallel processing in games & applications
Here's one that I happened to be looking at the other day that does pathfinding through the GPU: https://store.steampowered.com/app/1468 ... mulator_2/cashtrevor wrote: Tue Nov 08, 2022 6:13 pm This thread has had me curious about using GPU's for game development.
[...]
I have been able to simulate 35k units with simple fixedpoint physics and collision detection at < 32fps all determistically in sync across the network. With A* Pathfinding and such as well.
Course, none of that helps this game now.
My Mods: Classic Factorio Basic Oil Processing | Sulfur Production from Oils | Wood to Oil Processing | Infinite Resources - Normal Yield | Tree Saplings (Redux) | Alien Biomes Tweaked | Restrictions on Artificial Tiles | New Gear Girl & HR Graphics
Re: Parrallel processing in games & applications
To the amazing devs,
Have you tried running SwapBuffers in parallel, with or without multiple contexts? The basic idea is to run the event loop, window handling and SwapBuffers on the main thread, then do updates and rendering on thread #2. When it's time to submit a frame after drawing, signal the main thread to wake up. When it's time to draw again (on thread #2), make sure the previous SwapBuffers call on the main thread has already completed.
It used to be the case that most render time (up to 7 ms) was spent inside that call, not on drawing arrays or binding textures as I'd initially suspect on my 1060 and 270X's.
Have you tried running SwapBuffers in parallel, with or without multiple contexts? The basic idea is to run the event loop, window handling and SwapBuffers on the main thread, then do updates and rendering on thread #2. When it's time to submit a frame after drawing, signal the main thread to wake up. When it's time to draw again (on thread #2), make sure the previous SwapBuffers call on the main thread has already completed.
It used to be the case that most render time (up to 7 ms) was spent inside that call, not on drawing arrays or binding textures as I'd initially suspect on my 1060 and 270X's.
- TheKillerChicken
- Long Handed Inserter
- Posts: 80
- Joined: Sat Mar 02, 2019 7:06 am
- Contact:
Re: Parrallel processing in games & applications
I use the NUMA setup to get the max out of my memory. My system can support 8-channel DDR3-1866 LR3DS ram. But I would like to see SMT technology integrated into this game at some point. Right now I just funnel all my bandwidth through my 4 sticks of DRAM whereas my server uses 2 sticks. My problem is I like to put yotta-scale values and 4 billion stacks with 10 million+ construction robots, so I am bound to feel the wrath of DRAM bottlenecking to the ultra-extreme. The funny thing is, my server UPS stays at 60. My gamer and my server are the exact same hardware specs CPU-wise. I have 128G on my gamer, and 64G on my server so I can catch up easily. I am not familiar with paraellisation (I can't spell it) but perhaps SMT is possible later on. Great thing about my gamer is that I have a mainframe motherboard, so I can manipulate where the bandwidth can go within my QPI. I wonder how this game will run off of DDR XD-RAM? Fastest I have seen is 5.6TB/s bandwidth on XD-RAM. Either way, this is an awesome game and Wube is an incredible company. Thanks for making such a game.
-
- Manual Inserter
- Posts: 2
- Joined: Sat Apr 30, 2022 8:25 pm
- Contact:
Re: Parrallel processing in games & applications
At this point, we should just say "rewrite Factorio in ASM". C++ is already highly performant, unless the devs want to use C, ASM is the only option.
mov eax, ebx
int 0x80
push eax
add eax, ecx
pop eax
...etc. For Factorio to emerge, there would have to be hundreds of thousands of ASM instructions.
mov eax, ebx
int 0x80
push eax
add eax, ecx
pop eax
...etc. For Factorio to emerge, there would have to be hundreds of thousands of ASM instructions.
-
- Manual Inserter
- Posts: 2
- Joined: Tue Sep 06, 2022 5:30 pm
- Contact:
Re: Parrallel processing in games & applications
I wanted to follow up with my experiments with parrellism from above.
While using OpenCL does work. I came to the conclusion that the tooling is just not there yet for game development completely on GPU. (as compared with compilers etc for x86_64, arm, etc). one starts to miss C++ features. CUDA uses c++ but there is no good way to support AMD as well. (the clang compiler may get there at some point).
porting game code to CPU (while still using an arbritrary amount of threads) was not too bad, and runs comparitively fast - (faster single threaded performance). Entity updates are split evenly between all threads so 1000 entities and 10 threads would be 100 entities updating in each thread.
So I am still using a massively multithreaded approach. where a game update consists of N number of stages where threads must hit a thread barrier and synchronize after each stage). Then fitting game mechanics inside those stages.
https://gameprototypes.itch.io/astroid-miner
While using OpenCL does work. I came to the conclusion that the tooling is just not there yet for game development completely on GPU. (as compared with compilers etc for x86_64, arm, etc). one starts to miss C++ features. CUDA uses c++ but there is no good way to support AMD as well. (the clang compiler may get there at some point).
porting game code to CPU (while still using an arbritrary amount of threads) was not too bad, and runs comparitively fast - (faster single threaded performance). Entity updates are split evenly between all threads so 1000 entities and 10 threads would be 100 entities updating in each thread.
So I am still using a massively multithreaded approach. where a game update consists of N number of stages where threads must hit a thread barrier and synchronize after each stage). Then fitting game mechanics inside those stages.
https://gameprototypes.itch.io/astroid-miner
Re: Parrallel processing in games & applications
Good luck with that. There's a reason more abstract programming languages were developped, and compilers take care about optimizations on machine level. In most cases, any attempt of a programmer to outdo the compiler backfire.Ondrashek06 wrote: Sat Sep 09, 2023 7:43 am At this point, we should just say "rewrite Factorio in ASM". C++ is already highly performant, unless the devs want to use C, ASM is the only option.
mov eax, ebx
int 0x80
push eax
add eax, ecx
pop eax
...etc. For Factorio to emerge, there would have to be hundreds of thousands of ASM instructions.
Besides that, the issue with parallel computing is to identify which parts of the software can be run in parallel while staying deterministic. You don't simplify that by making your code far harder to understand and maintain.
Multithreading shower thoughts
Dear developers,
I need to apologize in advance as I am very bad in putting my thoughts into a half decent order, so I'll just put them here as they come and hope you can get something useful out of it. I am also not sure if the forum is the correct one.
Inspired to this thoughts I got by the FFF #421, in particular the multithreading discussion (possible that there was another multithreading post I don't recall now). Anyway... first I got a question: if you play multiplayer, does every player needs to compute the entire game, or is sufficient to have a super beefy host to do all the computations and the other players can be "thin clients" who just get the results and need to compute only graphics. I can imagine that it would be very bandwidth intense, if you stream the position of every single item to all clients... on the other hand, he needs only stuff he can actually "see". So in particularly with different planets it would be rather easy to "outsource" computations.
This brought me to an idea: because planets are decoupled quiet much, if should be possible (if above is implemented of course), to distribute the computational power among several hosts... so I was imagining an insanely huge map, big enough that each planet would need its own bare metal host, probably this has no "practical usecase", but... we are not playing this game for practical usecase, rather then the possibility of going completely insane in it...
To put some background... I work as theoretical physicist, and even though I am not working on simulations myself (I do more blackboard physics and low performance computations), I know a lot of people doing simulations and your blog posts often remind me on problems they face... So I wonder if there is any active collaboration between game developers and developers in science, which face similar problems (I guess that would be particularly useful for the scientist, as they are not really known as good programmers xD, but they might have good ideas anyway)
Best regards
I need to apologize in advance as I am very bad in putting my thoughts into a half decent order, so I'll just put them here as they come and hope you can get something useful out of it. I am also not sure if the forum is the correct one.
Inspired to this thoughts I got by the FFF #421, in particular the multithreading discussion (possible that there was another multithreading post I don't recall now). Anyway... first I got a question: if you play multiplayer, does every player needs to compute the entire game, or is sufficient to have a super beefy host to do all the computations and the other players can be "thin clients" who just get the results and need to compute only graphics. I can imagine that it would be very bandwidth intense, if you stream the position of every single item to all clients... on the other hand, he needs only stuff he can actually "see". So in particularly with different planets it would be rather easy to "outsource" computations.
This brought me to an idea: because planets are decoupled quiet much, if should be possible (if above is implemented of course), to distribute the computational power among several hosts... so I was imagining an insanely huge map, big enough that each planet would need its own bare metal host, probably this has no "practical usecase", but... we are not playing this game for practical usecase, rather then the possibility of going completely insane in it...
To put some background... I work as theoretical physicist, and even though I am not working on simulations myself (I do more blackboard physics and low performance computations), I know a lot of people doing simulations and your blog posts often remind me on problems they face... So I wonder if there is any active collaboration between game developers and developers in science, which face similar problems (I guess that would be particularly useful for the scientist, as they are not really known as good programmers xD, but they might have good ideas anyway)
Best regards
Re: Multithreading shower thoughts
Every player computes the entire game. When you connect to a server, you get send a saved copy of the entire game at that point, and then use that as the basis to simulate everything that happens while you are connected. Only player input has to be sent over the network: everything in the game is deterministic and so every client calculates the same results given the same input (unless there is a bug in the game code or in a mod, causing a desync, which is detected by checksums and causes you to get disconnected).NUF wrote: Tue Aug 13, 2024 12:58 pm Anyway... first I got a question: if you play multiplayer, does every player needs to compute the entire game, or is sufficient to have a super beefy host to do all the computations and the other players can be "thin clients" who just get the results and need to compute only graphics.
Re: Multithreading shower thoughts
Exactly this client/server concept can be found with many (probably all) modern MMO games. It's efficient, works well and scales reasonably good to support 100 and more players on the same map. Network bandwidth isn't an issue.NUF wrote: Tue Aug 13, 2024 12:58 pm if you play multiplayer, does every player needs to compute the entire game, or is sufficient to have a super beefy host to do all the computations and the other players can be "thin clients" who just get the results and need to compute only graphics.
However, Factorio runs a completely different concept. Every client computes the same game state and exchanges just interactions such as keypresses. The most important reason for this is probably development time. Essentially, you develop a single player game, which is much less complex than a client/server game. For multiplayer, add sync user interactions across the net and ensure deterministic behavior, so exactly the same game state is computed with the same user interactions. You don't need to develop a client/server architecture.
This is also the reason why it isn't possible to change to a client/server design: this requires a complete redesign and rewrite of almost the whole game, which isn't feasible for obvious reasons.
Do you want 5 years development time for an expansion with new challenges, item and game mechanics, or do you want 5 years development time for a client/server architecture where, when it is being released, the game will look the same from a player perspective as the peer2peer design as we have now?
Re: Multithreading shower thoughts
Every client knows entire game state. This is because amount of data changing from tick to tick and frequency of updates would make it not possible to work in any other architecture over a real network.NUF wrote: Tue Aug 13, 2024 12:58 pm if you play multiplayer, does every player needs to compute the entire game, or is sufficient to have a super beefy host to do all the computations and the other players can be "thin clients" who just get the results and need to compute only graphics. I can imagine that it would be very bandwidth intense, if you stream the position of every single item to all clients...
Surfaces are not "decoupled quite much". Entities can raise lua events, lua can mutate anything on any other surface. Entities created on one surface will request allocation of unique unit numbers, items created will request allocation of unique item numbers, there is so much shared state that blindly saying "surfaces are decoupled" is just wrong.NUF wrote: Tue Aug 13, 2024 12:58 pmThis brought me to an idea: because planets are decoupled quiet much,
Re: Parrallel processing in games & applications
@BenSeldel only read the initial post, did you post a link to what you work on, it looks interesting outside of Factorio applications.
Re: Parrallel processing in games & applications
Unfortunately I didn't, and now I can't as the projects been shut down.BlueTrin wrote: Mon Sep 02, 2024 7:21 am @BenSeldel only read the initial post, did you post a link to what you work on, it looks interesting outside of Factorio applications.
If your interested, it was an experimental ERP system with a laundry list of features. The main one was that all the system functionality was defined in a database and were editable while the system was running. Because those definitions could be changed at any time, it meant that the entire application structure had to be fully dynamic so that all behaviour and state was consistent. This included any outstanding transactions as we were guaranteeing sequential consistency (there is an ordering of atomic transactions that will produce the observed state). Eg, if someone created a new field and marked it as required, all running applications would have to put that field on the screen and enforce that the value is there (as well as it being enforced by the database server).
This had to happen on both the client and the server, as we never wanted anything to be restarted (for minor logic changes).
The system had a couple of evolutions in its internal data propagation systems: from callbacks, to observables, to a monadic implementation of the observables (map & bind), and then finally to "signals" - An Implementation of monadic observables that guaranteed glichless propagation of updates in O(N) time, where N is the number of observables that needed to be updated. At the time it was a big break-through as there was only one glitchless implementation that we were aware of and it had O(U) update time (The ELM programming language). Since then there have been a couple of other glitchless implementations, but this was 15+ years ago now and I've not stayed up-to-date.
The data propagation system was called "Signals" because it was based on the ideas in the VHDL programming language and the wires (interconnects) that are inside a chip. The rules were simple: Each wire carries a signal. It must have a signal at all times (even if it's random, such as startup). There is a system in place to prevent invalid or incomplete signals from propagating to other components before it becomes stable/valid. Components were only executed when all its inputs were stable. (Stable in this context means fully executed).
While the Signals propagation system sat as the main execution engine of the ERP, its major benefit wasn't for its multi-threaded capabilities but for it's code understand ability. It massively simplified the entire code base because it removed all the interconnection logic. For example, instead of creating a Textbox then hooking up the observables, you would pass in a Signal<Integer> for its width, Signal<Text> for its text content, etc, to a constructing function that would create the control and the connection logic would be entirely self-contained within the Signals system.
Aside - I remember logging the maximum number of outstanding updates, even though we only ran one thread at the time, and it was about 10,000 for a simple form. So we (in theory) had 10,000 code blocks that had their inputs ready to be run. This was the test that showed me that we (as computer engineers) don't have any understanding of the level of concurrency that potentially exists in even the most simple of application.
It wasn't until later that we tried rewriting the execution engine to use multi threading. We did a bunch of stuff, such as writing out values on each thread sequentially to a thread only buffer, and at the end of each clock, discarded what was the previous clock's buffer(s) that were being read from (we called them the "input slate" and the "output slate"). We also marked nodes as "deterministic" and just ran nodes 2,3,or more times on each thread instead of latching the value as all the inputs were latched anyway. We also merged nodes together so their intermediate values were never latched, but stored in the running context and discarded when that context was completed. This node merging technique was part of my original post where I said you could "move" the inserter from being attached to the box to being attached to the belt.
But most of that was just for fun (read: we didn't get any significant time to write any of these implementations so there were always bugs). We did eventually replace the single threaded execution system with a multi threaded one, but that just used a basic work-stealing algorithm as it was really simple to implement. Unfortunately, even that was hard to get in, as the single threaded version was "good enough" for our requirements. Not to mention there was a lot of code that was hand-optimised to run well on a single core that slowly had to be removed.
In the end we could run some extremely large computations on many cores and the system would "just work". I did have a graph of the performance characteristics back in the day (I don't know if it's in this thread), but IIRC, after about 4-6 cores, the system would be faster than running on the one core and would continue to raise in performance up to the max we could test (64 cores on a 4 socket NUMA server). *This doesn't sound right as I don't recall each socket having 16 cores in it, but I'm 99% certain we tested it up to 64 cores*. This graph was the only reason the work stealing multi threaded version got in.
While the application was an ERP and not a game, the lessons learnt showed me that multi threading is possible and that most of the rhetoric out there is absolutely wrong. Hence the almost sarcastic listing of the different multi threaded "techniques" in the original post.