Factorio Forums

Posted: **Sun Jan 29, 2017 3:36 pm**

orzelek wrote:One possible solution to allow for overall bigger Factories would be making them multi-surface. Factorissimo already does that on small scale.
It would allow for each surface to run on separate thread (should also help with memory bottlenecks - each core can have own cache and memory controller can handle more requests looking at stats given in this thread).
Then we would need some nice in game abstraction for this

Exactly that I explained in viewtopic.php?f=5&t=8670&start=10#p69213
But now I think RAM is the limiting bottleneck and so it is not possible/useful to go that way (increasing world by using more CPUs but everything on one computer) much further.

Something like caves/varying levels of terrain and then entities that would share power and transport items/fluids between those.

Yes, but imagine how it would be if the game can spawn a new surface, which is not on this host but computed ON ANOTHER HOST! A remote process call.

Quite plausible would be making terrain semi-3d. And when you go over to higher level you would switch surface. With ability to dig through lower levels and build ramps up/down plus pillars for levels above this would make game both 3d and nicely scalable for multithreading. Seems like win-win

Yes, that is one possibility of hundreds, no thousands!

hoho wrote:Biggest problem with connected worlds is how to ensure they all stay synchronized.

No. I thought through this very deep. We can follow here Einstein's relativity theory: Everything is relative.

Which means for Factorio: Two worlds/surfaces can have completely different speed. It doesn't matter, cause the exchange between the two worlds/surfaces is (must be!) event-based.

So it doesn't matter when I get an item from "the other side". If it's there, it's there. No need to synchronize both.

As it stands, each player simulates actions of all other players. If you have several different worlds running in a grid, how would you verify that they're all still running in same overall state? If one server slows down for whatever reason and tick rate drops <60, would others also have to drop lower? What if some items get sent from the faster server to slower one and due to lower tick rate it won't be able to accept them all?

Well, it's of course useful to have both worlds/surfaces synchronized, so that both run more or less with the same speed. But that is just an option.

Posted: **Mon Jan 30, 2017 11:23 am**

orzelek wrote:One possible solution to allow for overall bigger Factories would be making them multi-surface. Factorissimo already does that on small scale.
It would allow for each surface to run on separate thread (should also help with memory bottlenecks - each core can have own cache and memory controller can handle more requests looking at stats given in this thread).
Then we would need some nice in game abstraction for this

Something like caves/varying levels of terrain and then entities that would share power and transport items/fluids between those.

Quite plausible would be making terrain semi-3d. And when you go over to higher level you would switch surface. With ability to dig through lower levels and build ramps up/down plus pillars for levels above this would make game both 3d and nicely scalable for multithreading. Seems like win-win

That is just an extension of the chunks into the 3rd dimension. Splitting up the work to threads by hight is just as difficult doing it in X/Y. You need to synchonise things that cross the boundaries. Same problem, same solutions.

Posted: **Mon Jan 30, 2017 3:07 pm**

ssilk wrote:TL;DR
The most interesting part about multi-threading etc. is to make even bigger worlds. But instead of targeting to multi-threading this can be achieved also, by concentrating more into interconnecting game-worlds (servers), cause that is similar - if made clever: The size of a world can be doubled by connecting a second server to the first.

I can see that being a mod, with supporting communication software. Put an item in a box and it gets moved from server A to server B.

I was thinking of an intergalactic trading mod whereby you could by and sell items on a rng stock market, but do it in a meaningful way. Up till now the "but I can make everything" was in the way and I was thinking of having some form of resource sink such as machine degradation or maybe a mining tax. The whole linking of servers really is the last piece of the puzzle. If it was possible to link servers then this idea could be a lot of fun. You could "log on" and start trading all sorts of stuff, playing the market.

ssilk wrote:But now I think RAM is the limiting bottleneck and so it is not possible/useful to go that way (increasing world by using more CPUs but everything on one computer) much further.

It's not ram quantity, nor is it the memory bandwidth, it's the access patterns of the software. The maximum speed that memory can be transferred to the CPU is for sequential access. Current software (in general) does not get anywhere near the maximum. Sure running it across multiple cores may get you the added benefit of an additional memory controller or 2, but the real benefit is the utilization of additional prefetchers. They will increase the utilisation of your memory bandwidth. So unless you have an application that is able to pull above 60% of the memory bandwidth, making it multithreaded will increase your performance.

If you don't believe me, benchmark it yourself. It's easy enough to do. Create a massive array (couple of 100 megs, even up to a gig) and access it randomly (don't use a rng generator as it will affect your results, just do something like x = x^7 or something similar). You can play round with the test to get all sorts of memory throughput, right the way up to 100% for sequential access
Once you have your desired throughput, run the access pattern over two threads. It can be the same memory or separate memory, it does not matter. Your not after determinism so just do what is easy. Running it over two threads will increase your throughput in all cases but the sequential access case. Then run it over 4 or 8 threads (4 core, hyper threaded) and see what occurs (I'm not going to say what happens because here it becomes more complected with shared caches & context switches...YMMV).

I just wrote a really basic memory throughput tester, it's in java and it's not the best (took me about 4 minutes, but it gets the point across). It measures the time to access 50MB of an 800Mb array, either in sequential access or random in both a single thread and multiple threads.

sequential access:
1 thread: 88-91 ms
2 threads: 150-160 ms
4 threads: 180-190 ms

random access:
1 thread: 240 ms
2 threads: 1000-1500 ms
4 threads: 2500 ms
8 threads: 3600 ms

the JVM gets in the way a lot and I am doing synchronised calls to the console and I am doing string creation and.... etc, but the results (except for the 1 thread random to 2 thread random, something nasty is occurring there) are pretty much the results that you would see on any memory throughput benchmark. I ran the test on my laptop, and it's not the best example of modern technology.

Anyway, if you compare the transfer speeds of the sequential data they transfer 50mb: 1 thread every 90ms, 2 ever 75 ms and 4 threads every 50 ms. No it's not linear, but still a really nice improvement. Once you get into multiple processors and additional memory controllers then you should see a more linear relationship.

If any one has some better data then please post it. I would really like to see if there is an issue with the random access data or if there is some CPU mechanism that I am not seeing taking effect there.

Posted: **Tue Jan 31, 2017 2:28 am**

Same test, only on my desktop. This matches my expectations. Maybe the laptop was on power saving mode?
sequential:
1 thread - 225 ms
2 threads - 250 ms
4 threads - 250 ms

random access:
1 thread - 1065 ms
2 threads - 1107 ms
4 threads - 1300 - 1500 ms

remember that each thread reads 50 mb from the 800 mb array, so more threads means more data read (200 mb for 4 threads).

Posted: **Tue Jan 31, 2017 3:02 pm**

Can't you normalize the data and output MB/s or something? You numbers are pretty much meaningless and look quite the opposite of what you want to show.

Posted: **Tue Jan 31, 2017 10:39 pm**

mrvn wrote:Can't you normalize the data and output MB/s or something? You numbers are pretty much meaningless and look quite the opposite of what you want to show.

not to mention there's a big difference between read and write access

Posted: **Wed Feb 01, 2017 3:54 am**

BenSeidel wrote:If you don't believe me, benchmark it yourself. It's easy enough to do. Create a massive array (couple of 100 megs, even up to a gig) and access it randomly (don't use a rng generator as it will affect your results, just do something like x = x^7 or something similar). You can play round with the test to get all sorts of memory throughput, right the way up to 100% for sequential access

Microbenchmarks are actually very hard to design because you very often end up measuring (and optimizing!) for results that are completely pointless. But since you're so insistent on this point, I'll indulge. Here's my results (note that it's per-thread average throughput, not total throughput):

For 1 threads:
Sequential access did 8011.65MiB/s
Random access did 2338.25MiB/s
Pure cache access did 4798.9MiB/s
For 2 threads:
Sequential access did 7602.74MiB/s
Random access did 1231.57MiB/s [Hand annotation: this is probably anomalous core scheduling; other runs show on par with 1/4/8 thread times]
Pure cache access did 4795.83MiB/s
For 4 threads:
Sequential access did 4712.05MiB/s
Random access did 2496.04MiB/s
Pure cache access did 5076.86MiB/s
For 8 threads:
Sequential access did 2227.64MiB/s
Random access did 2360.71MiB/s [I've not bothered with enough runs to do a statistical test, but this number was higher than the one above for every run I saw]
Pure cache access did 2840.26MiB/s
For 16 threads:
Sequential access did 1132.05MiB/s
Random access did 1219.22MiB/s
Pure cache access did 1425.44MiB/s

Clearly, I have a pre-Haswell 4-core hyperthreaded machine. The results illustrate something important about hyperthreading (or the research name for it, SMT). The original idea behind SMT is that you take two cores and you share their functional units so that both cores can issue to as many of them as they need, allowing for better use of resources. As implemented as Intel HyperThreading, though, it's the same number of functional units as one core but two independent issue feeds. If you're CPU-bound, a 4-core HT machine will perform only as if it has 4 cores. But if you're not, if you're bound on memory latency, it actually acts like it has 8 cores.

Pure sequential access is only really useful if that's the only memory you're touching and the hardware prefetcher can detect the pattern well. If you've got enough work to hit all of the CPUs and keep SMT units busy while you're waiting on memory latency, there's not really any benefit to cache patterns.

Code for boring details

Code: Select all

#include <atomic>
#include <chrono>
#include <iostream>
#include <stdint.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <thread>
#include <vector>

template <uint32_t Access(uint32_t)>
class AccessTest {
  uint32_t *memory;
  volatile uint32_t trash;
public:
  static const uint64_t allocSize = 1LL << 30;
  static const uint32_t sizeInMB = (allocSize * 4) >> 20;
  typedef std::chrono::duration<uint32_t, std::chrono::nanoseconds::period> ns;

  AccessTest() {
    memory = (uint32_t*)mmap(nullptr, allocSize * 4, PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    for (uint32_t i = 0; i < allocSize; i++) {
      memory[i] = rand();
    }
  }
  ~AccessTest() {
    munmap((void*)memory, allocSize * 4);
  }

  ns accessPattern(uint32_t first) {
    uint32_t sum = 0;
    uint32_t index = first & (allocSize - 1);
    auto start = std::chrono::high_resolution_clock::now();
    for (uint32_t i = 0; i < allocSize; i++) {
      sum += memory[index];
      index = Access(index) & (allocSize - 1);
    }
    auto end = std::chrono::high_resolution_clock::now();
    trash = sum;
    return end - start;
  }

  double testNThreads(int N) {
    std::vector<std::thread> threads;
    std::atomic<uint32_t> nReady(0);
    std::atomic<uint64_t> totalTime(0);
    for (int i =0; i < N; i++) {
      threads.emplace_back([&](uint32_t start) {
          nReady++;
          while (nReady < N);
          totalTime.fetch_add(accessPattern(start).count());
      }, rand());
    }
    for (auto & t : threads)
      t.join();
    return totalTime / N;
  }

  static double getAverage(int N) {
    AccessTest<Access> base;
    return base.testNThreads(N);
  }
};

uint32_t sequential(uint32_t start) { return start + 1; }
uint32_t random(uint32_t x) {
  x ^= x << 13;
  x ^= x >> 17;
  x ^= x << 5;
  return x;
}
uint32_t zero(uint32_t start) { return random(start) & 1 + 1; }

double rateCalc(double timeInNS) {
  return AccessTest<sequential>::sizeInMB / timeInNS * 1e9;
}

int main() {
  for (int i = 1; i <= 16; i *= 2) {
    std::cout << "For " << i << " threads:\n";
    double seqTime = AccessTest<sequential>::getAverage(i);
    double rngTime = AccessTest<random>::getAverage(i);
    double zeroTime = AccessTest<zero>::getAverage(i);
    std::cout << "Sequential access did " << rateCalc(seqTime) << "MiB/s\n";
    std::cout << "Random access did " << rateCalc(rngTime) << "MiB/s\n";
    std::cout << "Pure cache access did " << rateCalc(zeroTime) << "MiB/s\n";
  }
}

Posted: **Wed Feb 01, 2017 6:56 am**

mrvn wrote:Can't you normalize the data and output MB/s or something? You numbers are pretty much meaningless and look quite the opposite of what you want to show.

I could have normalised them, but I wanted to show rates per core and not just rates overall. As there is a trade off for making something multithreaded (programmers time) it can be nice to look at how long it would take to process something when your are also processing something else. For example, if you split the game update loop and use the remaining cores, how will it affect the renderer loop? will that need to be split so that it gets enough throughput? Should you only go to 4 cores instead of 2? etc.
Anyway, normalised, all Mbit/sec, full data transfer into the chip:

(laptop)
sequential access:
1 thread: 4444
2 threads: 5000
4 threads: 8420

random access:
1 thread: 1666
2 threads: 533
4 threads: 640
8 threads: 507

(desktop)
sequential access:
1 thread: 1777
2 threads: 3200
4 threads: 6400

random access:
1 thread: 375
2 threads: 722
4 threads: 1066

jcranmer wrote:Microbenchmarks are actually very hard to design because you very often end up measuring (and optimizing!) for results that are completely pointless. But since you're so insistent on this point, I'll indulge. Here's my results (note that it's per-thread average throughput, not total throughput):

Sooo true, microbenchmarks are absolutely pointless as you can get them to say whatever you wish. The main reason I did not post my code is that it's shocking and doesn't do a bunch of stuff that you need to do in this sort of test, like synchronise the threads using a busy loop like you did. But when I saw an increase in performance by removing the modulo and using a bitmask, I knew I was in the ball-park. And that's all you want, a ball-park figure. In this case I don't believe that you really want to spend hours trying to get as much performance out of your benchmark as possible, we are not trying to sell the chip to someone. The point of the exercise is to try and look at "general" code and how it would behave if you could run it over multiple threads, which I believe both tests have done. Who cares that you have 3 exponent operations in one test and an addition on the other. If you swap the index bitmask with a modulo operation then you can see that yours is also in the ballpark.

I also love that you did cache access. Never thought of that.

I have also "normalised" your results as I have done with mine (personally, per core throughput with core count is far more useful, but total chip throughput is far more impressive).

For 1 threads:
Sequential access did 8011.65MiB/s
Random access did 2338.25MiB/s
Pure cache access did 4798.9MiB/s
For 2 threads:
Sequential access did 15204 MiB/s
Random access did 2462 MiB/s [Hand annotation: this is probably anomalous core scheduling; other runs show on par with 1/4/8 thread times]
Pure cache access did 9590 MiB/s
For 4 threads:
Sequential access did 18848 MiB/s
Random access did 9984 MiB/s
Pure cache access did 20304 MiB/s
For 8 threads:
Sequential access did 17816 MiB/s
Random access did 18880 MiB/s [I've not bothered with enough runs to do a statistical test, but this number was higher than the one above for every run I saw]
Pure cache access did 22720 MiB/s
For 16 threads:
Sequential access did 18112 MiB/s
Random access did 19504 MiB/s
Pure cache access did 22800 MiB/s

It's strange that the cache access was soooo much slower than the sequential access for one thread. In theory it should be much much faster, or at least be somewhat similar. I find it really surprising that running multiple threads closes the gap between sequential access and random access so dramatically, that is something I have never seen before as I have always found a marked improvement with the sequential reads over the random reads. Now I want to see if other machines exhibit this behaviour. If they do then it really blows the whole "we can make it faster by looking at our memory access" argument out of the water.

Admittedly I have not sampled many machines (8 in total) and they were not using a benchmarking tool like this but the application I have been writing. Those tests were for different memory layouts to see what ones performed the best on what core configuration and to spot areas that could be improved. In all just about all cases, memory locality made a massive difference when in single core mode (200%-300% faster), and less on a multicore mode (20-30% faster). But multicore mode still runs faster when compared to the single core mode.

I still need to look at what the laptop was doing, it may be a property of the newer chips as the desktop I am running is an i5-2xxx, so extremely old. But the results I got just don't seem to add up with my expectations. Hell, there may be CPU's out there where more threads doesn't equate to more bandwidth (in the average use case, not worst case), but I find in unbelievable that more threads would decrease the bandwidth as much as it has for my laptop.

Posted: **Wed Feb 01, 2017 7:32 am**

jcranmer, your cache bandwidth doesn't really seem to have too much to do with actual cache performance. Reading from L2 cache should be an order of magnitude greater, at around 300GB/s (see attached image).

I currently don't have access to a Linux machine so I can't run your code. I should also be working right now so I can't really come up with better code either

[edit]

I did spend some time from my work to get Stream running on my machine (i5 6300hq, a laptop CPU with 4 physical cores, using highest-bandwith RAM it officially supports at max theoretical memory bandwidth of ~34GB/s). Only changes I made to the code were to increase the array size by 5x but I don't think that changed results too much, if at all. I compiled both single-threaded and openmp versions with "-march=corei7-avx -fomit-frame-pointer -fexpensive-optimizations -O3 -m64 -fopenmp" flags.

Sadly, the code has to be changed to be able to get reasonable results for cache bandwidth and it also doesn't benchmark random access.

Single threaded results:

Code: Select all

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           21276.7     0.014817     0.007520     0.034593
Scale:          12765.1     0.015875     0.012534     0.019051
Add:            15957.5     0.025457     0.015040     0.041111
Triad:          14960.7     0.020332     0.016042     0.025568

Multithreaded (4 threads):

Code: Select all

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           22799.0     0.007911     0.007018     0.009525
Scale:          15199.2     0.012643     0.010527     0.017546
Add:            15957.5     0.017212     0.015040     0.021057
Triad:          17730.5     0.017324     0.013536     0.025566

In other words, going from 1->4 threads increased effective bandwidth by tiny amount.

Posted: **Wed Feb 01, 2017 8:08 am**

BenSeidel wrote:Sooo true, microbenchmarks are absolutely pointless as you can get them to say whatever you wish. The main reason I did not post my code is that it's shocking and doesn't do a bunch of stuff that you need to do in this sort of test, like synchronise the threads using a busy loop like you did. But when I saw an increase in performance by removing the modulo and using a bitmask, I knew I was in the ball-park. And that's all you want, a ball-park figure. In this case I don't believe that you really want to spend hours trying to get as much performance out of your benchmark as possible, we are not trying to sell the chip to someone. The point of the exercise is to try and look at "general" code and how it would behave if you could run it over multiple threads, which I believe both tests have done. Who cares that you have 3 exponent operations in one test and an addition on the other. If you swap the index bitmask with a modulo operation then you can see that yours is also in the ballpark.

You're missing the point. Compilers and systems can play lots of games with your code in the name of optimizations and speed. It is surprisingly common that an attempt to repeatedly measure a tight loop instead measures the time it takes to do nops--and convincing people that their test boiled down to a "you're testing nop speed" can be quite difficult. Microbenchmarks have at best a very, very narrow purpose, and using them to infer anything is fraught with error.

It's strange that the cache access was soooo much slower than the sequential access for one thread. In theory it should be much much faster, or at least be somewhat similar. I find it really surprising that running multiple threads closes the gap between sequential access and random access so dramatically, that is something I have never seen before as I have always found a marked improvement with the sequential reads over the random reads. Now I want to see if other machines exhibit this behaviour. If they do then it really blows the whole "we can make it faster by looking at our memory access" argument out of the water.

It's not at all strange, not if you look at the generated assembly code. (If you're ever serious about trying to microoptimize performance, reading assembly and drilling down to hardware perf counters is a matter of course). I didn't explain why it should be the case, since the explanation should be easily deduced just from looking at the code and guessing what happened. In the sequential case, the inner loop basically consists of an and, add, load, and add for each memory access; the index calculation for the subsequent memory load overlaps with the delay in a cache hit perfectly (actually, there's a pipeline stall for a single clock cycle if my timing recollections are correct). In contrast, the code for the cache access case ends up reusing the same index computations for both loads in the unrolled loop, so the second load is causing several clock cycles of latency delay, followed by some clock cycles where no cache load is being performed.

As for why the sequential access fell behind random access, my suspicion is that the hyperthreaded core scheduling screwed up the cache prefetching. The tests are nowhere near idempotent enough to isolate memory access patterns (I really can't be arsed trying to figure out equal-length, equal-critical-path assembly sequences that compute x + 1 and constant-value sequences as for the xorshift rng index calcuations).

jcranmer, your cache bandwidth doesn't really seem to have too much to do with actual cache performance. Reading from L2 cache should be an order of magnitude greater, at around 300GB/s.

The data size is 4 GB, well outside the size of L3 cache. The code is pretty much guaranteed to cause consistent L2 cache misses (and L3 for sure in case of sequential), since I'm iterating over the entire data size. The real test that's being shown is the power of the hardware cache prefetcher--and when it's not kicking in.

Posted: **Wed Feb 01, 2017 9:25 am**

jcranmer wrote:It is surprisingly common that an attempt to repeatedly measure a tight loop instead measures the time it takes to do nops

I didn't miss your point, I understand why they are dangerous, but if you have a reasonable understanding of the expected results you can make a quick and dirty one fairly quickly. You may not be testing exclusively what you want, but you can get meaningful results. That is why I know that the results from the laptop test are absolutely bogus (but I posted them anyway because that is what you do with data), and I know that the JVM's memory model is seriously screwing up the random access parts. Just apply the scientific method and you should be fine no matter how much your measuring tools blow up in your face

. And ALWAYS get a peer to review your data.

As a control, I check for changes in the timing with a modulo. They are notoriously slow and introducing one of them will slow your code down, if it does not then you are not running your loop (or you may not be running the modulo, either way you know that your code isn't doing what you want it to). It also gives you a really good idea of how much of your measurement is what you are measuring.

jcranmer wrote:It's not at all strange, not if you look at the generated assembly code

No doubt when you look at it in detail, but the "in theory" part refers to the orders of magnitude difference in speed between the cache levels and main memory. As hoho pointed out (with a nice diagram), the level 3 cache is extremely slow when compared to L1 cache. No matter how good the prefetcher is it cannot increase the bandwidth out to memory, only decrease the apparent latency. So I was expecting the cached test to be measuring the L1 cache speed and not the main memory speed. Hence I found it strange/unexpected. Although looking at it you should be performing more than 8 instructions per loop, so that should give you enough time to load the sequential data from main memory. Strange things are only strange from a distance.

jcranmer wrote:As for why the sequential access fell behind random access, my suspicion is that the hyperthreaded core scheduling screwed up the cache prefetching

Hyperthreading does indeed share the prefetcher, as well as all the other resources on the chip including branch prediction. The other resources don't matter too much, as they are more instruction-per-instruction. The historical units are brittle, in the same way the cache is (because cache is a historical unit).

However, you will only get a hardware switch when one process stalls (basically?) and that really means that the prefetcher has failed. I have never heard of a "prefetcher storm" being a thing, where a context swap cause the prefetcher to fail, causing another context switch... etc. But it may exist, lurking in the shadows, waiting... watching... Also, if it was a hyper threading issue then I would not expect it to appear so early (at 2 threads). Unless you are using a really old OS, the two threads should be scheduled on two different cores. OS's have been hyper thread aware since the early 2000's, well at least the servers, don't know when the desktops became "smart".

Anyway, the above ^^ paragraph is purely speculative (based on what I would expect). There is only one way to know and that is for you to go into your bios and turn it off.

hoho wrote: I should also be working right now so I can't really come up with better code either

Meh, ball-park guys... unless there is something really off, why bother? (and "coz I want to" is a valid answer

, Don't let me stop you)

edit:
I remember reading that hyper threading essentially boils down to register-renaming, can anyone confirm this?

Posted: **Wed Feb 01, 2017 9:32 am**

BenSeidel wrote:
hoho wrote: I should also be working right now so I can't really come up with better code either
Meh, ball-park guys... unless there is something really off, why bother? (and "coz I want to" is a valid answer , Don't let me stop you)

That's the problem - I *do* want to

I found another benchmark I'd like to get running: https://github.com/maleadt/pChase

That one seems to be much more useful for measuring "real-life" performance in scenarios with loads of random access.

Posted: **Wed Feb 01, 2017 9:47 am**

hoho wrote:I found another benchmark I'd like to get running: https://github.com/maleadt/pChase

Cool, I would like to see if a pointer read is any slower than an arithmetic generated version for random access. If the prefetcher is unable to work, how far can the out-of-order instruction execution go? Is it able to start the next load? And I don't mean a 2x slowdown, I want to see face-planting 486 speeds. In theory, a stall is a stall. If you are "chasing pointers" then it will stall until you get the next pointer, if you are generating the next address then at some point you must have saturated the pipeline, thereby stalling until the oldest "sum" has been completed.

Posted: **Thu Feb 02, 2017 9:43 am**

BenSeidel wrote:
hoho wrote:I found another benchmark I'd like to get running: https://github.com/maleadt/pChase
Cool, I would like to see if a pointer read is any slower than an arithmetic generated version for random access. If the prefetcher is unable to work, how far can the out-of-order instruction execution go? Is it able to start the next load? And I don't mean a 2x slowdown, I want to see face-planting 486 speeds. In theory, a stall is a stall. If you are "chasing pointers" then it will stall until you get the next pointer, if you are generating the next address then at some point you must have saturated the pipeline, thereby stalling until the oldest "sum" has been completed.

In ocaml anything not an primitive integer type (including bool, char, enums) is handled as pointer. If you have a structure that contains another structure that actually means it has a pointer to the other structure. And thinks like linked lists are absolutely unavoidable. In short everything non trivial is a pointer access. Despite all that ocaml code is, with the exception of heavy floating point use, about the same speed as plain C. And the ocaml compiler doesn't optimize much other than inline code.

Worrying about access patterns and saving a pointer here or there probably costs more time to modify the code than you safe in the long run. All of those optimizations are also linear. Maybe you get a 10% improvement for a loop that runs 0.001% of the time qby spending hours rewriting your code. You usually get far far more improvements by optimizing the algorithm itself. If you can change something from O(n*n) to O(n * log n) that will put you in a whole new category and only benefit more as the factory grows. Similar with multithreading. Get it right and 2 cores will be nearly twice as fast and for something like factorio you can do 16 cores at bascialy 16 times the speed given a large enough factory.

Posted: **Thu Feb 02, 2017 12:39 pm**

Hey guys, nice to see that you got a little past theory and turned to something real

I believe your benchmarks are comparing meaningless things. It is too low-level. What I think you need to compare is the performance of accessing linked lists and arrays. Looking up elements of a linked list, especially in a random-access fashion, requires far more pointers to be dereferenced than in an array. If the linked list then contains only references to objects, it becomes worse. Random accesses to arrays are very different, as they do not require traversal of the array to access a specific location in the first place.

It would be very helpful if the Factorio devs tell us something about their usage of data structures and the most common access patterns. The questions I have here are:

Which data structures are used and accessed the most, linked list, array / vector?
What type are the elements, fixed size structures (objects) or references to objects?
What size is the typical element?
What are the access patterns that happen most of the time, sequential traversal/access or random-access to a specific index?

Based on that information we can do benchmarks that are more representative.

I have some personal experience to share. I've once worked for a company where a product written in Java used a large in-memory database. The lead developer insisted on only using arrays and never Java lists (LinkedList or ArrayList) because it would be faster. I did some comparisons to check his claims and my results showed that using plain arrays was always faster, sometimes by an order of magnitude. Consider this and the fact that insertion and removal often requires copying the whole array to a new one. If you google for "vector vs list performance" you can find some peoples opinions and benchmarks on that topic (C++ related).

Posted: **Thu Feb 02, 2017 1:51 pm**

mrvn wrote:Get it right and 2 cores will be nearly twice as fast and for something like factorio you can do 16 cores at bascialy 16 times the speed given a large enough factory.

In case you didn't notice, devs already tried multithreading in their 0.15 branch. Going from 1->4 increased performance at sub-linear rate and that was best-case-scenario with using the old way of dealing with belts. Their new way makes belt handling tens of times faster meaning parallelization of the game won't even show as good results as that sub-linear rate they saw.

I know of no games where going beyond 2 cores has given linear speedups outside of purely aesthetic stuff like allowing more particle systems or blades of grass.

Amdahl's law is a bitch, basically

hansinator wrote:Based on that information we can do benchmarks that are more representative.

I'm fairly certain that factorio devs themselves have ran benchmarks on the actual code itself while trying various solutions. They have vastly better overview of where the bottlenecks exist and why. No amount of synthetic benchmarks are as useful as running a game session through VTune.

hansinator wrote:Consider this and the fact that insertion and removal often requires copying the whole array to a new one. If you google for "vector vs list performance" you can find some peoples opinions and benchmarks on that topic (C++ related).

Speed difference between linked lists and arrays VASTLY depends on how you use them. If you don't have to have the list sorted meaning you won't be inserting stuff randomly in the middle then it's pretty darn hard to write array-based thingy that's slower than linked list.

Similarly, if you require sorted list and are moving or adding stuff to the middle of the list/array but you do those additions/removals at vastly lower rate than iterating over it or searching from it, using arrays can easily be more effective even if each addition/removal requires moving around half the array.

Posted: **Thu Feb 02, 2017 4:48 pm**

hoho wrote:I'm fairly certain that factorio devs themselves have ran benchmarks on the actual code itself while trying various solutions.

They have vastly better overview of where the bottlenecks exist and why. No amount of synthetic benchmarks are as useful as running a game session through VTune.

It could also be that they ride pink elefants in their office. It's all speculation and we don't know for sure. This doesn't help us here, because lately this thread was discussing memory access patterns and people were posting synthetic benchmarks. In one of the previous posts from the devs it appeared like they had the supspicion that memory access patterns may slow down things but weren't sure. The parameters I was asking for would be helpful to design benchmarks that are not synthetic, at least with regard to memory access. The operations on the elements themselves are another story, but they are of less interest to the current discussion anyway. Mentioning a $900+ profiling tool doesn't help either. It doesn't help this discussion, nor would development benefit from it significantly more than from tools that are freely available. From my point of view your argumentation more or less translates to "you don't know what you are talking about and there is no way you can ever achieve something meaninful here". Which is something I don't agree to. So maybe we can continue discussing that memory stuff and do benchmarks as we like to do, please?

hoho wrote: Speed difference between linked lists and arrays VASTLY depends on how you use them. If you don't have to have the list sorted meaning you won't be inserting stuff randomly in the middle then it's pretty darn hard to write array-based thingy that's slower than linked list.

Similarly, if you require sorted list and are moving or adding stuff to the middle of the list/array but you do those additions/removals at vastly lower rate than iterating over it or searching from it, using arrays can easily be more effective even if each addition/removal requires moving around half the array.

That's what they told you in algorithms & data structures class based on books written in the late 90's. Nowadays we have memory speeds commonly approaching 50gb/s for sequential accesses. Thus copying around arrays is dirt-cheap. It will be fast even if you frequently insert stuff at random positions, as long as you don't do it as often as you do read accesses. As I suggested, search google and you will find loads of information on that topic.

Posted: **Thu Feb 02, 2017 8:02 pm**

hansinator wrote:Mentioning a $900+ profiling tool doesn't help either.

Using it helps the devs for sure and I'm quite certain that one time expense for vtune isn't exactly making the devs go broke.

hansinator wrote:From my point of view your argumentation more or less translates to "you don't know what you are talking about and there is no way you can ever achieve something meaninful here". Which is something I don't agree to. So maybe we can continue discussing that memory stuff and do benchmarks as we like to do, please?

No, my claims were that you'd need to know a LOT more about the inner workings of the game to design a meaningful test than just that simple list.

hansinator wrote:That's what they told you in algorithms & data structures class based on books written in the late 90's

Back in reality, that were the results I was getting when optimizing the forward looking sonar software our startup was working on that had to be able to crunch through a metric ton of data on a puny netbook with the weakest CPU on the market.

Also, you won't be seeing 50GB/s unless you're using Xeons and/or CPUs costing around $1000 with more than two memory channels.

hansinator wrote:Nowadays we have memory speeds commonly approaching 50gb/s for sequential accesses. Thus copying around arrays is dirt-cheap.

Huh? My entire point was that arrays *can* be faster than linked lists even if you have to add and remove stuff in the middle in certain scenarios. You're essentially agreeing with me.

Posted: **Sat Feb 04, 2017 2:04 pm**

What an interesting discussion, too bad I'm late to the party.

One thing I want to mention first: I love how open the devs are with respect to the code, algorithms etc. Please keep that up! I'm certain the game benefits in the end - and it's also educational for us.

There is one thing about parallelization that seems was not clearly mentioned yet:

Scientific simulation

Factorio is in essence a discrete simulation on a 2D grid. Most interactions are local. That makes it very similar to all kinds of scientific simulations, be it particle simulations, weather, computation fluid dynamics, you name it. In that sense it is ideal for a domain decomposition on sub-grids (e.g. chunks). I have said that already at the initial multithreading-FF and I am glad to see that the discussion is going in the right direction.

Now I'm not particularly sure about the 3x3 generation parallelization. For once I'm not really use I understand the approach correctly: I take it that chunks within the same generation are considered independent and can be executed by threads in parallel. I further assume this will lead to some irregularities at chunk borders, possibly changing the priority e.g. of inserters depending on the order of generations (in order to avoid double buffering).

My fear about this processing is, that it might loose a lot of data locality. In essence every border exchange will not be cached because the respective neighbor-chunk was/or will be processed far in the past/future. In order to reduce the amount of data that has to be exchanged, a decomposition can look like this:

Of course that won't work well if your megabase is in the top right corner, but then there are many load balancing techniques, for instance using a space-filling-curve and a workload metric to assign chunks:

Or keeping it even simpler, why process only complete rows of chunks in parallel and make only two generations: even and odd Y-numbers. That way, at least half of the data exchange is now between chunks that will be processed after each other.

In general, my feeling is that chunks are too small for dynamic scheduling (as Harkonnen also indicates), and larger subgrids would still expose enough parallelism. That said, I do recognize that chunks are probably an integral part of the general optimizations. BTW: I wouldn't call them single-threaded optimizations, just because they don't have anything to do with multithreading. A multithreaded Factorio can just as well benefit from those optimizations.

By the way, game of life was mentioned before and a interesting code example was shown. I'd like to show how this would be done with scientific computing methods, namely OpenMP:

Code: Select all

#pragma omp parallel
for (curgen = 0; curgen < gens_max; curgen++)
{
    #pragma omp for
    for (t = 0; t < nrows*ncols; t++)
    {
        int i = t / ncols;
        int j = t % ncols;
        const int inorth = mod (i-1, nrows);
        const int isouth = mod (i+1, nrows);
        const int jwest = mod (j-1, ncols);
        const int jeast = mod (j+1, ncols);

        const char neighbor_count = 
            BOARD (ingrid, inorth, jwest) + 
            BOARD (ingrid, inorth, j) + 
            BOARD (ingrid, inorth, jeast) + 
            BOARD (ingrid, i, jwest) +
            BOARD (ingrid, i, jeast) + 
            BOARD (ingrid, isouth, jwest) +
            BOARD (ingrid, isouth, j) + 
            BOARD (ingrid, isouth, jeast);

        BOARD(outgrid, i, j) = alivep (neighbor_count, BOARD (ingrid, i, j));
    }
    SWAP_BOARDS( outgrid, ingrid );
}

(by saeedn on stackoverflow.com)

I'm interested to hear what you think which approach is easier

. In any case, I don't say the Factorio devs are dumb/lazy/ignorant. In fact I consider them among the brightest programmers in the industry. However, I do see little bit of a tendency towards reinventing the wheel. And I don't think they should switch all code to OpenMP and PETSc. However, I can only encourage you to take a look at techniques used in scientific computing and learn from that. I have no idea how hard it will be to apply these techniques to the factorio codebase, but the problem lends itself very well to be a parallel simulation.

An interesting anecdote about game of life is, that the best optimization actually comes from hashing subgrids, which allows huge simulations. Unfortunately I don't think that is feasible with the slightly more complex ruleset of Factorio.

Brownout scenario

Power calculation is among the challenges when it comes to global effects. I do think this can be overcome: For instance changing the internal power to use binary on/off and allow single-tick energy debts/buffers. Some care has to be taken when merging/splitting electricity networks so that one cannot create free energy with combinator/power switches.

Memory optimization

It seems very much that there is still huge potential for optimization by using better memory layouts to improve locality. I would agree that the devs should focus on those. But while those improvements don't necessarily conflict with parallelization, it can be more effort to add parallelization as an aferthought.

Anyway I am very interested in taking a deeper look at Factorio performance, but I don't want to do that on 0.14. So I hope the experimental builds will soon be available again.

Oh and for multi-process-factorio, there is already a mod,.

Posted: **Sun Feb 05, 2017 3:17 pm**

TL;DR: Bigger "chunks" should result in lower CPU usage.

Zulan wrote: In general, my feeling is that chunks are too small for dynamic scheduling (as Harkonnen also indicates), and larger subgrids would still expose enough parallelism.

That is also my feeling.

I try to explain that also:

Code: Select all


   ______                           ______
  |      |                         |      |
  |chunk | -- interacts with -->   |chunk |
  |______|                         |______|


 Tiles to claculate inside "chunk": 9
 Border, that interacts with other "chunks": 12

Let's say a chunk would be only 3x3 tiles wide. Then it contains 9 tiles. But the chunk has four side á 3 tiles, which is a potential of 12 tiles that interacts with other chunks! This can be expressed in a ratio of inner tiles to outer:

Code: Select all

Participation probability =  Perimeter (outer) / Area (inner)

We can see this as a number of how tiles of the perimeter needs to be calculated versus tiles in the area.

In this case Participation probability = 12 / 9 = 1.33

For chunks with 4x4 this ratio is lower: 16 / 16 = 1

For 32x32 (as now): 128 / 1024 = 0.125

So we can say about 10% CPU of a chunk is needed for interaction with other chunks if chunksize is 32. That's is just a stupid assumption, but I think the sense can be understood. I think this is still too much, I think we need to come into an area of 1/100 instead of 1/10.

So for calculating this effetively we need a bigger "unit": 256 to 1024 tiles.

And when I compare that with "reality" it makes sense: 1024 tiles is the size of a small factory. Let's say that utilizes a CPU by 100%. So when you have two factories that should utitlize two CPUs plus the "exchange" between the two (the two players have a beltween their factories).

Factorio Forums

Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications

Re: Parrallel processing in games & applications