[0.17.56] Linux: long application startup due to sprite loading

Bigos · Post by **Bigos** » Sun Jul 14, 2019 11:56 am

Hi,

Since 0.17, I have been experiencing long application startup times. By long I mean a minute spent in "Loading sprites..." phase. I have worked it around by using the atlas cache (graphics/cache-sprite-atlas=true) but now I decided to come back to this issue to check what is going on.

I have profiled factorio using perf and it seems like most of the time is being spent in the kernel, in the page_fault routine. If perf backtraces are to be believed (and I am reading them correctly) these page faults come from the calculateSpriteCropRect function (inlined in AtlasSystem::loadSprites function). I have then generated another perf profile, this time to look for page faults explicitly.

As I am not sure if my perf profile contains sensitive data, I will just copy parts of it here (I can PM both perf reports, if desired):

CPU cycles perf report:

Code: Select all

  Children      Self  Command         Shared Object           Symbol
-   94.06%     0.76%  factorio        factorio                [.] AtlasSystem::loadSprites
   - 93.30% AtlasSystem::loadSprites
      - 91.38% calculateSpriteCropRect (inlined)
           91.23% page_fault
      + 1.49% AtlasSystem::loadSprite
   + 0.76% 0xffffffffffffffff
+   94.06%     0.00%  factorio        [unknown]               [k] 0xffffffffffffffff
+   94.06%     0.00%  factorio        factorio                [.] GlobalContext::init
+   94.06%     0.00%  factorio        factorio                [.] AtlasSystem::build
+   94.06%     0.00%  factorio        factorio                [.] AtlasSystem::tryLoadSpritesWithFallbackToMinimalMode
-   92.13%     0.00%  factorio        factorio                [.] calculateSpriteCropRect (inlined)
   - calculateSpriteCropRect (inlined)
        91.23% page_fault
+   91.49%    90.98%  factorio        [kernel.kallsyms]       [k] page_fault
     1.70%     1.70%  PreloadWorker   factorio                [.] preloadPng

Page faults perf report:

Code: Select all

  Children      Self  Command         Shared Object           Symbol
+   88.29%     0.00%  factorio        [unknown]               [.] 0xffffffffffffffff
+   88.29%     0.00%  factorio        factorio                [.] MainLoop::run
+   88.29%     0.00%  factorio        factorio                [.] GlobalContext::init
+   88.29%     0.00%  factorio        factorio                [.] AtlasSystem::build
+   88.29%    76.13%  factorio        factorio                [.] AtlasSystem::loadSprites
+   88.29%     0.00%  factorio        factorio                [.] AtlasSystem::tryLoadSpritesWithFallbackToMinimalMode
+   75.85%     0.00%  factorio        factorio                [.] calculateSpriteCropRect (inlined)
+    9.13%     0.00%  factorio        factorio                [.] AtlasSystem::loadSprite
+    8.67%     8.67%  factorio        libc-2.29.so            [.] __memmove_avx_unaligned_erms
+    8.26%     0.00%  factorio        factorio                [.] BitmapUtil::copyWithDuplicatedBorder (inlined)
+    8.20%     0.00%  factorio        factorio                [.] memcpy (inlined)
+    8.15%     0.00%  factorio        factorio                [.] operator() (inlined)
+    7.82%     0.00%  factorio        factorio                [.] BitmapUtil::copyWithDuplicatedBorderNoConversion
     7.10%     7.10%  PreloadWorker   libc-2.29.so            [.] __memmove_avx_unaligned_erms
     3.41%     3.41%  PreloadWorker   factorio                [.] preloadPng

I have looked at the annotated disassembly of AtlasSystem::loadSprites for the second report. All of the page faults are on read instructions, probably on memory held by MemoryBitmap (I see MemoryBitmap::getData() inlined).

Does MemoryBitmap by any chance hold huge amount of data? If so, you might want to enable transparent huge pages (THP), using the madvise() syscall [1]. I have performed an experiment and enabled THP on the whole system and it cut the factorio startup time in half (to half a minute). You can experiment with it yourself by writing (as the root) to the /sys/kernel/mm/transparent_hugepage/enabled file. By default (at least on my distro) THP is only enabled when the application asks for it using madvise(), by writing "always" to the sys file you can enable it on all of the memory.

(huge pages are also useful for reducing TLB misses)

Still, it is strange there are so many page faults involved. It might happen if there are a lot of allocations/deallocations and the memory is being returned to the kernel in the meantime.

[1] http://man7.org/linux/man-pages/man2/madvise.2.html

Post by **posila** » Sun Jul 14, 2019 12:37 pm

Interesting. Even half a minute is slow on your PC. From "Parallel Sprite Loader initialized" to "Sprites loaded" it should take less than 10 seconds.

Can you try to test if changing graphics/max-sprite-loading-threads to 8 and then to 1 makes any difference?

Bigos · Post by **Bigos** » Sun Jul 14, 2019 1:55 pm

That seems to be it. I have tested the startup time with various max-sprite-loading-threads values (with THP set to madvise-only, i.e. default).

The times are from the game launch (in steam) to the appearance of the main menu.

- max-sprite-loading-threads=1: 55s, 54s, 54s
- max-sprite-loading-threads=2: 30s, 29s, 29s
- max-sprite-loading-threads=4: 17s, 18s, 17s
- max-sprite-loading-threads=8: 12s, 11s, 10s
- max-sprite-loading-threads=16: 7.5s, 7.5s, 8s, 7.5s
- max-sprite-loading-threads=32: 66s, 116s, 60s, 80s, 68s

And here are interesting results:
- max-sprite-loading-threads=16 (right after 32 thread testing): 65s, 27s, 61s, 44s, 29s
- max-sprite-loading-threads=8 (right after the above 16 thread testing): 18s, 12s, 14s, 16s, 18s

Factorio bound to NUMA #1 (half CPU cores and memory available):
- max-sprite-loading-threads=8: 10s, 10s, 11s
- max-sprite-loading-threads=16: 8s, 8s, 8s

It seems there is some scalability issue present that is also non-deterministic. The higher thread count sprite loading seems to have higher variability. Maybe it depends on memory addresses, like there being some false sharing?

Note that my CPU is configured as two NUMA nodes - half of the cores have direct access to half of the memory. Looking at 16 thread results after 32 thread testing, the state of system memory (how much is allocated in each NUMA node, etc.) might have a huge effect on performance. My tests on a single NUMA node seem to confirm that.

Post by **Rseding91** » Sun Jul 14, 2019 2:09 pm

Thanks for the report. Based off Posilas comment and your reply + having a work around I'm going to say that your system configuration is to blame and move this to not-a-bug.

Our main Linux developer sits right behind me and he doesn't have the janky configuration you've got and experiences no such performance problems.

Post by **posila** » Sun Jul 14, 2019 4:23 pm

Based on loading stats from automatic crash logs, this doesn't seem to be generic problem (neither on Linux nor on Windows).

I am not sure if cross NUMA-node memory access would generate page fault; maybe the problem is not related to NUMA, but your RAM is used heavily or just OS is too eager to page out memory.

Factorio Forums

[0.17.56] Linux: long application startup due to sprite loading

[0.17.56] Linux: long application startup due to sprite loading

Re: [0.17.56] Linux: long application startup due to sprite loading

Re: [0.17.56] Linux: long application startup due to sprite loading

Re: [0.17.56] Linux: long application startup due to sprite loading

Re: [0.17.56] Linux: long application startup due to sprite loading