Factorio Forums

Posted: **Fri Jan 14, 2022 5:08 am**

Greets, been having some UPS issues in a bug-heavy game (with mods like rampant, but script times are remaining sub-1.2ms consistently), the in-game timings show that the Entity Update times are (depending on what the bugs are doing) between 15 to 35ms. I took a profile of the game while it was running and nicely enough a dwarf debug section exists in the executable so I got a nice mapping for the report. What I found was that the bugs are indeed doing a lot of work, over 47% of the total average frametime by themselves across both the bugs and their spawners (both are quite heavy in surprisingly roughly equal measures). However, among a lot of the calculations there is a single function called from multiple callstacks among the profile and this function by itself across all the calls is taking up ~16.2% of the total frametime, and that is `FixedPointNumberTemplate<int, 8u>::getDouble() const`. Now among my normal programming tasks I'd first look into seeing if anything is easily optimizable within it, perhaps a different algorithm (if my hunch about what it does based on its type and name is), and secondly might look into seeing how well it could be SIMD'd or so across what calls it (if it needs to be called so often to begin with), the latter would involve looking in a lot more places and the source code isn't open, so would it instead be possible to at least get the source for the `FixedPointNumberTemplate<int, 8u>::getDouble() const` (and whatever other potentially related functions if any) to see if I can work on optimizing it any as I get time? If there are tests for it I would like to see those as well to confirm any changes else I can write some.

Also, I'm curious, do the template arguments imply it's a 4-byte/32-bit `int` sized fixed point number with 8 after-fixed-point binary digits, and does its use really help over just using a 32-bit floating point type (especially since it will lack a lot of the most useful SIMD optimizations especially after it's more easily inlined as well), or is it primarily to use integer values so as to make the simulation more easily consistent (which floats can do as well, though are indeed more difficult to do right, especially on older hardware)?

Posted: **Fri Jan 14, 2022 9:14 am**

Code: Select all

template<class T, uint32_t PrecisionBits>
class FixedPointNumberTemplate
{
  // ...
  static constexpr T ValuesPerOne = T(1) << PrecisionBits;
  // ...
  constexpr double getDouble() const { constexpr double ValuePerIncrement = 1.0 / ValuesPerOne; return double(this->value) * ValuePerIncrement; }
  // ...
};

On Windows (so compiled by MSVC), it ends up being movd + cvtdq2pd + mulsd and is inlined everywhere I looked. Which makes me think your profiler is incorrectly attributing instructions to getDouble() as it can't figure out what belongs to what after inlining and link time optimizations.

EDIT: As for second part of the question ... fixed point number is used for position coordinates in order to have the same precision everywhere on the map.

Posted: **Fri Jan 14, 2022 4:49 pm**

posila wrote: Fri Jan 14, 2022 9:14 am On Windows (so compiled by MSVC), it ends up being movd + cvtdq2pd + mulsd and is inlined everywhere I looked. Which makes me think your profiler is incorrectly attributing instructions to getDouble() as it can't figure out what belongs to what after inlining and link time optimizations.

Even if it is inlined the DWARF section in the file (on linux) should still attribute it correctly as long as the code exists and didn't get split up, but it is indeed quite possible.

If that's the case then the same still holds for

Code: Select all

MapPosition::distanceSquared(MapPosition const&)

, of which the

Code: Select all

FixedPointNumberTemplate<int, 8u>::getDouble() const

makes up ~96.7% of the time of that called function and is the number one place where

Code: Select all

FixedPointNumberTemplate<int, 8u>::getDouble() const

is being called from according to the profiler by a substantial margin (over 98% of the time). And that function is primarily being called by

Code: Select all

Surface::listFriendlyEntities(MapPosition const&, ForceID, double, std::vector<EntityWithForce*>&, bool)

where it makes up ~80.6% of the runtime of that function. And 'that' function is primarily called (with near 100% of the runtime of it, so I'm guessing it's just a simple jump function after adding the bool argument to the end) from

Code: Select all

Surface::listFriendlyUnitsMapPosition const&, ForceID, double, std::vector<EntityWithForce*>&)

, and then this is called from quite a variety of places.

Hmm, actually I have a few minutes here, let me ssh home and run it remotely right quick so I can disassemble it properly...

Okay, so yes the

Code: Select all

MapPosition::distanceSquared(MapPosition const&)

function did get inlined into

Code: Select all

FixedPointNumberTemplate<int, 8u>::getDouble() const

which then *that* also got inlined into

Code: Select all

Surface::listFriendlyEntities(MapPosition const&, ForceID, double, std::vector<EntityWithForce*>&, bool)

. So, now I have the disassembled dump of apparently

Code: Select all

Surface::listFriendlyEntities(MapPosition const&, ForceID, double, std::vector<EntityWithForce*>&, bool)

, which contains quite a chunk of code, ~607 machine instructions

I've uploaded the 2.4gigs of 30 seconds of `perf` dump (as a highly compressed 21 meg file for ease of downloading) if you want to look at it (if you aren't familiar with perf dumps then there are a few useful GUI's for viewing its data dumps, like hotspot or so, though most IDE's are capable of viewing the data directly in very useful forms): https://overminddl1.com/Factorio/perf/p ... io.data.xz
Although you no doubt can get it yourself, I also uploaded the disassembly of that function (just from gdb, haven't loaded it up into IDA or anything like that yet) in text form to: https://overminddl1.com/Factorio/perf/S ... tities.txt

As you can see in the `perf` dump the

Code: Select all

Surface::listFriendlyEntities(MapPosition const&, ForceID, double, std::vector<EntityWithForce*>&, bool)

function takes up 20.8% of the total runtime, of which 2.9% is attributed to the function itself and most of the rest of it is attributed to

Code: Select all

MapPosition::distanceSquared(MapPosition const&)

that's been inlined (and thus also

Code: Select all

FixedPointNumberTemplate<int, 8u>::getDouble() const

that was inlined into

Code: Select all

distanceSquared

as well). Without having given structure to the disassembly yet either manually or via IDA pro or so it looks like there is a *lot* of looping, which is probably why distanceSquared/toDouble ranks so extremely very high as they are quite likely within those loops. This dump is from a multiplayer game where we've had to reduce the speed to about 20UPS to keep the other player (who's on a slower computer than mine) from getting extremely 'Behind' and eventually dropping as they can't keep up.

As an aside, the LUA GC is surprisingly heavy, and not doing much cleanup, I'm guessing it's just going over all the script objects from the mods but it's not finding much to clean up as not a whole lot of garbage is being made (but a lot of things are being cached and held purposefully in those mods). It's taking consistently well over 3ms of runtime per frame compared to the scripts execution taking 0.7ms to 1.2ms, but unsure how much that could be helped without turning down how much is incrementally GC'd each frame without more fundamental structural changes.

Posted: **Fri Jan 14, 2022 5:04 pm**

posila wrote: Fri Jan 14, 2022 9:14 ammovd + cvtdq2pd + mulsd

Interestingly GCC 11.2 compiles it to:

Code: Select all

  pxor xmm0, xmm0
  cvtsi2sd xmm0, edi
  mulsd xmm0, QWORD PTR .LC0[rip]
.LC0:
  .long 0
  .long 1064304640

And clang 13 compiles it to:

Code: Select all

.LCPI0_0:
  .quad 0x3f70000000000000 # double 0.00390625
  cvtsi2sd xmm0, edi
  mulsd xmm0, qword ptr [rip + .LCPI0_0]

EDIT: I wonder why MSVC is generating the packed 64-bit instructions instead of using the scalar 32-bit instructions like gcc and clang are, it could get rid of that useless movd if so, that seems like a bug in MSVC... The gcc version that factorio was compiled with for linux appears to use the scalar instruction properly so it doesn't have to do the useless movd, however it also seems like it's doing a useless pxor instead, seems like only clang gets it right... Lol.

Posted: **Fri Jan 14, 2022 5:13 pm**

OvermindDL1 wrote: Fri Jan 14, 2022 5:04 pm EDIT: I wonder why MSVC is generating the packed 64-bit instructions instead of using the scalar 32-bit instructions like gcc and clang are, it could get rid of that useless movd if so, that seems like a bug in MSVC... The gcc version that factorio was compiled with for linux appears to use the scalar instruction properly so it doesn't have to do the useless movd, however it also seems like it's doing a useless pxor instead, seems like only clang gets it right... Lol.

well, GCC, Clang, and MSVC all have very different compiler optimisations they're applied.

which mtune, O-level, or other CXXFLAGS are you using?

Posted: **Fri Jan 14, 2022 6:50 pm**

OvermindDL1 wrote: Fri Jan 14, 2022 4:49 pm ...

I can't download https://overminddl1.com/Factorio/perf/p ... io.data.xz, I get 403 Forbidden.

But, save would be more useful, so we can profile ourselves with tools that we are used to using and can experiment how changes affect the performance.

Anyway, listFriendlyEntities iterates over linked list of entities with force on a chunk, so distanceSquared() is working with values that are unlikely to be in CPU cache. But traversal of the linked list has the same problem, unless the CPU sees the pattern of reading position and then next pointer and prefetches the next pointer while waiting to get the position too.

Posting save to 17501 would be helpful, because in 1.1.51 there is a refactoring, that added here one more layer of indirection, so it is potentially lot slower.

Posted: **Fri Jan 14, 2022 7:03 pm**

ptx0 wrote: Fri Jan 14, 2022 5:13 pm which mtune, O-level, or other CXXFLAGS are you using?

Purely just -O3 for this test, nothing fancy.

posila wrote: Fri Jan 14, 2022 6:50 pm
OvermindDL1 wrote: Fri Jan 14, 2022 4:49 pm ...
I can't download https://overminddl1.com/Factorio/perf/p ... io.data.xz, I get 403 Forbidden.

Oh bugger all, didn't have xz files in the allow list, fixed.

posila wrote: Fri Jan 14, 2022 6:50 pm But, save would be more useful, so we can profile ourselves with tools that we are used to using and can experiment how changes affect the performance.

That it would, uploaded the latest multiplayer save (thankfully no where near as big as I was worried that save was going to be, only 69.3 megs) to: https://overminddl1.com/Factorio/perf/OverSE-E1-22.zip
(Edit: Also finished posting it to the requested thread at viewtopic.php?p=559892#p559892 ).

posila wrote: Fri Jan 14, 2022 6:50 pm Anyway, listFriendlyEntities iterates over linked list of entities with force on a chunk, so distanceSquared() is working with values that are unlikely to be in CPU cache. But traversal of the linked list has the same problem, unless the CPU sees the pattern of reading position and then next pointer and prefetches the next pointer while waiting to get the position too.

I did notice prefetches happening, but yeah it was pretty memory unfriendly, looked like an intrusive list, not something I see often anymore considering how fairly ubiquitous type chunked storage is nowadays.

posila wrote: Fri Jan 14, 2022 6:50 pm Save would be useful, because in 1.1.51 there is a refactoring, that added here one more layer of indirection, so it is potentially lot slower.

Slower? Oh if only I could get a hold of the code, it would be a large change but a form of type chunked storage would so very much speed things over repeated pointer indirections. ^.^;

Posted: **Sat Jan 15, 2022 1:30 am**

Looking at that save specifically around the things you mentioned:

during a given update 4 unique types of entities are found while iterating listFriendlyEntities on the chunks. Of which there are an average of 60,000 unique entities touched each tick through listFriendlyEntities an average of 120,000 times each tick (multiple are touched twice by different spawners near each other).

Virtually all of the time is spent loading in the memory for the MapPositions from the 60,000 unique entities. That's 8 bytes * 60,000 or 480,000 bytes just to touch each of the positions.

Since there are 4 unique entity types the logic does not know what type of entity it might get when iterating the chunk lists and since we need to keep addition and removal of an entity from these lists O(1) they can't be pooled into any kind of memory pool.

Posted: **Sat Jan 15, 2022 3:08 am**

you could just publish the portion of source for these methods like you did for the rail pathfinder and other bits.

Posted: **Sat Jan 15, 2022 3:12 am**

This function? https://gist.github.com/Rseding91/04a11 ... 925221ff47

Posted: **Sat Jan 15, 2022 5:27 am**

Rseding91 wrote: Sat Jan 15, 2022 1:30 am Since there are 4 unique entity types the logic does not know what type of entity it might get when iterating the chunk lists and since we need to keep addition and removal of an entity from these lists O(1) they can't be pooled into any kind of memory pool.

This is precisely the kind of issue typed chunk storage is supposed to solve (and there are known methods of encoding runtime data into the chunk 'types' to be able to perform things like game-related positional range checks faster as well for mostly static things, like nests/spawners), though again that would necessitate a bit larger of a structural change than I'd be comfortable attempting in factorio without a substantive amount of time and a lot of tests first (though if curious in such data structures for some light/heavy reading then might look at the flecs or similar libraries for C/C++ that encode the full type chunk storage patterns for as much speed as you can churn out of the CPU, especially with something like factorio and it's restricted types of its game objects).

A few possible hacks could potentially help though, depending on things like how readily you need to be able to insert 'inside' the list compared to just append and swap removing (removing one in the middle by popping the one on the end and moving it into the removed ones position), if that's allowed for that (still easy to keep consistent in a deadmans lockstep simulation) then could break them up into individual bag arrays for each type in the chunks and you'll have O(1) append and O(1) removal anywhere, and a potentially huge speed boost especially since code can finally be generated that knows the type that it will be operating over (much much nicer on memory especially with the linear access and known prefetching of pointers in that array and you get a lot of SIMD opportunities for a fairly minimal amount of code change, in comparison anyway). Maybe some other options as well.

Posted: **Sat Jan 15, 2022 10:36 am**

I measured the save in 1.1.51 and it is pretty much the same as in 1.1.50, so that's not bad

Posted: **Sat Jan 15, 2022 9:57 pm**

posila wrote: Sat Jan 15, 2022 10:36 am I measured the save in 1.1.51 and it is pretty much the same as in 1.1.50, so that's not bad

Indeed a very good sign considering the worry mentioned earlier, guess the CPU is able to prefetch that new change well or it gets optimized out or something. ^.^;

Factorio Forums

Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`

Re: Costly `FixedPointNumberTemplate<int, 8u>::getDouble() const`