Page 1 of 1

[wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Feb 11, 2020 12:15 pm
by linux_zealot
Short description

While looking at the profile of Factorio on Linux I noticed that most hot functions contain lots of redundant stores and loads of floating point values. My guess that it is caused by using -ffloat-store compiler flag. My second guess is that -ffloat-store was added to make 32-bit binary produce the same floating-point results as 64-bit do. My understanding is that -ffloat-store doesn't change the results of 64-bit binary, but causes massive slowdown of floating point code (different sources report up to 2x slowdown).

Is it possible to remove -ffloat-store if it is present? If it is not present perhaps there is some other reason why generated code contains so many loads/stores.

Long description

Yesterday I tried to connect to an online game and what I got was that my computer wasn't fast enough to catch up with the game. I decided to profile it to see if something can be improved. I collected a perf profile. While looking at the top functions of the profile, I noticed quite bad codegen of function Math::sincosUnsafe. Here are the first few lines:

Code: Select all

  0,64 │       sub    $0xc0,%rsp
  0,93 │       movsd  %xmm0,-0x70(%rsp)    # store to some local variable
  8,43 │       movsd  -0x70(%rsp),%xmm0    # load from it
  0,13 │       movapd %xmm0,%xmm1
  1,06 │       subsd  .LC197,%xmm1
       │       movsd  %xmm1,-0x60(%rsp)    # store to another local variable
  0,28 │       movsd  -0x60(%rsp),%xmm2    # load from it
  0,27 │       unpcklpd %xmm0,%xmm2
  0,90 │       movaps %xmm2,0x98(%rsp)     # store to another local variable
  1,79 │       cvtpd2dq 0x98(%rsp),%xmm0   # load from it
  1,80 │       cvtdq2pd %xmm0,%xmm0
My first impression was like this:
1. Wow, SRoA (scalar replacement of aggregates) did very poor job here. I wonder if I could make a reprocase to report to GCC.
2. Stop! sincos shouldn't have any aggregates. What caused GCC to generate such a bad code then?
3. Aha, it must be some volatile variables they put here to get the same results on different compilers.
4. Then I looked at other hot functions and I saw exactly the same codegen even in function read_to_mixer_linear_float_32 from allegro. No way they would put volatile in sound mixing function in allegro. It must be some compiler flag that pessimize the whole program. What can it be?
5. I remember the -ffloat-store flag from x87 era to trim excessive precision of floating point operation done on x87 coprocessor. And after a quick check I verified that it pessimizes SSE codegen in similar way as seen in Factorio.

My understanding of -ffloat-store

In my understanding (I might be wrong) -ffloat-store is needed for 32-bit binaries when x87 fpu is used to trim the excessive precision of x87 operations. SSE operations don't have excessive precision and there is no need to store/load them to memory. Some libraries use -ffloat-store only when SSE is not available source. My guess -ffloat-store was added to Factorio when the game supported 32-bit mode. Now 32-bit is dropped and there is no need for -ffloat-store to be used.

FloatingPointMath page in GCC wiki especially says that -mfpmath=sse -msse2. And compiling for x64 implies both of these options.

If 32-bit compatibility is still needed, I would recommed either (1) leaving -ffloat-store only in 32-bit mode or (2) enabling -mfpmath=sse -msse2 on 32-bit. The second option would require early pentium4-class (since year 2000) machine to run the game, but I don't think this is too constraining today.

Final remark

It is possible that my understanding is wrong and removing -ffloat-store affects the results of floating point math on x64. If this is the case I would like to help troubleshooting the problem.

Re: Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Feb 11, 2020 12:31 pm
by linux_zealot
Windows note

I also disassembled the windows binary and compared selected hot functions. I can tell that windows binary is not affected by the same problem. The generated code is much nicer. This confirms that the problem is compiler-related. Also this might imply that windows binary of the game runs measurably faster.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Feb 11, 2020 8:50 pm
by movax20h
Yeah. Using

Code: Select all

-ffloat-store
might make sense in some code when using x87 code, but shouldn't be needed when using SSE2, which should be a default on the 64-bit anyway (SSE2 is mandatory on all x86 64-bit CPUs).

Definitively this compiler flag should be removed from 64-bity build. And considering there are no 32-bit support in Factorio anyway, just remove it.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Feb 11, 2020 9:17 pm
by movax20h
I see this problem in many other functions:

Code: Select all

_ZN13LowPassFilter7processEPfj:
...
  5f2e4e:       f2 0f 59 44 24 38       mulsd  0x38(%rsp),%xmm0
  5f2e54:       f2 0f 11 44 24 40       movsd  %xmm0,0x40(%rsp)
  5f2e5a:       f2 0f 10 44 24 40       movsd  0x40(%rsp),%xmm0
...

Code: Select all

_ZNK6Corpse15getSelectionBoxEv

  5f374a:       f2 0f 11 4c 24 10       movsd  %xmm1,0x10(%rsp)
  5f3750:       f2 0f 59 44 24 10       mulsd  0x10(%rsp),%xmm0
  5f3756:       f2 0f 11 44 24 18       movsd  %xmm0,0x18(%rsp)
...

Code: Select all

_ZNK20FlyingRobotPrototype23maximumPowerConsumptionEPK9ForceData
...
  5f6e07:       f2 0f 58 44 24 e0       addsd  -0x20(%rsp),%xmm0
  5f6e0d:       f2 0f 11 44 24 c0       movsd  %xmm0,-0x40(%rsp)
  5f6e13:       f2 0f 10 44 24 c0       movsd  -0x40(%rsp),%xmm0
  5f6e19:       f2 0f 59 44 24 f8       mulsd  -0x8(%rsp),%xmm0
  5f6e1f:       f2 0f 11 44 24 c8       movsd  %xmm0,-0x38(%rsp)
  5f6e25:       f2 0f 10 44 24 c8       movsd  -0x38(%rsp),%xmm0
  5f6e2b:       f2 0f 58 44 24 e8       addsd  -0x18(%rsp),%xmm0
  5f6e31:       f2 0f 11 44 24 d0       movsd  %xmm0,-0x30(%rsp)
  5f6e37:       f2 0f 10 44 24 d0       movsd  -0x30(%rsp),%xmm0
...

Code: Select all

_ZNK26AssemblingMachinePrototype19entityInfoIconScaleEv
...
  5f720d:       f2 0f 2a c0             cvtsi2sd %eax,%xmm0
...
  5f7225:       f2 0f 11 44 24 c8       movsd  %xmm0,-0x38(%rsp)
  5f722b:       f2 0f 10 44 24 c8       movsd  -0x38(%rsp),%xmm0
  5f7231:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  5f7235:       f2 0f 11 44 24 d0       movsd  %xmm0,-0x30(%rsp)
  5f723b:       66 0f ef c0             pxor   %xmm0,%xmm0
  5f723f:       f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  5f7243:       f2 0f 11 44 24 d8       movsd  %xmm0,-0x28(%rsp)
  5f7249:       f2 0f 59 4c 24 d8       mulsd  -0x28(%rsp),%xmm1
  5f724f:       f2 0f 10 44 24 d0       movsd  -0x30(%rsp),%xmm0
  5f7255:       f2 0f 11 4c 24 e0       movsd  %xmm1,-0x20(%rsp)
  5f725b:       f2 0f 59 44 24 e0       mulsd  -0x20(%rsp),%xmm0
  5f7261:       f2 0f 11 44 24 e8       movsd  %xmm0,-0x18(%rsp)
  5f7267:       f2 0f 10 44 24 e8       movsd  -0x18(%rsp),%xmm0
  5f726d:       f2 0f 51 c0             sqrtsd %xmm0,%xmm0
  5f7271:       f2 0f 11 44 24 f0       movsd  %xmm0,-0x10(%rsp)
  5f7277:       f2 0f 10 44 24 f0       movsd  -0x10(%rsp),%xmm0
  ...

Code: Select all

_ZNK16FurnacePrototype19entityInfoIconScaleEv
...
  5f72dd:       f2 0f 2a c0             cvtsi2sd %eax,%xmm0
...
  5f72f5:       f2 0f 11 44 24 c0       movsd  %xmm0,-0x40(%rsp)
  5f72fb:       f2 0f 10 44 24 c0       movsd  -0x40(%rsp),%xmm0
  5f7301:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  5f7305:       f2 0f 11 44 24 c8       movsd  %xmm0,-0x38(%rsp)
  5f730b:       66 0f ef c0             pxor   %xmm0,%xmm0
  5f730f:       f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  5f7313:       f2 0f 11 44 24 d0       movsd  %xmm0,-0x30(%rsp)
  5f7319:       f2 0f 59 4c 24 d0       mulsd  -0x30(%rsp),%xmm1
  5f731f:       f2 0f 10 44 24 c8       movsd  -0x38(%rsp),%xmm0
  5f7325:       f2 0f 11 4c 24 d8       movsd  %xmm1,-0x28(%rsp)
  5f732b:       f2 0f 59 44 24 d8       mulsd  -0x28(%rsp),%xmm0
  5f7331:       f2 0f 11 44 24 e0       movsd  %xmm0,-0x20(%rsp)
  5f7337:       f2 0f 10 44 24 e0       movsd  -0x20(%rsp),%xmm0
  5f733d:       f2 0f 51 c0             sqrtsd %xmm0,%xmm0
  5f7341:       f2 0f 11 44 24 e8       movsd  %xmm0,-0x18(%rsp)
  5f7347:       f2 0f 10 44 24 e8       movsd  -0x18(%rsp),%xmm0
...

Code: Select all

_ZNK16EditorController22isCorrectSpeedForScaleEdd
...
  5f75f4:       66 0f 2f 44 24 b8       comisd -0x48(%rsp),%xmm0
  5f75fa:       76 06                   jbe    5f7602 <_ZNK16EditorController22isCorrectSpeedForScaleEdd+0x52>
  5f75fc:       f2 0f 11 44 24 b8       movsd  %xmm0,-0x48(%rsp)
  5f7602:       f2 0f 10 44 24 b8       movsd  -0x48(%rsp),%xmm0
  5f7608:       f2 0f 5c 44 24 a0       subsd  -0x60(%rsp),%xmm0
  5f760e:       f2 0f 11 44 24 a8       movsd  %xmm0,-0x58(%rsp)
  5f7614:       f3 0f 7e 44 24 a8       movq   -0x58(%rsp),%xmm0
...
  5f7666:       f2 0f 59 44 24 c0       mulsd  -0x40(%rsp),%xmm0
  5f766c:       f2 0f 11 44 24 c8       movsd  %xmm0,-0x38(%rsp)
  5f7672:       f2 0f 10 44 24 d0       movsd  -0x30(%rsp),%xmm0
  5f7678:       f2 0f 5e 44 24 c8       divsd  -0x38(%rsp),%xmm0
...

Code: Select all

_ZN13HeatInterface18copyEntitySettingsERK22CopyEntitySettingsData
...
  5f77c1:       74 29                   je     5f77ec <_ZN13HeatInterface18copyEntitySettingsERK22CopyEntitySettingsData+0x5c>
  5f77c3:       f2 0f 10 80 d0 00 00    movsd  0xd0(%rax),%xmm0
  5f77ca:       00 
  5f77cb:       0f b6 80 d8 00 00 00    movzbl 0xd8(%rax),%eax
  5f77d2:       f2 0f 11 44 24 08       movsd  %xmm0,0x8(%rsp)
  5f77d8:       f2 0f 10 44 24 08       movsd  0x8(%rsp),%xmm0
...

Code: Select all

_ZN14NumberInputGui31valueFromSliderPositionInternalEdN22NumberInputGuiSettings5ScaleEd.part.0
...
  5f7cb0:       f2 0f 59 44 24 28       mulsd  0x28(%rsp),%xmm0
  5f7cb6:       f2 0f 11 44 24 30       movsd  %xmm0,0x30(%rsp)
  5f7cbc:       f2 0f 10 44 24 30       movsd  0x30(%rsp),%xmm0
...
  5f7d3a:       f2 0f 59 44 24 10       mulsd  0x10(%rsp),%xmm0
  5f7d40:       f2 0f 11 44 24 58       movsd  %xmm0,0x58(%rsp)
  5f7d46:       f2 0f 10 44 24 58       movsd  0x58(%rsp),%xmm0
...

Code: Select all

_ZNK14CliffPrototype13buildPositionERK11MapPosition9Directionb.part.0
...
  5f99fc:       66 0f ef c0             pxor   %xmm0,%xmm0
  5f9a00:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
  5f9a04:       c1 e2 05                shl    $0x5,%edx
  5f9a07:       29 d0                   sub    %edx,%eax
  5f9a09:       31 d2                   xor    %edx,%edx
  5f9a0b:       41 f7 f0                div    %r8d
  5f9a0e:       f2 0f 11 44 24 a8       movsd  %xmm0,-0x58(%rsp)
  5f9a14:       f2 0f 10 44 24 a8       movsd  -0x58(%rsp),%xmm0
...
 5f9a3b:       f2 0f 11 44 24 b8       movsd  %xmm0,-0x48(%rsp)
  5f9a41:       f2 0f 10 44 24 b0       movsd  -0x50(%rsp),%xmm0
  5f9a47:       f2 0f 58 44 24 b8       addsd  -0x48(%rsp),%xmm0
  5f9a4d:       f2 0f 11 44 24 c0       movsd  %xmm0,-0x40(%rsp)
  5f9a53:       66 0f ef c0             pxor   %xmm0,%xmm0
  5f9a57:       f2 41 0f 2a c1          cvtsi2sd %r9d,%xmm0
  5f9a5c:       f2 0f 11 44 24 c8       movsd  %xmm0,-0x38(%rsp)
  5f9a62:       f2 0f 10 44 24 c0       movsd  -0x40(%rsp),%xmm0
  5f9a68:       f2 0f 58 44 24 c8       addsd  -0x38(%rsp),%xmm0
  5f9a6e:       f2 0f 11 44 24 d0       movsd  %xmm0,-0x30(%rsp)
  5f9a74:       f2 0f 10 44 24 a0       movsd  -0x60(%rsp),%xmm0
  5f9a7a:       f2 0f 58 44 24 d0       addsd  -0x30(%rsp),%xmm0
  5f9a80:       f2 0f 11 44 24 d8       movsd  %xmm0,-0x28(%rsp)
  5f9a86:       f2 0f 10 44 24 d8       movsd  -0x28(%rsp),%xmm0
  5f9a8c:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  5f9a90:       f2 0f 11 44 24 e0       movsd  %xmm0,-0x20(%rsp)
...
...

Code: Select all

_ZNSt17_Function_handlerIFbRK15EntityWithForceEZN11Commandable15findDistractionERP6EntityEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
...
  5fbaa6:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
  5fbaaa:       89 f2                   mov    %esi,%edx
  5fbaac:       2b 50 04                sub    0x4(%rax),%edx
  5fbaaf:       f2 0f 11 44 24 08       movsd  %xmm0,0x8(%rsp)
  5fbab5:       f2 0f 10 44 24 08       movsd  0x8(%rsp),%xmm0
  5fbabb:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  5fbabf:       f2 0f 11 44 24 10       movsd  %xmm0,0x10(%rsp)
  5fbac5:       66 0f ef c0             pxor   %xmm0,%xmm0
  5fbac9:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
  5fbacd:       f2 0f 11 44 24 18       movsd  %xmm0,0x18(%rsp)
  5fbad3:       f2 0f 10 44 24 18       movsd  0x18(%rsp),%xmm0
  5fbad9:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  5fbadd:       f2 0f 11 44 24 20       movsd  %xmm0,0x20(%rsp)
  5fbae3:       f2 0f 10 44 24 10       movsd  0x10(%rsp),%xmm0
  5fbae9:       f2 0f 59 c0             mulsd  %xmm0,%xmm0
  5fbaed:       f2 0f 11 44 24 28       movsd  %xmm0,0x28(%rsp)
  5fbaf3:       f2 0f 10 44 24 20       movsd  0x20(%rsp),%xmm0
  5fbaf9:       f2 0f 59 c0             mulsd  %xmm0,%xmm0
  5fbafd:       f2 0f 11 44 24 30       movsd  %xmm0,0x30(%rsp)
  5fbb03:       f2 0f 10 44 24 28       movsd  0x28(%rsp),%xmm0
  5fbb09:       f2 0f 58 44 24 30       addsd  0x30(%rsp),%xmm0
...

Code: Select all

_ZNK15GhostController16getMovementSpeedEv
...
  5ff280:       f2 0f 11 44 24 f0       movsd  %xmm0,-0x10(%rsp)
  5ff286:       f2 0f 10 05 72 64 a2    movsd  0x1a26472(%rip),%xmm0        # 2025700 <.LC1>
  5ff28d:       01 
  5ff28e:       f2 0f 10 4c 24 f0       movsd  -0x10(%rsp),%xmm1
  5ff294:       66 0f 2f c8             comisd %xmm0,%xmm1
  5ff298:       76 1e                   jbe    5ff2b8 <_ZNK15GhostController16getMovementSpeedEv+0x68>
  5ff29a:       f2 0f 5e 44 24 f0       divsd  -0x10(%rsp),%xmm0
  5ff2a0:       f2 0f 11 44 24 f8       movsd  %xmm0,-0x8(%rsp)
  5ff2a6:       f2 0f 10 44 24 e8       movsd  -0x18(%rsp),%xmm0
  5ff2ac:       f2 0f 59 44 24 f8       mulsd  -0x8(%rsp),%xmm0
  5ff2b2:       f2 0f 11 44 24 e8       movsd  %xmm0,-0x18(%rsp)
  5ff2b8:       f2 0f 10 44 24 e8       movsd  -0x18(%rsp),%xmm0
...

Code: Select all

_ZN17GameActionHandler10moveOnZoomERK11InputActionP10Controller
...
  60879b:       f2 0f 59 c8             mulsd  %xmm0,%xmm1
  60879f:       f2 0f 11 4c 24 10       movsd  %xmm1,0x10(%rsp)
  6087a5:       f2 0f 10 4b 10          movsd  0x10(%rbx),%xmm1
...

Code: Select all

_ZL22stbir__encode_scanlineP11stbir__infoiPvPfiii.isra.0
...
  6108a5:       66 0f ef c0             pxor   %xmm0,%xmm0
  6108a9:       f3 0f 5a 44 24 74       cvtss2sd 0x74(%rsp),%xmm0
  6108af:       f2 0f 11 84 24 40 01    movsd  %xmm0,0x140(%rsp)
  6108b6:       00 00 
  6108b8:       f2 0f 10 05 28 4f a1    movsd  0x1a14f28(%rip),%xmm0        # 20257e8 <.LC131>
  6108bf:       01 
  6108c0:       f2 0f 59 84 24 40 01    mulsd  0x140(%rsp),%xmm0
  6108c7:       00 00 
  6108c9:       f2 0f 11 84 24 38 01    movsd  %xmm0,0x138(%rsp)
  6108d0:       00 00 
  6108d2:       f2 0f 10 05 56 4e a1    movsd  0x1a14e56(%rip),%xmm0        # 2025730 <.LC9>
  6108d9:       01 
  6108da:       f2 0f 58 84 24 38 01    addsd  0x138(%rsp),%xmm0
  6108e1:       00 00 
  6108e3:       f2 0f 11 84 24 30 01    movsd  %xmm0,0x130(%rsp)
  6108ea:       00 00 
  6108ec:       f2 48 0f 2c b4 24 30    cvttsd2si 0x130(%rsp),%rsi
  6108f3:       01 00 00 
...
  61094a:       0f 2f 05 73 ad 79 01    comiss 0x179ad73(%rip),%xmm0        # 1dab6c4 <.LC0>
  610951:       b8 ff ff ff 7f          mov    $0x7fffffff,%eax
  610956:       77 53                   ja     6109ab <_ZL22stbir__encode_scanlineP11stbir__infoiPvPfiii.isra.0+0xbcb>
  610958:       66 0f ef c0             pxor   %xmm0,%xmm0
  61095c:       f3 0f 5a 84 24 94 00    cvtss2sd 0x94(%rsp),%xmm0
  610963:       00 00 
  610965:       f2 0f 11 84 24 00 02    movsd  %xmm0,0x200(%rsp)
  61096c:       00 00 
  61096e:       f2 0f 10 05 72 4e a1    movsd  0x1a14e72(%rip),%xmm0        # 20257e8 <.LC131>
  610975:       01 
  610976:       f2 0f 59 84 24 00 02    mulsd  0x200(%rsp),%xmm0
  61097d:       00 00 
  61097f:       f2 0f 11 84 24 08 02    movsd  %xmm0,0x208(%rsp)
  610986:       00 00 
  610988:       f2 0f 10 05 a0 4d a1    movsd  0x1a14da0(%rip),%xmm0        # 2025730 <.LC9>
  61098f:       01 
  610990:       f2 0f 58 84 24 08 02    addsd  0x208(%rsp),%xmm0
  610997:       00 00 
  610999:       f2 0f 11 84 24 10 02    movsd  %xmm0,0x210(%rsp)
  6109a0:       00 00 
  6109a2:       f2 0f 2c 84 24 10 02    cvttsd2si 0x210(%rsp),%eax
...
and many many more in various forms.

Sometimes the xmm register is stored on stack and then read immediately or few instructions later, even if there are plenty of free / unused registers available. Sometimes it is read few instructions later into same or different register, or used directly as one of the arguments of the other instruction (i.e. addpd), despite the value not needing to be used again or returned from the function or stored in global or thread local variables.

It is BAD.

Not only it is slower, it probably explains why the Linux binary is significantly bigger than the Windows one.

linux: bin/x64/factorio after manual strip: 35704288 bytes
windows: bin/x64/factorio.exe : 25807872 bytes

And that will also negatively impact cache and instruction cache pressure.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Feb 11, 2020 11:41 pm
by movax20h
I also checked the headless server binary, and it looks to also to be affected same way, but I did find significantly less of these patterns. However when looking at the same same functions I reported in previous post, and they exhibit same issues in general of redundant store-load.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Fri Feb 14, 2020 12:02 pm
by triffid_hunter
Ooh more potential performance improvements for Linux? nice!

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Tue Apr 28, 2020 3:50 pm
by wheybags
Bang on technical analysis, ffloat-store will be disabled in the next release.
It made quite a difference too:

Code: Select all

Before:
  Performed 1000 updates in 20680.999 ms
  avg: 20.681 ms, min: 16.671 ms, max: 65.007 ms

After:
  Performed 1000 updates in 15265.129 ms
  avg: 15.265 ms, min: 13.098 ms, max: 59.362 ms
Thanks a lot!

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Wed Apr 29, 2020 3:07 pm
by linux_zealot
Thank you for the fix. Can't wait to try my new 35% faster factorio (35% if I did the math right).

BTW, have the removal of -ffloat-store changed the result of floating point operations? Shortly after I reported this bug I did some fuzzing of gcc arithmetic expressions and I haven't found any instances of this when -mfpmath=sse -msse2 are enabled. Which is reasonable because all it does are additional loads and stores that don't affect the values. I wonder if gcc should just ignore this flag when it doesn't affect results.

Also do you have any numbers how gcc-compiled and msvc-compiled (you use msvc when targeting windows, right?) binaries perform relative to each other? I'm just curious what are the numbers before and after this change.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Wed Jun 17, 2020 6:24 pm
by movax20h
I can confirm this is fixed now.

Example:

Code: Select all

000000000062b120 <_ZNK15GhostController16getMovementSpeedEv>:
  62b120:       80 bf 98 00 00 00 00    cmpb   $0x0,0x98(%rdi)
  62b127:       66 0f ef c0             pxor   %xmm0,%xmm0
  62b12b:       74 37                   je     62b164 <_ZNK15GhostController16getMovementSpeedEv+0x44>
  62b12d:       48 8b 47 28             mov    0x28(%rdi),%rax
  62b131:       f2 0f 10 87 90 00 00    movsd  0x90(%rdi),%xmm0
  62b138:       00 
  62b139:       48 8b 40 18             mov    0x18(%rax),%rax
  62b13d:       48 8b 80 38 01 00 00    mov    0x138(%rax),%rax
  62b144:       48 85 c0                test   %rax,%rax
  62b147:       74 1b                   je     62b164 <_ZNK15GhostController16getMovementSpeedEv+0x44>
  62b149:       f2 0f 10 50 40          movsd  0x40(%rax),%xmm2
  62b14e:       f2 0f 10 0d 2a 47 ad    movsd  0x1ad472a(%rip),%xmm1        # 20ff880 <.LC1>
  62b155:       01 
  62b156:       66 0f 2f d1             comisd %xmm1,%xmm2
  62b15a:       76 08                   jbe    62b164 <_ZNK15GhostController16getMovementSpeedEv+0x44>
  62b15c:       f2 0f 5e ca             divsd  %xmm2,%xmm1
  62b160:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  62b164:       c3                      retq   
  62b165:       90                      nop
  62b166:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  62b16d:       00 00 00 

Also it looks way more code looks to be vectorized, like this:

Code: Select all

000000000063b8b0 <_ZL22stbir__encode_scanlineP11stbir__infoiPvPfiii.isra.0>:
...
  63bb8c:       0f 1f 40 00             nopl   0x0(%rax)
  63bb90:       83 fa 06                cmp    $0x6,%edx
  63bb93:       0f 86 d8 10 00 00       jbe    63cc71 <_ZL22stbir__encode_scanlineP11stbir__infoiPvPfiii.isra.0+0x13c1>
  63bb99:       31 f6                   xor    %esi,%esi
  63bb9b:       66 45 0f ef c0          pxor   %xmm8,%xmm8
  63bba0:       0f 10 0c 70             movups (%rax,%rsi,2),%xmm1
  63bba4:       44 0f 10 4c 70 10       movups 0x10(%rax,%rsi,2),%xmm9
  63bbaa:       0f 28 d5                movaps %xmm5,%xmm2
  63bbad:       44 0f 28 d5             movaps %xmm5,%xmm10
  63bbb1:       0f 28 d9                movaps %xmm1,%xmm3
  63bbb4:       45 0f 28 d9             movaps %xmm9,%xmm11
  63bbb8:       44 0f 28 f9             movaps %xmm1,%xmm15
  63bbbc:       41 0f c2 d8 05          cmpnltps %xmm8,%xmm3
  63bbc1:       45 0f c2 d8 05          cmpnltps %xmm8,%xmm11
  63bbc6:       45 0f 28 e1             movaps %xmm9,%xmm12
  63bbca:       0f c2 d1 05             cmpnltps %xmm1,%xmm2
  63bbce:       45 0f c2 d1 05          cmpnltps %xmm9,%xmm10
  63bbd3:       45 0f 59 fe             mulps  %xmm14,%xmm15
  63bbd7:       45 0f 59 e6             mulps  %xmm14,%xmm12
  63bbdb:       66 45 0f db d3          pand   %xmm11,%xmm10
  63bbe0:       66 0f db d3             pand   %xmm3,%xmm2
  63bbe4:       66 0f 6f c2             movdqa %xmm2,%xmm0
  63bbe8:       66 41 0f 61 d2          punpcklwd %xmm10,%xmm2
  63bbed:       66 41 0f 69 c2          punpckhwd %xmm10,%xmm0
  63bbf2:       66 44 0f 6f d2          movdqa %xmm2,%xmm10
  63bbf7:       66 44 0f 69 d0          punpckhwd %xmm0,%xmm10
  63bbfc:       66 0f 61 d0             punpcklwd %xmm0,%xmm2
  63bc00:       41 0f 12 f7             movhlps %xmm15,%xmm6
  63bc04:       41 0f 5a c7             cvtps2pd %xmm15,%xmm0
  63bc08:       66 41 0f 61 d2          punpcklwd %xmm10,%xmm2
  63bc0d:       66 0f 58 c4             addpd  %xmm4,%xmm0
  63bc11:       44 0f 5a d6             cvtps2pd %xmm6,%xmm10
  63bc15:       66 44 0f 58 d4          addpd  %xmm4,%xmm10
  63bc1a:       41 0f 12 fc             movhlps %xmm12,%xmm7
  63bc1e:       66 45 0f e6 d2          cvttpd2dq %xmm10,%xmm10
  63bc23:       66 0f e6 c0             cvttpd2dq %xmm0,%xmm0
  63bc27:       66 41 0f 6c c2          punpcklqdq %xmm10,%xmm0
  63bc2c:       45 0f 5a d4             cvtps2pd %xmm12,%xmm10
  63bc30:       44 0f 5a e7             cvtps2pd %xmm7,%xmm12
  63bc34:       66 44 0f 58 d4          addpd  %xmm4,%xmm10
  63bc39:       66 44 0f 58 e4          addpd  %xmm4,%xmm12
  63bc3e:       66 45 0f e6 d2          cvttpd2dq %xmm10,%xmm10
  63bc43:       66 45 0f e6 e4          cvttpd2dq %xmm12,%xmm12
  63bc48:       66 45 0f 6c d4          punpcklqdq %xmm12,%xmm10
  63bc4d:       66 44 0f 6f e0          movdqa %xmm0,%xmm12
  63bc52:       66 41 0f 61 c2          punpcklwd %xmm10,%xmm0
  63bc57:       66 45 0f 69 e2          punpckhwd %xmm10,%xmm12
  63bc5c:       66 44 0f 6f d0          movdqa %xmm0,%xmm10
  63bc61:       66 41 0f 61 c4          punpcklwd %xmm12,%xmm0
  63bc66:       66 45 0f 69 d4          punpckhwd %xmm12,%xmm10
  63bc6b:       66 41 0f 61 c2          punpcklwd %xmm10,%xmm0
  63bc70:       44 0f 28 d5             movaps %xmm5,%xmm10
  63bc74:       44 0f c2 d1 01          cmpltps %xmm1,%xmm10
  63bc79:       41 0f c2 c8 01          cmpltps %xmm8,%xmm1
  63bc7e:       66 0f db c2             pand   %xmm2,%xmm0
  63bc82:       66 41 0f db da          pand   %xmm10,%xmm3
  63bc87:       44 0f 28 d5             movaps %xmm5,%xmm10
  63bc8b:       66 41 0f df cd          pandn  %xmm13,%xmm1
  63bc90:       45 0f c2 d1 01          cmpltps %xmm9,%xmm10
  63bc95:       45 0f c2 c8 01          cmpltps %xmm8,%xmm9
  63bc9a:       66 45 0f db da          pand   %xmm10,%xmm11
  63bc9f:       66 44 0f 6f d3          movdqa %xmm3,%xmm10
  63bca4:       66 45 0f df cd          pandn  %xmm13,%xmm9
  63bca9:       66 41 0f 61 db          punpcklwd %xmm11,%xmm3
  63bcae:       66 45 0f 69 d3          punpckhwd %xmm11,%xmm10
  63bcb3:       66 44 0f 6f db          movdqa %xmm3,%xmm11
  63bcb8:       66 41 0f 61 da          punpcklwd %xmm10,%xmm3
  63bcbd:       66 45 0f 69 da          punpckhwd %xmm10,%xmm11
  63bcc2:       66 44 0f 6f d1          movdqa %xmm1,%xmm10
  63bcc7:       66 41 0f 61 c9          punpcklwd %xmm9,%xmm1
  63bccc:       66 45 0f 69 d1          punpckhwd %xmm9,%xmm10
  63bcd1:       66 44 0f 6f c9          movdqa %xmm1,%xmm9
  63bcd6:       66 41 0f 61 db          punpcklwd %xmm11,%xmm3
  63bcdb:       66 45 0f 69 ca          punpckhwd %xmm10,%xmm9
  63bce0:       66 41 0f 61 ca          punpcklwd %xmm10,%xmm1
  63bce5:       66 44 0f 6f fb          movdqa %xmm3,%xmm15
  63bcea:       66 41 0f 61 c9          punpcklwd %xmm9,%xmm1
  63bcef:       66 44 0f df f9          pandn  %xmm1,%xmm15
  63bcf4:       66 41 0f eb df          por    %xmm15,%xmm3
  63bcf9:       66 0f df d3             pandn  %xmm3,%xmm2
  63bcfd:       66 0f eb c2             por    %xmm2,%xmm0
  63bd01:       0f 11 04 37             movups %xmm0,(%rdi,%rsi,1)
  63bd05:       48 83 c6 10             add    $0x10,%rsi
  63bd09:       4c 39 de                cmp    %r11,%rsi
  63bd0c:       0f 85 8e fe ff ff       jne    63bba0 <_ZL22stbir__encode_scanlineP11stbir__infoiPvPfiii.isra.0+0x2f0>
  ...
[code]

This is simply beautiful. This makes me happy :)

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Wed Jun 17, 2020 6:27 pm
by movax20h
linux_zealot wrote: Wed Apr 29, 2020 3:07 pm Thank you for the fix. Can't wait to try my new 35% faster factorio (35% if I did the math right).

BTW, have the removal of -ffloat-store changed the result of floating point operations? Shortly after I reported this bug I did some fuzzing of gcc arithmetic expressions and I haven't found any instances of this when -mfpmath=sse -msse2 are enabled. Which is reasonable because all it does are additional loads and stores that don't affect the values. I wonder if gcc should just ignore this flag when it doesn't affect results.

Also do you have any numbers how gcc-compiled and msvc-compiled (you use msvc when targeting windows, right?) binaries perform relative to each other? I'm just curious what are the numbers before and after this change.
`-ffloat-store` doesn't change any results in this case, because the results in register are the same as in they would be stored and loaded back. amd64 (aka x86_64) uses sse2 by default for math. You don't need to add `-mfpmath=sse -msse2`. The `-ffloat-store` was a hack to force 64-bit doubles or 32-bit float when using 387 (x87) which by default operates on 80-bit extended precision floating points numbers. This only really apply to some i386 (32-bit) codes, and older stuff that doesn't use SSE2.

I guess, `-ffloat-store` should be ignored when using gcc and SSE2, but there might be some weird use cases. I suspect for example that when using AVX2 or FMA instruction, it will do a double rounding, instead of one, when doing operation like `a*b + c`. So there will still be a difference. But without FMA the result is probably going to be exactly the same.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Wed Jun 17, 2020 7:08 pm
by movax20h
linux_zealot wrote: Wed Apr 29, 2020 3:07 pm I wonder if gcc should just ignore this flag when it doesn't affect results.
I talked to some GCC devs about it, and they kind of agree that -ffloat-store maybe could be ignored on x86_64 (and some other flags), but the thing is even on x86_64 one can still use x87 (because some people port old stuff that expect 80-bit floats in some situations, but also rounding in some other places, and that compatibility unfortunatly need to be preserved for correct behavior; or when asking to use `long double`), and that is hard to encode to only use -ffloat-store for x87, and not non-x87 code. It also has strange and complex interactions with other flags. It is not just a matter of ignoring the flag, but actually using or not using it depending on other flags and generated code down the stack in the middle and backend. That would make a hard to explain to users and leads to people probably complaining why `-ffloat-store` is not doing anything.

It technically is possible to implement in gcc, and they are open to patches tho to make it work better.

But in the end it is a user error to really use `-ffloat-store` on x86_64. At least 99% of the time.

Re: [wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)

Posted: Wed Jun 17, 2020 7:26 pm
by movax20h
The binary size is still somehow bigger on Linux, not sure why:

Code: Select all

factorio_0.18.32$ ls -al
-rw-rw-r-- 1 user user  28299264 Jun 16 10:40 factorio.exe
-rwxr-xr-x 1 user user  36932256 Jun 17 19:20 factorio.stripped

-rwxr-xr-x 1 user user 194706448 Jun 16 08:50 factorio.orig
Compared to previous windows binary this version is about 2.5MB bigger. And compared to previous Linux binary this version is about 1.2MB bigger. So I guess considering floating point operations are only a fraction of all code, that makes sense.

The rest of the size difference is probably due to different optimizations and inlining.

And really few MB difference isn't probably a big deal for download. But I am still a bit worried about instruction cache pressure.

One would need to do a bit more benchmarks and maybe analyze with Linux perf to see instruction cache misses.