[wheybags] [0.18.4] Unoptimal code generation of Linux binary (too many redundant stores/loads)
Posted: Tue Feb 11, 2020 12:15 pm
Short description
While looking at the profile of Factorio on Linux I noticed that most hot functions contain lots of redundant stores and loads of floating point values. My guess that it is caused by using -ffloat-store compiler flag. My second guess is that -ffloat-store was added to make 32-bit binary produce the same floating-point results as 64-bit do. My understanding is that -ffloat-store doesn't change the results of 64-bit binary, but causes massive slowdown of floating point code (different sources report up to 2x slowdown).
Is it possible to remove -ffloat-store if it is present? If it is not present perhaps there is some other reason why generated code contains so many loads/stores.
Long description
Yesterday I tried to connect to an online game and what I got was that my computer wasn't fast enough to catch up with the game. I decided to profile it to see if something can be improved. I collected a perf profile. While looking at the top functions of the profile, I noticed quite bad codegen of function Math::sincosUnsafe. Here are the first few lines:
My first impression was like this:
1. Wow, SRoA (scalar replacement of aggregates) did very poor job here. I wonder if I could make a reprocase to report to GCC.
2. Stop! sincos shouldn't have any aggregates. What caused GCC to generate such a bad code then?
3. Aha, it must be some volatile variables they put here to get the same results on different compilers.
4. Then I looked at other hot functions and I saw exactly the same codegen even in function read_to_mixer_linear_float_32 from allegro. No way they would put volatile in sound mixing function in allegro. It must be some compiler flag that pessimize the whole program. What can it be?
5. I remember the -ffloat-store flag from x87 era to trim excessive precision of floating point operation done on x87 coprocessor. And after a quick check I verified that it pessimizes SSE codegen in similar way as seen in Factorio.
My understanding of -ffloat-store
In my understanding (I might be wrong) -ffloat-store is needed for 32-bit binaries when x87 fpu is used to trim the excessive precision of x87 operations. SSE operations don't have excessive precision and there is no need to store/load them to memory. Some libraries use -ffloat-store only when SSE is not available source. My guess -ffloat-store was added to Factorio when the game supported 32-bit mode. Now 32-bit is dropped and there is no need for -ffloat-store to be used.
FloatingPointMath page in GCC wiki especially says that -mfpmath=sse -msse2. And compiling for x64 implies both of these options.
If 32-bit compatibility is still needed, I would recommed either (1) leaving -ffloat-store only in 32-bit mode or (2) enabling -mfpmath=sse -msse2 on 32-bit. The second option would require early pentium4-class (since year 2000) machine to run the game, but I don't think this is too constraining today.
Final remark
It is possible that my understanding is wrong and removing -ffloat-store affects the results of floating point math on x64. If this is the case I would like to help troubleshooting the problem.
While looking at the profile of Factorio on Linux I noticed that most hot functions contain lots of redundant stores and loads of floating point values. My guess that it is caused by using -ffloat-store compiler flag. My second guess is that -ffloat-store was added to make 32-bit binary produce the same floating-point results as 64-bit do. My understanding is that -ffloat-store doesn't change the results of 64-bit binary, but causes massive slowdown of floating point code (different sources report up to 2x slowdown).
Is it possible to remove -ffloat-store if it is present? If it is not present perhaps there is some other reason why generated code contains so many loads/stores.
Long description
Yesterday I tried to connect to an online game and what I got was that my computer wasn't fast enough to catch up with the game. I decided to profile it to see if something can be improved. I collected a perf profile. While looking at the top functions of the profile, I noticed quite bad codegen of function Math::sincosUnsafe. Here are the first few lines:
Code: Select all
0,64 │ sub $0xc0,%rsp
0,93 │ movsd %xmm0,-0x70(%rsp) # store to some local variable
8,43 │ movsd -0x70(%rsp),%xmm0 # load from it
0,13 │ movapd %xmm0,%xmm1
1,06 │ subsd .LC197,%xmm1
│ movsd %xmm1,-0x60(%rsp) # store to another local variable
0,28 │ movsd -0x60(%rsp),%xmm2 # load from it
0,27 │ unpcklpd %xmm0,%xmm2
0,90 │ movaps %xmm2,0x98(%rsp) # store to another local variable
1,79 │ cvtpd2dq 0x98(%rsp),%xmm0 # load from it
1,80 │ cvtdq2pd %xmm0,%xmm0
1. Wow, SRoA (scalar replacement of aggregates) did very poor job here. I wonder if I could make a reprocase to report to GCC.
2. Stop! sincos shouldn't have any aggregates. What caused GCC to generate such a bad code then?
3. Aha, it must be some volatile variables they put here to get the same results on different compilers.
4. Then I looked at other hot functions and I saw exactly the same codegen even in function read_to_mixer_linear_float_32 from allegro. No way they would put volatile in sound mixing function in allegro. It must be some compiler flag that pessimize the whole program. What can it be?
5. I remember the -ffloat-store flag from x87 era to trim excessive precision of floating point operation done on x87 coprocessor. And after a quick check I verified that it pessimizes SSE codegen in similar way as seen in Factorio.
My understanding of -ffloat-store
In my understanding (I might be wrong) -ffloat-store is needed for 32-bit binaries when x87 fpu is used to trim the excessive precision of x87 operations. SSE operations don't have excessive precision and there is no need to store/load them to memory. Some libraries use -ffloat-store only when SSE is not available source. My guess -ffloat-store was added to Factorio when the game supported 32-bit mode. Now 32-bit is dropped and there is no need for -ffloat-store to be used.
FloatingPointMath page in GCC wiki especially says that -mfpmath=sse -msse2. And compiling for x64 implies both of these options.
If 32-bit compatibility is still needed, I would recommed either (1) leaving -ffloat-store only in 32-bit mode or (2) enabling -mfpmath=sse -msse2 on 32-bit. The second option would require early pentium4-class (since year 2000) machine to run the game, but I don't think this is too constraining today.
Final remark
It is possible that my understanding is wrong and removing -ffloat-store affects the results of floating point math on x64. If this is the case I would like to help troubleshooting the problem.