Rather off topic, but this piqued my curiosity and I didn't find anything searching... has anyone benchmarked a SSE2/3 build (whatever you're at now) vs AVX2?
Was build targeting AVX ever benchmarked against SSE2 build?
Was build targeting AVX ever benchmarked against SSE2 build?
There are 10 types of people: those who get this joke and those who don't.
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: [2.0.14][macOS] Massive performance regression versus 1.1.110 when zoomed in
On a related note, as I'm curious, would it be feasible to have the game to detect your hardware and OS, and if it's new enough, take advantage of the newer features?
I guess that would make a larger binary, and increase maintenance work as you'll have to write two sets of code and have two sets of bugs....
Re: [2.0.14][macOS] Massive performance regression versus 1.1.110 when zoomed in
Possible, although there are some trade-offs:IsaacOscar wrote: ↑Sat Nov 23, 2024 2:46 amOn a related note, as I'm curious, would it be feasible to have the game to detect your hardware and OS, and if it's new enough, take advantage of the newer features?
I guess that would make a larger binary, and increase maintenance work as you'll have to write two sets of code and have two sets of bugs....
- Compile the entire binary for multiple architectures, determine which is running when launched -> much larger binaries
- Offer multiple versions to download -> don't know if Steam supports this automatically, would make the website more confusing to download from, in some cases means you can't copy binaries from one computer to another
- Detect instruction sets and offer multiple versions for a few hot spots in the code -> more complicated, more work to determine those spots; works for programs with a few defined hot spots (e.g. video encoding/decoding), but Factorio doesn't seem to fall into that bucket
There are 10 types of people: those who get this joke and those who don't.
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: [2.0.14][macOS] Massive performance regression versus 1.1.110 when zoomed in
Sorry, I didn't mwan overal, I meant for the portions of code that want to use the new features. Also its not that simple, for e.g. using new graphics APIs might require them to rewrite all the graphics code, or convert assets to different formats.
Also stuff like AVX can be done using compiler intrinsics (i.e. another api), or require the code to be written in a way that the compiler can easily apply (e.g. to ensure sequential storage of numbers you want to do SIMD operations om:
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Few notes from experience:
- AVX2 is rare, most systems won't have this yet. It's currently exclusive to server machines. Float16 support that it adds is cool, but won't help factorio as that breaks lockstep.
- Using intrinsic is hard, only suits certain problems. And, I've had multiple cases where gcc would outperform me by being smarter then my intrinsic code. And my problem was highly optimizable for SIMD (unlike Factorio).
- Core 2 duo is a fine target, it adds a lot of generic features that benefit performance that the compiler can use without special tricks, SSE4 doesn't add that much compared to SSE2 (which is a deal changer).
- AVX2 is rare, most systems won't have this yet. It's currently exclusive to server machines. Float16 support that it adds is cool, but won't help factorio as that breaks lockstep.
- Using intrinsic is hard, only suits certain problems. And, I've had multiple cases where gcc would outperform me by being smarter then my intrinsic code. And my problem was highly optimizable for SIMD (unlike Factorio).
- Core 2 duo is a fine target, it adds a lot of generic features that benefit performance that the compiler can use without special tricks, SSE4 doesn't add that much compared to SSE2 (which is a deal changer).
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Umm, it's 10 years old now, and I've had various AVX2 processors for most of that time.
Well Windows 10 never even supported that processor, and factorio requires Windows 10 anyway. (Although it might work on Linux/Mac with a core 2).
Re: Was build targeting AVX ever benchmarked against SSE2 build?
I tried to compile for AVX just out of curiosity. But have not measured any interesting speedup out of the bat. We are using strict math compiler option because of simulation determinism when compiling for different platforms (usually using even different compilers), and as the simulation is implemented, it's not really vectorizable. So I guess only benefit for the game update was that compiler could use 3-operand versions of the instructions it was already using, removing lot of mov instructions, but that didn't yield any significant speedup.
But! In rendering, we don't need determinism, don't need strict math, and there is potential for vectorization (which is there even with SSE2) and could really benefit from fused-multiply-add instructions. But this has not been tested. EDIT: I can see FMA is not part of AVX, but is separate extension of instruction set.
It is possible to detect available instruction sets, and do dynamic dispatch to code written/compiled for that instruction set ... But, there still doesn't seem good compiler support for it (I have not looked into it for past 4 years, though), so it is lot of manual boiler plate + possibly code duplication, which then means different parts of source code run on different machines, which makes testing more complicated, more complicated maintanence and further development ... And in the end, it needs to run fast on old HW too. And my frustration comes from having to always figure out fast way of doing things without using the new(ish) toys.
Frankly, I am tired of Factorio. I hope it'll get released as open-source eventually. When it comes to graphics, I feel like it is cemented into decisions made many years ago. Even in the early access, we were always a year away from 1.0, so the game was always nearly finished, and I didn't feel like there was room for changing art-style. Rendering performance would greatly benefit either from changing nearly all sprites to not use semi-transparency (pixels would be either fully opaque or fully transparent), which would allow us to use depth buffer instead of having to sort sprites back to front and send them in that order to GPU, which is inefficient for both CPU and GPU. Or by switching to rendering 3D models instead of sprites (I regret we didn't make at least asteroids in Space Age 3D). Looking back, I feel like both would have been be possible, if we decided to do that when doing (or instead of doing) high resolution sprites. Eh, sorry for offtopic.
Re: Was build targeting AVX ever benchmarked against SSE2 build?
It still works on Win 7 (and possibly Vista) tooIsaacOscar wrote: ↑Sat Nov 23, 2024 9:16 am Well Windows 10 never even supported that processor, and factorio requires Windows 10 anyway. (Although it might work on Linux/Mac with a core 2).
I guess. From SSE4 I was missing integer variants of packed min, max, abs, and other instructions. I also missed blend instruction, and I think there are some better shuffles ... I really miss floor and round instructions though.
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: Was build targeting AVX ever benchmarked against SSE2 build?
That would be awesome!
Overall, everything looks awesome (except the engineer looks a bit grainy....) and I've had no performance problems (I constantly get 60 FPS/UPS),
but I also regularly spend too much money on my computer...
So at least from my point of view, you guys have done an excellent job.
(although there was one time I turned on the debug setting for the rail planner and it drew like a hundred different possible rail paths and got a bit laggy..., but that's my fault for using a sebug setting!)
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Someone should update your website then https://www.factorio.com/space-age/buy
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Even though it runs, it doesn't mean we want to support it (for new buyers)IsaacOscar wrote: ↑Sat Nov 23, 2024 10:07 am Someone should update your website then https://www.factorio.com/space-age/buy
Thank you, I appreciate it.IsaacOscar wrote: ↑Sat Nov 23, 2024 10:05 am So at least from my point of view, you guys have done an excellent job.
Re: Was build targeting AVX ever benchmarked against SSE2 build?
My mistake, I mistook AVX2 for AVX512 (imagine having a memory that works)IsaacOscar wrote: ↑Sat Nov 23, 2024 9:16 amUmm, it's 10 years old now, and I've had various AVX2 processors for most of that time.
-
- Filter Inserter
- Posts: 344
- Joined: Sat Nov 09, 2024 2:36 pm
- Contact:
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Depending on the instructions used AVX might not speed up much if the bottleneck is not the instruction frontend.
Then it will just slightly reduce the amount of instructions used and thereby increase code density (resulting in a smaller executable if there are not different execution branches used).
The reason why you won't see as a huge speed up is because all modern CPU cores are super-scalar already; they check for data dependencies on the fly and will detect that the 2x128 bit instructions are independent from each other and can be done simultaneously.
And because of that they will usually schedule them like as if they were a single 1x256 bit instruction if the core has AVX256 capability.
The Vector units usually also can be partitioned like that where it allows for 1x256 instruction to be scheduled or 2x128 at once. Same is true for even more recent stuff like AVX512.
Some AMD CPUs do some of the AVX512 stuff the other way around, where it actually splits them into 2x256 instructions that are done one after another because they had no native AVX512 ALUs until recently.
That said depending on the application and the amount of usuage of AVX there is some performance gain achieveable. Because in some applications the bottleneck is the instruction frontend. So if it has to load & decode only 1 instruction from the instruction cache instead of 2 then obviously you save some performance with that.
But if the bottleneck is rather the data cache performance, you will likely not notice much.
Also obviously some of the more recent AVX implementations have functions previously not available in any form in older SSE variants. But that I cannot say much about, how much there is to gain from such new super-special instructions.
In any case, when you see synthetic AVX benchmarks I would always take them with a HUGE grain of salt. Because they obviously test for very specific handpicked applications, where obviously the wider AVX set then excels and outperforms an older set for marketing purposes.
But in practice many applications will likely not work that way.
Then it will just slightly reduce the amount of instructions used and thereby increase code density (resulting in a smaller executable if there are not different execution branches used).
The reason why you won't see as a huge speed up is because all modern CPU cores are super-scalar already; they check for data dependencies on the fly and will detect that the 2x128 bit instructions are independent from each other and can be done simultaneously.
And because of that they will usually schedule them like as if they were a single 1x256 bit instruction if the core has AVX256 capability.
The Vector units usually also can be partitioned like that where it allows for 1x256 instruction to be scheduled or 2x128 at once. Same is true for even more recent stuff like AVX512.
Some AMD CPUs do some of the AVX512 stuff the other way around, where it actually splits them into 2x256 instructions that are done one after another because they had no native AVX512 ALUs until recently.
That said depending on the application and the amount of usuage of AVX there is some performance gain achieveable. Because in some applications the bottleneck is the instruction frontend. So if it has to load & decode only 1 instruction from the instruction cache instead of 2 then obviously you save some performance with that.
But if the bottleneck is rather the data cache performance, you will likely not notice much.
Also obviously some of the more recent AVX implementations have functions previously not available in any form in older SSE variants. But that I cannot say much about, how much there is to gain from such new super-special instructions.
In any case, when you see synthetic AVX benchmarks I would always take them with a HUGE grain of salt. Because they obviously test for very specific handpicked applications, where obviously the wider AVX set then excels and outperforms an older set for marketing purposes.
But in practice many applications will likely not work that way.
Re: Was build targeting AVX ever benchmarked against SSE2 build?
Thanks, Posila, for humoring my curiosity (and for splitting the thread).
It makes sense that not much of the simulation code in Factorio would vectorize well.
I have heard some about AVX helping with memory load bandwidth (shaky on the details here), but that wouldn't help for things which are bound by latency (control flow and pointer-chasing and whatnot).
It makes sense that not much of the simulation code in Factorio would vectorize well.
I have heard some about AVX helping with memory load bandwidth (shaky on the details here), but that wouldn't help for things which are bound by latency (control flow and pointer-chasing and whatnot).
I have a young friend who plays on Windows 7 quite regularlyposila wrote: ↑Sat Nov 23, 2024 9:58 amIt still works on Win 7 (and possibly Vista) tooIsaacOscar wrote: ↑Sat Nov 23, 2024 9:16 am Well Windows 10 never even supported that processor, and factorio requires Windows 10 anyway. (Although it might work on Linux/Mac with a core 2).
Yeah AVX-512 is a really weird situation where it exists on server chips, about 3 generations of Intel laptop chips, maybe 1 generation of Intel desktop chips, and recent AMD chips... but not recent (consumer) Intel chips, because they didn't implement it in E-cores and didn't want to deal with different instruction sets on different cores (although there was a brief while where if you disabled E-cores on a chip you could use it).IsaacOscar wrote: ↑Sat Nov 23, 2024 12:53 pmInterestingly, my old (desktop) CPU had AVX512, but my current newer and faster one doesn't (apparently its because it doesn't work on E-cores?).
Last edited by Jap2.0 on Sun Nov 24, 2024 5:17 am, edited 1 time in total.
There are 10 types of people: those who get this joke and those who don't.
Re: Was build targeting AVX ever benchmarked against SSE2 build?
I’ve tested in the past compiling targeting newer versions of SSE and AVX using the MSVC option to target them but saw no measurable improvements in simulation time.
As Posia said, the simulation just isn’t instruction bound and is instead memory latency bound (waiting for loads to finish before the next instruction can continue).
As Posia said, the simulation just isn’t instruction bound and is instead memory latency bound (waiting for loads to finish before the next instruction can continue).
If you want to get ahold of me I'm almost always on Discord.