Friday Facts #251 - A Fistful of Frames
Re: Friday Facts #251 - A Fistful of Frames
What I wrote originally was pretty technical and it was pretty hard to digest even for other non-graphics programmers on the team so we decided to lighten it up a lot. I decided to put some of the original content in this post, in case some of you are interested to read it. I might be adding more info to this post if I feel like something needs to be explained further.
The vertex buffer streaming articles (and generally others talking about the same topic) suggest to use vertex buffer of fixed size. Map the buffer with D3D11_MAP_WRITE_NO_OVERWRITE/GL_MAP_UNSYNCHRONIZED_BIT flag, copy your batch into a part that has not been used in previous draw calls, unmap it and make a draw call. Once the buffer is full, map it with D3D11_MAP_DISCARD/GL_MAP_INVALIDATE_BUFFER_BIT flags. This will let the driver know you don’t care about previous content of the buffer anymore, and either gives you the same chunk of memory, if there are no pending draw calls using the buffer, or it will allocate new memory for you and keeps the old buffer around until pending draw calls are finished. Chances are, the driver will end up reusing previously allocated buffers for subsequent discards, so the system stabilizes in state where there won’t be any more dynamic memory allocations. OpenGL reffers to this pattern as buffer orphaning.
So we implemented this version of streaming, and it was pretty fast, but we noticed mapping a buffer to system memory takes still lot of time even with D3D11_MAP_WRITE_NO_OVERWRITE/GL_MAP_UNSYNCHRONIZED_BIT flags as did memcpy of a batch to the buffer. That’s why we map the buffer once, write vertices into it directly, and write as many batches as we can before we unmap it. Then we loop through list of prepared batches and commit a draw call for each batch. This eliminated map/unmap per batch and unnecessary memcopy and gave us nice boost. We still continue using streaming though, because we often need to flush the buffer before it is full (when we change render target, or wrender something else than sprites …)
After all of this the new rendering code was already faster than the old one, but we were noticing that calling our draw functions takes-up lot of overhead time. At first we thought it is due to memory latency, but adding prefetching to this code speeded it up only very little. After looking into generated assembly code we realized the draw function uses many CPU registers, values of which are backup on stack when the function enter and restored when the function exits. This was creating lot of the overhead, because we were iterating through sprite draw commands and calling render on them one by one. We changed render to operate over range of sprites draw commands instead and gained quite large additional speedup.
The vertex buffer streaming articles (and generally others talking about the same topic) suggest to use vertex buffer of fixed size. Map the buffer with D3D11_MAP_WRITE_NO_OVERWRITE/GL_MAP_UNSYNCHRONIZED_BIT flag, copy your batch into a part that has not been used in previous draw calls, unmap it and make a draw call. Once the buffer is full, map it with D3D11_MAP_DISCARD/GL_MAP_INVALIDATE_BUFFER_BIT flags. This will let the driver know you don’t care about previous content of the buffer anymore, and either gives you the same chunk of memory, if there are no pending draw calls using the buffer, or it will allocate new memory for you and keeps the old buffer around until pending draw calls are finished. Chances are, the driver will end up reusing previously allocated buffers for subsequent discards, so the system stabilizes in state where there won’t be any more dynamic memory allocations. OpenGL reffers to this pattern as buffer orphaning.
So we implemented this version of streaming, and it was pretty fast, but we noticed mapping a buffer to system memory takes still lot of time even with D3D11_MAP_WRITE_NO_OVERWRITE/GL_MAP_UNSYNCHRONIZED_BIT flags as did memcpy of a batch to the buffer. That’s why we map the buffer once, write vertices into it directly, and write as many batches as we can before we unmap it. Then we loop through list of prepared batches and commit a draw call for each batch. This eliminated map/unmap per batch and unnecessary memcopy and gave us nice boost. We still continue using streaming though, because we often need to flush the buffer before it is full (when we change render target, or wrender something else than sprites …)
After all of this the new rendering code was already faster than the old one, but we were noticing that calling our draw functions takes-up lot of overhead time. At first we thought it is due to memory latency, but adding prefetching to this code speeded it up only very little. After looking into generated assembly code we realized the draw function uses many CPU registers, values of which are backup on stack when the function enter and restored when the function exits. This was creating lot of the overhead, because we were iterating through sprite draw commands and calling render on them one by one. We changed render to operate over range of sprites draw commands instead and gained quite large additional speedup.
Re: Friday Facts #251 - A Fistful of Frames
Nice to read about your progress but boring stuff.
Maybe add a random screenshot of the new GUI
Maybe add a random screenshot of the new GUI
Re: Friday Facts #251 - A Fistful of Frames
You just earned yourself 2 more weeks about blueprint library, mister.steinio wrote:Nice to read about your progress but boring stuff.
Maybe add a random screenshot of the new GUI
- MasterBuilder
- Filter Inserter
- Posts: 353
- Joined: Sun Nov 23, 2014 1:22 am
- Contact:
Re: Friday Facts #251 - A Fistful of Frames
You say that like it's a bad thingTwinsen wrote:You just earned yourself 2 more weeks about blueprint library, mister.steinio wrote:Nice to read about your progress but boring stuff.
Maybe add a random screenshot of the new GUI
Give a man fire and he'll be warm for a day. Set a man on fire and he'll be warm for the rest of his life.
Re: Friday Facts #251 - A Fistful of Frames
Excellent write-up. The technical aspects of optimization are often overlooked and rarely appreciated as much as they should be.
Re: Friday Facts #251 - A Fistful of Frames
With stronger and stronger machines nowadays, optimization are often overlooked in favor of faster development. Bethesda games are primary example. It's very nice to see Factorio still try to keep the optimization high. I know it is hard, unfun, and many times not rewarding, but very useful for existing userbase.CakeDog wrote:Excellent write-up. The technical aspects of optimization are often overlooked and rarely appreciated as much as they should be.
On the other hand, the topic today is very hard indeed, hopefully there'll be users that can give advices, ideas and input for that.
Re: Friday Facts #251 - A Fistful of Frames
And I was wondering what the hell I should do on my August holidays. Thanks for giving me a good excuse to visit Prague! If anyone wants to meet up for a few beers and a LAN session I'm more than happy to tag along.
Question for the devs, will these computers have 0.17 on them? Or the current 0.16 stable?
Question for the devs, will these computers have 0.17 on them? Or the current 0.16 stable?
Other Crazy Suggestions:
| "Each" Signal Choice as Stack Size Control Signal |
| "Each" Signal Choice as Stack Size Control Signal |
Re: Friday Facts #251 - A Fistful of Frames
0.16 of courseMeddleman wrote:And I was wondering what the hell I should do on my August holidays. Thanks for giving me a good excuse to visit Prague! If anyone wants to meet up for a few beers and a LAN session I'm more than happy to tag along.
Question for the devs, will these computers have 0.17 on them? Or the current 0.16 stable?
-
- Filter Inserter
- Posts: 952
- Joined: Sat May 23, 2015 12:10 pm
- Contact:
Re: Friday Facts #251 - A Fistful of Frames
In newer opengl versions you can have buffers mapped persistently. Then you can do away with the map/unmap operation.
Did you also consider tesselation shaders? It's the more specialized little brother of geom shading that doesn't have its drawbacks.
It will let you expand a single vertex into a quad using a constant tess control shader (domain shader in D3D speak) and the frag eval shader (hull shader in D3D speak) is what the vertex shader used to be.
Did you also consider tesselation shaders? It's the more specialized little brother of geom shading that doesn't have its drawbacks.
It will let you expand a single vertex into a quad using a constant tess control shader (domain shader in D3D speak) and the frag eval shader (hull shader in D3D speak) is what the vertex shader used to be.
- eradicator
- Smart Inserter
- Posts: 5211
- Joined: Tue Jul 12, 2016 9:03 am
- Contact:
Re: Friday Facts #251 - A Fistful of Frames
Yea. I vote to extend to 4 weeks. Until you give in and give us directory trees :D.MasterBuilder wrote:You say that like it's a bad thing :)Twinsen wrote:You just earned yourself 2 more weeks about blueprint library, mister.steinio wrote:Nice to read about your progress but boring stuff.
Maybe add a random screenshot of the new GUI :)
How often are you guys at the Library?
I'm considering staying a day in Prague on July 30th (instead of going straight through it), but haven't yet entirely convinced myself.
Author of: Belt Planner, Hand Crank Generator, Screenshot Maker, /sudo and more.
Mod support languages: 日本語, Deutsch, English
My code in the post above is dedicated to the public domain under CC0.
Mod support languages: 日本語, Deutsch, English
My code in the post above is dedicated to the public domain under CC0.
Re: Friday Facts #251 - A Fistful of Frames
I know some of the words from posila's part of the post, but I'm happy that 4790k got some great results
Re: Friday Facts #251 - A Fistful of Frames
I know, but we are really doing OpenGL just for legacy support, so it doesn't make sense for us to have two different backends for OGL 3.3 and 4.5. We will do Vulkan instead.ratchetfreak wrote:In newer opengl versions you can have buffers mapped persistently. Then you can do away with the map/unmap operation.
We didn't. We are targeting DirectX 10 class hardware, where tessellation is not available yet.ratchetfreak wrote:Did you also consider tesselation shaders? It's the more specialized little brother of geom shading that doesn't have its drawbacks.
Re: Friday Facts #251 - A Fistful of Frames
Calling it now, next FFF is "For a Few Frames More", followed by "The Good, The Bad, and the Poorly Optimized"
Re: Friday Facts #251 - A Fistful of Frames
I am still not sure about the last oneDrNick wrote:Calling it now, next FFF is "For a Few Frames More", followed by "The Good, The Bad, and the Poorly Optimized"
-
- Long Handed Inserter
- Posts: 70
- Joined: Sat May 16, 2015 4:39 am
- Contact:
Re: Friday Facts #251 - A Fistful of Frames
You mentioned that you saw better improvements from AMD cards, after reviewing the benchmarks, were they are improvements for the NVidia cards? all I see are AMD and intel adapters.
Re: Friday Facts #251 - A Fistful of Frames
My question may sounds over stupid but I'm not into computer:
What would be the best PC Build for Factorio (what do you recommend)? Just aim to the most expensive CPU and GPU ?
What would be the best PC Build for Factorio (what do you recommend)? Just aim to the most expensive CPU and GPU ?
I'm not english, sorry for my mistakes
Re: Friday Facts #251 - A Fistful of Frames
YES PLEASE, (older) cards with less than 3GB are getting hit really hard so that would be awesomeNext we need to improve the GPU side of things, mainly excessive usage of video memory (VRAM),
Re: Friday Facts #251 - A Fistful of Frames
so "when it's done" is at least a few weeks to goKlonan wrote:0.16 of courseMeddleman wrote:And I was wondering what the hell I should do on my August holidays. Thanks for giving me a good excuse to visit Prague! If anyone wants to meet up for a few beers and a LAN session I'm more than happy to tag along.
Question for the devs, will these computers have 0.17 on them? Or the current 0.16 stable?
Re: Friday Facts #251 - A Fistful of Frames
On the Y axis are CPU names and we didn't mention GPU, because the optiomization is supposed to be just CPU side of the rendering. But we marked tests that run on Intel integrated GPUs with asterisk, because it those cases the CPU was already waiting on GPU to finish rendering.DaemosDaen wrote:You mentioned that you saw better improvements from AMD cards, after reviewing the benchmarks, were they are improvements for the NVidia cards? all I see are AMD and intel adapters.
Here are benchmarks on my computer with the two different GPUs: