There is potentially another approach here, which could be of interest.A lot of testing later and the results were correct; it was just that much faster. The underlying algorithm didn't change but it just ran > 3x faster now by touching less memory. This is another nice example of "Factorio is not CPU bound, it's memory latency bound". More cores wasn't going to make this faster - because it was never limited by how fast the CPU ran.
And that is transforming the iteration loop from sequential order, to partially overlapping, cooperative execution by the use of co-routines with a prefetch+yield operation pair wherever you assume that you will get a L1 cache miss.
It doesn't change the logic, and it doesn't change the code (much) either, but it completely changes how sensitive your code is to memory latency.
Co-routines are really the key feature there, as a pre-fetch without parallel execution is usually pretty hard to use properly (as you don't know what you will need to prefetch early enough), and manual interleaving .... that is just a pain in terms of maintainability of the resulting code.
Check https://www.youtube.com/watch?v=j9tlJAqMV7U for a further in-depth explanation how this works (and an example showcasing just this described prefetch-yield pattern). Once you get your had wrapped around the concept of corotuines in C++, they are a surprisingly helpful feature for a threading-free iteration strategy.