Solved already, but Factorio successfully identified that I had failing hardware (Linux)
Posted: Fri Jan 31, 2025 12:00 am
So kind of don't need help anymore, but still want to make a post because actually diagnosing the issue was a bit of a saga, and posting for issue SEO. I learned some things.
Basically after I installed Space Age, Factorio started having checksum errors when loading PNG files, or occasionally zip files. This didn't happen often, only once or twice a week. I'd do the steam verify integrity of game files thing and move on.
As this repeated, I thought this was obviously corruption on the nvme drive even though it passed all tests I tried, so deciding to bring out the big guns of data integrity, I swapped my ext4 filesystem for btrfs in a raid1 configuration.
Oddly enough, the corruption continued happening, but brtfs detected no errors, which should be impossible, because btrfs uses block level checksums so should always detect errors.
That's when I discovered that copying the corrupt file, rebooting the computer, or most tellingly, purging Linux's page cache, would fix the corrupt file(s). This meant the files weren't corrupted on disk, but the files were corrupted in memory between first and subsequent accesses. (bear in mind, with only 1 or 2 instances of this corruption per week, this was a slow process of discovery)
At that point, it should've been obvious it was bad memory, but I do be dumb though.
Anyway basically the game still worked fine, and if it happened I'd just purge the page cache and continue playing.
After a couple of months of this, errors started getting more frequent like daily, including a couple of game crashes which would exit with a "this is probably failing hardware" type message and a couple of refusals to save the game because of corrupt state detected.
I had been running a lot of CPU and memory tests in Linux which had detected nothing even when running for hours, but I finally decided to get serious and actually start swapping around hardware.
I pulled out one ram stick, and ran memtest86+ on the remaining one, 0 errors, and I was used to seeing zero errors.
I then swapped the ram sticks, and ran memtest86+ on the other one, and boom, lit up like a christmas tree with errors. Also the computer wouldn't even successfully boot on this stick. Thus it was conclusively one ram stick that had gone bad.
What I find quite remarkable was that my PC was fairly stable (though I'm sure with the bad stick gone, it'll now be extremely stable), my guess is apparently either by accident or design Linux was mostly using the good stick and was only putting less important stuff on the bad stick, and it certainly seemed to shield the bad memory from the memory tests. Also credit to the devs, Factorio was the only program complaining that the hardware was probably bad, making it one of the better diagnostic tools.
Basically after I installed Space Age, Factorio started having checksum errors when loading PNG files, or occasionally zip files. This didn't happen often, only once or twice a week. I'd do the steam verify integrity of game files thing and move on.
As this repeated, I thought this was obviously corruption on the nvme drive even though it passed all tests I tried, so deciding to bring out the big guns of data integrity, I swapped my ext4 filesystem for btrfs in a raid1 configuration.
Oddly enough, the corruption continued happening, but brtfs detected no errors, which should be impossible, because btrfs uses block level checksums so should always detect errors.
That's when I discovered that copying the corrupt file, rebooting the computer, or most tellingly, purging Linux's page cache, would fix the corrupt file(s). This meant the files weren't corrupted on disk, but the files were corrupted in memory between first and subsequent accesses. (bear in mind, with only 1 or 2 instances of this corruption per week, this was a slow process of discovery)
At that point, it should've been obvious it was bad memory, but I do be dumb though.
Anyway basically the game still worked fine, and if it happened I'd just purge the page cache and continue playing.
After a couple of months of this, errors started getting more frequent like daily, including a couple of game crashes which would exit with a "this is probably failing hardware" type message and a couple of refusals to save the game because of corrupt state detected.
I had been running a lot of CPU and memory tests in Linux which had detected nothing even when running for hours, but I finally decided to get serious and actually start swapping around hardware.
I pulled out one ram stick, and ran memtest86+ on the remaining one, 0 errors, and I was used to seeing zero errors.
I then swapped the ram sticks, and ran memtest86+ on the other one, and boom, lit up like a christmas tree with errors. Also the computer wouldn't even successfully boot on this stick. Thus it was conclusively one ram stick that had gone bad.
What I find quite remarkable was that my PC was fairly stable (though I'm sure with the bad stick gone, it'll now be extremely stable), my guess is apparently either by accident or design Linux was mostly using the good stick and was only putting less important stuff on the bad stick, and it certainly seemed to shield the bad memory from the memory tests. Also credit to the devs, Factorio was the only program complaining that the hardware was probably bad, making it one of the better diagnostic tools.