Solved already, but Factorio successfully identified that I had failing hardware (Linux)

Anything that prevents you from playing the game properly. Do you have issues playing for the game, downloading it or successfully running it on your computer? Let us know here.
BlakeMW
Filter Inserter
Filter Inserter
Posts: 992
Joined: Thu Jan 21, 2016 9:29 am
Contact:

Solved already, but Factorio successfully identified that I had failing hardware (Linux)

Post by BlakeMW »

So kind of don't need help anymore, but still want to make a post because actually diagnosing the issue was a bit of a saga, and posting for issue SEO. I learned some things.

Basically after I installed Space Age, Factorio started having checksum errors when loading PNG files, or occasionally zip files. This didn't happen often, only once or twice a week. I'd do the steam verify integrity of game files thing and move on.

As this repeated, I thought this was obviously corruption on the nvme drive even though it passed all tests I tried, so deciding to bring out the big guns of data integrity, I swapped my ext4 filesystem for btrfs in a raid1 configuration.

Oddly enough, the corruption continued happening, but brtfs detected no errors, which should be impossible, because btrfs uses block level checksums so should always detect errors.

That's when I discovered that copying the corrupt file, rebooting the computer, or most tellingly, purging Linux's page cache, would fix the corrupt file(s). This meant the files weren't corrupted on disk, but the files were corrupted in memory between first and subsequent accesses. (bear in mind, with only 1 or 2 instances of this corruption per week, this was a slow process of discovery)

At that point, it should've been obvious it was bad memory, but I do be dumb though.

Anyway basically the game still worked fine, and if it happened I'd just purge the page cache and continue playing.

After a couple of months of this, errors started getting more frequent like daily, including a couple of game crashes which would exit with a "this is probably failing hardware" type message and a couple of refusals to save the game because of corrupt state detected.

I had been running a lot of CPU and memory tests in Linux which had detected nothing even when running for hours, but I finally decided to get serious and actually start swapping around hardware.

I pulled out one ram stick, and ran memtest86+ on the remaining one, 0 errors, and I was used to seeing zero errors.

I then swapped the ram sticks, and ran memtest86+ on the other one, and boom, lit up like a christmas tree with errors. Also the computer wouldn't even successfully boot on this stick. Thus it was conclusively one ram stick that had gone bad.

What I find quite remarkable was that my PC was fairly stable (though I'm sure with the bad stick gone, it'll now be extremely stable), my guess is apparently either by accident or design Linux was mostly using the good stick and was only putting less important stuff on the bad stick, and it certainly seemed to shield the bad memory from the memory tests. Also credit to the devs, Factorio was the only program complaining that the hardware was probably bad, making it one of the better diagnostic tools.
andyseemight
Burner Inserter
Burner Inserter
Posts: 6
Joined: Thu Feb 06, 2025 9:54 am
Contact:

Re: Solved already, but Factorio successfully identified that I had failing hardware (Linux)

Post by andyseemight »

I've also experienced this in the past. Actually today is when I found out what actually was the culprit. This would happen extremely rare to me. The problem was my ram sticks not fully inserted. I must have nudged it after installing the cpu power cable or something because I remember hearing all the clicks. After reading your post and while downloading memtest and writing it to a usb stick I decided to just verify that they're actually all in completely.

Interesting that it was not inserted fully but still 99% reliable!

For those of you still trying to figure this out go to this website and download the free version:

https://www.memtest86.com/download.htm

Write it to a USB stick you've got that has nothing on it using Rufus for windows or on Linux the native USB Writer program and write memtest to it. Then restart and make that USB your bootable drive and follow the instructions.
eugenekay
Filter Inserter
Filter Inserter
Posts: 310
Joined: Tue May 15, 2018 2:14 am
Contact:

Re: Solved already, but Factorio successfully identified that I had failing hardware (Linux)

Post by eugenekay »

In the spirit of "posting for issue SEO" for future readers:

ECC Memory has real-world benefits. While sadly not available in most laptops/tablets/small desktops, it is a great idea to have in a serious "Workstation" or "Gaming Rig" - and a must-have to be called a "Server". It does not Guarantee against bit/byte corruption issues - but you will see the ECC Warnings (remember to monitor dmesg!) to indicate a failing Memory chip earlier, without waiting for application-level corruption to be reported - and ruin your savegame file. The price difference is worth the peace of mind. 8-)
User avatar
pioruns
Long Handed Inserter
Long Handed Inserter
Posts: 69
Joined: Tue Nov 05, 2024 3:38 pm
Contact:

Re: Solved already, but Factorio successfully identified that I had failing hardware (Linux)

Post by pioruns »

I fully agree! ECC RAM is a great investment for reliability. I use it in both my home server and desktop workstation. I learned a hard lesson when I had failing RAM sticks that were extremely difficult to diagnose, causing infrequent instability and silent data corruption for months. Since then, I’ve made ECC RAM a must-have in all of my desktop systems.
Fortunately, using ECC RAM is much easier nowadays with Ryzen processors and a suitable motherboard that fully supports ECC detection and correction. I use Kingston Server Premier ECC UDIMMs with my Ryzen CPUs. ECC functionality fully verified.
I also run Btrfs on all my machines and use Btrfs RAID1 wherever possible. Combination of ECC RAM and Btrfs has saved me from potential data loss multiple times.

My PC setups:

Newer workstation computer for all my work and gaming:

Code: Select all

AMD Ryzen 5 5800X (16) @ 3.8GHz
ASRock X570 Steel Legend
64GB ECC DDR4 @ 3600MT/s (overclocked from 3200 MT/s)
Older 24/7 homeserver:

Code: Select all

AMD Ryzen 7 1700 (16) @ 3.0GHz
ASUS PRIME B350-PLUS
64GB ECC DDR4 @ 2666MT/s
With ECC RAM and Btrfs redundancy enabled, data integrity and early failure detection are easy. Definitely worth it.
Post Reply

Return to “Technical Help”