Page 1 of 1
[raiguard][1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sat Apr 06, 2024 1:22 am
by someone1337
I use the fork-saving (cant remember how you named it in game).
I explored in editor mode and run a few commands to further afk explore, as mapgen takes a while, which made the map quite big and was generating during the fork-saving, since fork uses copy on write memory allocation i guess it went boom as both processess' ram contents diverged too fast ...
Long story short: It got me the forked factorio saving process OOM-Killed. This in turn made the factorio main process die too.
What I would expect in such a case is to get a big red warning in the chat stating that fork-saving failed (ideally showing the reason (sigterm in this case)) and repeat the saving while blocking/pausing the game as if the fork-saving mechanism was not enabled.
Relevant parts of dmesg:
Code: Select all
[Sat Apr 6 02:54:57 2024] Tasks state (memory values in pages):
[Sat Apr 6 02:54:57 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Sat Apr 6 02:54:57 2024] [ 73377] 1000 73377 5527248 2807998 39755776 0 200 factorio
[Sat Apr 6 02:54:57 2024] [ 124042] 1000 124042 5777336 2926700 40886272 0 200 factorio
[Sat Apr 6 02:54:57 2024] Out of memory: Killed process 124042 (factorio) total-vm:23109344kB, anon-rss:11706800kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:39928kB oom_score_adj:200
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sat Apr 06, 2024 3:13 am
by Rseding91
In general we don’t attempt to handle out of memory errors because there’s almost nothing that can be done in the standard execution path. But I’ll leave this to the Linux guys to decide if they want to try to handle this case. It seems unlikely it would succeed because if the fork ran out the main one likely is out and will shortly crash the next time it tries to allocate anything.
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sat Apr 06, 2024 8:23 pm
by someone1337
If I read that correctly, I was at 28GB with the main process, out of 32 gb my laptop has...
Anyway: I now loaded factorio from the previous sucessful autosave, which is not that far off the one that exploded ... it only needs 9 GB ram ... are there maybe some memleaks? ^^
I let it run and do exactly the same as I did yesterday... the fork-save that exploded worked without any issue now:
Max system ram usage: 16 GB (11 GB factorio, 5 GB os with chromium+thunderbird), but spikes to 25 GB during fork-save, so it seems it actually does quite a lot of cow during fork-saving, even if mapgen does nothing.
In that case it would really be cool, if the devs looked at a graceful fallback to blocking-saving.
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sat Apr 06, 2024 10:15 pm
by Rseding91
Factorio on Linux is particularly bad about memory fragmentation and not releasing memory back to the OS when the process has freed it. Last I knew it was related to the C runtime library used and how the allocator there handles it.
So the longer you run a game instance the higher chance allocations get failed by the OS.
And because of the way Linux works with the OOM killer; instead of letting Factorio know the allocation failed it just kills the process making it really annoying to know if a fork process failed because it crashed or because something else like another process/the user killed it.
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sat Apr 06, 2024 10:42 pm
by Rseding91
Actually, maybe the Linux guys will correct me, but there doesn’t seem to be any official way to detect your process has been killed due to OOM. So this whole thing would go into “not a bug”.
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Wed Apr 10, 2024 7:18 am
by julus
Both factorio processes were using 22GB of RAM. (rss is in 4kb pages, so rss*4/1024/1024), based on the limited output it got killed by reaching cgroup limit or possibly systemd-oom.
If I understand fork-saving terminology, it basically makes a copy of the running factorio process and performs the save there and then exits, hence temporary need for another 11GB of RAM in this case.
To handle this scenario ideally, the factorio can put +100 oom_score to saving fork so in case OOM occurs, only the fork is killed and original process survives (and message can be printed ingame that fork-saving failed).
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Wed Apr 10, 2024 7:47 pm
by AndreasTPC
Rseding91 wrote: ↑Sat Apr 06, 2024 10:15 pm
Factorio on Linux is particularly bad about memory fragmentation and not releasing memory back to the OS when the process has freed it. Last I knew it was related to the C runtime library used and how the allocator there handles it.
Would it perhaps be worth using a non-system allocator for reducing memory issues on linux? Mimalloc, Jemalloc, and the like, are supposed to be better about handling memory fragmentation and releasing ram back to the os than the one that comes with libc.
(If someone wants to test you can load an alternate allocator with LD_PRELOAD, it's a drop-in replacement, no code changes needed).
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Fri Apr 12, 2024 8:53 pm
by someone1337
Rseding91 wrote: ↑Sat Apr 06, 2024 10:42 pm
Actually, maybe the Linux guys will correct me, but there doesn’t seem to be any official way to detect your process has been killed due to OOM. So this whole thing would go into “not a bug”.
Maybe:
https://stackoverflow.com/questions/718 ... -a-process
What could surely be tested for is: "Did the forked save process terminate abnormally? -> retry blockingly (instead of make the parent process die horribly)".
julus wrote: ↑Wed Apr 10, 2024 7:18 am
To handle this scenario ideally, the factorio can put +100 oom_score to saving fork so in case OOM occurs, only the fork is killed and original process survives (and message can be printed ingame that fork-saving failed).
In combination with this one could make sure that the child and not the parent process gets oom killed.
OOM-killer tries to kill the biggest but also short running processes, unless factorio gets started and immediately runs into an autosave that gets oom killed, this should not really be encessary, but there is also no reason not to do this.
Re: [raiguard][1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sun Apr 28, 2024 2:27 am
by Kallinger
On the last FFF I was told to add onto this here, as this all happened quiet a while ago I sadly i don't have any crash-logs anymore.
But I also have experienced a lot of crashes while saving because another process (Firefox) was slowly eating up more Memory in the background and at some point the forking of the Game overwhelmed both Ram and swap, at that point my whole PC goes unresponsive until Factorio gets killed.
Would it be doable to check how much memory is left before the game gets forked? Sadly i have no idea if that's doable/easy to implement...
Re: [1.1.106][Linux] crash during fork-saving, when child process gets OOM-Killed
Posted: Sun Apr 28, 2024 1:47 pm
by SoShootMe
someone1337 wrote: ↑Fri Apr 12, 2024 8:53 pm
Rseding91 wrote: ↑Sat Apr 06, 2024 10:42 pm
Actually, maybe the Linux guys will correct me, but there doesn’t seem to be any official way to detect your process has been killed due to OOM. So this whole thing would go into “not a bug”.
Maybe:
https://stackoverflow.com/questions/718 ... -a-process
What could surely be tested for is: "Did the forked save process terminate abnormally? -> retry blockingly (instead of make the parent process die horribly)".
Yeah; as far as I know the OOM killer more or less literally sends SIGKILL to the selected process so it's not possible to tell the difference between that and it being sent by some other process (eg the user running kill), but I don't see the relevance. Any abnormal termination should ideally cause fallback to a blocking save (maybe for the rest of the session). At minimum the effect should be the same as an error during a blocking save (which should be to report the error, eg "Async save process terminated with SIGKILL" for the OOM killer case). Exit status could also communicate handled failures (like I/O errors). That said, all this may be easier said than done within the existing codebase.
julus wrote: ↑Wed Apr 10, 2024 7:18 am
To handle this scenario ideally, the factorio can put +100 oom_score to saving fork so in case OOM occurs, only the fork is killed and original process survives (and message can be printed ingame that fork-saving failed).
In combination with this one could make sure that the child and not the parent process gets oom killed.
OOM-killer tries to kill the biggest but also short running processes, unless factorio gets started and immediately runs into an autosave that gets oom killed, this should not really be encessary, but there is also no reason not to do this.
I think it's what you meant but it only makes sense to increase the likelihood that the save process is killed (rather than some other process) by the OOM killer if the behaviour of the main process in this case is improved. Another reason not to do it is that it is Linux-specific, but I suspect that's negligible given there must already be a framework for platform-specific code.