[Oxyd] [0.16.51] non-blocking saving + desync = data loss

This subforum contains all the issues which we already resolved.
Post Reply
User avatar
Therax
Filter Inserter
Filter Inserter
Posts: 470
Joined: Sun May 21, 2017 6:28 pm
Contact:

[Oxyd] [0.16.51] non-blocking saving + desync = data loss

Post by Therax »

Running 0.16.51 headless server on Arch Linux, kernel 4.18.16 x86_64. I have attached a log with all the information I have. Here are the highlights as near as I can reconstruct what happened:

1) The server starts and appears to run flawlessly until 2019-02-02T20:44.
2) At this time the server saves _autosave5.zip.
3) At 20:54 and 21:04, the server logs that it has successfully saved _autosave1.zip and _autosave2.zip, but these files are not actually saved.
4) At 21:10, the sole connected player desyncs, and is disconnected. The server logs that it saves the py.zip at this time, but this save also fails.
5) The player reconnects and continues playing without issue. During this time the server logs several more successful autosaves and saves to the main py.zip file, all of which appear to have actually failed.
6) After several more hours of playtime and ~32 more failed autosaves, the server crashes on a failed fork with errno 12 (ENOMEM).
7) When I attempt to restart the server at some time later, it finds bad *.zip.tmp files that it cannot load.
8) Looking in the saves directory, the most recent valid save is 2019-02-02T20:44, and every save slot has a more recent incomplete *.tmp.zip file.

Here is my hypothesis: the server logs "Saving finished" when the fork is complete, even though the child process may still fail writing out the save. Each of the child processes somehow got blocked while trying to write out the save, and never exited. Eventually, after ~32 child processes were running, all trying to save, the server memory was exhausted and all the processes were terminated when the parent process exited. We have lost several hours of playtime.
Attachments
debug.zip
(13.65 KiB) Downloaded 96 times
Miniloader — UPS-friendly 1x1 loaders
Bulk Rail Loaders — Rapid train loading and unloading
Beltlayer & Pipelayer — Route items and fluids freely underground

Oxyd
Former Staff
Former Staff
Posts: 1428
Joined: Thu May 07, 2015 8:42 am
Contact:

Re: [Oxyd] [0.16.51] non-blocking saving + desync = data loss

Post by Oxyd »

It logs “Saving finished” when the child exits. Also there are no children running when the server exits.

I don't know what exactly happened in your case, but I found an issue where if the child process received a signal, the parent would think the child exited normally and without error. So I fixed that in 0.17.53. At least now it shouldn't fail silently.

Post Reply

Return to “Resolved Problems and Bugs”