Page 1 of 1

[Oxyd] [0.16.51] non-blocking saving + desync = data loss

Posted: Thu Feb 07, 2019 7:17 pm
by Therax
Running 0.16.51 headless server on Arch Linux, kernel 4.18.16 x86_64. I have attached a log with all the information I have. Here are the highlights as near as I can reconstruct what happened:

1) The server starts and appears to run flawlessly until 2019-02-02T20:44.
2) At this time the server saves _autosave5.zip.
3) At 20:54 and 21:04, the server logs that it has successfully saved _autosave1.zip and _autosave2.zip, but these files are not actually saved.
4) At 21:10, the sole connected player desyncs, and is disconnected. The server logs that it saves the py.zip at this time, but this save also fails.
5) The player reconnects and continues playing without issue. During this time the server logs several more successful autosaves and saves to the main py.zip file, all of which appear to have actually failed.
6) After several more hours of playtime and ~32 more failed autosaves, the server crashes on a failed fork with errno 12 (ENOMEM).
7) When I attempt to restart the server at some time later, it finds bad *.zip.tmp files that it cannot load.
8) Looking in the saves directory, the most recent valid save is 2019-02-02T20:44, and every save slot has a more recent incomplete *.tmp.zip file.

Here is my hypothesis: the server logs "Saving finished" when the fork is complete, even though the child process may still fail writing out the save. Each of the child processes somehow got blocked while trying to write out the save, and never exited. Eventually, after ~32 child processes were running, all trying to save, the server memory was exhausted and all the processes were terminated when the parent process exited. We have lost several hours of playtime.

Re: [Oxyd] [0.16.51] non-blocking saving + desync = data loss

Posted: Fri Jun 28, 2019 4:13 pm
by Oxyd
It logs “Saving finished” when the child exits. Also there are no children running when the server exits.

I don't know what exactly happened in your case, but I found an issue where if the child process received a signal, the parent would think the child exited normally and without error. So I fixed that in 0.17.53. At least now it shouldn't fail silently.