[raiguard] [2.0.9] Crash trying to delete .bak.zip save files on CephFS

We are aware of them, but they have low priority. We have more important things to do. They go here in order not to take space in the main bug thread list.
sirscriptsalot
Manual Inserter
Manual Inserter
Posts: 3
Joined: Wed Oct 23, 2024 10:16 am
Contact:

[raiguard] [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by sirscriptsalot »

I recently started a new server for Space Age, and the game files are stored on a CephFS network file system. This is the first time I have tried to run Factorio on a CephFS file system, I used NFS when running Factorio before.

When trying to overwrite an older autosave, the game fails to remove the .bak.zip that is created, giving either a non-fatal warning, or often, a fatal error, crashing the server.

Code: Select all

Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave47.bak.zip failed: Invalid argument

Code: Select all

Error AsyncScenarioSaver.cpp:197: Saving failed: Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument
The game seems to be completely unable to delete these .bak.zip files, and the saves folder fills up with .bak.zip files.

There seems to be an issue with deleting these files, which is causing this warning and error. It is also strange how the game crashes because it is unable to delete the files.

I believe the underlying hardware, network, OS, and file permissions are healthy, as all the files have uid/gid 845/845 (same as server process), and I have no trouble deleting the files manually with rm, even while the game is running. I never run more than 1 factorio container at a time.

Multiple other applications run on the same CephFS system without issues. By contrast, Factorio consistently fails to delete the files.

Maybe related, CephFS is notable for having high latency for fsync calls, as it has to complete the fsync on multiple storage servers before returning. This maybe takes about 50ms on average on my system, though I imagine I could have spikes up to 1 second.

The server is running on an AMD ryzen 5600X on linux 6.6.54 in the factoriotools 2.0.9 docker container on Kubernetes, with /factorio and /factorio/saves mounted as CephFS filesystems, and /factorio/config/server-settings.json and /factorio/config/rconpw mounted as read-only tmpfs filesystems.

Logs below:

server init example

Code: Select all

   0.000 2024-10-23 09:07:46; Factorio 2.0.9 (build 79509, linux64, headless, space-age)
   0.008 Operating system: Linux
   0.010 Program arguments: "/opt/factorio/bin/x64/factorio" "--port" "34197" "--server-settings" "/factorio/config/server-settings.json" "--server-banlist" "/factorio/config/server-banlist.json" "--rcon-port" "27015" "--server-whitelist" "/factorio/config/server-whitelist.json" "--use-server-whitelist" "--server-adminlist" "/factorio/config/server-adminlist.json" "--rcon-password" <private> "--server-id" "/factorio/config/server-id.json" "--mod-directory" "/factorio/mods" "--start-server-load-latest"
   0.013 Config path: /opt/factorio/config/config.ini
   0.017 Read data path: /opt/factorio/data
   0.025 Write data path: /factorio [51200/51200MB]
   0.028 Binaries path: /opt/factorio/bin
   0.042 System info: [CPU: AMD Ryzen 5 5600X 6-Core Processor, 12 cores, RAM: 64196 MB]
   0.045 Environment: DISPLAY=<unset> WAYLAND_DISPLAY=<unset> DESKTOP_SESSION=<unset> XDG_SESSION_DESKTOP=<unset> XDG_CURRENT_DESKTOP=<unset> SDL_VIDEODRIVER=<unset> __GL_FSAA_MODE=<unset> __GL_LOG_MAX_ANISO=<unset> __GL_SYNC_TO_VBLANK=<unset> __GL_SORT_FBCONFIGS=<unset> __GL_YIELD=<unset>
   0.139 Running in headless mode
   0.145 Audio is disabled
   0.154 Info ModManager.cpp:434: FeatureFlag expansion-shaders = true
   0.159 Info ModManager.cpp:434: FeatureFlag freezing = true
   0.164 Info ModManager.cpp:434: FeatureFlag quality = true
   0.168 Info ModManager.cpp:434: FeatureFlag rail-bridges = true
   0.174 Info ModManager.cpp:434: FeatureFlag segmented-units = true
   0.177 Info ModManager.cpp:434: FeatureFlag space-travel = true
   0.180 Info ModManager.cpp:434: FeatureFlag spoiling = true
   0.188 Loading mod core 0.0.0 (data.lua)
   0.208 Loading mod base 2.0.9 (data.lua)
   0.387 Loading mod elevated-rails 2.0.9 (data.lua)
   0.421 Loading mod quality 2.0.9 (data.lua)
   0.444 Loading mod space-age 2.0.9 (data.lua)
   0.792 Loading mod base 2.0.9 (data-updates.lua)
   0.813 Loading mod quality 2.0.9 (data-updates.lua)
   0.849 Loading mod space-age 2.0.9 (data-updates.lua)
   0.959 Checksum for core: 2820763882
   1.147 Checksum of base: 1355320214
   1.154 Checksum of elevated-rails: 1891730324
   1.170 Checksum of quality: 3715527832
   1.201 Checksum of space-age: 440491908
   1.691 Prototype list checksum: 984080119
   1.844 Info PlayerData.cpp:66: Local player-data.json available, timestamp 1729674461
   1.850 Info PlayerData.cpp:73: Cloud player-data.json unavailable
   1.865 Factorio initialised
Failing to delete file async

Code: Select all

 159.173 Info AppManager.cpp:290: Saving to _autosave47 (non-blocking).
 159.413 Info AsyncScenarioSaver.cpp:162: Saving process PID: 247
 160.379 [247] Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave47.bak.zip failed: Invalid argument
 160.413 Info ChildProcessAgent.cpp:61: Child 247 exited with return value 0
 160.413 Info AppManager.cpp:291: Saving finished
 279.339 Info AppManager.cpp:290: Saving to _autosave48 (non-blocking).
 279.422 Info AsyncScenarioSaver.cpp:162: Saving process PID: 288
 280.356 [288] Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave48.bak.zip failed: Invalid argument
 280.395 Info ChildProcessAgent.cpp:61: Child 288 exited with return value 0
 280.395 Info AppManager.cpp:291: Saving finished
 399.339 Info AppManager.cpp:290: Saving to _autosave49 (non-blocking).
 399.385 Info AsyncScenarioSaver.cpp:162: Saving process PID: 329
 400.093 [329] Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave49.bak.zip failed: Invalid argument
 400.122 Info ChildProcessAgent.cpp:61: Child 329 exited with return value 0
 400.122 Info AppManager.cpp:291: Saving finished
 519.339 Info AppManager.cpp:290: Saving to _autosave50 (non-blocking).
 519.382 Info AsyncScenarioSaver.cpp:162: Saving process PID: 370
 523.203 [370] Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave50.bak.zip failed: Invalid argument
 523.237 Info ChildProcessAgent.cpp:61: Child 370 exited with return value 0
 523.240 Info AppManager.cpp:291: Saving finished
 639.339 Info AppManager.cpp:290: Saving to _autosave51 (non-blocking).
 639.393 Info AsyncScenarioSaver.cpp:162: Saving process PID: 411
 654.040 [411] Warning WriteFileGuard.cpp:51: Writing /factorio/saves/_autosave51.zip failed; previous version (if any) should still be available
 654.056 Error AsyncScenarioSaver.cpp:197: Saving failed: Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument
 654.073 Error AsyncScenarioSaver.cpp:197: Saving failed: Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument
 654.089 Error AsyncScenarioSaver.cpp:197: Saving failed: Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument
 654.106 Error AsyncScenarioSaver.cpp:197: Saving failed: Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument
 654.115 Info ChildProcessAgent.cpp:61: Child 411 exited with return value 1
 654.118 Error Util.cpp:81: Attempting to create notice box in headless mode. Message: 'Couldn't remove /factorio/saves/_autosave51.bak.zip: Invalid argument'
Failing to delete file sync

Code: Select all

 410.666 Info AppManager.cpp:310: Saving to _autosave52 (blocking).
 413.017 Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave52.bak.zip failed: Invalid argument
 413.033 Info AppManagerStates.cpp:2045: Saving finished
 713.033 Info AppManager.cpp:310: Saving to _autosave53 (blocking).
 714.739 Warning WriteFileGuard.cpp:51: Writing /factorio/saves/_autosave53.zip failed; previous version (if any) should still be available
 714.740 Error ParallelScenarioSaver.cpp:156: Saving scenario failed: Couldn't remove /factorio/saves/_autosave53.bak.zip: Invalid argument
 714.750 Error Util.cpp:81: Attempting to create notice box in headless mode. Message: 'Cannot save map: Couldn't remove /factorio/saves/_autosave53.bak.zip: Invalid argument'
Thanks for any help you can offer. Let me know if there's anything I can help troubleshoot with.

I will be moving the saves folder to ext4 on RADOS Block (Ceph equivalent of iSCSI) to see if this makes my server stop crashing. I will report back with results.
eugenekay
Fast Inserter
Fast Inserter
Posts: 165
Joined: Tue May 15, 2018 2:14 am
Contact:

Re: [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by eugenekay »

Thoughts from a fellow Factorio-on-Linux enthusiast: Try attaching strace to the Factorio process, looking only at file I/O operations. You can then compare the arguments passed to the filesystem more directly:

Code: Select all

strace -e trace=read,write -p `pidof factorio`
strace rm _autosave.47.bak.zip
My best-educated-guess is that Factorio is not aware of the extended attributes used by this filesystem. You can examine these using `getfattr` as in https://docs.ceph.com/en/mimic/cephfs/f ... h-getfattr

A second guess: Do you have Snapshots enabled on the CephFS volume? Their presence causes a dramatic slowdown in the removal time, though I don't think this would lead to the "Invalid Argument" code path, just a fsync hang....
sirscriptsalot
Manual Inserter
Manual Inserter
Posts: 3
Joined: Wed Oct 23, 2024 10:16 am
Contact:

Re: [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by sirscriptsalot »

Hi eugenekay, thanks for the tip about strace!

I added a sidecar with CAP_SYS_PTRACE in a shared pid namespace to the factorio pod like so:

Code: Select all

spec:
  shareProcessNamespace: true
  containers:
    - name: factorio
      image: docker.io/factoriotools/factorio:2.0.10
      ...
    - name: debug
      image: debian
      command:
        - tail
      args:
        - -f
        - /dev/null
      securityContext:
        capabilities:
          add:
            - SYS_PTRACE
      volumeMounts:
        - name: factorio
          mountPath: /factorio/
        - name: saves
          mountPath: /factorio/saves/

I then execed into the debug container and installed and ran strace to compare rm's behavior and factorio's behavior:

Code: Select all

root@factorio-0:/# apt update && apt install -y strace procps
(lines omitted)
root@factorio-0:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65535          1  0.0  0.0    972   404 ?        Ss   10:56   0:00 /pause
845            7  6.0  1.5 1702148 1050556 ?     Ssl  10:56   0:06 /opt/factorio/bin/x64/factorio --port 34197 --server-settings /factorio/config/server-settings.json --server-banlist /factorio/config/server-banlist.json --rcon-port 27015 --server-whitelist
root         212  0.0  0.0   2516  1172 ?        Ss   10:56   0:00 tail -f /dev/null
root         223  0.0  0.0   4188  3012 pts/0    Ss   10:57   0:00 bash
root         434  0.0  0.0   8088  3884 pts/0    R+   10:58   0:00 ps aux
root@factorio-0:/# ls /factorio/saves/ -l       
total 4790001
(lines omitted)
-rw-rw-r-- 1 845 845 13531395 Oct 23 16:11 _autosave44.bak.zip
root@factorio-0:/# useradd -u 845 factorio
useradd warning: factorio's uid 845 outside of the UID_MIN 1000 and UID_MAX 60000 range.
root@factorio-0:/# su -c 'strace rm /factorio/saves/_autosave44.bak.zip' factorio
execve("/usr/bin/rm", ["rm", "/factorio/saves/_autosave44.bak."...], 0x7ffe4bf48c58 /* 27 vars */) = 0
brk(NULL)                               = 0x5573f23ae000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efe440db000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=5150, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 5150, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7efe440d9000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20t\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1922136, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7efe43ef8000
mmap(0x7efe43f1e000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7efe43f1e000
mmap(0x7efe44073000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7efe44073000
mmap(0x7efe440c6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7efe440c6000
mmap(0x7efe440cc000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7efe440cc000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efe43ef5000
arch_prctl(ARCH_SET_FS, 0x7efe43ef5740) = 0
set_tid_address(0x7efe43ef5a10)         = 450
set_robust_list(0x7efe43ef5a20, 24)     = 0
rseq(0x7efe43ef6060, 0x20, 0, 0x53053053) = 0
mprotect(0x7efe440c6000, 16384, PROT_READ) = 0
mprotect(0x5573d541f000, 4096, PROT_READ) = 0
mprotect(0x7efe4410d000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7efe440d9000, 5150)            = 0
getrandom("\xe9\xd4\x4c\xa5\xdb\xd9\xd1\xf4", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x5573f23ae000
brk(0x5573f23cf000)                     = 0x5573f23cf000
ioctl(0, TCGETS, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
newfstatat(AT_FDCWD, "/factorio/saves/_autosave44.bak.zip", {st_mode=S_IFREG|0664, st_size=13637043, ...}, AT_SYMLINK_NOFOLLOW) = 0
geteuid()                               = 845
newfstatat(AT_FDCWD, "/factorio/saves/_autosave44.bak.zip", {st_mode=S_IFREG|0664, st_size=13637043, ...}, AT_SYMLINK_NOFOLLOW) = 0
faccessat2(AT_FDCWD, "/factorio/saves/_autosave44.bak.zip", W_OK, AT_EACCESS) = 0
unlinkat(AT_FDCWD, "/factorio/saves/_autosave44.bak.zip", 0) = 0
lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
root@factorio-0:/# 
I then started my game client, joined the server, and ran that strace against the factorio process, but I realized that it didn't capture anything due to the asnyc save child process, so I restarted the server with synchronous saving.

After restarting, I reconnected to the server and then began the strace, writing the output to a file:

Code: Select all

root@factorio-0:/# strace -o /factorio/rmfilestrace.txt -e trace=read,write -p `pidof factorio`
strace: Process 7 attached
After ~2 minutes, the server tried to delete /factorio/saves/_autosave66.bak.zip and had a non-fatal warning about failing to delete the file. The logs during the error were as follows:

Code: Select all

358.404 Info AppManager.cpp:310: Saving to _autosave66 (blocking).
359.621 Warning WriteFileGuard.cpp:154: Removing /factorio/saves/_autosave66.bak.zip failed: Invalid argument
359.638 Info AppManagerStates.cpp:2050: Saving finished
I then stopped the strace, and the strace file looked like this:

Code: Select all

write(4, " 358.404 Info AppManager.cpp:310"..., 68) = 68
write(1, " 358.404 Info AppManager.cpp:310"..., 68) = 68
write(17, "\n{\n  \"available-campaign-levels\""..., 8192) = 8192
write(17, " \"status\": \"locked\",\n      \"skip"..., 8192) = 8192
write(17, "version\": 79578,\n    \"build_mode"..., 142) = 142
write(4, " 359.638 Info AppManagerStates.c"..., 57) = 57
write(1, " 359.638 Info AppManagerStates.c"..., 57) = 57
I do see those ceph extended attributes on the save files:

Code: Select all

root@factorio-0:/# getfattr -n ceph.file.layout /factorio/saves/_autosave45.bak.zip 
getfattr: Removing leading '/' from absolute path names
# file: factorio/saves/_autosave45.bak.zip
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=ssdfs-replicated"

root@factorio-0:/#
I have some snapshots in this cephfs volume, but I don't have any snapshots for the subvolumes I'm using for Factorio. I'm using ceph-csi to create and manage snapshots for specific subvolumes through Kubernetes, so I don't really know that much about how they work.

`rm`ing the files manually seems to be quite fast.

Code: Select all

root@factorio-0:/# time rm /factorio/saves/_autosave43.bak.zip

real	0m0.019s
user	0m0.000s
sys	0m0.001s
root@factorio-0:/#
I hope that helps, let me know if you have any ideas.
eugenekay
Fast Inserter
Fast Inserter
Posts: 165
Joined: Tue May 15, 2018 2:14 am
Contact:

Re: [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by eugenekay »

Awesome! I like this setup, it really appeals to my Nerdery. :-D

Can you attach the strace log from the Factorio process itself please? The most interesting lines to compare will be near the end, including calls to ioctl(), newfstatat(), faccessat(), unlinkat(). It is also possible that Factorio calls entirely different C functions than `rm`.


From `ls -l`, "total 4790001". Do you really have that many Savefiles in this directory!? By itself this shouldn't break anything in a headless Server, but I would expect a GUI client to be..... very slow.... in enumerating this many savegames. I try to keep directories under 65,000, for sanity. Presumably, having auto-delete work would drop this number by quite a bit anyway.
sirscriptsalot
Manual Inserter
Manual Inserter
Posts: 3
Joined: Wed Oct 23, 2024 10:16 am
Contact:

Re: [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by sirscriptsalot »

Yeah, Kubernetes is really powerful for servers!

Right, a full strace without the read/write filter might be more useful.

I've switched the saves folder over to an ext4 fs backed by ceph rbd, and I am no longer getting the "Invalid argument" error while autosaving. I wrote a little backup script into a sidecar container so that the latest save is copied over to the more durable cephfs storage every 30 minutes.

Code: Select all

      containers:
        - name: backup # Makes a backup of the latest save every 30 min
          image: debian:bookworm-slim
          command:
            # https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/
            - "bash"
            - "-c"
            # https://unix.stackexchange.com/questions/195304/how-do-i-copy-the-latest-file-from-one-directory-to-another
            # https://unix.stackexchange.com/questions/278939/how-do-you-put-date-and-time-in-a-file-name
            - 'while true; do cp -p `ls -dtr1 /saves/*.zip | tail -1` "/backups/$(date -u +"%FT%T").zip"; sleep 1800; done'
          volumeMounts:
            - name: saves
              mountPath: /saves/
            - name: backups
              mountPath: /backups/
I'll go look into spinning up a second server and getting that full strace for you tomorrow.

I believe "total 4790001" is the total size of the save files, aka 4GB. I have autosave slots set to 180, and then there's the .bak.zip files that built up, so there's only a bit over 300 files.
User avatar
raiguard
Factorio Staff
Factorio Staff
Posts: 669
Joined: Wed Dec 13, 2017 8:29 pm
Contact:

Re: [raiguard] [2.0.9] Crash trying to delete .bak.zip save files on CephFS

Post by raiguard »

Thanks for the report. We just use std::filesystem::remove_all, so if it doesn't support CephFS then there isn't much I can easily do. This is an extremely exotic usecase.

Side note: I really need to get a personal server up and running with some kind of containerization software so I can easily spin up VMs to test things like this...
Don't forget, you're here forever.
Post Reply

Return to “Minor issues”