[0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt
Posted: Wed May 06, 2020 9:57 am
Well, this has been a bit of a trip. Sorry for the lengthy story; I feel it's important for understanding the circumstances, and there hopefully will be enough technical meat to provide concrete leads to follow as well.
I'm playing a long 4-player multiplayer game with a few mods, including in particular Krastorio2 (0.9.15) and Armoured Biters. We're running a dedicated server, and the players are 1x Mac, 1x Windows and 2x linux64, with myself in the last category. The server would pause when nobody is connected.
The first sign that something was amiss was that last night, after the (Mac-based) admin went to sleep and I reconnected, I kept getting instant desyncs the moment I tried to reconnect to the game. I shrugged it off and went to sleep; the next day I came fairly late and the game had been going for a while, but I could reconnect without issue.
The next day was also when we decided to add Rampant because the biter threat was becoming very negligible. This worked fine enough for a while, but then suddenly everyone desynced. Upon restarting, we found that the linux64 players would now be continuously afflicted by the desync-on-connect issue. The Mac-based admin could connect and play without issue; and the Windows player reported mixed results which I unfortunately don't have a full account of.
The admin didn't put much stock in my stories of nighttime desyncs and concluded that Rampant was the most likely culprit. Upon removing Rampant, everyone could reconnect without issue; but of course, this resulted in most biters being deleted from the map altogether. To restore some semblance of challenge, the admin went around and manually spawned biter spawners around our periphery, and then went to sleep after giving me ingame admin rights to perform any further meddling as necessary.
After some observation, I realised that these manually created biter spawners were largely inert. Even after reactivating the vanilla AI (game.forces.enemy.ai_controllable=true), they only exhibited a subset of their vanilla behaviour, and in particular would not spawn new bases. I speed-read some mod code and settled on invoking the following incantation (sorry if it's self-evidently stupid, I've played the game for all of 4 days and haven't even touched Lua before this):
For me, with this, everything went south. The game crashed (I foolishly did not save the log this first time); and subsequent attempts to reconnect to the server resulted in the same resync-on-connect issue that we had earlier. Meanwhile, the Windows player was still awake and doing just fine. Finally, I saved the desync report and restarted locally from the desynced-level.zip contained in it. This worked fine for something on the order of a minute (and I was pleased enough to see that the biters that came out of the handmade spawners earlier did indeed settle where they were standing)... and then the game crashed again. This time, I took note of the stack trace, which was rather curious. I'm attaching the whole thing, but the interesting slice is:
What makes a function called RandomGenerator::getInt segfault? (The crash was a SIGSEGV, so this is not a "rand()%0"-type issue for sure.) My best guess is that, considering the #4 frame that we have no debug data for (and which seems to be very close to RandomGenerator::getInt), this is some stack-smashing problem (note how close the address is to RandomGenerator::getInt's), but without seeing the source, this is pretty blind guesswork. Through this, the Windows player has been reporting no issues whatsoever, and is continuing to play on the server where all biters have been made to settle in place.
Either way, this makes it seem not unlikely to me that the actual issue is due to some unsafe memory operations related to RandomGenerator::getInt and/or BuildBaseBehavior::findBuildingPosition, which for one reason or another may be more relevant in the linux64 build than on other platforms (but this might just have to do with platform peculiarities like contents of uninitialised memory or ASLR or whatever). The desyncs for sure always occurred in busy-biter situations (the first time for me alone when the game had advanced pretty far, then with Rampant's rampant biters, and finally when all the non-Rampant biters were made to settle new bases all at once), so it seems plausible to me that they were caused by a non-crashing instance of the same bug (where getInt merely returned a value that is not the pseudorandom number it was supposed to return, causing the simulations to diverge). I would appreciate if you could look into this, even though on the surface it involves shady admin commands and a lot of mods.
Most recent desync report: http://twilightro.kafuka.org/~blackhole ... -54-21.zip
Last night's desync report, first one on that game (before Rampant was ever added): http://twilightro.kafuka.org/~blackhole ... -30-34.zip
I'm playing a long 4-player multiplayer game with a few mods, including in particular Krastorio2 (0.9.15) and Armoured Biters. We're running a dedicated server, and the players are 1x Mac, 1x Windows and 2x linux64, with myself in the last category. The server would pause when nobody is connected.
The first sign that something was amiss was that last night, after the (Mac-based) admin went to sleep and I reconnected, I kept getting instant desyncs the moment I tried to reconnect to the game. I shrugged it off and went to sleep; the next day I came fairly late and the game had been going for a while, but I could reconnect without issue.
The next day was also when we decided to add Rampant because the biter threat was becoming very negligible. This worked fine enough for a while, but then suddenly everyone desynced. Upon restarting, we found that the linux64 players would now be continuously afflicted by the desync-on-connect issue. The Mac-based admin could connect and play without issue; and the Windows player reported mixed results which I unfortunately don't have a full account of.
The admin didn't put much stock in my stories of nighttime desyncs and concluded that Rampant was the most likely culprit. Upon removing Rampant, everyone could reconnect without issue; but of course, this resulted in most biters being deleted from the map altogether. To restore some semblance of challenge, the admin went around and manually spawned biter spawners around our periphery, and then went to sleep after giving me ingame admin rights to perform any further meddling as necessary.
After some observation, I realised that these manually created biter spawners were largely inert. Even after reactivating the vanilla AI (game.forces.enemy.ai_controllable=true), they only exhibited a subset of their vanilla behaviour, and in particular would not spawn new bases. I speed-read some mod code and settled on invoking the following incantation (sorry if it's self-evidently stupid, I've played the game for all of 4 days and haven't even touched Lua before this):
Code: Select all
for _,e in pairs(game.players[1].surface.find_entities_filtered{force="enemy",type="unit"}) do
e.set_command({type=defines.command.build_base,destination=e.position,distraction=defines.distraction.by_enemy,ignore_planner=true})
end
Code: Select all
...
#3 0x0000000000039fe0 in CrashHandler::SignalHandler(int) at /tmp/factorio-build-XqwiXo/src/Util/CrashHandler.cpp:638
#4 0x0000000000c6679f in ?? at ??:0
#5 0x0000000000c66c20 in RandomGenerator::getInt() at /tmp/factorio-build-XqwiXo/src/Util/RandomGenerator.cpp:79
#6 0x000000000117816a in RandomGenerator::uniformDouble() at /tmp/factorio-build-XqwiXo/src/Util/RandomGenerator.cpp:74
#7 0x0000000001178269 in BuildBaseBehavior::findBuildingPosition(EntityPrototype const&, BoundingBox const&, CollisionMask) at /tmp/factorio-build-XqwiXo/src/AI/BuildBaseBehavior.cpp:311
...
Either way, this makes it seem not unlikely to me that the actual issue is due to some unsafe memory operations related to RandomGenerator::getInt and/or BuildBaseBehavior::findBuildingPosition, which for one reason or another may be more relevant in the linux64 build than on other platforms (but this might just have to do with platform peculiarities like contents of uninitialised memory or ASLR or whatever). The desyncs for sure always occurred in busy-biter situations (the first time for me alone when the game had advanced pretty far, then with Rampant's rampant biters, and finally when all the non-Rampant biters were made to settle new bases all at once), so it seems plausible to me that they were caused by a non-crashing instance of the same bug (where getInt merely returned a value that is not the pseudorandom number it was supposed to return, causing the simulations to diverge). I would appreciate if you could look into this, even though on the surface it involves shady admin commands and a lot of mods.
Most recent desync report: http://twilightro.kafuka.org/~blackhole ... -54-21.zip
Last night's desync report, first one on that game (before Rampant was ever added): http://twilightro.kafuka.org/~blackhole ... -30-34.zip