[0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt

This subforum contains all the issues which we already resolved.
Post Reply
2bpp
Manual Inserter
Manual Inserter
Posts: 2
Joined: Wed May 06, 2020 9:04 am
Contact:

[0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt

Post by 2bpp »

Well, this has been a bit of a trip. Sorry for the lengthy story; I feel it's important for understanding the circumstances, and there hopefully will be enough technical meat to provide concrete leads to follow as well.

I'm playing a long 4-player multiplayer game with a few mods, including in particular Krastorio2 (0.9.15) and Armoured Biters. We're running a dedicated server, and the players are 1x Mac, 1x Windows and 2x linux64, with myself in the last category. The server would pause when nobody is connected.

The first sign that something was amiss was that last night, after the (Mac-based) admin went to sleep and I reconnected, I kept getting instant desyncs the moment I tried to reconnect to the game. I shrugged it off and went to sleep; the next day I came fairly late and the game had been going for a while, but I could reconnect without issue.

The next day was also when we decided to add Rampant because the biter threat was becoming very negligible. This worked fine enough for a while, but then suddenly everyone desynced. Upon restarting, we found that the linux64 players would now be continuously afflicted by the desync-on-connect issue. The Mac-based admin could connect and play without issue; and the Windows player reported mixed results which I unfortunately don't have a full account of.

The admin didn't put much stock in my stories of nighttime desyncs and concluded that Rampant was the most likely culprit. Upon removing Rampant, everyone could reconnect without issue; but of course, this resulted in most biters being deleted from the map altogether. To restore some semblance of challenge, the admin went around and manually spawned biter spawners around our periphery, and then went to sleep after giving me ingame admin rights to perform any further meddling as necessary.

After some observation, I realised that these manually created biter spawners were largely inert. Even after reactivating the vanilla AI (game.forces.enemy.ai_controllable=true), they only exhibited a subset of their vanilla behaviour, and in particular would not spawn new bases. I speed-read some mod code and settled on invoking the following incantation (sorry if it's self-evidently stupid, I've played the game for all of 4 days and haven't even touched Lua before this):

Code: Select all

for _,e in pairs(game.players[1].surface.find_entities_filtered{force="enemy",type="unit"}) do 
e.set_command({type=defines.command.build_base,destination=e.position,distraction=defines.distraction.by_enemy,ignore_planner=true})
end
For me, with this, everything went south. The game crashed (I foolishly did not save the log this first time); and subsequent attempts to reconnect to the server resulted in the same resync-on-connect issue that we had earlier. Meanwhile, the Windows player was still awake and doing just fine. Finally, I saved the desync report and restarted locally from the desynced-level.zip contained in it. This worked fine for something on the order of a minute (and I was pleased enough to see that the biters that came out of the handmade spawners earlier did indeed settle where they were standing)... and then the game crashed again. This time, I took note of the stack trace, which was rather curious. I'm attaching the whole thing, but the interesting slice is:

Code: Select all

...
#3  0x0000000000039fe0 in CrashHandler::SignalHandler(int) at /tmp/factorio-build-XqwiXo/src/Util/CrashHandler.cpp:638
#4  0x0000000000c6679f in ?? at ??:0
#5  0x0000000000c66c20 in RandomGenerator::getInt() at /tmp/factorio-build-XqwiXo/src/Util/RandomGenerator.cpp:79
#6  0x000000000117816a in RandomGenerator::uniformDouble() at /tmp/factorio-build-XqwiXo/src/Util/RandomGenerator.cpp:74
#7  0x0000000001178269 in BuildBaseBehavior::findBuildingPosition(EntityPrototype const&, BoundingBox const&, CollisionMask) at /tmp/factorio-build-XqwiXo/src/AI/BuildBaseBehavior.cpp:311
...
What makes a function called RandomGenerator::getInt segfault? (The crash was a SIGSEGV, so this is not a "rand()%0"-type issue for sure.) My best guess is that, considering the #4 frame that we have no debug data for (and which seems to be very close to RandomGenerator::getInt), this is some stack-smashing problem (note how close the address is to RandomGenerator::getInt's), but without seeing the source, this is pretty blind guesswork. Through this, the Windows player has been reporting no issues whatsoever, and is continuing to play on the server where all biters have been made to settle in place.

Either way, this makes it seem not unlikely to me that the actual issue is due to some unsafe memory operations related to RandomGenerator::getInt and/or BuildBaseBehavior::findBuildingPosition, which for one reason or another may be more relevant in the linux64 build than on other platforms (but this might just have to do with platform peculiarities like contents of uninitialised memory or ASLR or whatever). The desyncs for sure always occurred in busy-biter situations (the first time for me alone when the game had advanced pretty far, then with Rampant's rampant biters, and finally when all the non-Rampant biters were made to settle new bases all at once), so it seems plausible to me that they were caused by a non-crashing instance of the same bug (where getInt merely returned a value that is not the pseudorandom number it was supposed to return, causing the simulations to diverge). I would appreciate if you could look into this, even though on the surface it involves shady admin commands and a lot of mods.

Most recent desync report: http://twilightro.kafuka.org/~blackhole ... -54-21.zip
Last night's desync report, first one on that game (before Rampant was ever added): http://twilightro.kafuka.org/~blackhole ... -30-34.zip
Attachments
desync_crash.txt
(7 KiB) Downloaded 110 times

User avatar
Klonan
Factorio Staff
Factorio Staff
Posts: 5150
Joined: Sun Jan 11, 2015 2:09 pm
Contact:

Re: [0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt

Post by Klonan »

Surprisingly, I managed to also get a crash just by running that command:

Code: Select all

7021.455 Time travel logging:
  72.376 Popped blueprint record (player-index: 0, ID: 14)
  72.380 Popped blueprint record (player-index: 0, ID: 47)
  72.380 Popped blueprint record (player-index: 0, ID: 56)
  72.381 Popped blueprint record (player-index: 0, ID: 84)
  72.388 Popped blueprint record (player-index: 0, ID: 146)
  72.394 Popped blueprint record (player-index: 0, ID: 209)
  72.399 Popped blueprint record (player-index: 0, ID: 272)
  72.410 Popped blueprint record (player-index: 0, ID: 307)
  72.412 Popped blueprint record (player-index: 0, ID: 365)
  72.444 Popped blueprint record (player-index: 0, ID: 372)
7002.509 Player 0 ran command: for _,e in pairs(game.players[1].surface.find_entities_filtered{force="enemy",type="unit"}) do  e.set_command({type=defines.command.build_base,destination=e.position,distraction=defines.distraction.by_enemy,ignore_planner=true}) end
7021.455 Error CrashHandler.cpp:480: Exception Code: c0000005, Address: 0x00007ff6aaa0a6ba
ModuleBase: 0x00007ff6aa030000, ImageSize: 02172000, RelativeAddress: 009da6ba
7021.455 Error CrashHandler.cpp:486: Access Violation: Read at address FFFFFFFFFFFFFFFF
7021.455 Error CrashHandler.cpp:500: Exception Context:
rax=0000000000000000, rbx=0000029c6230ef80, rcx=007465727275742d,
rdx=000000000000000f, rsi=0000029c84102210, rdi=0002324400091f8f,
rip=00007ff6aaa0a6ba, rsp=000000bcf84fe400, rbp=000000bcf84fe500,
 r8=0000029b4766cef0,  r9=0000000000000001, r10=0000029b2b4a1350,
r11=000000bcf84fe230, r12=0000000000000000, r13=0000000000000000,
r14=0000029b66754940, r15=0000029b3d8d95f0
7021.455 Crashed in C:\Factorio\bin\Releasex64vs2017\factorio-run.exe (0x00007ff6aa030000 - 0x00007ff6ac1a2000)
Factorio crashed. Generating symbolized stacktrace, please wait ...
c:\factorio\libraries\stackwalker\stackwalker.cpp (907): StackWalker::ShowCallstack
c:\factorio\src\util\logger.cpp (542): Logger::writeStacktrace
c:\factorio\src\util\logger.cpp (549): Logger::logStacktrace
c:\factorio\src\util\crashhandler.cpp (185): CrashHandler::writeStackTrace
c:\factorio\src\util\crashhandler.cpp (529): CrashHandler::SehHandler
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8A7AFEEA)
00007FFB8A7AFEEA (KERNELBASE): (filename not available): UnhandledExceptionFilter
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D7C4AB2)
00007FFB8D7C4AB2 (ntdll): (filename not available): memset
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D7AC656)
00007FFB8D7AC656 (ntdll): (filename not available): _C_specific_handler
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D7C11CF)
00007FFB8D7C11CF (ntdll): (filename not available): _chkstk
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D78A209)
00007FFB8D78A209 (ntdll): (filename not available): RtlRaiseException
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D7BFE3E)
00007FFB8D7BFE3E (ntdll): (filename not available): KiUserExceptionDispatcher
c:\factorio\src\ai\buildbasebehavior.cpp (265): BuildBaseBehavior::finishBuilding
c:\factorio\src\ai\buildbasebehavior.cpp (127): BuildBaseBehavior::execute
c:\factorio\src\ai\commandable.cpp (201): Commandable::runBehavior
c:\factorio\src\entity\unit.cpp (327): Unit::update
c:\factorio\src\surface\chunk.cpp (632): Chunk::updateActiveEntities
c:\factorio\src\surface\chunk.cpp (693): Chunk::updateEntities
c:\factorio\src\surface\surface.cpp (1608): Surface::update
c:\factorio\src\map\map.cpp (1698): Map::updateEntities
c:\factorio\src\map\map.cpp (1660): Map::update
c:\factorio\src\game.cpp (202): Game::update
c:\factorio\src\scenario\scenario.cpp (1219): Scenario::update
c:\factorio\src\scenario\scenario.cpp (1116): Scenario::updateStep
c:\factorio\src\mainloop.cpp (1171): MainLoop::gameUpdateStep
c:\factorio\src\mainloop.cpp (1037): MainLoop::gameUpdateLoop
c:\factorio\src\util\workerthread.cpp (49): WorkerThread::loop
c:\program files (x86)\microsoft visual studio\2017\community\vc\tools\msvc\14.16.27023\include\thr\xthread (230): std::_LaunchPad<std::unique_ptr<std::tuple<void (__cdecl WorkerThread::*)(void),WorkerThread *>,std::default_delete<std::tuple<void (__cdecl WorkerThread::*)(void),WorkerThread *> > > >::_Go
c:\program files (x86)\microsoft visual studio\2017\community\vc\tools\msvc\14.16.27023\include\thr\xthread (209): std::_Pad::_Call_func
minkernel\crts\ucrt\src\appcrt\startup\thread.cpp (115): thread_start<unsigned int (__cdecl*)(void * __ptr64)>
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8C7D7BD4)
00007FFB8C7D7BD4 (KERNEL32): (filename not available): BaseThreadInitThunk
ERROR: SymGetLineFromAddr64, GetLastError: 487 (Address: 00007FFB8D78CE51)
00007FFB8D78CE51 (ntdll): (filename not available): RtlUserThreadStart
Stack trace logging done
7022.208 Error CrashHandler.cpp:190: Map tick at moment of crash: 37512660
7022.208 Error Util.cpp:97: Unexpected error occurred. If you're running the latest version of the game you can help us solve the problem by posting the contents of the log file on the Factorio forums.
Please also include the save file(s), any mods you may be using, and any steps you know of to reproduce the crash.
7029.717 Error CrashHandler.cpp:258: Heap validation: success.
7029.717 Creating crash dump.
7029.864 CrashDump success

2bpp
Manual Inserter
Manual Inserter
Posts: 2
Joined: Wed May 06, 2020 9:04 am
Contact:

Re: [0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt

Post by 2bpp »

Update: I'm getting the desync problem, only affecting the Linux users, in a 0.18.22 game with Rampant and no Krastorio now as well. No crash so far, though; I'm trying to run the game in Valgrind to check for any untoward memory accesses to test my theory, but startup is taking on the order of hours.


posila
Factorio Staff
Factorio Staff
Posts: 5202
Joined: Thu Jun 11, 2015 1:35 pm
Contact:

Re: [0.18.21] Desync, potentially related to biter spawning behaviour, segfault in RandomGenerator::getInt

Post by posila »

Thanks for the report.
Fixed for 0.18.23

Post Reply

Return to “Resolved Problems and Bugs”