[Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos work

Place for things which are bugs but we have no idea how to solve them. Things related to hardware, libraries, strange setups, etc.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

[Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos work

Post by MF- »

Hello. Here goes a story about a *buntu Quantal crash.

I got 0.2.7/8 Alphas and Demos.
Alphas crashed with SIGBUS after 14th square of the splashscreen turned on.
I got a strace from the crash.

Demos worked just fine, so I played and finished the 0.2.8 demo.

Then I decided to produce a stack trace with GDB.
And the Alpha ran without a crash.
I tried without GDB. No crash. I guess it auto-resolved?

Was it solved by winning the demo campaign?
Was it solved by running in GDB? (That would hint a race condition I think)

End of the A0.2.8 strace:

Code: Select all

 gettimeofday({1361655513, 339688}, NULL) = 0
 gettimeofday({1361655513, 339839}, NULL) = 0
 gettimeofday({1361655513, 340023}, NULL) = 0
 open("/usr/share/factorio/resources/too-far.png", O_RDONLY) = 21
 fstat64(21, {st_mode=S_IFREG|0644, st_size=880, ...}) = 0
 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 0xfffffffff6eb1000
 read(21, "\211PNG\r\n\32\n\0\0\0\rIHDR\0\0\0 \0\0\0 \10\6\0\0\0szz"..., 4096) =
 880
 ioctl(6, 0xc0086457, 0xffdc31a8)        = 0
 ioctl(6, 0xc00c6466, 0xffdc31a4)        = 0
 ioctl(6, 0xc0086457, 0xffdc34a8)        = 0
 ioctl(6, 0x4020645d, 0xffdc3220)        = 0
 ioctl(6, 0x4020645d, 0xffdc3240)        = 0
 ioctl(6, 0x4020645d, 0xffdc3240)        = 0
 ioctl(6, 0x40406469, 0xffdc3210)        = 0
 clock_gettime(CLOCK_MONOTONIC, {107983, 719418173}) = 0
 ioctl(6, 0xc00c6466, 0xffdc3154)        = 0
 ioctl(6, 0xc00c6466, 0xffdc3154)        = 0
 ioctl(6, 0xc00c6466, 0xffdc3154)        = 0
 ioctl(6, 0xc00c6466, 0xffdc31a4)        = 0
 ioctl(6, 0xc0086457, 0xffdc31e8)        = 0
 ioctl(6, 0xc00c6466, 0xffdc31e4)        = 0
 ioctl(6, 0x400c645f, 0xffdc32a4)        = 0
 ioctl(6, 0x40046460, 0xffdc330c)        = 0
 close(21)                               = 0
 munmap(0xf6eb1000, 4096)                = 0
 ioctl(6, 0xc0086457, 0xffdc4298)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4294)        = 0
 ioctl(6, 0xc0086457, 0xffdc4418)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4414)        = 0
 ioctl(6, 0x400c645f, 0xffdc4484)        = 0
 mmap2(NULL, 164630528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
 = 0xffffffffd9c97000
 ioctl(6, 0x4020645d, 0xffdc43d0)        = 0
 ioctl(6, 0x4020645d, 0xffdc43f0)        = 0
 ioctl(6, 0x4020645d, 0xffdc43f0)        = 0
 ioctl(6, 0x40406469, 0xffdc43c0)        = 0
 clock_gettime(CLOCK_MONOTONIC, {107983, 722222232}) = 0
 ioctl(6, 0xc00c6466, 0xffdc4304)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4304)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4304)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4354)        = 0
 ioctl(6, 0xc0086457, 0xffdc4398)        = 0
 ioctl(6, 0xc00c6466, 0xffdc4394)        = 0
 ioctl(6, 0xc0106464, 0xffdc4440)        = 0
 mmap2(NULL, 164626432, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0x114936) =
 0xffffffffcff97000
 ioctl(6, 0x400c645f, 0xffdc4494)        = 0
 --- SIGBUS (Bus error) @ 0 (0) ---
 +++ killed by SIGBUS +++
User avatar
Narc
Filter Inserter
Filter Inserter
Posts: 284
Joined: Mon Feb 11, 2013 7:25 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by Narc »

Okay, I did a bit of research and it turns out SIGBUS is generally a result of over-committing on memory. I had been looking around for conditions where pulseaudio specifically would do so, and found a discussion thread from 2010 explaining what could possibly trigger it. Further googling suggests SIGBUS could happen on just about any malloc()'d addresses that can't be backed by real (or virtual!) memory at the time of writing to them.

Now, for your specific case, I'd love to see the full strace because I'm curious what file descriptor 6 actually *is*. Just for fun, though, I ran a strace of my own and it turns out my Factorio opens /dev/dri/card0 as file descriptor 6 -- a video card driver interface.

Basically, to put it simply, my guess is your video card driver crashed because some exceptional situation happened, and then the exceptional situation was resolved by the time you got around to trying the demo. It's possible (even probable) the exceptional situation had to do with over-committing on RAM (note the next-to-last action taken was mmap2()ing something (probably the PNG that was read earlier, using file descriptor 21)).

Or, in other words, not a programming error.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

I sometimes run out of real RAM, but I always have 3GB of free swap space around.
I am not sure how high is the SHM limit set, but that limit is fixed and doesn't allow overcommitment AFAIK.

I don't have the full strace around. I will look it up if it happens again.

I am pretty sure that I tried the Alpha once more after the success with the demo. THEN I decided to play the demo.
(That doesn't line up with the theory that a situation got resolved between me trying the alpha and the demo)
terminology detail: If the driver crashed, it wouldn't be around when I played the demo....

I doubt it was the .png file. Why would anyone mmap a png with MAP_SHARED?
Furthermore that last mmap is getting FD=6 as an argument.

It would make sense if FD 6 was a DRI interface.
One doesn't execute a lot of ioctls on normal files, right?

For the video card: It's an Intel GPU managed by the opensource i915 module.
I have absolutely no clue how to monitor it's internal memory utilization.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

I ran a strace anyway.
FD 6 is indeed /dev/dri/card0

The fun part: There is a long stall (over 10s, didn't measure it) at the very same spot as where the crash used to be.
User avatar
Narc
Filter Inserter
Filter Inserter
Posts: 284
Joined: Mon Feb 11, 2013 7:25 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by Narc »

MF- wrote:I sometimes run out of real RAM, but I always have 3GB of free swap space around.
I am not sure how high is the SHM limit set, but that limit is fixed and doesn't allow overcommitment AFAIK.
You can't overcommit on SHM, but you can overcommit on the RAM backing it -- the SHM malloc() is just a regular one, it doesn't reserve RAM, as I understand from the linked pulseaudio discussion. Either way, there's no guarantee SHM ever came into it at all.
MF- wrote:[...]Furthermore that last mmap is getting FD=6 as an argument.
Oh, mmap2 has a file descriptor in that parameter? Then yeah, it's far more likely that was mmapping something like a back-buffer.
MF- wrote:It would make sense if FD 6 was a DRI interface.
One doesn't execute a lot of ioctls on normal files, right?
Yeah, that was why I thought it was either a video or sound driver -- I'd assumed sound first, but video was just as likely.
MF- wrote:For the video card: It's an Intel GPU managed by the opensource i915 module.
I have absolutely no clue how to monitor it's internal memory utilization.
Hmm, I've got that on my testbed. I'll try hogging as much of its RAM as I can and see if I can reproduce your crash. Either way, though, it doesn't sound like anything Factorio can really do to help.
MF- wrote:The fun part: There is a long stall (over 10s, didn't measure it) at the very same spot as where the crash used to be.
Yeah, that is impressive. I'm guessing part of it might be the way on-board shared video memory works with the i915-driven cards. Though, happily, at least they seem to run all right aside from this long init/exit stall (and occasional crash).
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

It happened again. This time it crashed instantly at the beginning.

then I ran it in gdb.. crashed at the 14th box as previously position, stack trace attached.
1GB of RAM was empty. (half buffers, half completely empty)

Fun thing: I got "The extension is not allowed" error when I tried to upload a file named "stacktrace"
WHAT? extension txt is not allowed?

EDIT: Figured it out - ZIP is the answer.
I uploaded both variant
- a text file named .txt.zip
- an actual ZIP file.
Choose your favourite.
Attachments
stacktrace.txt.zip
A stacktrace from the "14th square" crash.
THIS ONE IS A TEXT FILE!
(3.72 KiB) Downloaded 371 times
stacktrace.zip
A stacktrace from the "14th square" crash.
(Regular ZIP file)
(879 Bytes) Downloaded 335 times
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

UPDATE:
I forgot to mention I am running two X11 at once.
factorio runs on the secondary X11.

I tried stopping and starting the secondary X11.
Voila, it works.
User avatar
Narc
Filter Inserter
Filter Inserter
Posts: 284
Joined: Mon Feb 11, 2013 7:25 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by Narc »

stacktrace.txt wrote:0xf6aca058 in ?? () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
Well, if we hadn't been pointed at /dev/dri before, this would've been a big hint. As it is, it's still good confirmation.
MF- wrote:I forgot to mention I am running two X11 at once.
factorio runs on the secondary X11.

I tried stopping and starting the secondary X11.
Voila, it works.
So, is this report actually "libGL behaves strangely on a secondary X11 server", then?
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

Narc wrote:
stacktrace.txt wrote:0xf6aca058 in ?? () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
So, is this report actually "libGL behaves strangely on a secondary X11 server", then?
I am not sure.
It would be really nice if I new what is happening at that crash-or-stall time.
BTW: This time a X11 restart didn't work. Four restarts did.
I assume it's X11-instance specific. It will always run or never run in one particular X11 instance.

Maybe it's related more to DRI v.s. VT switching? No real clue.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

I think I could have done something like this:
primary X11 on VT7
started secondary X11 on VT8, played minecraft on it for a while, stopped secondary X11
started secondary X11 on VT8 as different user, ran factorio

But I am not sure. I installed new libgl-mesa-dri today, it didn't crash when I tried those steps.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

Hmm.. it's actually possible to get the bus error on shutdown.
I have played for a long time, didn't save and died.
Next start attempt... fails. I agree, it sounds like some in-driver condition.

It would be really nice to know what were the arguments passed to libgl.
Last edited by MF- on Sat Mar 02, 2013 11:23 am, edited 2 times in total.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

How hard would removing all the gameplay parts and leaving only the loading / quitting code be?
It would be really nice if I had some sort of testcase sourcecode to show to the relevant developers.

ALSO: Does "some features are disabled for the demo" cause interesting pieces of code to get skipped?
Or it only doesn't load the full texture set?


PS: I discovered this:
For Intel GPU's there exists the intel-gpu-tools from http://intellinuxgraphics.org/ project, which brings the command intel_gpu_top (amongst other things). It is similar to top and htop, but specifically for the Intel GPU.
Sadly, it doesn't seem to list the actual memory consumption
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

Another longer game session close ended in SIGBUS and inability to start again.
I believe that I had a longer minecraft session before the first handful of crashes.

It starts looking like a pattern.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

Update: Linux version works on Ubuntu 13.04 "Raring" just as it did on 12.11 "Quantal".

I played once so far.
it seems to periodically micro-freeze, but that can have many possible reasons.
The shutdown crash did not occur.

Tried playing today... and reproduced the startup crash.


EDIT: That's disturbing.. >5 attempts and still crashing.

BTW: It could be a programming error (For example if a program causes a gpu-memory leak)
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

UPDATE: I have a strong feeling that I already wrote this, but I didn't find it
Setting LIBGL_ALWAYS_INDIRECT=y makes whole X11 crash (even earlier than the game "normally" would)
For some funny reason that increases the chance of it running when I start the X11 again..
=> It makes sense to crash the X11 that way and immediately restart it.

I even wrote a script for the automatic crash when the startup fails...

#!/bin/sh

crash_unsafe(){
echo
echo "4s to crash-prone mode..."
sleep 1; echo 3
sleep 1; echo 2
sleep 1; echo 1
sleep 1; echo 0
LIBGL_ALWAYS_INDIRECT=y ~/factorio/usr/games/factorio
}

~/factorio/usr/games/factorio || crash_unsafe
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.3.1 dies with SIGBUS

Post by MF- »

This still happens in 0.3
So far it seems that the crash probability has increased (I haven't been able to start it yet)
But maybe it's ilnor's stream's fault :D
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

I'll give it one more week of testing.

But the current hypothesis is " YAYYY.. FIXED in 0.4.0 !" EDIT: Wrong :(

It doesn't even crash with LIBGL_ALWAYS_INDIRECT=y (it just gets slower)

What did you change? Did you update the library?
Last edited by MF- on Sun May 05, 2013 8:52 am, edited 1 time in total.
slpwnd
Factorio Staff
Factorio Staff
Posts: 1835
Joined: Sun Feb 03, 2013 2:51 pm
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by slpwnd »

I am not aware of any specific change related to this. Could be simply 1s and 0s magic:)
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

slpwnd wrote:I am not aware of any specific change related to this. Could be simply 1s and 0s magic:)
I have been playing minecraft after the 100% good session and didn't reboot since..
A 1/0 magic indeed. It worked at 100% in that particular X11 instance, but when I spawned another one today, it's back to the old behaviour.
MF-
Smart Inserter
Smart Inserter
Posts: 1235
Joined: Sun Feb 24, 2013 12:07 am
Contact:

Re: [Crash] [Linux] Alpha 0.2.7/8 dies with SIGBUS, demos wo

Post by MF- »

I re-read the stacktrace..
Are you sure that this is not useful for anything?
Sure, it would be better if the graphic library wasn't stripped.

Code: Select all

Thread 1 (Thread 0xf78dda40 (LWP 15685)):
#0  0xf6aca058 in ?? () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
No symbol table info available.
#1  0xf6ace0d5 in _mesa_unpack_rgba_row () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
No symbol table info available.
#2  0xf6b0a323 in _mesa_readpixels () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
No symbol table info available.
#3  0xf6dde367 in ?? () from /usr/lib/i386-linux-gnu/dri/i965_dri.so
No symbol table info available.
#4  0xf6b0b2a5 in _mesa_ReadnPixelsARB () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
No symbol table info available.
#5  0xf6b0b363 in _mesa_ReadPixels () from /usr/lib/i386-linux-gnu/libdricore9.0.0.so.1
<factorio>
#13 0xf79cc4d3 in __libc_start_main () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#14 0x08054fd1 in ?? ()
No symbol table info available.

Idea:
I got a suspicion... Can I disable the loading screen? / Can you craft a version without it for me?
When I "look away" during the critical point, the chance of a successful startup rises considerably.
("look away"= VT switch, which also means freezing of all DRI operations, giving <something> more time.)

Are you periodically repainting the loader screen, or only when the next green square is to be filled?
Post Reply

Return to “1 / 0 magic”