Stalking the Wild Heisenbug
Fedora bug 958326 has been a source of much amusement (not to mention confusion). It manifests itself by switching the video display to the old contents of the text console rather than the X11 display (but X is still there just waiting to be restored by Ctrl-Alt-F2 then Ctrl-Atl-F1).
It has been intermittant, it has seemed to be associated with high CPU activity transcoding video, it has happened when playing movies. There have been occasions when log messages about errors in the Intel video driver have appeared, but mostly it has been utterly silent. Finally it started happening every single night.
This sudden regularity gave rise to the hope that I might be able to study it and find the cause.
I loaded the Tina Time-lapse app on my android camera, pointed it at my TV screen, and let it take one picture every minute all night long:
Here we have 1:16:26 AM:
Here we have 1:17:26 AM:
So whatever is causing this happened real close to 1:17 am, and there just happens to be an entry in my user's crontab to run the script /home/tom/profile/scripts/nightly at 1:17 am every night.
I try running that script by hand in a terminal window. Nothing happens (it doesn't take very long because it doesn't actually have much to do, which makes things even more mysterious).
OK, I try running it using the at command in one minute:
echo /home/tom/profile/scripts/nightly | at now + 1 min
One minute later and, blip, my X display disappears!
The most obvious difference between cron running stuff and me running stuff is that cron won't have the DISPLAY environment variable set, so I try this:
unset DISPLAY /home/tom/profile/scripts/nightly
Once again, my GUI disappears, so something doesn't like not having a DISPLAY, and, as a result, manages to do something which seems to be impossible to do even if I really wanted to accomplish it. I know the old console contents revert when the X server is shut down, but I have no idea of an official way to show the old contents while X is still running on VT1.
Nothing in the script seems to be a likely candidate for screwing up the display. There are a couple of rsync commands which, in fact, have nothing to do at the moment since the directories already match, and there is a script which downloads any new Doctor Who episodes from my TiVo (there aren't any of those at the moment either) and updates a web page directory of all my downloaded episodes.
I start running the different pieces by hand with DISPLAY unset, and eventually find it is the script which builds the database of episodes that triggers the screen switch, so I put lots of trace code in it and some small delays so I'll be more likely to see each line of output before the screen disappears on me and get my camera again and make a video of the script trace running in a terminal window with a really large font that will be easy to read in the recorded video.
I then run the video up to the point where the screen switches, and find it is running the command:
mplayer -identify -frames 0 /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg
I try running that command with no DISPLAY, and by golly my screen switches again. I have finally tracked down the specific command! There are other mplayer commands run to do similar things, so this must be data dependent on the specific video file. I run the command under strace and attach the listings to the bugzilla.
I try using dd to make a copy of just the 1st megabyte of the file, and the same command switches the screen the same way when given just the 1 meg file, so now I even have a relatively small sample file that can reproduce the problem, so I attach that to the bugzilla as well.
That specific file is actually a second copy of the same program which I don't actually need to keep, so I go ahead and delete that directory, and hopefully I won't find my screen switching on me every night any longer (until some new file comes along that tickles the same bug).
What I should probably do is switch to using something like ffprobe to query properties of videos since it just looks at the file rather than pretending to play it, but this does seem like some kind of real bug in some ioctl() call somewhere. Perhaps my next step will be to try and find the specific call and reproduce the problem in a tiny little program. I guess I could videotape the strace and see for sure which call triggers the screen switch, then perhaps get in the debugger and find what arguments are being passed.
(And 1:17 am went past this day with no screen switching, Yea! :-).
I tried filming the strace output and it scrolls past so fast it never has time to properly render on the screen and/or get captured by the camera, but will a true hacker let such minor details stand in the way? Never!
I downloaded the source rpm for strace and modified it to call usleep(100000); just before every ptrace(PTRACE_SYSCALL,...). Now I have a version of strace that waits 1/10 of a second at each system call, and now the output doesn't scroll past so fast, though it does take more than 5 minutes to make it to the syscall that switches the screen which looks like this:
If I try to advance the video to the point where I can see the result of that final ioctl, the screen has already gone black on me.
So it looks like, if I'm gonna use mplayer rather than ffprobe, I probably ought to force it to use the null video driver when all I'm doing is querying the properties (something you'd imagine -frames 0 might imply, but you'd be wrong :-).
unset DISPLAY mplayer -identify -frames 0 -vo null trunc.mpg
Hey! My screen didn't go blank! I've arrived at a work around, and I'm not even sure if the bug is really a bug or not. I guess mplayer is trying to fall back to the low level frame buffer to display the video and that isn't sitting too well with the Intel video driver. That does seem like a bug to me, but you never know what the developers are gonna call it :-).
|Game of Linux Entry||Game of Linux Site Map||Tom's Fabulous Web Page|