Opened 4 years ago
Closed 4 years ago
#5348 closed defect (not a bug)
Windows crashes faulthandler sometimes reports many fatal exceptions
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | UI | Version: | |
Keywords: | Cc: | chimera-programmers | |
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
How is it that the Python faulthandler module is reporting many fatal signals on Windows in some crash reports? It seems like one fatal signal should end the process.
An example is ticket #5332 where 3 identical python stack traces in show_open_file_dialog() are given followed by 3 identical python stack traces in show_save_file_dialog(). All 6 stack traces list the same thread id.
An example of normal faulthandler stack tracebacks on windows is ticket #4836. Two tracebacks are shown, one for the "Current thread" and one for another thread with differing thread ids.
Possibly these multiple faulthandler tracebacks only occur when show_open_file_dialog() (14 crash reports) and show_save_file_dialog() are used and may be some oddity of QFileDialog.getOpenFileNames() or QFileDialog which is a base class for the save file dialog. It would be interesting to know if we have had any reports where faulthandler produced multiple tracebacks that were not in either of these routines. I don't remember any and am not sure how to search for those (might look for multiple occurrences of "Current thread" in a ticket).
Change History (12)
comment:1 by , 4 years ago
comment:2 by , 4 years ago
Maybe every time the Open file dialog is open it generates a signal but does not kill ChimeraX. But then if the atexit routine is not called and the faulthandler output file is not removed then the user ends up reporting it next time the start ChimeraX as a crash even though it did not crash. It would be worth testing on Windows to see if using the Open file dialog generates faulthandler output.
The Python faulthandler documentation says
"Changed in version 3.6: On Windows, a handler for Windows exception is also installed."
Maybe "Windows exception" is something that is not necessarily fatal.
comment:3 by , 4 years ago
Looking at this page: https://github.com/ray-project/ray/issues/13511, it seems that "Windows fatal exception: access violation" is not, in fact, fatal. So if every access of the Open dialog generates that fault, then we would see this on the occasions when there is an actual crash later.
comment:4 by , 4 years ago
The faults might not occur for all users. It might require certain folder/file permissions, or for certain plugin file formats to be available.
comment:5 by , 4 years ago
I tried on Windows 10 with ChimeraX 1.2.5 to misuse the Open file dialog in every way possible and I monitored the faulthandler output file and it never produced a traceback. Some things I tried: opening a pdb, opening a map, pressing cancel, pressing the close button, navigating to a directory I don't have permissions to read, using an external drive, unplugging the external drive while the open dialog is viewing it, opening files of unrecognized format, choosing a format that has no files, opening a file, then renaming the directory and showing the open file dialog (it tries to start in the last used directory), typing a non-existent file name and pressing Open.
After letting it sit 5 minutes I then tried Open dialog typing non-existent filename and the faulthandler showed a traceback! But I suspect the traceback was there before since it just showed ChimeraX, one thread, in event_loop(). Just looked again after 5 minutes and there are two more of those event_loop tracebacks (total of 3 now) and I did no Open dialogs in between. So ChimeraX is spontaneously generating those somehow. Let's wait another 5 minutes and see if more come along. Then we might try leaving the open dialog showing for 5 minutes and see if that produces some in show_open_file_dialog(). I did not get more of the tracebacks by waiting 5 minutes. The 3 tracebacks are reported in ticket #5350. The are remote procedure call server not available errors, which I think we have seen before many times.
comment:6 by , 4 years ago
Another example is ticket #5237 where there are dozens of show_open_file_dialog() tracebacks with "Windows fatal exception: code 0x8001010e" with interleaved traceback output. Then at the end there is a single traceback with "Fatal Python error: Aborted" in event_loop. Seems likely that all the show_open_file_dialog() tracebacks were not fatal and some unrelated crash from the event_loop killed the process.
comment:7 by , 4 years ago
Because faulthandler produces these tracebacks that in fact do not crash ChimeraX and on Windows double clicking the ChimeraX icon a second time starts a second ChimeraX, that second ChimeraX will see the tracebacks and ask to report it as a crash. Probably some of our crash reports are from starting ChimeraX twice and this weird Windows behavior of reporting "Windows fatal exception" that aren't actually fatal. Ticket #5323 looks like an example of this where there is not different kind of exception at the end that looks like a crash.
comment:8 by , 4 years ago
comment:9 by , 4 years ago
Conclusions.
Ticket #8350 shows that faulthandler writes out tracebacks that say "Windows fatal exception: code 0x..." and in fact the program does not crash or exhibit any error.
When looking at these Windows crash reports we should always look at the last traceback since that is probably the one that is the real cause of the crash.
Some crash reports did not involve any crash. If the user starts two ChimeraX instances on Windows which is what happens when clicking the ChimeraX icon a second time, then it will see the spurious non-fatal tracebacks and ask the user to report a crash. Not sure how to fix this. The basic premiss that Python faulthandler only writes out tracebacks for crashes is violated.
comment:10 by , 4 years ago
My previous comment about some crash reports being generated with no crash when a second ChimeraX instance is started is wrong. Testing shows that when the second ChimeraX is started it does not prompt the user to report a crash because it is not able to remove the faulthandler traceback file because another process (the first ChimeraX) is still using it.
So I think all crash reports resulting from the faulthandler traceback file are really crashes. But many of the tracebacks in the report may not be fatal. So only the last traceback or last few should be considered.
comment:11 by , 4 years ago
In 2017 there was a related Python bug report about faulthandler logging all C++ exceptions even when they were caught. The resolved it by not logging any C++ exceptions starting in Python 3.6.
The rationale was they did not want to spam the faulthandler output with all these caught exceptions. They also considered other Windows exceptions that are normal and not fatal but it does not appear they had any way to filter those out. It looks like faulthandler is not intended to guarantee that all tracebacks are fatal exceptions.
One unpleasant fix to remove these spam tracebacks that are not fatal would be to periodically check the faulthandler output file say every minute or 10 seconds and whenever it has non zero length, clear it. I'd prefer to not make the code more complicated. Instead when we analyze these crashes we should be aware that the Python tracebacks may be spurious (non-fatal) and the final traceback or two is most likely to be the cause of the crash.
comment:12 by , 4 years ago
Cc: | added; removed |
---|---|
Resolution: | → not a bug |
Status: | assigned → closed |
Summary: | Windows crashes faulthandler sometimes reports many fatal signals → Windows crashes faulthandler sometimes reports many fatal exceptions |
Conclusion: Be aware that the faulthandler python tracebacks in ChimeraX crash reports on Windows may not have caused the crash, many are non-fatal (despite saying "Windows fatal exception"). So attention should be on the last python traceback when there are several.
Ticket #5323 has 102 tracebacks in show_open_file_dialog all in the same thread.