- The function reporting the crash on the pipe doesn't need to suspend all the threads. The original purpose for suspending the threads was likely to preserve the state of the process as best as possible, but that is not required and prone to deadlocks.
Not suspending all the threads may would fix the hyphotetical case where CRC main thread is waiting for a prior ensure call stack to get resolved - I observed degenerated cases on my machine where this could take more than 15 minutes - preventing it to respond promptly to an incoming crash from the Editor. The flow was as following:
- Editor fires an ensure, suspends all the thread, pipe a message to CRC to process the ensure.
- CRC collects the ensure artefacts quickly, replies to the Editor, the Editor resumes, then CRC starts to resolve the call stack (blocking the main thread) from the minidump - degenerated cases can take several minutes.
- Editor gets CRC messages and resumes its threads.
- Editor fires a crash, suspends all the threads, pipes a message to CRC to process the crash.
- CRC main thread is busy, waiting for the previous ensure call stack to be resolved... and doesn't respond promptly to the crash message.
- Editor threads behing suspended, the code responsible to timeout if CRC takes too long never executes and Editor stalls until CRC dies or responds -> The user likely kills the Editor (and possibly CRC).
As a side effect from this change, if CRC doesn't respond promptly to a crash, the thread calling ReportCrash( )/ReportGPUCrash( ) will timeout and likely terminate the Editor before CRC could collect the crash artifacts or walk the thread to collect the call stacks.
- Added a hint to the diagnostic logs reported with the Editor 'SummaryEvent' analytic event to indicate if the crash report was produced after the Editor died, so that the portable call stack wasn't captured.
- Added a message displayed to the user by CRC saying that the the system failed to capture the callstack.
#jira UE-108701 - Editor deadlocks when reporting an ensure, a stall or a crash.
#rb Johan.Berg
[CL 15452515 by Patrick Laflamme in ue5-main branch]
- When CRC runs out of process, instead of reading the current thread context of the crashed thread, read and use the crash context that was reported during the crash (which is different).
- Added an optional context parameter to FGenericPlatformStackWalk::CaptureThreadStackBackTrace(), implemented it across all platforms, but only used on Windows.
On Windows, fixed InitStackWalking() and InitStackWalkingForProcess() to reset the process that needs to be walked.
- CRC, running out of process may run its own process or the Editor process and which ever was walking first ruled out the other.
#jira UE-105006 - [CrashReporter] VCRUNTIME140!7fffce010000 + e390
#rb Johan.Berg
#preflight 15217159
[CL 15319737 by Patrick Laflamme in ue5-main branch]
This code is meant to help locate and send reports/telemetry for slow code pathways that create unresponsive conditions
FGameThreadHitchHeartBeatThreaded was considered, but doesn't fit Editor's needs because its designed around general GameThread deadlines
Editor workloads are much less homogenous, and proper async support for a consistent GameThread deadline in Editor is a ways away
This necessitates a more focused approach where we can instrument specific routines such that each issue their own telemetry report
Add a "Stalls" counter in the Frame Rate and Memory title bar stats
Add LogStall Log category for viewing details about stalls that have occurred
Introduces a stall counter object on the GameThread to collect statistical data about stalls (this will not report to telemetry)
Future changes will introduce report objects into specific routines to upload to crashreporter
Future changes will introduce support for non-Windows OSes
#jira none
#rb francis.hurteau
[CL 15213394 by geoff evans in ue5-main branch]
- Count number of crashes before Analytics get initialized and report them as DelayedCrashCount field of the Editor summary session event.
#rb Jamie.Dale
[CL 15128718 by Patrick Laflamme in ue5-main branch]
- Prevented CrashReportClient from serializing the CrashContext to the memory buffer if it was already serialized once.
- Increased the default space reserved by memory buffer used to serialize the crash context from 32K to 128K because serializing the crash context of 'debug crash' command line in Editor uses up to 112K.
- Cleared the memory buffer before serializing the CrashContext in case it was serialized more than once preventing the internal buffer to grow needlessly. Also, the XML reader being limited, would only read the first one written, ignoring further and more recent ones appended.
#rb Johan.Berg
[CL 15037279 by Patrick Laflamme in ue5-main branch]
- Fixed the application title, getting it from the engine version rather than hardcoding it.
#rb Francis.Hurteau
[CL 13747857 by Patrick Laflamme in ue5-main branch]
- Implemented a special logger inside CrashReportClientEditor to capture and save important events such as crash reporting (along with the CrashGUID)
- When CrashReportClientEditor sends all the Editor summary events, if an error was detected in the session being sent, the mini-log for that session is attached to the analytic event.
#rb Chris.Gagnon, Jamie.Dale
#lockdown cristina.riverun
#ROBOMERGE-SOURCE: CL 12935952 in //UE4/Release-4.25/... via CL 12935970 via CL 12935996
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v682-12900288)
[CL 12936020 by patrick laflamme in Main branch]
- Added code to the Editor to detect and report when CrashReportClientEditor exited unexpectedly. (MonitorExceptCode 777005 is set in the Editor session summary event)
- Added a retrial loop to CrashReportClientApp to retry opening the the handle on the Editor process if the first time fails.
#rb Jamie.Dale
#lockdown cristina.riverun
#ROBOMERGE-SOURCE: CL 12878012 in //UE4/Release-4.25/... via CL 12878014 via CL 12878016
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v681-12776863)
[CL 12878017 by patrick laflamme in Main branch]
Details:
The 4.24.3 analytics shows many unexplained exit codes, 23 647 at the moment. Normally, the Editor will exit with code 0 if everything when well, 3 or 1 if it gracefully handled a crash, 255 it was aborted. But we also see may others like the following predominent cases below:
-1073741819 => STATUS_ACCESS_VIOLATION => 8081 cases
-1073740791 => STATUS_STACK_BUFFER_OVERRUN => 7581 cases
-1073740771 => STATUS_FATAL_USER_CALLBACK_EXCEPTION => 5357 cases
On Windows, the crash reporting system should catch and report STATUS_ACCESS_VIOLATION and then exit with code 3 (as the error was handled). For example, if you add a null pointer dereference(STATUS_ACCESS_VIOLATION) in the code, the crash reporter handle it and the Editor exit with code 3. Just like if you enter 'debug crash' console command, the editor gracefully handle the error and exit with code 3. But if you move the null pointer dereference in the crash handler thread itself, the error is not handled and the Editor exits with code STATUS_ACCESS_VIOLATION. This hints that our crash reporting thread is likely crashing in the wild. It would be useful to isolate those cases from the other cases and keep count of how many times this happens.
#jira UE-91803 - Analytics hints that crash reporting and crash handling crashes themselves.
#rb Jamie.Dale
#lockdown cristina.riverun
#ROBOMERGE-SOURCE: CL 12695027 in //UE4/Release-4.25/... via CL 12695062 via CL 12695098
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v676-12543919)
[CL 12695136 by patrick laflamme in Main branch]
#jira UE-91493 - CrashReportClientEditor may send a report owned by another concurrent instance losing the exit code in the process
#rb Jamie.Dale
#ROBOMERGE-SOURCE: CL 12598059 in //UE4/Release-4.25/... via CL 12598060 via CL 12598062
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v675-12543919)
[CL 12598063 by patrick laflamme in Main branch]
#jira UE-91318 - Events were sent from Crash Reporter after Editor Usage Data is disabled.
#rb none
#lockdown cristina.riveron
#ROBOMERGE-SOURCE: CL 12489133 in //UE4/Release-4.25/... via CL 12489135 via CL 12489146
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v673-12478461)
[CL 12489149 by patrick laflamme in Main branch]
- Fixed crash report client editor to prevent sending any usage data.
#rb Jamie.Dale
#lockdown cristina.riveron
#ROBOMERGE-SOURCE: CL 12489019 in //UE4/Release-4.25/... via CL 12489022 via CL 12489029
#ROBOMERGE-BOT: RELEASE (Release-Engine-Staging -> Main) (v673-12478461)
[CL 12489031 by patrick laflamme in Main branch]
#jira none
#rb trivial
#ROBOMERGE-SOURCE: CL 12380556 in //UE4/Release-4.25/... via CL 12380573
#ROBOMERGE-BOT: RELEASE (Release-4.25Plus -> Main) (v671-12333473)
[CL 12381193 by patrick laflamme in Main branch]