Commit Graph

81 Commits

Author SHA1 Message Date
michael atchison
6e3e668f8b Update analytics user id and crash reporting user id when the epic account id for the running process changes.
[REVIEW] [at]patrick.laflamme, [at]wes.hunt, [at]eric.day
#preflight 63f065463c1eb56f0516c03e

[CL 24355926 by michael atchison in ue5-main branch]
2023-02-22 00:22:48 -05:00
patrick laflamme
6f5ba2331d Added a command line option to CrashReportClient to disable the 'Submit and Restart' button in case the application doesn't want to be automaticaly restarted.
#rb Chris.Gagnon, Johan.Berg
#preflight 629f7017617cbe81d32add99

#ushell-cherrypick of 20541559 by Patrick.Laflamme
#preflight 629fbb4b521254896f6c2c43

#ROBOMERGE-AUTHOR: patrick.laflamme
#ROBOMERGE-SOURCE: CL 20546605 via CL 20546619 via CL 20546630 via CL 20546644 via CL 20546648
#ROBOMERGE-BOT: UE5 (Release-Engine-Staging -> Main) (v954-20466795)

[CL 20552343 by patrick laflamme in ue5-main branch]
2022-06-08 02:02:29 -04:00
geoff evans
1981b2f5f3 StallDetector support for Linux
### Features

This change enables the StallDetector watchdog in Editor to submit reports to crashreporter about threads violating instrumented deadlines in the source code. This feature was available prior on Windows, and this change adds Linux support.

### Notes

New APIs:
ReportStall()
CaptureThreadPortableCallStack()

Many APIs are updated from purely "Ensure" naming to more general naming. Stalls are more like Ensures than crashes, and so the appropriate renames to make the code readable and clear have been made. In some places Ensure is replaced with the clearer: Continuable Event nomenclature.

### Testing

I synthesized an ensure on Linux, and did the same for a stall. I then compared crash report XML file to make sure they contain accurate data in the callstack, portable callstack, and other fields in the report. I also noted that the stall information was showing as expected in the crash reporter.

#rb brandon.schaefer, francis.hurteau
#jira UETOOL-3336
#preflight 625e20d2804460ab0fea3277

[CL 19911608 by geoff evans in ue5-main branch]
2022-04-25 19:19:04 -04:00
wes hunt
b14fc01621 Fix FGenericCrashContext::Initialize() to initialize the GameName to UE-ProjectName instead of UE5-GameName
#ROBOMERGE-AUTHOR: wes.hunt
#ROBOMERGE-SOURCE: CL 19472252 via CL 19472515 via CL 19472553 via CL 19488666 via CL 19488781
#ROBOMERGE-BOT: UE5 (Release-Engine-Staging -> Main) (v936-19480137)

[CL 19489780 by wes hunt in ue5-main branch]
2022-03-23 20:34:22 -04:00
Andriy Tylychko
2a295eb685 deprecated FTicker and family and replaced by thread-safe FTSTicker
#jira UE-120090
#rb francis.hurteau

[CL 17176325 by Andriy Tylychko in ue5-main branch]
2021-08-16 11:05:18 -04:00
Patrick Laflamme
24bc1477ae Removed CrashReportClient analytic field 'MonitorQueryingPipe' that was temporary added to verify if CRC crashed while reading the pipe.
- The data show no evidence that CRC is crashing there. Capturing this state is I/O expensive and not required moving forward.

#jira UETOOL-4042 Inspect UE5/Main analytics for CRC crashes
#rb Jamie.Dale

[CL 17116844 by Patrick Laflamme in ue5-main branch]
2021-08-10 10:57:38 -04:00
aurel cordonnier
d17d20ca36 Merge from Release-Engine-Test @ 16758890 to UE5/Main
This represents UE4/Main @ 16738161 and Dev-PerfTest @ 16737719 (and Release-17.00 @ 16658211)

[CL 16763350 by aurel cordonnier in ue5-main branch]
2021-06-23 17:51:32 -04:00
Patrick Laflamme
8064fa38b0 Added temporary diagnostic code to CrashReportClient in hopt to narrow down why it suspiciouly die often.
#rb Jamie.Dale

[CL 16640468 by Patrick Laflamme in ue5-main branch]
2021-06-11 08:51:14 -04:00
aurel cordonnier
e0ad4e25df Merge from Release-Engine-Test @ 16624776 to UE5/Main
This represents UE4/Main @ 16579691 and Dev-PerfTest @ 16579576

[CL 16625248 by aurel cordonnier in ue5-main branch]
2021-06-10 13:13:24 -04:00
Patrick Laflamme
571c8ffe14 Added analytics to report more granular metrics about CRC performance when handling crash/ensure/stall.
#rb Jamie.Dale

[CL 16570467 by Patrick Laflamme in ue5-main branch]
2021-06-07 10:37:55 -04:00
Patrick Laflamme
ef94e39d6c UETOOL-3650 - Delete expired UECrashContext-pid.xml that could be left over by crashed/killed CRC
- Added code to run a clean up on UECrashContext-{pid}.xml files that 30 days old where the process ID (pid in the name) is not running anymore.

#rb Jamie.Dale

[CL 16522984 by Patrick Laflamme in ue5-main branch]
2021-06-01 16:55:30 -04:00
Patrick Laflamme
2e5316e1ca Generalized the Editor analytics summary session system to be usable/extendable by other apps.
Engine/Editor changes:

- Split the Editor summary session in two, one summary for the Engine properties and one for the Editor specific properties. Made it easy to extend the Engine summary to create other summaries.
- Made the summary sender as agnostics as possible of the keys it sends.
- Fixed the system wide lock contention between the process when persisting a session. (On problem caused by the lock is UE-114315).
- Fixed concurrent issue when saving the summary sessions on Linux/Mac
- Fixed performance issue when saving the summary session on Linux/Mac. This enable saving at higher frequency.
- Fixed cases where the same session summary is sent more than once.
- Fixed Windows registry key overflow that could happens if we accumulated too many sessions (in theory, this can happen)
- Made adding new properties to the summary easy and private to the implementation.
- Brought the Linux/Mac implementation closer to Windows implementation.
- Reduced memory allocation, especially when the session records a crash.
- Improved chances to send the summary non-delayed by allowing the Editor to send the reports if CRC died unexpectedly.
- Generalized the support to collect and aggregate analytics from helper processes. For example, CRC already collects analytics that is merged with the Editor summary as information supplement
- Reserved the disk space required to store the summary ahead of time to prevent failing later.
- Increased frequency at which the summary is persisted because saving the summary is more efficient. (About every 10 seconds rather than every minutes).
- Added unit tests

CrashReportClient changes:

- Created a 'session summary' from the CRC point of view to merge with the Editor summary.
- Moved analytics collection in a separated class to make the crash reporting code leaner and less noisy with all the analytics
- Merged the CRC diagnostic logger in the class collecting CRC analytics summary and make the diagnostic log a property in the summary.
- Collected analytics (on behalf of Editor) in a background thread because CRC main thread can be blocked collecting a crash, so it doesn't pay attention to other things
- Added MonitorBatteryLevel and MonitorOnACPower summary properties on Windows. Collected on CRC background thread (never blocked, so we reduce changes to miss the battery running out)
- Added MonitorSessionDuration summary property to track now long CRC ran.
- Added MonitorQuitSignalRecv summary property to detect when CRC is soft killed like: taskkill /PID 1234
- Added MonitorIsReportingCrash summary property to track when CRC dies reporting a crash.
- Added MonitorIsCollectingCrash summary property to track when CRC dies collecting a crash artifacts.
- Added IsProcessingCrash summary property to track when CRC dies processing a crash.
- Added MonitorCrashed summary property to track when CRC exception handler was triggered.
- Added MonitorWasShutdown summary property to track when CRC summary was shutdown
- Added MonitorLoggingOut summary property to track when CRC died because the user was logging out (or as result of shutting down or restarting the computer).
- More accurate value for DeathTimestamp summary property because this is now captured in CRC background thread (which cannot be busy handling a crash)
- Added crash processing timing to CRC diagnostic logs (how long it takes to collect/process a crash).

#rb Jamie.Dale, Wes.Hunt, Johan.Berg
#jira UETOOL-3500
#jira UE-114315

[CL 16324612 by Patrick Laflamme in ue5-main branch]
2021-05-13 21:58:20 -04:00
aurel cordonnier
50944fd712 Merge UE5/RES @ 16162155 to UE5/Main
This represents UE4/Main @ 16130047 and Dev-PerfTest @ 16126156

[CL 16163576 by aurel cordonnier in ue5-main branch]
2021-04-29 19:32:06 -04:00
Patrick Laflamme
d6a9f2f2e9 Fixed missing PCallstack happening when the Editor has more than 256 threads and the crashing thread is not in the 256 first visited by the OS.
- Bumped the limit from 256 to 512
  - Always reserve one spot for the crashing thread in the list transmitted to CRC, possibly ignoring some thread.
  - Added diagnostic logs in CRC to captures cases where the number of thread would reach the new limit of 512 or if the crashing thread is 0.

#jira UE-114291 - Fail to capture some Editor PCallstack because a hard limit in GenericCrashContext
#rb Johan.Berg

[CL 16123400 by Patrick Laflamme in ue5-main branch]
2021-04-27 08:35:19 -04:00
Patrick Laflamme
296f501123 Added a diagnostic log to CRC when the handle returned by OpenProcess() is invalid. This handle is used to stack walk the crash and generate a minidump.
#rb trivial

[CL 16020816 by Patrick Laflamme in ue5-main branch]
2021-04-15 10:01:55 -04:00
Johan Berg
b2f93702ab Remove UE4 strings and names from Crash reporting
#rb none
#jira UE-111405, UE-111410, UE-111407, UE-111477, UE-111412, UE-111925, UE-111413, UE-111408, UE-111438, UE-111406

[CL 16002172 by Johan Berg in ue5-main branch]
2021-04-14 04:24:50 -04:00
Patrick Laflamme
b437ca4cd5 Report all Editor bootstrapping failures captured by CrashReportClientEditor (when the Editor dies before analytics could be initialized).
#rb Jamie.Dale

[CL 15645323 by Patrick Laflamme in ue5-main branch]
2021-03-08 16:29:55 -04:00
Martin Ridgers
b8ed8ba3d4 When capturing and reporting callstacks, use the return address of a failure instead of a count of stack frames to trim. The count approach was spread about in many places and fragile to maintain as code changed. This resulted in "noisy" callstacks with distracting boilerplate present like assert dispatch functions.
#rb brandon.schaefer,will.damon,johan.berg
#rnx

#ushell-cherrypick of 15568119 by Martin.Ridgers

[CL 15568152 by Martin Ridgers in ue5-main branch]
2021-03-02 07:48:13 -04:00
Patrick Laflamme
7a4ad8f56d Fixed out-of-process crash reporting used for the Editor on Windows to prevent deadlocking on allocation.
- The function reporting the crash on the pipe doesn't need to suspend all the threads. The original purpose for suspending the threads was likely to preserve the state of the process as best as possible, but that is not required and prone to deadlocks.

Not suspending all the threads may would fix the hyphotetical case where CRC main thread is waiting for a prior ensure call stack to get resolved - I observed degenerated cases on my machine where this could take more than 15 minutes - preventing it to respond promptly to an incoming crash from the Editor. The flow was as following:
       - Editor fires an ensure, suspends all the thread, pipe a message to CRC to process the ensure.
       - CRC collects the ensure artefacts quickly, replies to the Editor, the Editor resumes, then CRC starts to resolve the call stack (blocking the main thread) from the minidump - degenerated cases can take several minutes.
       - Editor gets CRC messages and resumes its threads.
       - Editor fires a crash, suspends all the threads, pipes a message to CRC to process the crash.
       - CRC main thread is busy, waiting for the previous ensure call stack to be resolved... and doesn't respond promptly to the crash message.
       - Editor threads behing suspended, the code responsible to timeout if CRC takes too long never executes and Editor stalls until CRC dies or responds -> The user likely kills the Editor (and possibly CRC).

As a side effect from this change, if CRC doesn't respond promptly to a crash, the thread calling ReportCrash( )/ReportGPUCrash( ) will timeout and likely terminate the Editor before CRC could collect the crash artifacts or walk the thread to collect the call stacks.
  - Added a hint to the diagnostic logs reported with the Editor 'SummaryEvent' analytic event to indicate if the crash report was produced after the Editor died, so that the portable call stack wasn't captured.
  - Added a message displayed to the user by CRC saying that the the system failed to capture the callstack.

#jira UE-108701 - Editor deadlocks when reporting an ensure, a stall or a crash.
#rb Johan.Berg

[CL 15452515 by Patrick Laflamme in ue5-main branch]
2021-02-18 10:40:40 -04:00
will damon
1fe47fa53a Fix the task tag scope of the crash report client on the Mac.
#rb arne.schober
#jira UE-104914
#rnx

#ROBOMERGE-SOURCE: CL 15408307 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v771-15082668)

[CL 15408317 by will damon in ue5-main branch]
2021-02-15 13:23:31 -04:00
Patrick Laflamme
a98b2214e3 On Windows, fixed CRC (out of process mode for Editor) generating an incomplete portable callstack when the crash occurred because a null function pointer was invoked
- When CRC runs out of process, instead of reading the current thread context of the crashed thread, read and use the crash context that was reported during the crash (which is different).
  - Added an optional context parameter to FGenericPlatformStackWalk::CaptureThreadStackBackTrace(), implemented it across all platforms, but only used on Windows.

On Windows, fixed InitStackWalking() and InitStackWalkingForProcess() to reset the process that needs to be walked.
  - CRC, running out of process may run its own process or the Editor process and which ever was walking first ruled out the other.

#jira UE-105006 - [CrashReporter] VCRUNTIME140!7fffce010000 + e390
#rb Johan.Berg
#preflight 15217159

[CL 15319737 by Patrick Laflamme in ue5-main branch]
2021-02-04 14:06:44 -04:00
Marc Audy
cac1fe0019 Merge UE5/Release-Engine-Staging @ CL# 15299266 to UE5/Main
This represents UE4/Main @ CL# 15277572

[CL 15299962 by Marc Audy in ue5-main branch]
2021-02-03 14:57:28 -04:00
geoff evans
4f72b503b8 Add Stall Detector API, enabled only for Editor builds for Windows
This code is meant to help locate and send reports/telemetry for slow code pathways that create unresponsive conditions
FGameThreadHitchHeartBeatThreaded was considered, but doesn't fit Editor's needs because its designed around general GameThread deadlines
Editor workloads are much less homogenous, and proper async support for a consistent GameThread deadline in Editor is a ways away
This necessitates a more focused approach where we can instrument specific routines such that each issue their own telemetry report
Add a "Stalls" counter in the Frame Rate and Memory title bar stats
Add LogStall Log category for viewing details about stalls that have occurred
Introduces a stall counter object on the GameThread to collect statistical data about stalls (this will not report to telemetry)
Future changes will introduce report objects into specific routines to upload to crashreporter
Future changes will introduce support for non-Windows OSes

#jira none
#rb francis.hurteau

[CL 15213394 by geoff evans in ue5-main branch]
2021-01-26 20:26:53 -04:00
Marc Audy
bc88b73a29 Merge Release-Engine-Staging to Main @ CL# 15151250
Represents UE4/Main @ 15133763

[CL 15158774 by Marc Audy in ue5-main branch]
2021-01-21 16:22:06 -04:00
Patrick Laflamme
39beb94b81 #jira UETOOL-2873 - For MTBF, account for crashes happening before analytics is initialized.
- Count number of crashes before Analytics get initialized and report them as DelayedCrashCount field of the Editor summary session event.

#rb Jamie.Dale

[CL 15128718 by Patrick Laflamme in ue5-main branch]
2021-01-18 10:45:38 -04:00