Significant refactor of RHI command list management and submission, and RHI breadcrumbs / RenderGraph (RDG) scopes, to allow for parallel translation of most RHI command lists.
See individual changelists in //UE5/Dev-ParallelRendering for details. A summary of the changes is as follows:
This work's primary goal was to allow as many RHI command lists as possible to be parallel translated, to make more efficient use of many-core systems. To achieve this:
- The submission code paths for the immediate and parallel RHI command lists have been merged into a single function: FRHICommandListExecutor::Submit().
- A "dispatch thread" (which is simply a series of chained task graph tasks) is used to decide which command lists are batched together in a single parallel translate job.
- Individual command lists can disable parallel translate, which forces them to be executed on the RHI thread. This happens automatically if an RHI command list performs an operation that is not thread safe (e.g. buffer lock, or low-level resource transition).
One of the primary blockers for parallel translation was the RHI breadcrumb system, and the way RDG builds scopes. This was also refactored to remove these limitations:
- RDG could only push/pop events on the immediate command list, which resulted in parallel and immediate work being interleaved, breaking any opportunity for parallelism.
- Platform RHI implementations of breadcrumbs (e.g. in D3D12 RHI) was not correct across multiple RHI contexts. Push/pop operations aren't necessarily balanced within any one RHI context given that RDG builds "parallel pass sets" containing arbitrary ranges of renderer passes.
A summary of the new RHI breadcrumb system is as follows:
- A tree of breadcrumb nodes is built by the render thread and RDG. Each node contains the node name, and pointers to the parent and next nodes. When fully built, the nodes form a depth-first linked list which is used for traversing the tree for GPU crash debugging.
- The memory for breadcrumb nodes is provided by ref-counted allocator objects. These allocators are pipelined through the RHI, allowing the platform RHI implementation to extend their lifetime for GPU crash debugging purposes.
- RHIPushEvent / RHIPopEvent have been removed, replaced with RHIBeginBreadcrumbGPU / RHIEndBreadcrumbGPU. Platform RHIs implement these functions to perform GPU immediate writes using the unique ID of each node, for tracking GPU progress.
- Format string arguments are captured by-value to remove the cost of string formatting while building the breadcrumb tree. String formatting only occurs when the actual formatted string is required (e.g. during GPU crash breadcrumb stack traversal, or when calling platform GPU profiling APIs).
RenderGraph scopes have been simplified:
- The separate scope trees / arrays of ops have been combined. There is now a single tree of RDG scopes containing all types.
- Each RDG pass holds a pointer to the scope it was created under.
- BeginCPU / EndCPU is called on each RDG scope as the various RDG threads enter / exit them. This allows us to mark-up each worker thread with the relevant Unreal Insights scopes.
Other changes include:
- Fixes for bugs uncovered when parallel translate was enabled.
- Adjusted platform affinities necessary due to the new layout of thread tasks in the renderer.
- Refactored RHI draw call stats to better fit the new pipeline design.
#rb jeannoe.morissette, zach.bethel
#jira UE-139543
[CL 30973133 by Luke Thatcher in ue5-main branch]
- Create our own JNI wrapper on top of latest play-services-games-v2 classes
- Adapt OSSGooglePlay to use this wrapper
- Remove deprecated/unsupported functionality
- Fix some issues with JNI support
#jira UE-201481
[REVIEW] [at]Michael.Kirzinger, [at]Chris.Babcock, [at]Bertrand.Carre
#rb Chris.Babcock, Michael.Kirzinger, Sam.Zamani
[CL 30963393 by rafa lecina in ue5-main branch]
known issues remaining: have to pause before switching surface binding (should be handled internally eventually)
need to finish up the changes for compiling reference ASIS android studio project samlple. This will be in another submit.
#rb chris.babcock
#android
[CL 30685682 by ahmed siddique in ue5-main branch]
When using bForceUniqueLogNames, we record offsets to an initial Engine TimeStamp which gives us similar functionality to just using WorldTimeStamp in the Log Visualizer (i.e. multiple runs will line-up at time 0).
When not using bForceUniqueLogNames, we only record the offset when the data is cleared in the Log Visualizer. That will allow multiple runs to record properly on a single timeline (rather than having new data overwrite old data).
bForceUniqueLogNames is now the default setting.
#jira UE-203873
#rb ben.hoffman, Mieszko.Zielinski, Yoan.StAmant
#lockdown marc.audy
[CL 30639009 by jodon karlik in ue5-main branch]
SlackReport debug command for TArray slack tracking.
* Requires ENABLE_ARRAY_SLACK_TRACKING define to be enabled at compile time (enabling the tracking all the time would be too significant a performance hit, as it requires a function call per array count change).
* Generates TSV (tab separated value) reports in Shared/Logs/SlackReport, which can be loaded into a spreadsheet program. A filename can be specified to the command, or a default one with an incrementing index will be used.
* Switch -Stack=N controls the number of stack frames used for sorting. Setting a value less than the maximum (9 levels) can coalesce instances of the same allocation from slightly different call stacks. Switch -Verbose=0 enables a more condensed report that only shows the largest unique Num / Max bucket per call stack.
#rb Steve.Robb
[CL 30346075 by jason hoerner in ue5-main branch]
[FYI] jason.hoerner
Original CL Desc
-----------------------------------------------------------------
SlackReport debug command for TArray slack tracking.
* Requires ENABLE_ARRAY_SLACK_TRACKING define to be enabled at compile time (enabling the tracking all the time would be too significant a performance hit, as it requires a function call per array count change).
* Generates TSV (tab separated value) reports in Shared/Logs/SlackReport, which can be loaded into a spreadsheet program. A filename can be specified to the command, or a default one with an incrementing index will be used.
* Switch -Stack=N controls the number of stack frames used for sorting. Setting a value less than the maximum (9 levels) can coalesce instances of the same allocation from slightly different call stacks. Switch -Verbose=0 enables a more condensed report that only shows the largest unique Num / Max bucket per call stack.
#rb Steve.Robb
[CL 30336635 by justin peterson in ue5-main branch]
* Requires ENABLE_ARRAY_SLACK_TRACKING define to be enabled at compile time (enabling the tracking all the time would be too significant a performance hit, as it requires a function call per array count change).
* Generates TSV (tab separated value) reports in Shared/Logs/SlackReport, which can be loaded into a spreadsheet program. A filename can be specified to the command, or a default one with an incrementing index will be used.
* Switch -Stack=N controls the number of stack frames used for sorting. Setting a value less than the maximum (9 levels) can coalesce instances of the same allocation from slightly different call stacks. Switch -Verbose=0 enables a more condensed report that only shows the largest unique Num / Max bucket per call stack.
#rb Steve.Robb
[CL 30332344 by jason hoerner in ue5-main branch]
- Moving IAssetCompilingManager to its own header
- Changing FAsyncCompilationNotification members to TUniquePtr<FAsyncCompilationNotification> to remove dependencies on AsyncCompilationHelpers.h
- Removing includes of AsyncCompilationHelpers.h and AssetCompilingManager.h removes 3s of compile time per file on a 3990x
#rb henrik.karlsson
[CL 30094069 by christopher waters in ue5-main branch]
[FYI] christopher.waters
Original CL Desc
-----------------------------------------------------------------
Dependency Cleanup
- Moving IAssetCompilingManager to its own header
- Changing FAsyncCompilationNotification members to TUniquePtr<FAsyncCompilationNotification> to remove dependencies on AsyncCompilationHelpers.h
- Removing includes of AsyncCompilationHelpers.h and AssetCompilingManager.h removes 3s of compile time per file on a 3990x
#rb henrik.karlsson
[CL 30054260 by alex kahn in ue5-main branch]
- Moving IAssetCompilingManager to its own header
- Changing FAsyncCompilationNotification members to TUniquePtr<FAsyncCompilationNotification> to remove dependencies on AsyncCompilationHelpers.h
- Removing includes of AsyncCompilationHelpers.h and AssetCompilingManager.h removes 3s of compile time per file on a 3990x
#rb henrik.karlsson
[CL 30051281 by christopher waters in ue5-main branch]
BlueprintCompileAndLoadTimerData has been wrong ever since the compilation manager was added
FScopedCompilerEvent was broken because there are two active compilation logs in many cases
Delete unused stats and replace a few with trace scopes
#rb dan.oconnor
[CL 29996340 by ben zeigler in ue5-main branch]