There was a race possible between FPipe::WaitUntilEmpty and FPipe::ClearTask() which was caused by a race in EventCount between Notify & Wait.
That led to a dead-lock on IOS.
Fixed by adding a StoreLoad (kinda) barrier into Notify to ensure the proper memory ordering for Counter value.
RMW is used as a barrier, but could also be solved with a few seq_cst fences (considered to be less optimal and unsupported by TSan).
More details:
Thread 1:
- ClearTask: fetch_sub-s TaskCount to 0 and goes Notify
- Notify: reads count as 0 and exits without any wakes (first bit is not set -> nobody is waiting)
Thead 2:
- WaitUntilEmpty: first check fails, second check check fails, calls PrepareWait (sets count to 1, returns 0), calls WaitFor with 0
- WaitFor: gets count as 1, count without 1 bit is 0, matches token -> wait for Notify which will never happen.
One thread was reading stale data due to memory reordering.
#rb anderson.ramos, danny.couture, Devin.Doucette
#rnx
[CL 36755848 by denys mentiei in 5.5 branch]
- Add delegate fired when the scheduler reaches its maximum oversubscription capacity
- Add function to determine when the scheduler has exceeded its oversubscription capacity
- Add CSV accumulative stat each time the scheduler reaches its maximum oversubscription capacity
[CL 35125841 by danny couture in ue5-main branch]
- Replace busywait by a better approach that doesn't end up wasting cpu cycles
#rnx
#rb kevin.macaulayvacher
[CL 34337670 by danny couture in ue5-main branch]
- Activate new frontend by default allowing the old taskgraph to benefit from retraction
#jira UE-117550
[CL 33543508 by danny couture in ue5-main branch]
- Prevent PrepareWait from being reordered with memory operations around it
- This fixes a potential deadlock on some platform
#rb kevin.macaulayvacher
[CL 32896177 by danny couture in ue5-main branch]
- Fix potential deadlock by always enqueueing to the global queue and perform wakeup for standby workers in case they go to sleep unconditionally.
#rb kevin.macaulayvacher
[CL 32397515 by danny couture in ue5-main branch]
- Add missing CORE_API export for WaitForAnyTaskCompleted and AnyTaskCompleted
#rb kevin.macaulayvacher
[CL 32320774 by danny couture in ue5-main branch]
- Replace busy wait logic by oversubscription using a pool of standby thread.
- This fixes a whole class of deadlocks because of dependency inversion between tasks on the same callstack.
- Using a simple oversubscription scope, we can now improve concurrency on any long IO or wait operations.
- Oversubscription has already been applied to many wait API so it just works without having developers know it's there.
- Standby threads go back to sleep as soon as the oversubscription period finishes and they finish their current task.
- Standby threads never busy yield, they go to sleep when no more work is available, giving time slice back for real work.
- Standby threads are only active when all normal threads are busy and some oversubscription scopes are active.
- Add dynamic thread creation support so we only create what's needed (especially good on high core count machines).
- For platforms where static thread creation is preferred. TaskGraph.UseDynamicThreadCreation = 0 can be used.
- The maximum number of standby threads is controlled by TaskGraph.OversubscriptionRatio. (Current default 2x).
- Deprecate reserve workers as they are now superseded by this feature.
- Busywait API has been reimplemented using oversubscription and will be deprecated in another CL to keep this one focused.
#jira UE-209887
#rb kevin.macaulayvacher
[CL 32298767 by danny couture in ue5-main branch]
- Fix race where multiple threads trying to steal from the same local queue could miss a task
#rb kevin.macaulayvacher
[CL 32168260 by danny couture in ue5-main branch]
Context:
- FRayTracingSceneAddInstancesTask processed cached and non-cached raytracing primitives in a single-threaded loop
- Non-cached primitives cannot be efficiently parallelized due to auto-instancing logic.
- On the other hand, cached primitives processing doesn't have dependencies between primitives which means we can efficiently split it in parallel tasks.
Change:
- Modified GatherRayTracingRelevantPrimitives_ComputeLOD to output separate arrays on cached and non-cached static raytracing primitives.
- Also moved logic that filtered out cached static primitives depending on ShowFlags/cvars/etc out of RayTracingSceneStaticInstanceTask and into GatherRayTracingRelevantPrimitives_ComputeLOD.
- Preallocate range of instances in RayTracingScene and VisibleMeshCommands for cached static instances.
- Fill RayTracingScene instances and VisibleMeshCommands for cached static instances in parallel.
#rb aleksander.netzel
[CL 31422467 by tiago costa in ue5-main branch]
ICompilable WaitCompletionWithTimeout () method is now documented to poll if given a 0 TimeLimitSeconds. To adjust for this, implementers of ICompilable will now poll if TimeLimitSeconds is 0 before waiting. There are a few implementers that don't use an event and sleep which is not ideal, but we at least now poll again after sleeping to avoid another round trip to know if the task is complete.
FAsyncTaskBase::WaitCompletionWithTimeout now polls for completion when given a time limit of 0 seconds. This simplifies use, and avoids unintended yielding.
Before PIE.Startup
FAsyncTask::SyncCompletion Total Inclusive Time 5.25s (40606 calls)
After PIE.Startup
FAsyncTask::SyncCompletion Total Inclusive Time 195ms (39504 calls)
#jira UE-204061
#rb Francis.Hurteau, danny.couture
[CL 30922445 by kevin macaulayvacher in ue5-main branch]
- Restore ParallelWithPreWork behavior that should always call prework even when number of tasks to execute is 0.
#rb kevin.macaulayvacher
[CL 30657555 by danny couture in ue5-main branch]