2019-12-27 09:26:59 -05:00
|
|
|
// Copyright Epic Games, Inc. All Rights Reserved.
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3511476)
#lockdown Nick.Penwarden
=====================================
MAJOR FEATURES + CHANGES
=====================================
Change 3372740 by Chris.Bunner
[Experimental] Partial compute post process pipeline (r.PostProcess.PreferCompute).
StencilSceneTexture added to deferred list.
A few known issues to be fixed in a follow-up CL.
Change 3374187 by Chris.Bunner
Volume texture support for CombineLUTs/Tonemap compute pass.
Refactored common param code to shared sub-class in CombineLUTs and Tonemap PS/CS.
Skip compute post process out-of-bounds writes.
Unsigned type conversion fixes.
Trimmed compute post process shader inputs.
Change 3441680 by Uriel.Doyon
Added units to point light intensity, to allow the user to specify the value in candelas or lumens.
New point light actors now configure the intensity in candelas by default.
Replaced viewport exposure settings by an EV100 slider.
Hidding the tone mapper in the show flag now still applies the exposure.
Added a new AutoExposure method called EV100 which allows to specify :
- MinEV100, MaxEV100
- Calibration Constnat
- Exposure Compensation
#jira UE-42783
Change 3454636 by Uriel.Doyon
Fixed point light having an extra scale of 16 in mobile
#jira UE-45272
Change 3454844 by Uriel.Doyon
Fixed extra X16 on some point lights
#jira UE-45250
Change 3454934 by Chris.Bunner
Backing out changelists 3441680, 3454636 and 3454844 for the sake of integration stability.
Change 3461206 by Guillaume.Abadie
Adds possibility to scene captures and player controller to render no primitives at all.
Change 3461207 by Guillaume.Abadie
Exposes showflag details to USceneCaptureComponent. This gives the possibility to configure scene capture's showflags in blueprint encapsulated compositing pipeline.
#jira UE-6810
Change 3461233 by Chris.Bunner
Added Log10 material expression.
Added tooltip for Log2 and Log10.
Change 3461434 by Michael.Trepka
Copy of CL 3456118
In Metal RHI report texture streaming as immediately successful as on D3D to avoid a race-condition leading to deadlock between the Main, Game, Render & RHI threads.
#jira UE-44961
Change 3461770 by Benjamin.Hyder
Submitting TM-RayTracedDistanceField map
Change 3461929 by Marc.Olano
Add Sobol blueprint and material node test maps to RenderTest project
Change 3462249 by Uriel.Doyon
Translucency after DoF is now disabled when showflag postprocess is disabled.
Change 3462371 by Brian.Karis
VT addressing is now 64bit to support huge sparse virtualized volumes
16bit page tables working.
Change 3462936 by Marc.Olano
Extend Sobol testing map with comparision between Random Sobol and Next Sobol functions
Change 3464394 by Uriel.Doyon
Improved synchronization for texture streaming commands.
This fixes an issue when accessing FStreamingTexture for pending textures.
Change 3464743 by Guillaume.Abadie
Adds .usf file extension on all shader's source file names and adds checks to verify them at engine load time.
Change 3464818 by Guillaume.Abadie
Fixes compilation error in FindShaderRelativePath
Change 3465184 by Daniel.Wright
r.Shadow.PreShadowResolutionFactor 1.0 on Epic shadow settings
Change 3465283 by Marc.Olano
Update Sobol Gray code tables to match random order tables
Change 3465976 by Arne.Schober
DR - [UE-44393] - The Canvas is using the Globalshaders for clearing but compilation is done asynconously at load time. Unfortunately there could be Code that uses a canvas to draw and cause this issue in between. There might be some plugins that do this. For now we need to wait and block for the shaqders to be compiled until we can allo the use of the canvas.
#RB none
Change 3467513 by Guillaume.Abadie
Fixes an issue where primitives would no longer draw in gameplay.
#jira UE-45550
Change 3471116 by Richard.Wallis
Mac OpenGL Is No Longer Supported - Remove All Code & Shader Platforms. Merge of CL 3327784 dev-editor stream from Michael Trepka with some extra changes.
- Also removed Metal shader platforms from PlatformSupportsDebugViewShaders() otherwise we get a compiler error. HLSL register binds not implemented in metal backend.
#jira UE-39108
Change 3471117 by Richard.Wallis
Drop down menus clip on 27" Screen iMacs. Disable viewport HDR rendering on macOS 10.12.x when in editor.
#jira UE-43026
Change 3471130 by Richard.Wallis
Mac GPU hang causes editor output log to be written to the wrong file. Try to emulate windows behaviour when opening a file for reading or writing. Tested against behaviour of windows log file with multiple instances running.
- Only defined in for Mac and non shipping.
#jira UE-44934
Change 3471224 by Guillaume.Abadie
Lets the ProjectFileGenerator to look at Shaders/ directories in plugin and game projects.
Change 3471646 by Daniel.Wright
Fixed ensure opening UT system settings
Change 3471862 by Arne.Schober
DR - revert accidently checked in changes.
#RB Chris.Bunner
Change 3472249 by Guillaume.Abadie
Implements virtual shader source directory mapping.
- /Engine/... maps to Engine/Shaders/...
- /Plugin/FooBar/... maps to FooBar plugin's Shaders/ directory
- /Project/... maps to project's Shaders/ directory
Change 3472443 by Daniel.Wright
Moved the Rendering category for lights to be just below the Light category, so the bVisible property is easily accessible
Change 3474537 by Uriel.Doyon
Fixed lighting needs rebuild happening after blueprint rescript and a non symetrical Quaterion != ToQuaternion(ToRotator(Quaternion)
Change 3475192 by Guillaume.Abadie
Implements LensDistortion engine plugin.
This CL import a polished version of Raven's lens distortion and undistortion from OpenCV parameters:
- It is implemented as the first engine plugin with its own shaders and render thread commands;
- Has feature tests in EngineTest with gold images directly extracted from OpenCV itself (GenerateLensDistortionUndistortReferences.py)
Change 3475209 by Guillaume.Abadie
Back out changelist 3475192
Change 3475252 by Guillaume.Abadie
Reland: Implements LensDistortion engine plugin.
This CL import a polished version of Raven's lens distortion and undistortion from OpenCV parameters:
- It is implemented as the first engine plugin with its own shaders and render thread commands;
- Has feature tests in EngineTest with gold images directly extracted from OpenCV itself (GenerateLensDistortionUndistortReferences.py)
Change 3475389 by Guillaume.Abadie
Adds LensDistortion plugin's feature tests.
Change 3475538 by Guillaume.Abadie
Adds the /Engine/* prefix on all of the renderer's USF file references.
Change 3475568 by Guillaume.Abadie
Adds a check for virtual shader source file path format in FShaderType::FShaderType()
Change 3475871 by Guillaume.Abadie
Fixes a bug in shader compile worker, were an error in a relative #include USF file would trigger an check failure in CheckVirtualShaderFilePath
Change 3475997 by Yujiang.Wang
Workaround for a compiler optimization bug introduced in VS2015 Update 3.
* The bug causes TSHVector<2>::CalcDiffuseTransfer to go to infinity at certain spot, making movable objects with ILCQ_Volume indirect lighting cache interpolation get very dark.
* Debug builds don't exhibit this bug.
* Semantics are exactly the same as the original code.
Change 3476203 by David.Hill
Compute SSAO: problem wiht AmbientOcclusionLevels and with various viewporttest sizes. Only seen when Levels >=2
#jira UE-45741
Change 3476536 by Benjamin.Hyder
adding player start to Ray Traced Distance Field Shadows Map
Change 3478298 by Benjamin.Hyder
disabling mesh distance fields in Tm-Raytraced_DistanceField_Shadows map
Change 3478948 by Rolando.Caloca
DR - Nicer check
Change 3478949 by Rolando.Caloca
DR - Default GPU morphs to enabled
Change 3478950 by Rolando.Caloca
DR - By default -vulkan will launch SM5
Change 3478984 by Rolando.Caloca
DR - Pass down -vulkan
Change 3479655 by Richard.Wallis
Video track does not switch in AVF Media Player. Need to disable unused video tracks to allow AVPlayerItemVideoOutput to decode the required track.
- Minimal change to allow video track changes/selection.
- Audio samples are extracted using AVAssetReaderTrackOutput but video uses AVPlayerItemVideoOutput. Video could also use AVAssetReaderTrackOutput to access the video data unless there is an iOS reason not to...
- Flush the audio sink sample buffers so we get instant audio track changes
#jira UE-39750, UE-39749
Change 3479834 by Rolando.Caloca
DR - Fix issue with bad vertex colors (per licensee)
Change 3480376 by Guillaume.Abadie
Disables ComputeLightGrid() if no volumetric fog and no lighting.
#jira UE-45377
Change 3480596 by Yujiang.Wang
Fix for dynamic shadows and raytraced distance field shadows of directional lights not appearing in planar reflection
* Bug caused by incorrect shadow culling volumes for cascaded shadow map and backface culling mode for WholeSceneShadowProjection
* Fixed by taking View.bReverseCulling into account
#jira UE-34452
Change 3480600 by Yujiang.Wang
Fix for UE-42376
* The bug is caused by post-processing ambient cubemaps not being supported in forward shading currently.
* This fix replaces all the occurences of them in CalcSceneView with a skylight using the cubemap
* If a CalcSceneView is used solely for setting the PP ambient cubemap, it is removed.
#jira UE-42376
Change 3480784 by Rolando.Caloca
DR - hlslcc - Initial support for [RW]StructuredBuffer
Change 3481690 by Uriel.Doyon
Attempt to fix static analysis warning
Change 3482012 by Simon.Tovey
Fixed issue when building distribution lookup tables where the final sample fell short of the max input time.
As sampling is done only over this range, under constant interpolation the final value was never actually sampled and so cut from the final optimized LUT.
#tests constant interpolation now works.
#jira UE-45614
Change 3482965 by Yujiang.Wang
Some quality of life changes for UE-42757
* The UV overlay in static mesh editor now has a darker background
* Selected edges are getting highlighted and bolder
* When some edges are selected others turn grey
#jira UE-42757
Change 3483014 by David.Hill
Change labels on bloom boost from x,y,z to min, max, mult.
#jira UE-43904
a PropertyRedirect in BaseEngine.ini allows this to work with older version.
Change 3484573 by Yujiang.Wang
Fix for shadow color not updated after light build when a texture is changed and reimported
* Bug caused by counter-intuitive design of UMaterial::GetReferencedFunctionIds and UMaterial::GetReferencedParameterCollectionIds, both of which will reset the OutIds parameter
* Renamed to AppendReferencedFunctionIdsTo and AppendReferencedParameterCollectionIdsTo, the resets are removed
#jira UE-45647
Change 3484969 by Yujiang.Wang
Fix for UE-39929 inconsistent type between C++ and shader code
* MeshDistanceFieldCasterIndices is declared as Buffer<uint> in CapsuleShadowShaders.usf, while created as PF_R32_SINT in CapsuleShadowRendering.cpp
* Changed PF_R32_SINT to PF_R32_UINT in CapsuleShadowRendering.cpp
#jira UE-39929
Change 3485012 by Yujiang.Wang
Fix for UE-39929 #2: Changed int32 to uint32 to match PF_R32_UINT
#jira UE-39929
Change 3485146 by Guillaume.Abadie
Destroyes scene capture's view states on the UnRegister, to avoid large memory usage cause by the ViewState's render targets when moving blueprints arround.
#jira UE-43455
Change 3486602 by Joe.Conley
Adding "texcoord" keyword to UMaterialExpressionTextureCoordinate so you can search for the name that is displayed on the node in the graph.
Change 3487471 by Yujiang.Wang
Github #3659: Improved performance of DumpUnbuiltLightInteractions
* Replaced TArrays with TSets
#jira UE-45783
Change 3487641 by Guillaume.Abadie
Fixes some shader file name casing issues in LPV.
Change 3488014 by Uriel.Doyon
New AllowAsyncLoading flag for UTexture::CachePlatformData().
It allows to load the source texture data in the async task if the source bulk data was not yet loaded.
Data loaded that way is not sharable between tasks and will be discarded.
This is required because updating the source data is not thread safe.
#jira UERNDR-190
#jira UE-33401
Change 3488249 by Uriel.Doyon
Fixed long stall in UpdateResourceStreaming() caused by Actor.GetComponents() not resetting the number of actors anymore.
Fixed inconsistent results in ALODActor::HasValidSubActors() caused by the same change.
#jira UE-46004
Change 3490228 by Mark.Satterthwaite
Fix the Nvidia driver bug with the old reversebits fallback function - you need to use the native reverse_bits intrinsic or use some uint(ushort()) casts to get the compiler to do the right thing, which means injecting the reverse_bits function in MetalBackend not the HLSL (as it has no such type).
#jira UE-46067
Change 3490538 by Arne.Schober
Back out changelist 3488249
#RB none
Change 3490551 by Arne.Schober
Back out changelist 3488249
#RB none
Change 3491828 by Guillaume.Abadie
Fixes another USf file reference casing issue in C++.
Change 3491924 by Yujiang.Wang
Fix for UE-43302 Crash when entering the DebugCreatePlayer console command with planar reflections in the level
* Crash caused by check(Views.Num() <= 2); in SceneCaptureRendering.cpp
* We still want to support at most 2 views for performance, but now instead of crash the planar reflections in additional views will simply turn black
#jira UE-43302
Change 3492359 by Guillaume.Abadie
Fixes non editor launches, failing in FGenericPlatformProcess::AddShaderSourceDirectoryMapping().
Change 3492367 by Marc.Olano
Change Sobol texture size to 32x16, tweak distribution
Change 3492599 by Marcus.Wassmer
PR #3669: -Fix logmessages ParticleModules_Location.cpp (Contributed by UpwindSpring01)
Change 3493473 by Uriel.Doyon
Back out changelist 3490538
Change 3493590 by Uriel.Doyon
Back out changelist 3490551
Fixed missing #pragma once
Change 3493911 by Marcus.Wassmer
Fix potential GPU crash/hang caused by out of bound subresource updates.
Added checks at cross-platform level to catch any instance earlier.
Change 3494139 by Uriel.Doyon
Fixed shadow variable issue on UE4Editor Linux.
Change 3494364 by Richard.Wallis
Mac OpenGL Is No Longer Supported - Remove All Code & Shader Platforms - Part 2: Remove some more areas and fixes for previous attempt. Also removed OpenGL based GPU performance checks in EditorEngine.cpp - assuming that any GPU that can run Metal is currently OK for UE4.
OpenGL left in the following areas:
- OpenGLShaderCompiler
- StandaloneRenderer
The following files need to be reviewed in conjunction with CL 3471116 as there were some logic errors made:
- OpenGLTexture.cpp
#jira UE-39108
Change 3494413 by Guillaume.Abadie
Updates r.InvalidateCachedShader and bump ShaderVersion.ush.
Change 3494422 by Guillaume.Abadie
Adds LensDistortion plugin's Private shader directory.
Change 3494717 by Guillaume.Abadie
Strengthens shader compiler with checks on generated file names and shader type file names.
Change 3494763 by Guillaume.Abadie
Removes a nolonger standing TODO in GlobalBeginCompileShader() that was automatically adding /Engine/ prefix to all relative virtual shader source file path.
Change 3494985 by Rolando.Caloca
DR - Integrate Vulkan Rewrite
Change 3495031 by Rolando.Caloca
DR - Delete file as it moved
Change 3495032 by Rolando.Caloca
DR - Show Vulkan SM5 instead of SM4 on windows packaging
- Also added support for Vulkan SM5_UB
Change 3495202 by Uriel.Doyon
Fixed static analysis warning with pointer dereferencing.
Change 3495342 by Rolando.Caloca
DR - clang compile fix
Change 3495354 by Rolando.Caloca
DR - clang compile fixes
Change 3495420 by Marc.Olano
Use Sobol sampling for PCSS
Change 3495799 by Rolando.Caloca
DR - Delete old dev assets
Change 3496202 by Mark.Satterthwaite
Switch to using actual Vector*Matrix intrinsic for Metal to avoid a problem whereby the Metal compiler reorders operations in such a way that it loses precision and ends up being different between pre-pass and base-pass.
#jira UE-46070
Change 3496253 by Uriel.Doyon
Fixed static analysis warning for IncludeTool
Change 3496631 by Guillaume.Abadie
Makes AScreenshotFunctionalTest::ScreenshotOptions blueprint readable.
Change 3496851 by Guillaume.Abadie
Fixes back slash issues in Platform.usf.
Change 3496852 by Guillaume.Abadie
Fixes other back slashes includes in PS4 specific usf files.
Change 3496941 by Guillaume.Abadie
Adds a check() for no backslash in virtual shader file paths.
Change 3497661 by Guillaume.Abadie
Lets FLensDistortionCameraModel::GetUndistortOverscanFactor() early return 1.0 if the camera model is does an identity transform.
Change 3497969 by Richard.Wallis
Fix for start Up Movies Are not Playing for iOS Devices. Handle case when movie is loading aysnc in background - need to wait for state changes otherwise it skips intermediate movies.
- Tested on iOS and Mac.
#jira UE-39585
Change 3498035 by Guillaume.Abadie
Polishes //Engine/Plugins/Compositing/LensDistortion/Shaders/Private/UVGeneration.usf from debuging artifacts.
Change 3498101 by Rolando.Caloca
DR - Compile fix
Change 3498254 by Guillaume.Abadie
Exposes comparing FLensDistortionCameraModel to blueprint with == and != operator nodes for cross frame uv displacement map caching.
Change 3498264 by Guillaume.Abadie
Integrate 3267269: Implements SceneCaptureComponent2D::bCameraCutThisFrame
Change 3498371 by Yujiang.Wang
Fix for UE-46149 Planar Reflections display screenspace info when viewports are >2
* Prevent planar reflections being rendered when ViewIndex >= GMaxPlanarReflectionViews
* Now planar reflections in >2 viewports will fallback to other reflection methods (SSR, reflection captures)
#jira UE-46149
Change 3498409 by Rolando.Caloca
DR - Swap resolves
Change 3498410 by Guillaume.Abadie
Adds support for opacity output alpha for post process material when doing a draw material to render target.
Change 3498705 by Rolando.Caloca
DR - Add UID for debugging mem allocations
Change 3498759 by Marcus.Wassmer
No post processing in vertexcolor view mode
#jira UE-44704
Change 3498891 by Rolando.Caloca
DR - Minor Vulkan per frame allocator refactor in prep for changes
Change 3499206 by Rolando.Caloca
DR - Fix temp frame allocator OOM on Vulkan
#jira UE-45913
Change 3499319 by Rolando.Caloca
DR - Vulkan support for StorageBuffer
Change 3499339 by Rolando.Caloca
DR - Remove deprecated typedef
Change 3499400 by Rolando.Caloca
DR - Remove some RHICmdList deprecated functions
Change 3499422 by Rolando.Caloca
DR - Allow buffer transitions inside render passes
Change 3500370 by Rolando.Caloca
DR - Compile fix
Change 3500474 by Rolando.Caloca
DR - Fix static analysis
Change 3500517 by Guillaume.Abadie
Exposes r.PostProcessing.PropagateAlpha to the renderer settings.
Change 3500537 by Guillaume.Abadie
Fixes a bug where scene capture WorldToView matrix would get scale != 1 when scaling the scene capture actor in the world.
#jira UE-39389
Change 3501069 by Mark.Satterthwaite
Bring back temporary 4.16 fix for iOS 9 (CL #3425995) into Dev-Rendering for 4.17 as a real fix will need to wait for 4.18.
temporary fix for skewed textures on IOS 9
#jira UE-44468
Change 3501164 by Michael.Lentine
PR #3402: UE-43131: Format argument count not equal to actual arguments (Contributed by projectgheist)
Change 3501222 by Benjamin.Hyder
Checking in Tm_SobolNoise map
Change 3501612 by zachary.wilson
Adding testing content for RTDF shadows on planar reflections
Change 3501708 by Guillaume.Abadie
Break FPostProcessSettings into smallers structs.
Change 3501830 by Olaf.Piesche
#jira UE-39628; using fix proposed in UDN, will investigate further
Change 3501954 by Marcus.Wassmer
Duplicate 3480903
Light culling safety measures.
Change 3502032 by Mark.Satterthwaite
Fix generation of Metal precompiled headers for the bytecode compiler when using Xcode 9.
Change 3502118 by Uriel.Doyon
Fixed shader compilation issues.
Change 3502191 by Guillaume.Abadie
Implements Composure plugin to make compositing in UE4 easier.
Change 3502192 by Guillaume.Abadie
Implements Composure feature testing in EngineTests
Change 3502196 by Guillaume.Abadie
Creates a dependency of Composure plugin over LensDistortion plugin.
Change 3502213 by Arciel.Rekman
Fix for loading shaders on Linux (UE-46276).
Change 3502243 by Brian.Karis
Bent normal map support.
Multibounce AO.
Spherical Gaussian based specular occlusion.
Change 3502506 by Guillaume.Abadie
Fixes compilation failure in Composure with unity build.
Change 3502507 by Guillaume.Abadie
Fixes composure Set Pass with Render Target blueprint helper.
Change 3502510 by Guillaume.Abadie
Attempts to fix ComposureUtils.cpp compile errors.
Change 3502515 by Guillaume.Abadie
Some other composure failure fixes.
Change 3502545 by Guillaume.Abadie
Fixes some unity build related error in Composure.
Change 3502548 by Guillaume.Abadie
Fixes last missing includes in ComposurePostProcessPass.cpp
Change 3502672 by Guillaume.Abadie
Fixes linux warning in Composure.
Change 3502790 by Ryan.Brucks
float4 PseudoVolumeTexture: Fixed frame layout being a float instead of float2. Now works correctly with non-square frame layouts. Only called in custom nodes and calling with a float still functions properly so no old content will break.
Change 3502836 by Guillaume.Abadie
Propagates scene capture engine showflag changes from blueprint editor to the blueprint instances.
#jira UE-6810
Change 3503096 by Guillaume.Abadie
Resave a unversioned asset.
Change 3503228 by Yujiang.Wang
Fix for UE-45646 Dynamic Light placed inside of a Dynamic Static Mesh doesn't pass through the geometry
* Bug caused by bReflectiveShadowmap not being passed into SetViewFlagsForShadowPass
* Replaced the true with bReflectiveShadowmap
#jira UE-45646
Change 3503284 by Rolando.Caloca
DR - Fixed initial clear on rendertargets
- Added support for r.Vulkan.EnableValidation 1, 2, 3 & 4
- Dump the vulkan log into VS output log
- Added validation for layouts when using dump log
Change 3503545 by Arciel.Rekman
Fix black UI on Linux (UE-46333)
- Rebuilt hlslcc with clang 3.7.0. Whatever issues we're running in with newer clangs still seem to persist.
#jira UE-46333
Change 3503638 by Daniel.Wright
[Copy] Changed DynamicBentNormalAO back to fp16, as PF_FloatR11G11B10 was not enough precision and introduced banding
Change 3503787 by Marcus.Wassmer
Fix difference between gpu/cpu morph target application
Change 3503902 by Marcus.Wassmer
Roll back TAA refactor until we have time to look into the bad interaction with DOF.
Change 3503953 by Arne.Schober
DR - UE-46319 - borked Reflections: The resource transition needs to be in this weired place for PS4 and switch until we teach the interface to know about subresources.
#RB Marcus.Wassmer
Change 3504131 by Rolando.Caloca
DR - Maintain a cache of pipeline and descriptor set layouts
- Fix marker dump
Change 3504462 by Guillaume.Abadie
Fixes an assertion failure that was failing because compute light grid was not done, but the shader used where not necessarily using compute light grid results.
#jira UE-46277
Change 3504779 by Chris.Bunner
Potential static analysis fix.
#jira UE-46360
Change 3504950 by Marc.Olano
Allow Sobol material nodes & textures only if feature level is at least ES3.1
#jira UE-46334
#jira UE-46317
Change 3505035 by Daniel.Wright
Increased MaxSearchCount in GetShaderIncludes. The previous limit of 20 is now getting hit in BasePassPixelShader.usf, causing compiles to fail erroneously.
Change 3505386 by Daniel.Wright
GetShaderIncludes handles infinite recursion gracefully, needed by Metal causing BasePassTessellation.usf to include BasePassVertexShader.usf
Change 3505491 by Rolando.Caloca
DR - Fix crash on first frame of particles on modern APIs
Change 3505557 by Chris.Bunner
[Duplicate] Workaround for outdated shader map crash.
#jira UE-46061
Change 3506071 by Rolando.Caloca
DR - Vulkan fixes
- Fix copy out of bounds reading textures to CPU
- Defer event deletion
- Split validation for errors and warnings
- Skip validation error about attachment not used
Change 3506698 by Guillaume.Abadie
Fixes Composure alpha channel clobering and performance regression in bloom and tonemapper passes caused by scene capture API compatibility breakage brought by Fortnite merge.
Change 3506797 by Rolando.Caloca
DR - Fix static analysis
#jira UE-46428
Change 3506861 by Rolando.Caloca
DR - Fix crash due to layering violation
#jira UE-46424
#jira UE-46431
Change 3508098 by Rolando.Caloca
DR - Fix for Vulkan ES31 crash
- Fix for AMD ensure
Change 3508123 by Rolando.Caloca
DR - Disable occlusion queries on Vulkan to avoid flickering
- Fix for bad HZB & cube mips on Vulkan (now using RHIGenerateMips)
- Fix for decal blending
#jira UE-46376
Change 3509064 by Uriel.Doyon
Changing the logic arround generating an error when HasHadBulkDataCleared() so that it only triggers if the DDC are not found.
#jira UE-46427
Change 3509854 by Marc.Olano
Fix 2D Sobol gray code numbers.
Just changes some numbers in initialization tables, so no effect on existing tests or content.
Change 3509920 by Marcus.Wassmer
Fix LPV fastvram ensure
Change 3509937 by Rolando.Caloca
DR - Fix crash due to deleted viewport
#jira UE-46281
Change 3509988 by Marcus.Wassmer
Roll back part of Sobol fix to avoid full shader recompile for integration.
Change 3510255 by Rolando.Caloca
DR - Fix popup window ensure
#jira UE-46511
Change 3510646 by Marcus.Wassmer
fix ios compiles
Change 3511442 by Rolando.Caloca
DR - Change mesh simplification check to ensure/checkslow to unblock
#jira UE-46538
DONE!
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CHANGES WITH MULTIPLE PLATFORMS!!! YOU MUST COPY THESE INTO THE OTHER ONES AS MAKES SENSE!!
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Change 3467095 by Guillaume.Abadie
Nukes all += TEXT(".usf")
#jira UE-45530
Change 3475084 by Guillaume.Abadie
Fixes compilation failure of the shader compiler on PS4 and XboxOne
Change 3477464 by Guillaume.Abadie
Fixes dumpshaderinfo that generate unecessary sub directory, breaking shell scripts.
Change 3494395 by Guillaume.Abadie
Moves all engine shader files into Public and Private directory, and introduce the .ush extensions for header file that do not contains entry points.
DONE!
[CL 3511602 by Marcus Wassmer in Main branch]
2017-06-27 11:38:28 -04:00
|
|
|
#include "Common.ush"
|
|
|
|
|
#include "DeferredShadingCommon.ush"
|
|
|
|
|
#include "DistanceFieldLightingShared.ush"
|
|
|
|
|
#include "DistanceFieldAOShared.ush"
|
2022-04-22 19:55:41 -04:00
|
|
|
#include "DistanceField/GlobalDistanceFieldShared.ush"
|
|
|
|
|
#include "DistanceField/GlobalDistanceFieldUtils.ush"
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
#include "ReflectionEnvironmentShared.ush"
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
RWTexture2D<float4> RWVisualizeMeshDistanceFields;
|
|
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
#define MAX_INTERSECTING_OBJECTS 2048
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
groupshared uint IntersectingObjectIndices[MAX_INTERSECTING_OBJECTS];
|
|
|
|
|
|
|
|
|
|
groupshared uint NumIntersectingObjects;
|
|
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
#define SDF_TRACING_TRAVERSE_MIPS 1
|
|
|
|
|
|
2022-03-04 06:03:15 -05:00
|
|
|
void RayTraceThroughTileCulledDistanceFields(float3 TranslatedWorldRayStart, float3 WorldRayDirection, float MaxRayTime, out float MinRayTime, out float TotalStepsTaken, out uint HitCulledObjectIndex)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
|
|
|
|
MinRayTime = MaxRayTime;
|
|
|
|
|
TotalStepsTaken = 0;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
HitCulledObjectIndex = 0xFFFFFFFF;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
LOOP
|
|
|
|
|
for (uint ListObjectIndex = 0; ListObjectIndex < min(NumIntersectingObjects, (uint)MAX_INTERSECTING_OBJECTS); ListObjectIndex++)
|
|
|
|
|
{
|
|
|
|
|
uint ObjectIndex = IntersectingObjectIndices[ListObjectIndex];
|
2022-01-31 10:23:36 -05:00
|
|
|
FDFObjectData DFObjectData = LoadDFObjectData(ObjectIndex);
|
2022-03-04 06:03:15 -05:00
|
|
|
float4x4 TranslatedWorldToVolume = LWCMultiplyTranslation(LWCNegate(PrimaryView.PreViewTranslation), DFObjectData.WorldToVolume);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
2022-03-04 06:03:15 -05:00
|
|
|
float3 VolumeRayStart = mul(float4(TranslatedWorldRayStart, 1.0f), TranslatedWorldToVolume).xyz;
|
|
|
|
|
float3 TranslatedWorldRayEnd = TranslatedWorldRayStart + WorldRayDirection * MaxRayTime;
|
|
|
|
|
float3 VolumeRayEnd = mul(float4(TranslatedWorldRayEnd, 1.0f), TranslatedWorldToVolume).xyz;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
float3 VolumeRayDirection = VolumeRayEnd - VolumeRayStart;
|
|
|
|
|
float VolumeRayLength = length(VolumeRayDirection);
|
|
|
|
|
VolumeRayDirection /= VolumeRayLength;
|
|
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
float2 VolumeSpaceIntersectionTimes = LineBoxIntersect(VolumeRayStart, VolumeRayEnd, -DFObjectData.VolumePositionExtent, DFObjectData.VolumePositionExtent) * VolumeRayLength;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
if (VolumeSpaceIntersectionTimes.x < VolumeSpaceIntersectionTimes.y && VolumeSpaceIntersectionTimes.x < VolumeRayLength)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
uint ReversedMipIndex = 0;
|
|
|
|
|
FDFAssetData DFAssetMipData = LoadDFAssetData(DFObjectData.AssetIndex, ReversedMipIndex);
|
|
|
|
|
uint MaxMipIndex = DFAssetMipData.NumMips - 1;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
#if !SDF_TRACING_TRAVERSE_MIPS
|
2021-11-18 14:37:34 -05:00
|
|
|
ReversedMipIndex = MaxMipIndex;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
DFAssetMipData = LoadDFAssetData(DFObjectData.AssetIndex, ReversedMipIndex);
|
|
|
|
|
#endif
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
float SampleRayTime = VolumeSpaceIntersectionTimes.x;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
uint StepIndex = 0;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
uint MaxSteps = 64;
|
|
|
|
|
bool bHit = false;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
LOOP
|
|
|
|
|
for (; StepIndex < MaxSteps; StepIndex++)
|
|
|
|
|
{
|
|
|
|
|
float3 SampleVolumePosition = VolumeRayStart + VolumeRayDirection * SampleRayTime;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
float DistanceField = SampleSparseMeshSignedDistanceField(SampleVolumePosition, DFAssetMipData);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
// Expand the surface to find thin features, but only away from the start of the trace where it won't introduce incorrect self-occlusion
|
|
|
|
|
// This still causes incorrect self-occlusion at grazing angles
|
|
|
|
|
float ExpandSurfaceDistance = DFObjectData.VolumeSurfaceBias;
|
|
|
|
|
const float ExpandSurfaceFalloff = 2.0f * ExpandSurfaceDistance;
|
|
|
|
|
const float ExpandSurfaceAmount = ExpandSurfaceDistance * saturate(SampleRayTime / ExpandSurfaceFalloff);
|
|
|
|
|
|
|
|
|
|
#if SDF_TRACING_TRAVERSE_MIPS
|
|
|
|
|
|
|
|
|
|
float MaxEncodedDistance = DFAssetMipData.DistanceFieldToVolumeScaleBias.x + DFAssetMipData.DistanceFieldToVolumeScaleBias.y;
|
|
|
|
|
|
2022-01-18 17:46:13 -05:00
|
|
|
if (abs(DistanceField) > MaxEncodedDistance && ReversedMipIndex > 0)
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
{
|
|
|
|
|
ReversedMipIndex--;
|
|
|
|
|
DFAssetMipData = LoadDFAssetData(DFObjectData.AssetIndex, ReversedMipIndex);
|
|
|
|
|
}
|
|
|
|
|
else if (abs(DistanceField) < .25f * MaxEncodedDistance && ReversedMipIndex < MaxMipIndex)
|
|
|
|
|
{
|
|
|
|
|
DistanceField -= 6.0f * DFObjectData.VolumeSurfaceBias;
|
|
|
|
|
ReversedMipIndex++;
|
|
|
|
|
DFAssetMipData = LoadDFAssetData(DFObjectData.AssetIndex, ReversedMipIndex);
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
#endif
|
|
|
|
|
// Terminate the trace if we reached a negative area or went past the end of the ray
|
|
|
|
|
if (DistanceField < ExpandSurfaceAmount && ReversedMipIndex == MaxMipIndex)
|
|
|
|
|
{
|
|
|
|
|
bHit = true;
|
|
|
|
|
// One more step back to the surface
|
|
|
|
|
SampleRayTime = clamp(SampleRayTime + DistanceField - ExpandSurfaceAmount, VolumeSpaceIntersectionTimes.x, VolumeSpaceIntersectionTimes.y);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
float MinStepSize = 1.0f / (16 * MaxSteps);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
float StepDistance = max(DistanceField, MinStepSize);
|
|
|
|
|
SampleRayTime += StepDistance;
|
|
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
if (SampleRayTime > VolumeSpaceIntersectionTimes.y + ExpandSurfaceAmount)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
if (bHit || StepIndex == MaxSteps)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
const float NewHitDistance = length(VolumeRayDirection * SampleRayTime * DFObjectData.VolumeToWorldScale);
|
|
|
|
|
|
|
|
|
|
if (NewHitDistance < MinRayTime)
|
|
|
|
|
{
|
|
|
|
|
HitCulledObjectIndex = ObjectIndex;
|
|
|
|
|
MinRayTime = NewHitDistance;
|
|
|
|
|
}
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
}
|
|
|
|
|
|
2021-03-24 18:12:59 -04:00
|
|
|
TotalStepsTaken += max(StepIndex, 1u);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
float2 NumGroups;
|
|
|
|
|
|
|
|
|
|
[numthreads(THREADGROUP_SIZEX, THREADGROUP_SIZEY, 1)]
|
|
|
|
|
void VisualizeMeshDistanceFieldCS(
|
|
|
|
|
uint3 GroupId : SV_GroupID,
|
|
|
|
|
uint3 DispatchThreadId : SV_DispatchThreadID,
|
|
|
|
|
uint3 GroupThreadId : SV_GroupThreadID)
|
|
|
|
|
{
|
|
|
|
|
uint ThreadIndex = GroupThreadId.y * THREADGROUP_SIZEX + GroupThreadId.x;
|
|
|
|
|
|
|
|
|
|
float2 ScreenUV = float2((DispatchThreadId.xy * DOWNSAMPLE_FACTOR + View.ViewRectMin.xy + .5f) * View.BufferSizeAndInvSize.zw);
|
|
|
|
|
float2 ScreenPosition = (ScreenUV.xy - View.ScreenPositionScaleBias.wz) / View.ScreenPositionScaleBias.xy;
|
|
|
|
|
|
|
|
|
|
float SceneDepth = CalcSceneDepth(ScreenUV);
|
Merging Dev-LWCRendering into Main, this includes initial work to support rendering with LWC-scale position
Basic approach is to add HLSL types FLWCScalar, FLWCMatrix, FLWCVector, etc. Inside shaders, absolute world space position values should be represented as FLWCVector3. Matrices that transform *into* absolute world space become FLWCMatrix. Matrices that transform *from* world space become FLWCInverseMatrix. Generally LWC values work by extending the regular 'float' value with an additional tile coordinate. Final tile size will be a trade-off between scale/accuracy; I'm using 256k for now, but may need to be adjusted. Value represented by a FLWCVector thus becomes V.Tile * TileSize + V.Offset. Most operations can be performed directly on LWC values. There are HLSL functions like LWCAdd, LWCSub, LWCMultiply, LWCDivide (operator overloading would be really nice here). The goal is to stay with LWC values for as long as needed, then convert to regular float values when possible. One thing that comes up a lot is working in translated (rather than absolute) world space. WorldSpace + View.PrevPreViewTranslation = TranslatedWorldspace. Except 'View.PrevPreViewTranslation' is now a FLWCVector3, and WorldSpace quantities should be as well. So that becomes LWCAdd(WorldSpace, View.PrevPreViewTranslation) = TranslatedWorldspace. Assuming that we're talking about a position that's "reasonably close" to the camera, it should be safe to convert the translated WS value to float. The 'tile' coordinate of the 2 LWC values should cancel out when added together in this case. I've done some work throughout the shader code to do this. Materials are fully supporting LWC-values as well. Projective texturing and vertex animation materials that I've tested work correctly even when positioned "far away" from the origin.
Lots of work remains to fully convert all of our shader code. There's a function LWCHackToFloat(), which is a simple wrapper for LWCToFloat(). The idea of HackToFloat is to mark places that need further attention, where I'm simply converting absolute WS positions to float, to get shaders to compile. Shaders converted in this way should continue to work for all existing content (without LWC-scale values), but they will break if positions get too large.
General overview of changed files:
LargeWorldCoordinates.ush - This defines the FLWC types and operations
GPUScene.cpp, SceneData.ush - Primitives add an extra 'float3' tile coordinate. Instance data is unchanged, so instances need to stay within single-precision range of the primitive origin. Could potentially split instances behind the scenes (I think) if we don't want this limitation
HLSLMaterialDerivativeAutogen.cpp, HLSLMaterialTranslator.cpp, Preshader.cpp - Translated materials to use LWC values
SceneView.cpp, SceneRelativeViewMatrices.cpp, ShaderCompiler.cpp, InstancedStereo.ush - View uniform buffer includes LWC values where appropriate
#jira UE-117101
#rb arne.schober, Michael.Galetzka
#ROBOMERGE-AUTHOR: ben.ingram
#ROBOMERGE-SOURCE: CL 17787435 in //UE5/Main/...
#ROBOMERGE-BOT: STARSHIP (Main -> Release-Engine-Test) (v881-17767770)
[CL 17787478 by ben ingram in ue5-release-engine-test branch]
2021-10-12 13:31:00 -04:00
|
|
|
float3 TranslatedWorldPosition = mul(float4(ScreenPosition * SceneDepth, SceneDepth, 1), View.ScreenToTranslatedWorld).xyz;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
float TraceDistance = 100000.0f;
|
2022-03-04 06:03:15 -05:00
|
|
|
float3 TranslatedWorldRayStart = PrimaryView.TranslatedWorldCameraOrigin;
|
Merging Dev-LWCRendering into Main, this includes initial work to support rendering with LWC-scale position
Basic approach is to add HLSL types FLWCScalar, FLWCMatrix, FLWCVector, etc. Inside shaders, absolute world space position values should be represented as FLWCVector3. Matrices that transform *into* absolute world space become FLWCMatrix. Matrices that transform *from* world space become FLWCInverseMatrix. Generally LWC values work by extending the regular 'float' value with an additional tile coordinate. Final tile size will be a trade-off between scale/accuracy; I'm using 256k for now, but may need to be adjusted. Value represented by a FLWCVector thus becomes V.Tile * TileSize + V.Offset. Most operations can be performed directly on LWC values. There are HLSL functions like LWCAdd, LWCSub, LWCMultiply, LWCDivide (operator overloading would be really nice here). The goal is to stay with LWC values for as long as needed, then convert to regular float values when possible. One thing that comes up a lot is working in translated (rather than absolute) world space. WorldSpace + View.PrevPreViewTranslation = TranslatedWorldspace. Except 'View.PrevPreViewTranslation' is now a FLWCVector3, and WorldSpace quantities should be as well. So that becomes LWCAdd(WorldSpace, View.PrevPreViewTranslation) = TranslatedWorldspace. Assuming that we're talking about a position that's "reasonably close" to the camera, it should be safe to convert the translated WS value to float. The 'tile' coordinate of the 2 LWC values should cancel out when added together in this case. I've done some work throughout the shader code to do this. Materials are fully supporting LWC-values as well. Projective texturing and vertex animation materials that I've tested work correctly even when positioned "far away" from the origin.
Lots of work remains to fully convert all of our shader code. There's a function LWCHackToFloat(), which is a simple wrapper for LWCToFloat(). The idea of HackToFloat is to mark places that need further attention, where I'm simply converting absolute WS positions to float, to get shaders to compile. Shaders converted in this way should continue to work for all existing content (without LWC-scale values), but they will break if positions get too large.
General overview of changed files:
LargeWorldCoordinates.ush - This defines the FLWC types and operations
GPUScene.cpp, SceneData.ush - Primitives add an extra 'float3' tile coordinate. Instance data is unchanged, so instances need to stay within single-precision range of the primitive origin. Could potentially split instances behind the scenes (I think) if we don't want this limitation
HLSLMaterialDerivativeAutogen.cpp, HLSLMaterialTranslator.cpp, Preshader.cpp - Translated materials to use LWC values
SceneView.cpp, SceneRelativeViewMatrices.cpp, ShaderCompiler.cpp, InstancedStereo.ush - View uniform buffer includes LWC values where appropriate
#jira UE-117101
#rb arne.schober, Michael.Galetzka
#ROBOMERGE-AUTHOR: ben.ingram
#ROBOMERGE-SOURCE: CL 17787435 in //UE5/Main/...
#ROBOMERGE-BOT: STARSHIP (Main -> Release-Engine-Test) (v881-17767770)
[CL 17787478 by ben ingram in ue5-release-engine-test branch]
2021-10-12 13:31:00 -04:00
|
|
|
float3 WorldRayStart = LWCHackToFloat(PrimaryView.WorldCameraOrigin);
|
|
|
|
|
float3 WorldRayDirection = normalize(TranslatedWorldPosition - View.TranslatedWorldCameraOrigin);
|
2022-02-24 20:39:55 -05:00
|
|
|
float3 WorldHitNormal = float3(0, 0, 1);
|
|
|
|
|
bool bHit = false;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
#if USE_GLOBAL_DISTANCE_FIELD
|
2020-09-08 17:44:06 -04:00
|
|
|
FGlobalSDFTraceInput TraceInput = SetupGlobalSDFTraceInput(WorldRayStart, WorldRayDirection, 0.0f, TraceDistance, 1.0f, 1.0f);
|
|
|
|
|
FGlobalSDFTraceResult SDFTraceResult = RayTraceGlobalDistanceField(TraceInput);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
2022-02-24 20:39:55 -05:00
|
|
|
if (GlobalSDFTraceResultIsHit(SDFTraceResult))
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
2022-02-24 20:39:55 -05:00
|
|
|
float3 WorldRayEnd = WorldRayStart + WorldRayDirection * SDFTraceResult.HitTime;
|
2022-04-22 19:55:41 -04:00
|
|
|
WorldHitNormal = ComputeGlobalDistanceFieldNormal(WorldRayEnd, SDFTraceResult.HitClipmapIndex, -WorldRayDirection);
|
2022-02-24 20:39:55 -05:00
|
|
|
bHit = true;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
if (ThreadIndex == 0)
|
|
|
|
|
{
|
|
|
|
|
NumIntersectingObjects = 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
GroupMemoryBarrierWithGroupSync();
|
|
|
|
|
|
2022-03-04 06:03:15 -05:00
|
|
|
float3 TileConeTranslatedVertex;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
float3 TileConeAxis;
|
|
|
|
|
float TileConeAngleCos;
|
|
|
|
|
float TileConeAngleSin;
|
|
|
|
|
|
|
|
|
|
{
|
|
|
|
|
float2 ViewSize = float2(1 / View.ViewToClip[0][0], 1 / View.ViewToClip[1][1]);
|
|
|
|
|
float3 TileCorner00 = normalize(float3((GroupId.x + 0) / NumGroups.x * ViewSize.x * 2 - ViewSize.x, ViewSize.y - (GroupId.y + 0) / NumGroups.y * ViewSize.y * 2, 1));
|
|
|
|
|
float3 TileCorner10 = normalize(float3((GroupId.x + 1) / NumGroups.x * ViewSize.x * 2 - ViewSize.x, ViewSize.y - (GroupId.y + 0) / NumGroups.y * ViewSize.y * 2, 1));
|
|
|
|
|
float3 TileCorner01 = normalize(float3((GroupId.x + 0) / NumGroups.x * ViewSize.x * 2 - ViewSize.x, ViewSize.y - (GroupId.y + 1) / NumGroups.y * ViewSize.y * 2, 1));
|
|
|
|
|
float3 TileCorner11 = normalize(float3((GroupId.x + 1) / NumGroups.x * ViewSize.x * 2 - ViewSize.x, ViewSize.y - (GroupId.y + 1) / NumGroups.y * ViewSize.y * 2, 1));
|
|
|
|
|
|
|
|
|
|
float3 ViewSpaceTileConeAxis = normalize(TileCorner00 + TileCorner10 + TileCorner01 + TileCorner11);
|
|
|
|
|
TileConeAxis = mul(ViewSpaceTileConeAxis, (float3x3)View.ViewToTranslatedWorld);
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
TileConeAngleCos = min(min(dot(ViewSpaceTileConeAxis, TileCorner00), dot(ViewSpaceTileConeAxis, TileCorner10)), min(dot(ViewSpaceTileConeAxis, TileCorner01), dot(ViewSpaceTileConeAxis, TileCorner11)));
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
TileConeAngleSin = sqrt(1 - TileConeAngleCos * TileConeAngleCos);
|
2022-03-04 06:03:15 -05:00
|
|
|
TileConeTranslatedVertex = PrimaryView.TranslatedWorldCameraOrigin;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
uint NumCulledObjects = GetCulledNumObjects();
|
|
|
|
|
|
|
|
|
|
LOOP
|
2022-01-31 10:23:36 -05:00
|
|
|
for (uint IndexInCulledList = ThreadIndex; IndexInCulledList < NumCulledObjects; IndexInCulledList += THREADGROUP_TOTALSIZE)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
2022-01-31 10:23:36 -05:00
|
|
|
const uint ObjectIndex = CulledObjectIndices[IndexInCulledList];
|
|
|
|
|
FDFObjectBounds ObjectBounds = LoadDFObjectBounds(ObjectIndex);
|
2022-03-04 06:03:15 -05:00
|
|
|
float4 SphereTranslatedCenterAndRadius = float4(LWCToFloat(LWCAdd(ObjectBounds.Center, PrimaryView.PreViewTranslation)), ObjectBounds.SphereRadius);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
BRANCH
|
2022-03-04 06:03:15 -05:00
|
|
|
if (SphereIntersectCone(SphereTranslatedCenterAndRadius, TileConeTranslatedVertex, TileConeAxis, TileConeAngleCos, TileConeAngleSin))
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
|
|
|
|
uint ListIndex;
|
|
|
|
|
InterlockedAdd(NumIntersectingObjects, 1U, ListIndex);
|
|
|
|
|
|
|
|
|
|
if (ListIndex < MAX_INTERSECTING_OBJECTS)
|
|
|
|
|
{
|
|
|
|
|
IntersectingObjectIndices[ListIndex] = ObjectIndex;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
GroupMemoryBarrierWithGroupSync();
|
|
|
|
|
|
|
|
|
|
float MinRayTime;
|
|
|
|
|
float TotalStepsTaken;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
uint HitCulledObjectIndex;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
// Trace once to find the distance to first intersection
|
2022-03-04 06:03:15 -05:00
|
|
|
RayTraceThroughTileCulledDistanceFields(TranslatedWorldRayStart, WorldRayDirection, TraceDistance, MinRayTime, TotalStepsTaken, HitCulledObjectIndex);
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
if (HitCulledObjectIndex != 0xFFFFFFFF)
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
{
|
2022-01-31 10:23:36 -05:00
|
|
|
FDFObjectData DFObjectData = LoadDFObjectData(HitCulledObjectIndex);
|
2022-03-04 06:03:15 -05:00
|
|
|
float4x4 TranslatedWorldToVolume = LWCMultiplyTranslation(LWCNegate(PrimaryView.PreViewTranslation), DFObjectData.WorldToVolume);
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
|
2022-03-04 06:03:15 -05:00
|
|
|
float3 TranslatedWorldRayEnd = TranslatedWorldRayStart + WorldRayDirection * MinRayTime;
|
|
|
|
|
float3 VolumeHitPosition = mul(float4(TranslatedWorldRayEnd, 1.0f), TranslatedWorldToVolume).xyz;
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
//VolumeHitPosition = clamp(VolumeHitPosition, -DFObjectData.VolumePositionExtent, DFObjectData.VolumePositionExtent);
|
|
|
|
|
uint NumMips = LoadDFAssetData(DFObjectData.AssetIndex, 0).NumMips;
|
|
|
|
|
FDFAssetData DFAssetData = LoadDFAssetData(DFObjectData.AssetIndex, NumMips - 1);
|
|
|
|
|
float3 VolumeGradient = CalculateMeshSDFGradient(VolumeHitPosition, DFAssetData);
|
|
|
|
|
float VolumeGradientLength = length(VolumeGradient);
|
|
|
|
|
float3 VolumeNormal = VolumeGradientLength > 0.0f ? VolumeGradient / VolumeGradientLength : 0;
|
2022-03-03 17:54:46 -05:00
|
|
|
float3 WorldGradient = mul(VolumeNormal, transpose((float3x3)DFObjectData.WorldToVolume.M));
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
float WorldGradientLength = length(WorldGradient);
|
|
|
|
|
WorldHitNormal = WorldGradientLength > 0.0f ? WorldGradient / WorldGradientLength : 0;
|
2022-02-24 20:39:55 -05:00
|
|
|
bHit = true;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
}
|
2022-02-24 20:39:55 -05:00
|
|
|
#endif
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
|
|
|
|
|
float3 Lighting = 0;
|
|
|
|
|
|
2021-03-24 02:22:11 -04:00
|
|
|
float3 DirectionalLightColor = 10;
|
|
|
|
|
float3 DirectionalLightDirection = float3(.707f, 0, .707f);
|
|
|
|
|
|
|
|
|
|
if (ForwardLightData.HasDirectionalLight)
|
|
|
|
|
{
|
|
|
|
|
DirectionalLightColor = ForwardLightData.DirectionalLightColor;
|
|
|
|
|
DirectionalLightDirection = ForwardLightData.DirectionalLightDirection;
|
|
|
|
|
}
|
|
|
|
|
|
2022-02-24 20:39:55 -05:00
|
|
|
if (bHit)
|
2020-10-29 23:43:01 -04:00
|
|
|
{
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
// Compute simple lighting to help visualize errors in either position or normal in a way that is intuitive to artists
|
2021-03-24 02:22:11 -04:00
|
|
|
Lighting += DirectionalLightColor * saturate(dot(DirectionalLightDirection, WorldHitNormal));
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
|
|
|
|
|
Lighting += GetSkySHDiffuse(WorldHitNormal) * View.SkyLightColor.rgb;
|
|
|
|
|
float3 Albedo = .05f;
|
|
|
|
|
Lighting *= Albedo;
|
2020-10-29 23:43:01 -04:00
|
|
|
}
|
Sparse, narrow band, streamed Mesh Signed Distance Fields
* SDFs are now generated, allocated from the atlas and uploaded in 8^3 bricks (7^3 unique data, half voxel padding).
* Tracing must load the brick index from the indirection table, and only bricks near the surface are stored
* 3 mips are now generated, with the lowest resolution always loaded and the other 2 streamed
* SDFs are now G8 narrow band. Lower resolution mips must be traversed when querying distance to nearest surface far away from the surface
* The Distance Field Brick Atlas is now stored for each FScene and dynamically resized based on needs with a GPU memcopy
* Brick atlas uses a 1d pooled allocator which has no fragmentation and greatly reduces packing waste over the 3d allocator
* Added new indirection for Distance Field Asset data, so that only a single entry needs to be updated when a mip is streamed in or out in scenes with millions of instances
* Compute shaders operating on distance field instances generate streaming requests, which are async read back to CPU, turned into IO requests, which are polled and when complete uploaded to atlases
* Any mesh instance inside the Global SDF extent (200m) requests mip1, and at 50m requests mip2
* Now using a batched compute scatter to upload to the distance field atlas instead of RHIUpdateTexture3d, to bypass alignment restrictions and per-upload overhead
* Distance Field streaming uses an async task to move Memcpy and IO request overhead off of the Rendering Thread
* Distance Field Visualization now computes a normal from the SDF gradient and does simple lighting to better visualize the scene representation
* Increased r.DistanceFields.MaxPerMeshResolution from 128 to 512, to better represent large objects
* Mesh SDF generation now uses an Embree point query to calculate closest unsigned distance, and then a much smaller set of rays to count backfaces for negative region determination, for a 11x speedup
* Upgraded mesh utilities to Embree 3.12.2 to get point queries
* Fixed wrong transform used for SDF normals in Lumen, causing non-uniformly scaled meshes to have incorrect Surface Cache interpolation
* Fixed Static Mesh materials not getting PostLoaded before SDF build, causing their blend modes to be wrong for the build, which corrupts the DDC. Also included those blend modes in the DDC key.
Original costs on 1080 GTX (full updates on everything and no screen traces)
10.60ms UpdateGlobalDistanceField
3.62ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
1.73ms VoxelizeCards Clipmaps=[0,1,2,3]
0.38ms TraceCards 1 dispatch 1 groups
0.51ms TraceCards 1 dispatch 1 groups
Sparse SDF costs
12.06ms UpdateGlobalDistanceField
4.35ms LumenReflectiveTest.DirectionalLight_1 Shadowmap 1
2.30ms VoxelizeCards Clipmaps=[0,1,2,3]
0.69ms TraceCards 1 dispatch 1 groups
0.77ms TraceCards 1 dispatch 1 groups
Tested: TopazEntry PC, Reverb PC and PS5, EngineTests, QAGame, Rift, Frosty P_Construct_WP, FortGPUTestbed
#rb Krzysztof.Narkowicz
#ROBOMERGE-OWNER: Daniel.Wright
#ROBOMERGE-AUTHOR: daniel.wright
#ROBOMERGE-SOURCE: CL 15784493 in //UE5/Release-5.0-EarlyAccess/...
#ROBOMERGE-BOT: STARSHIP (Release-5.0-EarlyAccess -> Main) (v783-15756269)
#ROBOMERGE-CONFLICT from-shelf
[CL 15790658 by Daniel Wright in ue5-main branch]
2021-03-23 22:40:05 -04:00
|
|
|
else if (ReflectionStruct.SkyLightParameters.y > 0)
|
|
|
|
|
{
|
|
|
|
|
float SkyAverageBrightness = 1.0f;
|
|
|
|
|
float Roughness = 0.0f;
|
|
|
|
|
Lighting += GetSkyLightReflection(WorldRayDirection, Roughness, SkyAverageBrightness);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
float3 Result = Lighting * View.PreExposure;
|
Copying //UE4/Dev-Rendering to //UE4/Dev-Main (Source: //UE4/Dev-Rendering @ 3274304)
#lockdown Nick.Penwarden
#rb none
==========================
MAJOR FEATURES + CHANGES
==========================
Change 3250856 on 2017/01/09 by Daniel.Wright
Only showing instruction count for 'Base pass shader' now
Change 3250943 on 2017/01/09 by Rolando.Caloca
DR - Async Compute PSO creation
Change 3251036 on 2017/01/09 by Rolando.Caloca
DR - Add r.AsyncPipelineCompile
- Dispatch on any thread
- Wait for completion event
Change 3251058 on 2017/01/09 by Ben.Woodhouse
Fix for PSO creation D3D error with NumRenderTargets. Add code to compute the correct number of valid rendertargets to prevent an issue during PSO creation when NumRenderTargets is >0, but none of the formats are valid (all formats are DXGI_UNKNOWN)
#jira UE-40332
Change 3251141 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite CL 3243458:
D3D12 memory optimization - The d3d12 buddy suballocator is very wasteful for allocations above 4KB, but the vast majority of allocations are smaller . In the default buffer allocator this was causing 149MB of waste in 340MB of allocations. Moving the max allocation size threshold down to 4KB from 512KB saved 100MB of memory wastage memory.
On PC, buffers are 64KB aligned, so we need the threshold to be higher to avoid additional wastage.
Add PIX memory tracking instrumentation for buddy allocators so we can track the memory properly in PIX
Change 3251142 on 2017/01/09 by Ben.Woodhouse
Duplicated from Fortnite 3243496
memory optimisation: use NULL-terminated ansi strings instead of unicode FStrings for symbols, saving 118MB. Previously the strings were loaded from disk as ansi and then converted to FStrings (slowly), before finally being converted them back to ansi strings before being used. In addition to reducing memory overhead, this change reduces complexity and improves startup time.
Change 3252323 on 2017/01/10 by Rolando.Caloca
DR - Gfx async PSO creation prep
Change 3252474 on 2017/01/10 by Daniel.Wright
Added 'Compile Unreal Lightmass' to error message
Change 3252589 on 2017/01/10 by Daniel.Wright
Back out bulk data for distance fields from cl 3241990 which causes distance fields to be corrupt in Fortnite
Change 3252790 on 2017/01/10 by Daniel.Wright
Added InscatteringColorCubemapAngle to exponential height fog
Change 3252843 on 2017/01/10 by Uriel.Doyon
Propper fix for UE-40211, where texture streaming bound defrag and async tasks could interact in coherent ways.
The bound defrag is now done outside of the async work logic.
Change 3252866 on 2017/01/10 by Mark.Satterthwaite
Fix Metal shader pipeline hash collisions caused by deferring MTLFunction construction until PrepareToDraw so that we may use Function-Constants to specialise the shader source without generating additional permutations. This is required to generate proper tessellation shaders which are specialised against the index-buffer usage & type (none, uint16, uint32). While we're here amend the hash functions to make better use of the existing hash functions to improve the distribution and hopefully reduce the possibility of collisions in future.
#jira UE-40357
Change 3254511 on 2017/01/11 by Rolando.Caloca
DR - PSO stats
Change 3255958 on 2017/01/12 by Mark.Satterthwaite
Reimplement RQT_AbsoluteTime for Metal - pretty sure I did this before, but somehow it got lost. When a RQT_AbsoluteTime is inserted into the command-stream, insert a command-buffer completion handler to record the time of completion & submit the command-buffer immediately. This breaks command-buffers so is noticeably slower and if inserted in a pass that can't be restarted will fail but is currently the only option available. This is sufficient to support the GPUBenchmark used by Scalability. To make this more efficient I've refactored the FMetalCommandBufferFence implementation so that we use a single shared-ptr object containing the command-buffer and a dispatch semaphore, rather than allocating one for each query. The semaphore allows for timed-waits where previously we'd block until completion, unlike the other APIs that report failure after a fixed interval (2s for RQT_AbsoluteTime, otherwise 0.5s). Sadly not all drivers support this abuse of the Metal API, so replace the GL-based workaround for not having time queries with one that just guesses based on RHI device details. Radars will be filed.
#jira UE-40554
Change 3256329 on 2017/01/12 by Olaf.Piesche
#jira UE-38615
Assert shouldn't be necessary; in fact, it causes a crash when exporting emitters, since in that case we're changing the template at runtime.
Change 3256371 on 2017/01/12 by Uriel.Doyon
Reenabled texture streaming bound defrag as the fix is in CL 3252843
Change 3257032 on 2017/01/13 by Daniel.Wright
Added fastClamp to fastmath.usf
Change 3257111 on 2017/01/13 by Daniel.Wright
Disabled bAffectDistanceFieldLighting on DefaultPawn, fixes VisualizeMeshDistanceFields in game
Change 3257112 on 2017/01/13 by Daniel.Wright
DFAO optimizations
* Changed the culling algorithm to produce a list of intersecting screen tiles for each object, instead of the other way around. Each tile / object intersection gets its own cone tracing thread group so wavefronts are much smaller and scheduled better. 3.63ms -> 3.48ms (.15ms)
* Replace slow instructions in inner loop with fast approximations (exp2 -> sqr + 1, rcpFast, lengthFast) 3.25ms -> 3.09ms (.16ms)
* Moved transform from world to local space out of the inner loop (sample position constructed from local space position + direction) 3.09ms -> 3.04ms
* Compute shader for ClearUAV 3.04ms -> 2.62ms (.42ms)
Change 3257113 on 2017/01/13 by Daniel.Wright
Better distance field memory stats
Change 3257326 on 2017/01/13 by Uriel.Doyon
Workaround to support cases where several textures have the same lighting GUID.
Change 3257448 on 2017/01/13 by Daniel.Wright
Removed legacy features Distance Field Specular Occlusion, Distance Field Surface Cache AO, PreCullTriangles
Change 3257616 on 2017/01/13 by Daniel.Wright
Distance field mesh visualization now uses a cone containing the entire tile to cull objects with, making the results stable
Change 3257657 on 2017/01/13 by Daniel.Wright
Mesh distance fields are stored zlib compressed in memory until needed for uploading to GPU
* 81Mb of backing memory -> 32Mb in GPUPerfTest, atlas upload time 29ms -> 893ms
Change 3258063 on 2017/01/14 by Rolando.Caloca
DR - vk - Refactor descriptor set reuse in prep for more changes
Change 3258715 on 2017/01/16 by Daniel.Wright
Added VisualizeGlobalDistanceField show flag
Change 3258827 on 2017/01/16 by Daniel.Wright
Global distance field update regions are clipped against others to reduce redundant updates.
Change 3258959 on 2017/01/16 by Benjamin.Hyder
Updating Planar Reflection example material in TM-Shadermodels
Change 3259270 on 2017/01/16 by Daniel.Wright
[Copy] 'r.MSAACount 1' now produces no MSAA or TAA. 'r.MSAACount 0' can be used to toggle TAA on for comparisons.
Change 3259652 on 2017/01/16 by Uriel.Doyon
Better support for static primitive becoming dynamic.
Change 3260107 on 2017/01/17 by Ben.Woodhouse
Fix FMonitoredProcess to prevent infinite loop in -nothreading mode
#jira UE-40717
Change 3260594 on 2017/01/17 by Daniel.Wright
Added a new global distance field (4x 128^3 clipmaps) which caches mostly static primitives (Mobility set to Static or Stationary)
* The full global distance field inherits from the mostly static cache, so when a Movable primitive is modified, only other movable primitives in the vicinity need to be re-composited into the global distance field
* Global distance field update cost with one large rotating object went from 2.5ms -> .2ms on 970GTX and 4.6ms -> .3ms. Worst case full volume update is mostly the same.
* Adds 12Mb for the new volume textures
Change 3260956 on 2017/01/17 by Daniel.Wright
Structured buffers for DF object data
* Full global distance field clipmap composite 3.0ms -> 2.0ms due to scalarized loads
Change 3261296 on 2017/01/17 by Daniel.Wright
Exposed MaxObjectsPerTile with 'r.AOMaxObjectsPerCullTile' and lowered the default from 512 to 256, saves 17Mb of object tile culling data structures
Removed unnecessary UAV transitions preventing object and global cone tracing from overlapping, saves ~.1ms
Change 3262036 on 2017/01/18 by Ben.Salem
V0 of Perf monitor plugin for easily consumable stat csvs. With plugin enabled, enter PerformanceMonitor help into the console to get usage details.
Change 3262056 on 2017/01/18 by Chris.Bunner
Remove inverse tonemapping when rendering HDR output.
#jira UE-40728
Change 3262661 on 2017/01/18 by Rolando.Caloca
DR - Add missing SetStencilRef() and SetBlendFactor() on most RHIs
- Fix hash for PSOs
Change 3263674 on 2017/01/19 by Chris.Bunner
PR #3144: Improved error messages (Contributed by DarkSlot)
#jira UE-40835
Change 3264150 on 2017/01/19 by Ben.Woodhouse
Add support for single threaded in FMonitoredProcess. Deprecated IsRunning() in favour of a new Update() method because polling IsRunning is not compatible with -nothreading mode
#jira UE-40841
Change 3264153 on 2017/01/19 by Ben.Woodhouse
Integrate latest changes from MS-DX12 CLs 3231395-3262526
- Added WinPixEventRuntime.tps
- Includes PIX support, various optimizations (saved 1.3ms in testbed scene)
CL 3262343:
Fix depth testing on translucency not working correctly after cl 3231395. This change reapplies the D3D12RHI changes from CL 3231395 because those changes were lost when integrating from //Dev-Rendering/ but also includes the depth fixes:
- Fix depth state not being in DEPTH_READ for use as depth read. The issue was HasDepthBits and HasStencilBits wern't intended for SRV formats and always returned false in the SRV case.
CL 3231395:
Update D3D12 RHI:
- Fix deferred MSAA path in RHI
- Add Pix3.h support
- Cleanup SetName usage and remove it from shipping builds.
- Fix fence reuse bug. We were signaling MAX UINT (-1) and then waiting for 0, which was always signaled. This change also removes the fence value reset code, there is no need to reset a fence to a previous value.
- Use FPlatformAtomics::InterlockedIncrement instead of InterlockedIncrement64
- Use InterlockedIncrement() instead of _InterlockedIncrement() and use the FPlatformAtomics:: version.
- Fix possible readback heap being evicted while in use. GetQueryData happens on the render thread and isn't tied to a command list so we should always have readback heaps resident.
Change 3264251 on 2017/01/19 by Mark.Satterthwaite
Modify some asserts in MetalRHI - technically using a store-action of ENoAction on Stencil buffers should make it invalid to restart a render-pass but on Mac it will work because ENoAction won't invalidate anything written. In future we need to use deferred store-actions in Metal so that we can "restart" passes while enforcing correct Load/Store actions.
#jira UE-40803
Change 3264642 on 2017/01/19 by Daniel.Wright
Raised GMaxShadowDepthBufferSizeX to max texture resolution on most platforms, was previously 4096.
Change 3265330 on 2017/01/20 by Ben.Salem
Stop performance plugin from building in Win32.
#tests recompiled and preflighted
Change 3265678 on 2017/01/20 by Marcus.Wassmer
Fix bad declaration.
#3055
Change 3266656 on 2017/01/20 by Mark.Satterthwaite
Changes to the FShaderCache to restore it and extend it to optionally report on shader de-duplication when generating a binary shader cache (Console Variable: r.BinaryShaderCacheLogging).
Duplicate & amend CL #3266053 from Trepka:
Fixed issues with shader cache not working properly with Mac Metal (but it still requires -norhithread to work at all). Enabled the shader cache by default if RHI thread is disabled.
Amend & integrate RCO's CL #3197085.
Change 3267741 on 2017/01/23 by Rolando.Caloca
DR - Detect duplicated shader and pipeline types
Change 3268600 on 2017/01/23 by Uriel.Doyon
Added missing r.Streaming.MaxEffectiveScreenSize config to base texture scability settings.
Integrated CL 3227368 from Orion stream
Enabled r.Streaming.UsePerTextureBias by default as this has been tested in Orion for several months.
Fixed issue with the InvestigateTexture command which could return invalid reference depending on the timing,
Added th MaxEffectiveScreenSize settings in the investigate texture command.
Change 3269512 on 2017/01/24 by Richard.Wallis
Fix for shader binary cache uncompress data size during internal shader log.
Change 3271237 on 2017/01/25 by Ben.Woodhouse
D3D12 updateTexture2D crash fix
#jira UE-41059
Change 3271564 on 2017/01/25 by Olaf.Piesche
#jira UE-40980
#udn 325525
Fix uniform buffers for mesh particles; these should really be on the mesh collector, so allocating them as a one frame resource is safe
Change 3271594 on 2017/01/25 by Ben.Woodhouse
ESRAM support stage 1:
Implemented noncontiguous ESRAM page allocator replacing XgMemoryLayout API. The allocator allocates non-contiguous ranges of pages and maps them onto a contiguous virtual address range.
Unlike the previous implementation, this allocator frees pages for reuse when resources are destroyed
Note: issues with deferred deallocation may prevent reuse in many cases - that will be addressed in the next stage
Support for the old allocator is still available (for now) via the define NEW_ESRAM_ALLOCATOR
#fyi rolando.caloca
Change 3272616 on 2017/01/25 by Rolando.Caloca
DR - Update shader version
Change 3273138 on 2017/01/26 by Ben.Woodhouse
Fix merge issue with MonitoredProcess.cpp (this arose from an integration made as an edit in dev-rendering, which confused perforce when the change was subsequently integrated from main)
[CL 3274498 by Rolando Caloca in Main branch]
2017-01-26 19:20:49 -05:00
|
|
|
|
|
|
|
|
RWVisualizeMeshDistanceFields[DispatchThreadId.xy] = float4(Result, 0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Texture2D VisualizeDistanceFieldTexture;
|
|
|
|
|
SamplerState VisualizeDistanceFieldSampler;
|
|
|
|
|
|
|
|
|
|
void VisualizeDistanceFieldUpsamplePS(in float4 UVAndScreenPos : TEXCOORD0, out float4 OutColor : SV_Target0)
|
|
|
|
|
{
|
|
|
|
|
// Distance field AO was computed at 0,0 regardless of viewrect min
|
|
|
|
|
float2 DistanceFieldUVs = UVAndScreenPos.xy - View.ViewRectMin.xy * View.BufferSizeAndInvSize.zw;
|
|
|
|
|
|
|
|
|
|
float3 Value = Texture2DSampleLevel(VisualizeDistanceFieldTexture, VisualizeDistanceFieldSampler, DistanceFieldUVs, 0).xyz;
|
|
|
|
|
|
|
|
|
|
OutColor = float4(Value, 1);
|
|
|
|
|
}
|