We currently check that non-shader-visible heaps have a NULL
handle, but that doesn't seem to be guaranteed: beside WARP, also
NVIDIA drivers still return a valid pointer. And that's a pretty
useless check anyway; rather, check that shader-visible heaps
have a valid pointer, which is more interesting.
Currently the tests expect that creating buffers in COMMON or
COPY_SOURCE state on UPLOAD heaps or in COMMON state on READBACK
heaps leads to a failure. I tested WARP, AMD and NVIDIA, and in
all cases the operations is successful.
I think the D3D12 runtime used reject resources created in the
configurations detailed above, but it doesn't any more (both
using the latest Agility SDK and the runtime distributed with
an updated Windows 11 system). However the CI still uses an
earlier runtime, so the old behavior is still allowed as
broken.
We cannot be redefining struct types in this rule, only referring to already
defined types. Struct type definition is handled by named_struct_spec, which
complains if the type was defined at all, regardless of class.
This is motivated by SampleId, whose presence or absence determines whether a
fragment shader runs at sample frequency or not.
In HLSL, if SV_SampleIndex is declared but not used, this results in a signature
entry, but no dcl instruction (and a zero used mask in the signature entry).
Whether the shader will actually run at sample frequency is inconsistent. NVidia
does, AMD does not, and WARP does for d3d12 but not for d3d11.
Previously vkd3d-shader relied on the dcl instruction, thereby aligning with
AMD. This was changed by 66cb2815f0. This commit
restores the previous behaviour.
Some versions of Clang do not properly pass include paths to the
assembler when LTO is enabled. That's a problem for the DEMO_EMBED macro
used by the demos, since it relies on .incbin. To make matters even
better, compilation fails, but doesn't return an error code; the
resulting binary is simply broken. Fortunately we don't really care
about LTO for the demos, so we can simply disable it. See also
https://github.com/llvm/llvm-project/issues/112920
Thanks to Giovanni for tracking down some of the pieces of this issue.
sm6_metadata_get_float_value() already calls sm6_value_is_constant()
itself, and some compilers (rightly) emit a warning that the conversion
of UINT_MAX to float changes the value.
And other related identifiers similarly.
Currently we use VALUE_TYPE_ICB to indicate a raw chunk of data
(which can be used as an immediate constant buffer, but also for
initializing an indexable temporary) and VALUE_TYPE_REG with
register type VKD3DSPR_IMMCONSTBUFFER to indicate an actual ICB
register.
Since we are now switching to represent the register type directly
in the SM6 value, it seems sensible to use VALUE_TYPE_ICB for an
ICB register. Indeed, since what we're currently describing with
VALUE_TYPE_ICB is not necessarily related to an ICB register, it
makes sense to use a more general name, which is therefore being
introduced with this commit.
VALUE_TYPE_UNDEFINED means that a value is explicitly undefined,
equivalently to what was first represented with is_undefined.
VALUE_TYPE_INVALID will be introduced for values that arise from
invalid programs.
This effectively moves "null_event_cond" from struct d3d12_fence and
"latch" from struct vkd3d_waiting_event together into a separate
structure, as well as storing the signalling function in struct
vkd3d_waiting_event instead of getting it from struct d3d12_device. I
think that largely makes sense on its own, but storing the signalling
function in struct vkd3d_waiting_event also allows us to more easily
implement d3d12_device_SetEventOnMultipleFenceCompletion() in a
subsequent commit.
Currently structured TGSM loads are encoded to something like this:
ld_structured sr12 <s:float>, sr1 <s:uint>, l(0) <s:uint>, g0[sr1 <s:uint> + 0] <s:float>
Notice how the TGSM offset, expressed by sr1, is encoded twice in
the instruction. In TPF there is no expectation of two indices
in the resource source, so let's avoid producing it for DXIL as
well. The same instruction will therefore become:
ld_structured sr12 <s:float>, sr1 <s:uint>, l(0) <s:uint>, g0 <s:float>
Rather than an array of pointers to registers. This makes it nicer
to use with registers that are synthesized on the fly, a situation
that already exists and is likely to become more common in future
commits.
The following table shows how each intrinsic maps to d3d assembly and the
flags that appear in the tpf bytecode, in binary.
GroupMemoryBarrier() sync_g 0010
GroupMemoryBarrierWithGroupSync() sync_g_t 0011
DeviceMemoryBarrier() sync_uglobal 1000
DeviceMemoryBarrierWithGroupSync() sync_uglobal_t 1001
AllMemoryBarrier() sync_uglobal_g 1010
AllMemoryBarrierWithGroupSync() sync_uglobal_g_t 1011
The idea is that sm6_value and related structures should not commit
to a specific vsir register representation yet, because at the time
they are created some information might be still unavailable. For
instance, it might be not yet known whether the program is using
native 16-bit types or minimum precision types. vsir registers
should only be synthesized during instruction emission.
Field "reg" in struct sm6_value will be handled in a later commit.
Commit 02fe9f5bdf introduced linking the
Windows build of the demos with d3d12 and dxgi, while also still linking
to libvkd3d-utils.la. That happens to more or less work on Wine; we get
vkd3d-utils' D3D12CreateDevice(), and Wine's
IDXGIFactory2_CreateSwapChainForHwnd(), but because Wine's
implementation of d3d12 swapchains uses vkd3d, we're able to use the
resulting swapchain buffers even though the instance of vkd3d used by
Wine may not be the same instance of vkd3d used by the demos. Perhaps
unsurprisingly, things don't go nearly as well on Windows.
We could of course stop linking the demos to vkd3d-utils on Windows, but
that's not that interesting; we're trying to show what vkd3d can do
here, not what d3d12 can do.
Much like we did for demo_xcb.h, demo_win32.h now just has the bits for
creating windows and handling events, while demo_d3d12.h has the d3d12
and dxgi bits.
For the build-mingw-32 and build-mingw-64 jobs. These currently end up
picking up Linux pkg-config instead of MinGW pkg-config, which in turn
causes configure to pick up Linux OpenGL and XCB. We happen to get away
with that at the moment because none of the code using HAVE_OPENGL or
HAVE_XCB ends up getting built for Windows, but that's about to change.
Specifically, we'd like to build the vkd3d versions of the demos for
Windows.
All stream output objects need to have a stream index allocated,
whether they are used or not.
We allocate stream outputs here, before other objects are allocated,
because the stream index is needed to create the appropriate output
semantic variables during append_output_copy(), which will be called
in a lowering pass for the Append() method.
It seems that the NVIDIA drivers leaves VBVs bindings untouched
when they are NULL (or the GPU buffer address is NULL), instead of
setting them to a null binding.
Differently from other cases of inconsistent behaviour between AMD
and NVIDIA, here I'm explicitly marking the NVIDIA behaviour as
broken, because the expected behaviour is spelled out explicitly
(at least for the D3D12 specification standards).
The code doesn't make sense in the first place, even if it's
accepted by the compiler, so it makes sense that the behaviour
is undefined. And indeed the behaviour is different on AMD (4 is
returned), NVIDIA (QNaN is returned) and WARP (device is removed).
Currently program errors might result on instructions that use
ctx->error_instr as src. In case we hit YYABORT while parsing another
part of the HLSL source, we have to make sure that the block that
contains the instruction is properly cleaned up, or we might hit the
vkd3d:590273:err:hlsl_free_instr Failed assertion: list_empty(&node->uses)
assertion when hlsl_ctx_cleanup() is called after the YYABORT.
Consider the following shader:
float4 main() : sv_target
{
// Statement A
int p = foo; // initializer argument is ERROR.
// Statement B
undeclared_fun(); // triggers YYABORT.
}
Statement A will src the ctx->error_instr because of the undeclared
identifier and Statement B will trigger an YYABORT because of the
undeclared function.
The stride didn't match the structure size used in the shader.
This didn't seem to be a problem on AMD and WARP, but it was
on NVIDIA on Windows. Specifically, it seems that the buffer
is read using the shader structure size (so most tests pass),
but bounds are checked using the buffer stride, so a test
returned zero simply because an out-of-bounds read was detected.
Unless the D3D11_FEATURE_DATA_D3D11_OPTIONS2.MapOnDefaultTextures
feature is supported, textures with default usage cannot be read by the
CPU; and there isn't a need to either, since we're going through a
staging texture anyway.
This worked on AMD and WARP, but failed on NVIDIA on Windows.
Currently a descriptor set is represented with a structure
whose fields are the various descriptors that the shader is
going to use, each of them qualified with its type. This model
doesn't match very well what happens with D3D12 and with Vulkan
and descriptor indexing and mutable descriptors, where many
descriptor of different types can overlap, and only at shader
runtime it is decided which of them must be used.
Therefore with this patch a descriptor is represented as a
structure whose fields carry no intrinsic typing information,
but are reinterpreted according to the shader behavior on
each access. For the moment there is just one field, but more
will be added to account for fancier descriptor types and
to workaround Metal limitations.
When completed, this design will be similar in spirit to what
the Metal shader converter does, with little technical
differences.
This choice has a couple of drawbacks:
1. shader debugging with Xcode is less comfortable, because
the shader doesn't carry static descriptor typing
information and Xcode is therefore less able to provide
useful context;
2. it requires tier 2 argument buffer support and macOS
version >= 13, which includes guarantees on the runtime
layout for descriptors.
Supporting descriptor indexing and mutable descriptors is
considered to be worthy of those drawbacks, though, is
effectively required by many applications and appears to be
the first goal we should aim for. If needed, in the future
we might also add support for older hardware, at least for
shaders that do not make use of advanced features.
Like we do for GLSL; there's no reason to abort compilation here. Note
that this also avoids leaking "gen->buffer" and "gen->string_buffers" on
the error path.
With this we can finally disallow failure for the macOS CI script,
which is more valuable than checking DXC. Eventually DXC tests
will have to be fixed, though.
It's hard to pinpoint exactly what's going wrong with these
tests. They seem to be related to atomics and GPU timestamps,
both categories that are known to have problems on MoltenVK in a
way or another; those failures clearly depend on a few factors
like the MoltenVK version, the macOS version and whether we're in
a virtual machine or not, but the exact dependency on those factors
is hard to describe (for example, in general the paravirtualized
device offered inside virtual machines has a lot more problems than
real devices, but I've seen tests, fixed all other conditions,
working on the paravirtualized device and not on the real device).
The only thing all tests in this batch have in common is that I've
never seen them fail on a Sequoia system, thus I've settled for
using just that as the bug_if() condition. Ultimately, wasting a
lot of time to get to the bottom of each single test failure is
pointless, and being able to mark the CI job as not allowed to
fail gives better regression protection than investigating each
of those. Also, I routinely run the tests on a Sequoia system, so
if these tests get broken this is going to be noticed anyway.
Every call to shader_instruction_array_insert_at() means a possible
reallocation of all vsir instructions in the program. This means that all
previous pointers are potentially no longer valid.
We are currently using these potentially invalid pointers in some cases,
usually in the form of "ins->location". This commit fixes these.
I moved all pointer changes to right after the call to
shader_instruction_array_insert_at() to make this more evident.
The following pixel shader currently triggers an infinite loop during
copy propagation, which is fixed by this commit:
sampler s;
Texture2D t1, t2;
float4 main() : sv_target
{
Texture2D t = t1;
t1 = t2;
t2 = t;
return t1.Sample(s, float2(0, 0)) + t2.Sample(s, float2(0, 0));
}
The infinite loop occurs because copy_propagation_transform_object_load()
replaces t1 in the resource_load(t1, ...) instruction
with t2, t1, t2, ... repeatedly.
Note that we still have to preempt the propagation to SM1 pixel shader
uniforms. Otherwise this will turn the many constant derefs that appear
from the <index-val> copy generated in lower_index_loads() into a single
non-constant deref, causing it to allocate all the registers instead of
up until the last one used.
Some SM1 src registers have idx_count = 0, in which case we have to
respect that instead of always reading reg->reg.idx[0].offset even when
it is invalid.
Here, a vertex shader version of the previous test by Shaun is
introduced. Note that in this case the uniform allocates all 4 registers
instead of 3 because it is indirectly addressed.
Note that, for indexes with a decimal part, the behavior is different
depending on whether it is a temp load or a direct uniform load (which
can only happen on vertex shaders). The former rounds to the
closest-to-zero, while the latter rounds to the nearest even.
On MoltenVK it seems that all draws are always executed,
independently of the early depth stencil test. The problem doesn't
seem to belong to vkd3d or MoltenVK, because the generated Metal
commands look correct. I tried looking at a GPU capture with Xcode,
which was not very conclusive because it doesn't state clearly
whether early fragment tests were passed or not. Sometimes it
says that a fragment shader execution had no thread execution
data, which I interpret as the early fragment tests having
prevented the fragment shader from running, but it's not really
consistent, and it's never clear which results are based on
software simulation and which on the hardware run.
However taking everything into account I think the most likely
explanation is some incorrect optimization at the Metal level.
The graphics pipeline triggers an internal error in the Metal
pipeline compiler, with a completely generic error message. I have
no idea what the actual problem is.
This makes it more similar to the MSL and GLSL generators. It also looks
like a cleaner design, the backend is supposed to get access to the vsir
program after it has gone through the pipeline.
Similarly to the modulus operator, d3dbc results with constant folding
are different from results when constant folding cannot be applied, and
different from tpf results.
Note that in d3dbc target profiles it gives different results when this
operation is constant folded compared to when it is not.
This suggests that whatever pass lowers the modulus operation to d3dbc
operations doesn't do it before constant folding.
Also note that when constant folded, d3dbc results differ from tpf
results for negative operands, because of the loss of precision that
happens when NEG is constant folded.
So the same integer modulus expression can have 3 different results
depending on the context.
As far as I know there is no way to implement this properly on
Vulkan, and all the other Vulkan implementations essentially work
by luck. In Vulkan the initial layout of a resource must always be
UNDEFINED or PREINITIALIZED and it must be transitioned away from
before any meaningful use of that image is done. Therefore it's
possible to alias two images and let the second one inherit the
content in the first one only if both already exist (and are in
the same layout) before the first writing is done. If, as in this
example, the second image is created after the first one has
already been written to, the obligatory transition away from
UNDEFINED or PREINITIALIZED will potentially wipe out the content.
Therefore I am marking this as todo, not as a bug. I might also be
that there is a bug in MoltenVK, and ultimately that's the reason
why we're reading invalid data, but technically the Vulkan
commands we generate are incorrect anyway.
When textures[1] is read for the second time it is aliased to
textures[0]. But textures[0] was left in COPY_DEST state (since
its creation), and textures[1] is currently recreated in COPY_SOURCE
state, which AFAIU is invalid. So recreate textures[1] in COPY_DEST
state and then transition it before reading it.
I haven't investigated the actual problem here, but the generated
Vulkan commands look correct (and work with basically all other
Vulkan implementations) and MoltenVK is known to have incomplete
tessellation support, so it's likely that the problem is there.
Similarly to 5d4edba925, it seems
that sometimes MoltenVK returns 0 to timestamp queries. It
doesn't happen deterministically and it might depend on the
hardware (I have seen differences between the M2 I used some
time ago and the M3 Max I have now).
My main motivation to this is avoiding generating a lot of useless
log lines from other executors when I'm interested in just one of
them, but I can imagine this also somewhat improving efficiency.
This test is currently miscompiling on SM4 because
copy_propagation_invalidate_variable_from_deref_recurse() is not always
invalidating the right components.
Valid stream output objects must be single-element containing a
PointStream/LineStream/TriangleStream object.
Moreover, stream output objects cannot be declared globally.
Basically, separate lower_casts_to_int() into the lowering of the CAST
and the lowering of the TRUNC, so that TRUNCs that are not part of a
cast are lowered as well.
We implement a transformation that propagates loads with a single
non-constant index in its deref path. Consider a load of the form
var[[a0][a1]...[i]...[an]], where ak are integral constants, and i is
an arbitrary non-constant node. If, for all j, the following holds:
var[[a0][a1]...[j]...[an]] = x[[c0*j + d0][c1*j + d1]...[cm*j + dm]],
where ck, dk are constants, then we can replace the load with
x[[c0*i + d0]...[cm*i + dm]]. This pass is implemented by
copy_propagation_replace_with_deref().
Adds new flags --sm-min and --sm-max. They each take a shader model
identifier, with the same syntax as in the test harness. If either is
present, then it will only run tests within the (inclusive) range.
Omitting one allows anything as the min/max.
"char" is (potentially) signed, so casting it to uint32_t will
sign-extend it. Because we use |= to assign it to "word", and don't
otherwise mask out the higher bits either, we effectively set subsequent
bytes in the same word to 0xff for input bytes > 0x7f. That potentially
includes the \0 terminator. For example, "é" (U+00e9) is "\xc3\xa9"
when encoded as UTF-8, and would get us 0xffffffc3 instead of
0x0000a9c3.
While fxc allows full expressions inside the angle brackets, we don't parse that
yet as it'd be quite a mess to properly do so with yacc, and I'm not aware of any
game doing so in their shaders.
To reproduce:
float4 v;
SamplerState s
{
BorderColor = 0.1 + v*0.2;
};
Expression should use more than one literal constant,
as a scalar in operation that involves a vector.
Signed-off-by: Nikolay Sivov <nsivov@codeweavers.com>
Instead of "%f". vkd3d_string_buffer_print_f32() will use sufficient
precision to represent the stored value exactly, and will use '.' as
decimal separator regardless of the current locale.
We currently generate our own I/O signatures inside the DXIL parser, but
use the element counts from the DXBC container signatures to allocate
the input_params/output_params/patch_constant_params arrays. That
happens to work for well-behaved inputs, but it's asking for trouble.
We already check for error instructions when parsing swizzles, but if allocation
fails at codegen time we would like to avoid asserting when subsequently
constructing a swizzle.
Similar to d1c2ae3f0e, this is a bit too strict
and may prevent e.g. simultaneous use of float and float1 at codegen time.
However, in this case the inciting factor is that in the case of allocation
failure at codegen time, we would like to allow one or more arguments to have
error type.
The primary motivation here is to avoid needing to worry about instructions
potentially pointing to the preallocated error instruction in the case of
allocation failure.
This doesn't cover all passes, but none of the other passes make assumptions
about instruction sources.
This is because lower_nonconstant_array_loads() can potentially turn
nonconstant loads into constant loads, allowing copy-prop to turn these
loads into previous instructions, which might help other passes as well.
This patch lowers the number of required temps for the following ps_2_0
shader from 19 to 16:
int i;
float3x3 mats[4];
float4 main() : sv_target
{
return mul(mats[i], float3(1, 2, 3)).xyzz;
}
This can save a significant amount of temp registers because it allows to
avoid referencing the temp (and having to store it) when not needed.
For instance, this patch lowers the number of required temps for the
following ps_2_0 shader from 24 to 19:
int i;
float3x3 mats[4];
float4 main() : sv_target
{
return mul(mats[i], float3(1, 2, 3)).xyzz;
}
Also, it is needed for SM1 vertex shader relative addressing since
non-constant loads are required to be directly on the uniform ('c'
registers) instead of the temp, and non-constant loads cannot be
transformed by copy propagation.
Since 4a94bfc2f6 we segregate
different D3D12 descriptor types in different Vulkan descriptor sets.
This change was introduced to reduce descriptor wasting when
allocating a new descriptor pool; that can be very useful when
using virtual heaps, which have to often cycle through many
descriptors, but it is expected to have limited impact for Vulkan
heaps, given that in that case most descriptors are allocated through
the descriptor heap rather than through the command allocator.
Instead, it has a rather detrimental effect with Vulkan heaps,
because it tends to use many more Vulkan descriptor sets than
necessary, often with just a handful of descriptors each. This
causes a regression on some Vulkan implementations that support
too few descriptor sets.
With this change we revert to a situation similar to before,
stuffing all the descriptors that do not live in a root
descriptor table in as few descriptor sets as possible (at most
one or two, depending on whether push descriptors are used).
We're already implicitly using it for image layouts in which
either depth or stencil is writeable and the other is not.
Correspondingly, add the _KHR suffix in those cases, so the
extension usage is more evident.
According to the Vulkan Hardware Database, only four reports
without this extension were filed since 2023, and all of them
for configurations we likely don't target.
This could be useful since there are many shaders that contain `#include`
directives or use parameter-defined macros and we can't reproduce bugs
from the source alone.
The current TPF validator enforces that for each register involved
in a DCL_INDEX_RANGE instruction there must be a signature element
for that register and the DCL_INDEX_RANGE write mask. This is an
excessively strong request, and causes some shaders from The Falconeer
to be invalidly rejected.
The excessively strong check was needed to avoid triggering a bug
in the I/O normaliser. Since that bug is now solved, the check
can be relaxed.
A good part of the I/O normaliser job is to merge together signature
elements that are spanned by DCL_INDEX_RANGE instructions. The
current algorithm assumes that each index range touches exactly
one signature element for each index spanned by the range. The
assumption is used in shader_signature_merge() in the form of
expecting that, if the index range is N registers long, then,
once you find the first signature element of an index range, the
other elements that will have to be merged with it are exactly the
following N-1 according to the order given by
signature_element_register_compare() or
signature_element_mask_compare(), depending on the signature type.
This doesn't necessarily happen. For example, The Falconeer has a
few hull shaders in which this happens:
hs_fork_phase
dcl_hs_fork_phase_instance_count 13
dcl_input vForkInstanceId
dcl_output o4.z
dcl_output o5.z
dcl_output o6.z
dcl_output o7.z
dcl_output o12.z
dcl_output o13.z
dcl_output o14.z
dcl_output o15.z
dcl_output o16.z
dcl_output o17.z
dcl_output o18.z
dcl_output o19.z
dcl_output o20.z
dcl_temps 1
dcl_index_range o4.z 17
iadd r0.x, vForkInstanceId.x, l(4)
ult r0.y, vForkInstanceId.x, l(4)
movc r0.x, r0.y, vForkInstanceId.x, r0.x
mov o[r0.x + 4].z, l(0)
ret
Here the index range "skips" o8.z through o11.z, because those
registers only use mask .xy. The current algorithm fails on such
a shader.
Even depending on the signature element order doesn't look ideal.
I don't have a full counterexample for that, but it looks fragile,
especially given that the register allocation algorithm in FXC is
notoriously full of unexpected corner cases.
We solve both problems by slightly changing the architecture of
the normaliser: first we move computing the masks for the merge
signature element from signature_element_range_expand_mask(),
which is executed while merging signature, to
io_normaliser_add_index_range(), which is executed before merging
signatures. Then, while we are merging signatures, we can decide
for each single signature element whether it has to be retained
or not, and how it should be patched. The algorithm becomes
independent of the order, because each signature element can be
processed individually.
The assumptions the I/O normaliser makes on its input program are
rather intricated. In theory the VSIR validator checks should be
strong enough, but the validator isn't run by default anyway.
Whether the TPF parser validation is strong enough is not
completely clear to me, and considering that the I/O normaliser
could end up being used on different programs as well it's
probably better to revalidate locally just in case.
Since this test depend on the specific code generated by the
native d3dcompiler we add the possibility to specify a "raw"
shader using a hex format. When the shader assembler is finally
available they should be replaced with assembly code.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.