Like we do for GLSL; there's no reason to abort compilation here. Note
that this also avoids leaking "gen->buffer" and "gen->string_buffers" on
the error path.
With this we can finally disallow failure for the macOS CI script,
which is more valuable than checking DXC. Eventually DXC tests
will have to be fixed, though.
It's hard to pinpoint exactly what's going wrong with these
tests. They seem to be related to atomics and GPU timestamps,
both categories that are known to have problems on MoltenVK in a
way or another; those failures clearly depend on a few factors
like the MoltenVK version, the macOS version and whether we're in
a virtual machine or not, but the exact dependency on those factors
is hard to describe (for example, in general the paravirtualized
device offered inside virtual machines has a lot more problems than
real devices, but I've seen tests, fixed all other conditions,
working on the paravirtualized device and not on the real device).
The only thing all tests in this batch have in common is that I've
never seen them fail on a Sequoia system, thus I've settled for
using just that as the bug_if() condition. Ultimately, wasting a
lot of time to get to the bottom of each single test failure is
pointless, and being able to mark the CI job as not allowed to
fail gives better regression protection than investigating each
of those. Also, I routinely run the tests on a Sequoia system, so
if these tests get broken this is going to be noticed anyway.
Every call to shader_instruction_array_insert_at() means a possible
reallocation of all vsir instructions in the program. This means that all
previous pointers are potentially no longer valid.
We are currently using these potentially invalid pointers in some cases,
usually in the form of "ins->location". This commit fixes these.
I moved all pointer changes to right after the call to
shader_instruction_array_insert_at() to make this more evident.
The following pixel shader currently triggers an infinite loop during
copy propagation, which is fixed by this commit:
sampler s;
Texture2D t1, t2;
float4 main() : sv_target
{
Texture2D t = t1;
t1 = t2;
t2 = t;
return t1.Sample(s, float2(0, 0)) + t2.Sample(s, float2(0, 0));
}
The infinite loop occurs because copy_propagation_transform_object_load()
replaces t1 in the resource_load(t1, ...) instruction
with t2, t1, t2, ... repeatedly.
Note that we still have to preempt the propagation to SM1 pixel shader
uniforms. Otherwise this will turn the many constant derefs that appear
from the <index-val> copy generated in lower_index_loads() into a single
non-constant deref, causing it to allocate all the registers instead of
up until the last one used.
Some SM1 src registers have idx_count = 0, in which case we have to
respect that instead of always reading reg->reg.idx[0].offset even when
it is invalid.
Here, a vertex shader version of the previous test by Shaun is
introduced. Note that in this case the uniform allocates all 4 registers
instead of 3 because it is indirectly addressed.
Note that, for indexes with a decimal part, the behavior is different
depending on whether it is a temp load or a direct uniform load (which
can only happen on vertex shaders). The former rounds to the
closest-to-zero, while the latter rounds to the nearest even.
On MoltenVK it seems that all draws are always executed,
independently of the early depth stencil test. The problem doesn't
seem to belong to vkd3d or MoltenVK, because the generated Metal
commands look correct. I tried looking at a GPU capture with Xcode,
which was not very conclusive because it doesn't state clearly
whether early fragment tests were passed or not. Sometimes it
says that a fragment shader execution had no thread execution
data, which I interpret as the early fragment tests having
prevented the fragment shader from running, but it's not really
consistent, and it's never clear which results are based on
software simulation and which on the hardware run.
However taking everything into account I think the most likely
explanation is some incorrect optimization at the Metal level.
The graphics pipeline triggers an internal error in the Metal
pipeline compiler, with a completely generic error message. I have
no idea what the actual problem is.
This makes it more similar to the MSL and GLSL generators. It also looks
like a cleaner design, the backend is supposed to get access to the vsir
program after it has gone through the pipeline.
Similarly to the modulus operator, d3dbc results with constant folding
are different from results when constant folding cannot be applied, and
different from tpf results.
Note that in d3dbc target profiles it gives different results when this
operation is constant folded compared to when it is not.
This suggests that whatever pass lowers the modulus operation to d3dbc
operations doesn't do it before constant folding.
Also note that when constant folded, d3dbc results differ from tpf
results for negative operands, because of the loss of precision that
happens when NEG is constant folded.
So the same integer modulus expression can have 3 different results
depending on the context.
As far as I know there is no way to implement this properly on
Vulkan, and all the other Vulkan implementations essentially work
by luck. In Vulkan the initial layout of a resource must always be
UNDEFINED or PREINITIALIZED and it must be transitioned away from
before any meaningful use of that image is done. Therefore it's
possible to alias two images and let the second one inherit the
content in the first one only if both already exist (and are in
the same layout) before the first writing is done. If, as in this
example, the second image is created after the first one has
already been written to, the obligatory transition away from
UNDEFINED or PREINITIALIZED will potentially wipe out the content.
Therefore I am marking this as todo, not as a bug. I might also be
that there is a bug in MoltenVK, and ultimately that's the reason
why we're reading invalid data, but technically the Vulkan
commands we generate are incorrect anyway.
When textures[1] is read for the second time it is aliased to
textures[0]. But textures[0] was left in COPY_DEST state (since
its creation), and textures[1] is currently recreated in COPY_SOURCE
state, which AFAIU is invalid. So recreate textures[1] in COPY_DEST
state and then transition it before reading it.
I haven't investigated the actual problem here, but the generated
Vulkan commands look correct (and work with basically all other
Vulkan implementations) and MoltenVK is known to have incomplete
tessellation support, so it's likely that the problem is there.
Similarly to 5d4edba925, it seems
that sometimes MoltenVK returns 0 to timestamp queries. It
doesn't happen deterministically and it might depend on the
hardware (I have seen differences between the M2 I used some
time ago and the M3 Max I have now).
My main motivation to this is avoiding generating a lot of useless
log lines from other executors when I'm interested in just one of
them, but I can imagine this also somewhat improving efficiency.
This test is currently miscompiling on SM4 because
copy_propagation_invalidate_variable_from_deref_recurse() is not always
invalidating the right components.
Valid stream output objects must be single-element containing a
PointStream/LineStream/TriangleStream object.
Moreover, stream output objects cannot be declared globally.
Basically, separate lower_casts_to_int() into the lowering of the CAST
and the lowering of the TRUNC, so that TRUNCs that are not part of a
cast are lowered as well.
We implement a transformation that propagates loads with a single
non-constant index in its deref path. Consider a load of the form
var[[a0][a1]...[i]...[an]], where ak are integral constants, and i is
an arbitrary non-constant node. If, for all j, the following holds:
var[[a0][a1]...[j]...[an]] = x[[c0*j + d0][c1*j + d1]...[cm*j + dm]],
where ck, dk are constants, then we can replace the load with
x[[c0*i + d0]...[cm*i + dm]]. This pass is implemented by
copy_propagation_replace_with_deref().
Adds new flags --sm-min and --sm-max. They each take a shader model
identifier, with the same syntax as in the test harness. If either is
present, then it will only run tests within the (inclusive) range.
Omitting one allows anything as the min/max.
"char" is (potentially) signed, so casting it to uint32_t will
sign-extend it. Because we use |= to assign it to "word", and don't
otherwise mask out the higher bits either, we effectively set subsequent
bytes in the same word to 0xff for input bytes > 0x7f. That potentially
includes the \0 terminator. For example, "é" (U+00e9) is "\xc3\xa9"
when encoded as UTF-8, and would get us 0xffffffc3 instead of
0x0000a9c3.
While fxc allows full expressions inside the angle brackets, we don't parse that
yet as it'd be quite a mess to properly do so with yacc, and I'm not aware of any
game doing so in their shaders.
To reproduce:
float4 v;
SamplerState s
{
BorderColor = 0.1 + v*0.2;
};
Expression should use more than one literal constant,
as a scalar in operation that involves a vector.
Signed-off-by: Nikolay Sivov <nsivov@codeweavers.com>
Instead of "%f". vkd3d_string_buffer_print_f32() will use sufficient
precision to represent the stored value exactly, and will use '.' as
decimal separator regardless of the current locale.
We currently generate our own I/O signatures inside the DXIL parser, but
use the element counts from the DXBC container signatures to allocate
the input_params/output_params/patch_constant_params arrays. That
happens to work for well-behaved inputs, but it's asking for trouble.
We already check for error instructions when parsing swizzles, but if allocation
fails at codegen time we would like to avoid asserting when subsequently
constructing a swizzle.
Similar to d1c2ae3f0e, this is a bit too strict
and may prevent e.g. simultaneous use of float and float1 at codegen time.
However, in this case the inciting factor is that in the case of allocation
failure at codegen time, we would like to allow one or more arguments to have
error type.
The primary motivation here is to avoid needing to worry about instructions
potentially pointing to the preallocated error instruction in the case of
allocation failure.
This doesn't cover all passes, but none of the other passes make assumptions
about instruction sources.
This is because lower_nonconstant_array_loads() can potentially turn
nonconstant loads into constant loads, allowing copy-prop to turn these
loads into previous instructions, which might help other passes as well.
This patch lowers the number of required temps for the following ps_2_0
shader from 19 to 16:
int i;
float3x3 mats[4];
float4 main() : sv_target
{
return mul(mats[i], float3(1, 2, 3)).xyzz;
}
This can save a significant amount of temp registers because it allows to
avoid referencing the temp (and having to store it) when not needed.
For instance, this patch lowers the number of required temps for the
following ps_2_0 shader from 24 to 19:
int i;
float3x3 mats[4];
float4 main() : sv_target
{
return mul(mats[i], float3(1, 2, 3)).xyzz;
}
Also, it is needed for SM1 vertex shader relative addressing since
non-constant loads are required to be directly on the uniform ('c'
registers) instead of the temp, and non-constant loads cannot be
transformed by copy propagation.
Since 4a94bfc2f6 we segregate
different D3D12 descriptor types in different Vulkan descriptor sets.
This change was introduced to reduce descriptor wasting when
allocating a new descriptor pool; that can be very useful when
using virtual heaps, which have to often cycle through many
descriptors, but it is expected to have limited impact for Vulkan
heaps, given that in that case most descriptors are allocated through
the descriptor heap rather than through the command allocator.
Instead, it has a rather detrimental effect with Vulkan heaps,
because it tends to use many more Vulkan descriptor sets than
necessary, often with just a handful of descriptors each. This
causes a regression on some Vulkan implementations that support
too few descriptor sets.
With this change we revert to a situation similar to before,
stuffing all the descriptors that do not live in a root
descriptor table in as few descriptor sets as possible (at most
one or two, depending on whether push descriptors are used).
We're already implicitly using it for image layouts in which
either depth or stencil is writeable and the other is not.
Correspondingly, add the _KHR suffix in those cases, so the
extension usage is more evident.
According to the Vulkan Hardware Database, only four reports
without this extension were filed since 2023, and all of them
for configurations we likely don't target.
This could be useful since there are many shaders that contain `#include`
directives or use parameter-defined macros and we can't reproduce bugs
from the source alone.
The current TPF validator enforces that for each register involved
in a DCL_INDEX_RANGE instruction there must be a signature element
for that register and the DCL_INDEX_RANGE write mask. This is an
excessively strong request, and causes some shaders from The Falconeer
to be invalidly rejected.
The excessively strong check was needed to avoid triggering a bug
in the I/O normaliser. Since that bug is now solved, the check
can be relaxed.
A good part of the I/O normaliser job is to merge together signature
elements that are spanned by DCL_INDEX_RANGE instructions. The
current algorithm assumes that each index range touches exactly
one signature element for each index spanned by the range. The
assumption is used in shader_signature_merge() in the form of
expecting that, if the index range is N registers long, then,
once you find the first signature element of an index range, the
other elements that will have to be merged with it are exactly the
following N-1 according to the order given by
signature_element_register_compare() or
signature_element_mask_compare(), depending on the signature type.
This doesn't necessarily happen. For example, The Falconeer has a
few hull shaders in which this happens:
hs_fork_phase
dcl_hs_fork_phase_instance_count 13
dcl_input vForkInstanceId
dcl_output o4.z
dcl_output o5.z
dcl_output o6.z
dcl_output o7.z
dcl_output o12.z
dcl_output o13.z
dcl_output o14.z
dcl_output o15.z
dcl_output o16.z
dcl_output o17.z
dcl_output o18.z
dcl_output o19.z
dcl_output o20.z
dcl_temps 1
dcl_index_range o4.z 17
iadd r0.x, vForkInstanceId.x, l(4)
ult r0.y, vForkInstanceId.x, l(4)
movc r0.x, r0.y, vForkInstanceId.x, r0.x
mov o[r0.x + 4].z, l(0)
ret
Here the index range "skips" o8.z through o11.z, because those
registers only use mask .xy. The current algorithm fails on such
a shader.
Even depending on the signature element order doesn't look ideal.
I don't have a full counterexample for that, but it looks fragile,
especially given that the register allocation algorithm in FXC is
notoriously full of unexpected corner cases.
We solve both problems by slightly changing the architecture of
the normaliser: first we move computing the masks for the merge
signature element from signature_element_range_expand_mask(),
which is executed while merging signature, to
io_normaliser_add_index_range(), which is executed before merging
signatures. Then, while we are merging signatures, we can decide
for each single signature element whether it has to be retained
or not, and how it should be patched. The algorithm becomes
independent of the order, because each signature element can be
processed individually.
The assumptions the I/O normaliser makes on its input program are
rather intricated. In theory the VSIR validator checks should be
strong enough, but the validator isn't run by default anyway.
Whether the TPF parser validation is strong enough is not
completely clear to me, and considering that the I/O normaliser
could end up being used on different programs as well it's
probably better to revalidate locally just in case.
Since this test depend on the specific code generated by the
native d3dcompiler we add the possibility to specify a "raw"
shader using a hex format. When the shader assembler is finally
available they should be replaced with assembly code.
Commit 4717775abb removed the code turning
the corresponding DCL_INPUT instructions into NOPs, presumably because
vsir_program_remove_io_decls() now already removes them. That function
only removes the instructions though, not the declarations as such; we
now need to remove these from the "io_dcls" bitmap instead.
Commit 949708450b introduced support for
effect binaries embedded in DXBC containers, but only when using
auto-detection to determine the source type. That's undesirable;
although auto-detection is convenient for interactive use, it's not
necessarily suitable for use in e.g. scripts. It also meant this wasn't
listed through --print-source-types.
The semantic variables from a patch parameter are split as usual, with
the difference being that the semantic variable being added is a patch
variable itself, with the type being the split variable type, and its
number of control points being equal to the original patch variable's
number of control points. It is then stored in the original patch
variable as follows:
for (i = 0; i < n; ++i)
patch[i][f] := <inputpatch-sem-var>[i]
where n is the number of control points of "patch", and f is the field
index in patch corresponding to "<inputpatch-sem-var>".
We use special prefixes, "inputpatch-" or "outputpatch-", when adding
the semantic patch variables, in order to distinguish them from
non-patch semantic variables of the same name.
In anticipation of the need for is_patch_constant_func in
sm4_generate_vsir_reg_from_deref(), in order to generate vsir
registers from InputPatch/OutputPatch dereferences.
swprintf() expects the length of the buffer in WCHARs instead of bytes,
so ARRAY_SIZE() is used instead of sizeof().
This caused almost all tests to terminate abruptly with the following
message, in my machine:
*** buffer overflow detected ***: terminated
If the parameter is HLSL_STORAGE_IN, we add a cast from the arg to the
param type so that it can enter the function, however this cast should
not be considered part of the lhs on the implicit assignment that happens
if the var is also HLSL_STORAGE_OUT.
While so far it has been posible to do this at parse time, this must
happen after knowing if the complex cast is on the lhs or not.
The modified tests show that before this patch we are currently
miscompiling when this happens, because a complex lhs cast is transformed
into a load, and add_assigment() just stores to the generated "cast"
temp.
For example, given two arguments, half3 and float, and two functions,
func(float, float) and func(float3, float3), fxc/d3dcompiler prefers to
widen both arguments to float3.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
Pixel shader 1.x constants must be between -1 and 1, or they will be clamped,
even constants defined in the shader.
Also mark 1.x-specific features if any.
When the client acquires the Vulkan queue it has to ensure that
it is not submitting work before other work it depends on already
submitted through the Direct3D 12 API but currently in the internal
vkd3d queue. Currently we suggest to enqueue signalling a fence and
than wait for it before acquiring the Vulkan queue, which is
correct but excessive: it will wait not just for the work currently
in the queue to be submitted, but for it to be executed too,
introducing useless dependencies.
By adding a way to enqueue signalling a fence on the CPU side we
allow the client to wait for the currently outstanding work to
be submitted to Vulkan, but nothing more.
Commit 1ed8d907b3 inadvertently dropped
emitting the tessellator domain for domain shaders. Although Vulkan
environments allow us to write the tessellator domain from the hull
shader, the domain shader, or both, that's not generally true for OpenGL
environments.
We're explicitly replacing zero with one in the only place where the
plane count being zero or one makes a difference. It also make sense:
UNKNOWN is used for buffers, which for all intents and purposes have one
plane.
Vertex shaders do not have CMP, so we use SLT and MAD.
For example, this vertex shader:
uniform float4 f;
void main(inout float4 pos : position, out float4 t1 : TEXCOORD1)
{
t1 = (int4)f;
}
results in:
vs_2_0
dcl_position v0
slt r0, c0, -c0
frc r1, c0
add r2, -r1, c0
slt r1, -r1, r1
mad oT1, r0, r1, r2
mov oPos, v0
while we have the lower_cmp() pass, each time it is applied many
instructions are generated, so this patch introduces a specialized
version of the cast-to-int lowering for efficiency.
Turns out we are not doing this correctly in SM1, because the rounding
should be to the number that is closer to zero and lower_casts_to_int()
doesn't do that.
A vertex shader test is added since in SM1 they must rely on the SLT
operation instead of the CMP operation.
fxc compiles this method even without the backcompat option.
Furthermore, it even does it on ps_2_0 despite the fact that it maps to
a texldd instruction, which is not available on plain ps_2_0, nor ps_2_b,
only on ps_2_a and ps_3_0 according to documentation.
It is worth mentioning that the additional offset parameter is not
supported when lowering.
Otherwise ubsan reports these errors on the bitwise.shader_test:
libs/vkd3d-shader/hlsl_constant_ops.c:970:50: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
libs/vkd3d-shader/hlsl_constant_ops.c:970:50: runtime error: left shift of negative value -12
Otherwise ubsan reports errors such as:
libs/vkd3d-shader/spirv.c:7266:5: runtime error: null pointer passed as argument 1, which is declared to never be null
Otherwise ubsan reports runtime errors such as:
libs/vkd3d-shader/ir.c:4731:5: runtime error: null pointer passed as argument 1, which is declared to never be null
Otherwise when passing "-fsanitize=undefined" to the compiler, ubsan
reports such as:
libs/vkd3d-shader/ir.c:3794:5: runtime error: null pointer passed as argument 1, which is declared to never be null
Since we have -Wswitch, this forces the developer to update all relevant
switches when an enum case is added.
Places where the default is just a FIXME are left alone.
We apply distributivity to applicable expressions, specifically with
the following rewrite rules:
(x OPL a) OPR (x OPL b) -> x OPL (a OPR b)
(y OPR (x OPL a)) OPR (x OPL b) -> y OPR (x OPL (a OPR b))
((x OPL a) OPR y) OPR (x OPL b) -> (x OPL (a OPR b)) OPR y
(x OPL a) OPR ((x OPL b) OPR y) -> (x OPL (a OPR b)) OPR y
(x OPL a) OPR (y OPR (x OPL b)) -> (x OPL (a OPR b)) OPR y
where a, b are constants.
Like we did before commit 067e6deee4.
These skips aren't all that interesting; it's entirely intentional that
e.g. a 2.0-3.0 run wouldn't run 4.0 shaders.
The two branches do essentially the same thing, but in different
ways and each one omitting different details. In particular there
is no need to discriminate on whether the register is a relative
address, we can just copy the NULL pointer.
The hull shader control point normalisation pass already ensures
that output registers in the control point phase have two
indices (the control point index and the register index).
Consistent with how D3DXFindShaderComment() allows looking up comments
by tag. This also makes it a little easier to move CTAB generation out
of d3dbc.c.
It makes the code quite longer, but also easier to read and extend
with further properties.
A (desirable) side effect of this commit is that it is checked
whether I/O register types are legal depending on the shader type
and phase, while before that was assumed.
This commit introduces enum vsir_io_reg_type and enum vsir_phase
which shadow enum vkd3d_shader_register_type and enum
vkd3d_shader_opcode, with the goal of making the data tables
smaller.
Adjust the algorithm for deciding for which profiles to test compilation.
We first ensure that if the compilation result changes (most often as the result
of a feature introduced in a specific version), we test the versions immediately
on either side of the change, to validate that vkd3d-shader is emulating the
same version behaviour.
We then ensure that we are testing at least one version from each set of sm1,
sm4, and sm6.
Mainly in order to not waste time compile-testing the same version
more than once [as we do with e.g. the d3d11 and d3d12 runners, or
d3d12, GL, and Vulkan].
When a shader fails to compile for a range of versions, we want to validate that
we are correctly implementing that behaviour. E.g. if a feature requires shader
model 5.0, we should validate that it compiles correctly with 5.0 (which we do),
but also that it *fails* to compile with 4.1 (which we do not).
The obvious and simple solution is to simply run compile tests for each version.
There are, however, at least 12 versions of HLSL up to and including 6.0, at
least 10 of which are known to introduce new features. Shader compilation takes
about 10-15% of the time that draw and dispatch does, both for native and
(currently) vkd3d. Testing every version for every shader would add a
noticeable amount of time to the tests.
In practice, the interesting versions to test for most shaders are:
* At least one from each range 1-3, 4-5, and 6. It's common enough for the
semantics of the HLSL to differ between bytecode formats, or for features to
be added or removed across those boundaries.
* If the shader requires a given feature, we want to test both sides of the cusp
to ensure we're requiring the same version for the feature.
In practice this is 3 or 4 versions, which is measurably less than the 12 we'd
otherwise be running.
In order to achieve this goal of testing only the 3 or 4 interesting versions
for a shader, we need to know what version is actually required for a feature.
This is encoded in the shader itself using e.g. [pixel shader fail(sm<5)].
This patch therefore implements the first step towards this goal, namely,
determining which versions succeed and fail, so we can figure out which ones are
interesting.
We could require the test writer to specify which versions are interesting ahead
of time (e.g. "for version in 2.0 4.1 5.0 6.0") but this is both redundant (and
there are a *lot* of tests that need some feature gate) and easy for a test
writer to get wrong by missing interesting versions.
Sometimes SM1-3 shaders contain write masks that exceed the
signature element masks. That happens because SM1-3 shaders do not
have a concept of signature and signature masks, and OTOH aren't
always able to express any given write mask.
In VSIR we don't want to deal with I/O register masks exceeding the
corresponding signature element mask or usage mask, because, for
instance, for higher shader models it can complicate dealing with
DCL_INDEX_RANGE. In order to have uniform rules for all shader
models we normalise masks coming from SM1-3 shaders.
We don't do that normalisation when disassembling, in order to
preserve the expected output.
The previous names "not normalised" and "fully normalised" have meanings
which are likely to change with time. OTOH including a description of the
normalisation level in the enumerant seems excessive. Relating
normalisation levels to shader model versions might be a reasonable
compromise.
We normalize binary expressions by attempting to group constants
together, in order to facilitate further simplification of the
expressions.
For any binary operator OP, non-constants x, y, and constants a, b, we
apply the following rewrite rules:
a OP x -> x OP a, if OP is commutative.
(x OP a) OP b -> x OP (a OP b), if OP is associative.
(x OP a) OP y -> (x OP y) OP a, if OP is associative and commutative.
x OP (y OP a) -> (x OP y) OP a, if OP is associative.
Note that we consider floating point operations to be
non-associative.
The alternative to adding the vsir_program->tess_output_primitive and
vsir_program->tess_partitioning fields would be to emit the vsir
DCL_TESSELLATOR_OUTPUT_PRIMITIVE and DCL_TESSELLATOR_PARTITIONING
instructions, like DXIL does, but I think that the preference is to store
these kind of data directly in the vsir_program.
The combined sampler is created as a SAMPLER instead of a TEXTURE
because that fits all our current infrastructure. The only problem is
that in the CTAB it must appear as a Texture, so the new field
hlsl_type.is_combined_sampler is added.
Co-authored-by: Elizabeth Figura <zfigura@codeweavers.com>
Looking at the implementation of shader_sm4_read_dcl_sampler(), vsir
stores the resource index range both in
vkd3d_shader_instruction.declaration.sampler.range
and in the
vkd3d_shader_instruction.declaration.sampler.src.reg.idx[1-2]
indexes, so we do the same.
It is also worth noting that for shader models lower than 5.1, vsir
has a normalization on the ins->declaration src register indexes.
Refer to the following comment:
/* SM5.1 places a symbol identifier in idx[0] and moves
* other values up one slot. Normalize to SM5.1. */
on shader_sm4_read_param().
This normalization is also added to the generated vsir instructions.
Currently, when using Vulkan heaps, we create descriptor set
layouts with as many descriptors as allowed by the Vulkan
implementation limits. For some implementations this can mean
hundreds of millions of descriptors or more, which is wasteful,
given that even on the best resource binding tier Direct3D 12
applications should not expect to have more than a million usable
descriptors.
Recently this began being a problem, because since Mesa 24.2.7
the Intel driver advertises more than 200 million descriptors,
but pipeline compilation takes linear RAM in the number of
descriptors declared in the pipeline layout. This means that
compiling even a simple shader requires 10-20 GB of RAM.
In order to avoid using too much memory, with this commit we clamp
the number of descriptors declared in the set layouts to how many
we actually need to guarantee tier 3 resource binding support.
Creating a pool of 16k descriptors is wasteful if an allocator only uses
a fraction of them, so start at 1k and double for each subsequent
allocation until 16k is reached.
Now that our Vulkan descriptor sets contain only a single vkd3d
descriptor type, we're able to create descriptor pools containing only a
single vkd3d descriptor type as well. This avoids wasting unallocated
descriptors of one type when we run out of descriptors of another type.
We currently create statically sized descriptor pools, shared among
different descriptor types. Once we're unable to allocate a descriptor
set from a pool, we create a new pool. The unfortunate but predictable
consequence is that when we run out of descriptors of one type, we waste
any unallocated descriptors of the other types.
Dynamically adjusting the pool sizes could mitigate the issue, but it
seems non-trivial to handle all the edge cases, particularly in
situations where the descriptor count ratios change significantly
between frames. Instead, by storing only a single vkd3d descriptor type
in each Vulkan descriptor set we're able to create separate descriptor
pools for each vkd3d descriptor type, which also avoids the issue.
The main drawback of using separate descriptor sets for each descriptor
type is that we can no longer pack all bounded descriptor ranges into a
single descriptor set, potentially leaving fewer descriptor sets
available for unbounded ranges. That seems worth it, but we may end up
having to switch to a more complicated strategy if this ends up being a
problem on Vulkan implementations with a very limited number of
available descriptor sets.
Most I/O registers are already described by the shader signatures.
The registers that are not do not have any property other then
being used by the program or not, so they can be collectively
described with a bitmap.
The register storage class is now represented in
vkd3d_register_builtins, so the spirv_compiler_emit_io_register()
doesn't need to know it from the caller.
Instead of returning nonsense (such as, currently, a type with zero size).
In practice this improves error reporting for shaders such as the following:
void func(float x[])
{
float y[] = {x};
}
Currently this outputs a nonsense
test.hlsl:1:19: E5002: Implicit size arrays not allowed in function parameters.
test.hlsl:3:7: E5002: Implicit size arrays need to be initialized.
With this patch the second warning is removed.
Some test programs, particularly the shader runner, are built from
many different files nowadays, and a line number is relatively
cumbersome to use if you don't know which file that line comes from.
These are redundant either because we already have a broader tag like
"sm<6", or because the tests are never executed with the GLSL runner in
the first place.
Instead of using DCL_INPUT.
The main goal here is to eventually get rid of the I/O
declaration instructions. A positive side effect is that we don't
add a useless barrier to shaders which have a DCL_INPUT instruction
in the patch constant phase but don't actually read OUTCONTROLPOINT
registers.
This makes it available to all backends, without requiring an
ad-hoc solution for each of them. It also gets rid of an
undocumented flag we're currently passing to
DCL_CONTROL_POINT_PHASE.
Translate the instructions that contain hlsl_blocks. Also move
other control flow instructions such as HS_CONTROL_POINT_PHASE and
RET to the vsir_program so that we can directly iterate over it now.
These system values are bound to the same allocation rules as other
semantics: they can share registers with other semantics with the same
interpolation mode and they prefer forming shorter writemasks. However,
for some reason, these don't allow further semantics to share the same
register once allocated, except among themselves.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.