For temporary registers, SM1-SM3 integer types are internally
represented as floating point, so, in order to perform a cast
from ints to floats we need a mere MOV.
For constant integer registers "iN" there is no operation for casting
from a floating point register to them. For address registers "aN", and
the loop counting register "aL", vertex shaders have the "mova" operation
but we haven't used these registers in any way yet.
We probably would want to introduce these as synthetic variables
allocated in a special register set. In that case we have to remember to
use MOVA instead of MOV in the store operations, but they shouldn't be src
or dst of CAST operations.
Regarding constant integer registers, in some shaders, constants are
expected to be received formatted as an integer, such as:
int m;
float4 main() : sv_target
{
float4 res = {0, 0, 0, 0};
for (int k = 0; k < m; ++k)
res += k;
return res;
}
which compiles as:
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// m i0 1
//
ps_3_0
def c0, 0, 1, 0, 0
mov r0, c0.x
mov r1.x, c0.x
rep i0
add r0, r0, r1.x
add r1.x, r1.x, c0.y
endrep
mov oC0, r0
but this only happens if the integer constant is used directly in an
instruction that needs it, and as I said there is no instruction that
allows converting them to a float representation.
Notice how a more complex shader, that performs operations with this
integer variable "m":
int m;
float4 main() : sv_target
{
float4 res = {0, 0, 0, 0};
for (int k = 0; k < m * m; ++k)
res += k;
return res;
}
gives the following output:
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// m c0 1
//
ps_3_0
def c1, 0, 0, 1, 0
defi i0, 255, 0, 0, 0
mul r0.x, c0.x, c0.x
mov r1, c1.y
mov r0.y, c1.y
rep i0
mov r0.z, r0.x
break_ge r0.y, r0.z
add r1, r0.y, r1
add r0.y, r0.y, c1.z
endrep
mov oC0, r1
Meaning that the uniform "m" is just stored as a floating point in
"c0", the constant integer register "i0" is just set to 255 (hoping
it is a high enough value) using "defi", and the "break_ge"
involving c0 is used to break from the loop.
We could potentially use this approach to implement loops from SM3
without expecting the variables being received as constant integer
registers.
According to the D3D documentation, for SM1-SM3 constant integer
registers are only used by the 'loop' and 'rep' instructions.
These tests should actually compile and run in SM1, which is possible
if we pass the int and uint uniforms in the expected IEEE 754 float
format for SM1 shaders.
Also, bools should be passed as 1.0f or 0.0f to SM1.
When the "if" qualifier is added to a directive, the directive is
skipped if the shader->minimum_shader_model is not included in the
range.
This can be used on the "probe" directives for tests that have different
expected results on different shader models, without having to resort to
[require] blocks.
This test currently hit a Metal bug when run on Apple Silicon with
MoltenVK and fails. We don't have an easy way to mark shader runner
tests as buggy and we're not interested in tracking that bug anyway,
so I'm just working around it.
The structurizer is implemented along the lines of what is usually called
the "structured program theorem": the control flow is completely
virtualized by mean of an additional TEMP register which stores the
block index which is currently running. The whole program is then
converted to a huge switch construction enclosed in a loop, executing
at each iteration the appropriate block and updating the register
depending on block jump instruction.
The algorithm's generality is also its major weakness: it accepts any
input program, even if its CFG is not reducible, but the output
program lacks any useful convergence information. It satisfies the
letter of the SPIR-V requirements, but it is expected that it will
be very inefficient to run on a GPU (unless a downstream compiler is
able to devirtualize the control flow and do a proper convergence
analysis pass). The algorithm is however very simple, and good enough
to at least pass tests, enabling further development. A better
alternative is expected to be upstreamed incrementally.
Side note: the structured program theorem is often called the
Böhm-Jacopini theorem; Böhm and Jacopini did indeed prove a variation
of it, but their algorithm is different from what is commontly attributed
to them and implemented here, so I opted for not using their name.
These can be disassembled by D3DDisassemble() just fine, and perhaps
more importantly, shader model 1 vertex shaders do not require dcl_
instructions in Direct3D 8.
The implementation of upload_buffer_data_with_states(), unlike the
implementation of upload_texture_data_with_states(), does not expect a
pointer to a D3D12_SUBRESOURCE_DATA, but rather, a direct pointer to the
data.
The generated Vulkan calls look right and do not trigger any
validation error, but the returned timestamp is 0. A valid
timestamp is returned if the CopyResource() call is commented,
or the second EndQuery() call is moved before CopyResource(),
or the first EndQuery() call is commented. I am not seeing any
sensible pattern here, so I guess there is just a bug in
MoltenVK.
Specifically, MoltenVK seems to be able to load from stencil, but
the specific replicating swizzle (repeating the stencil value on
all the channels) is not honored. The stencil value is read only
on the red channel.
At the current moment this is a little odd because for SM1 [test]
directives are skipped, and the [shader] directives are not executed by
the shader_runner_vulkan.c:compile_shader() but by the general
shader_runner.c:compile_shader(). So in principle it is a little weird
that we go through the vulkan runner.
But fret not, because in the future we plan to make the parser agnostic
to the language of the tests, so we will get rid of the general
shader_runner.c:compile_shader() function and instead call a
runner->compile_shader() function, defined for each runner. Granted,
most of these may call a generic implementation that uses native
compiler in Windows, and vkd3d-shader on Linux, but it would be more
conceptually correct.
If the runners require multiple calls to run_shader_tests() for
different shader model ranges, these are moved inside the sole runner
call.
For the same reason, the trace() messages are also moved inside the
runner calls.
We can now pass (sm<4) and (sm>=4) to "fail" and "todo" qualifiers, and
we can use multiple of these qualifier arguments using "&" for AND and
"|" for OR.
examples:
todo(sm>=4 & sm<6)
todo(sm<4 | sm>6)
parenthesis are not supported.
Adding additional model ranges for the tests, if we need them, should be
easier now, since they only have to be added to the "valid_args" array.
The relative-addressed case in shader_register_normalise_arrayed_addressing()
leaves the control point id in idx[0], while for constant register
indices it is placed in idx[1]. The latter case could be fixed instead,
but placing the control point count in the outer dimension is more
logical.
The FXC optimiser sometimes converts a local array of input values into
direct array addressing of the inputs, which can result in a
dcl_indexrange instruction spanning input elements with different masks.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=56162
Storing to a vector component using a non-constant index is not allowed
on profiles lower than 6.0. Unless this happens inside a loop that can be
unrolled, which we are not doing yet.
For this reason, a validate_nonconstant_vector_store_derefs pass is
added to detect these cases.
Ideally we would want to emit an hlsl_error on this pass, but before
implementing loop unrolling, we could reach this point on valid HLSL.
Also, as pointed out by Nikolay in the mentioned bug, currently
new_offset_from_path_index() fails an assertion when this happens,
because it expects an hlsl_ir_constant, so a check is added.
It also felt correct to emit an hlsl_fixme there, despite the
redundancy.
Apparently Metal doesn't support specifying a bias directly in the
sampler, and, with "nearest" mip filtering, it doesn't switch
precisely at LOD 0.5 (though still between 0.5 and 0.6).
Currently, if a probe fails, it will print the line number of the [test]
block the probe is in, not the line number of the probe itself. This
makes it somewhat difficult to debug.
This commit makes it print the line number that a test fails at.
This preempts us from replacing a swizzle incorrectly, as in the
following example:
1: A.x = 1.0
2: A
3: A.x = 2.0
4: @2.x
were @4 ends up being 2.0 instead of 1.0, because that's the value stored in
A.x at time 4, and we should be querying it at time 2.
This also helps us to avoid replacing a swizzle with itself in copy-prop
which can result in infinite loops, as with the included tests this commit.
Consider the following sequence of instructions:
1 : A
2 : B = @1
3 : B
4 : A = @3
5 : @1.x
Current copy-prop would replace 5 so it points to @3 now:
1 : A
2 : B = @1
3 : B
4 : A = @3
5 : @3.x
But in the next iteration it would make it point back to @1, keeping it
spinning infinitively.
The solution is to index the instructions and don't replace the swizzle
if the new load happens after the current load.
The included test fails because copy_propagation_transform_swizzle()
is using the value recorded for the variable when the swizzle is being
read, and not the swizzle's load.
This was broken by commit e390bc35e2c9b0a2110370f916033eea2366317e; that
commit fixed the source count for these instructions, but didn't adjust
shader_sm1_skip_opcode(). Note that this only affects shader model 1;
later versions have a token count embedded in the initial opcode token.
For relative addressing, the vkd3d_shader_registers must point to
another vkd3d_shader_src_param. For now, use the sm4_instruction to save
them, since the only purpose of this struct is to be used as paramter
for write_sm4_instruction.
I'm not sure of what's happening here, but it seems that this change
fixes a crash when running on Windows in the CI. Since most of the
test excludes SM1-3 anyway, this shouldn't be a big loss.