I realized that it is better to lower casts to int to FLOOR+REINTERPET
instead of appending a FLOOR to all casts to int and assuming that this
is the case for all of them in d3dbc.c.
This in case we introduce new passes in the future that add casts that
we forget to lower, after the lower_casts_to_bool pass.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=56162
Storing to a vector component using a non-constant index is not allowed
on profiles lower than 6.0. Unless this happens inside a loop that can be
unrolled, which we are not doing yet.
For this reason, a validate_nonconstant_vector_store_derefs pass is
added to detect these cases.
Ideally we would want to emit an hlsl_error on this pass, but before
implementing loop unrolling, we could reach this point on valid HLSL.
Also, as pointed out by Nikolay in the mentioned bug, currently
new_offset_from_path_index() fails an assertion when this happens,
because it expects an hlsl_ir_constant, so a check is added.
It also felt correct to emit an hlsl_fixme there, despite the
redundancy.
This preempts us from replacing a swizzle incorrectly, as in the
following example:
1: A.x = 1.0
2: A
3: A.x = 2.0
4: @2.x
were @4 ends up being 2.0 instead of 1.0, because that's the value stored in
A.x at time 4, and we should be querying it at time 2.
This also helps us to avoid replacing a swizzle with itself in copy-prop
which can result in infinite loops, as with the included tests this commit.
Consider the following sequence of instructions:
1 : A
2 : B = @1
3 : B
4 : A = @3
5 : @1.x
Current copy-prop would replace 5 so it points to @3 now:
1 : A
2 : B = @1
3 : B
4 : A = @3
5 : @3.x
But in the next iteration it would make it point back to @1, keeping it
spinning infinitively.
The solution is to index the instructions and don't replace the swizzle
if the new load happens after the current load.
Instead of only storing the value that each variable's component has at
the moment of the instruction currently handled by copy-prop, we store
the trace of all the historic values with their timestamps, i.e. the
instruction index on which the value was stored.
This would allow us to query the value that the variable had at the time
of execution of previous instructions.
The choice to store them in an rbtree was made early on. It does not seem likely
that HLSL programs would define many overloads for any of their functions, but I
suspect the idea was rather that intrinsics would be defined as plain
hlsl_ir_function_decl structures [cf. 447463e590]
and that some intrinsics that could operate on any type would therefore need
many overrides.
This is not how we deal with intrinsics, however. When the first intrinsics were
implemented I made the choice disregard this intended design, and instead match
and convert their types manually, in C. Nothing that has happened in the time
since has led me to question that choice, and in fact, the flexibility with
which we must accommodate functions has led me to believe that matching in this
way was definitely the right choice. The main other designs I see would have
been:
* define each intrinsic variant separately using existing HLSL types. Besides
efficiency concerns (i.e. this would take more space in memory, and would take
longer to generate each variant), the normal type-matching rules don't really
apply to intrinsics.
[For example: elementwise intrinsics like abs() return the same type as the
input, including preserving the distinction between float and float1. It is
legal to define separate HLSL overloads taking float and float1, but trying to
invoke these functions yields an "ambiguous function call" error.]
* introduce new (semi-)generic types. This is far more code and ends up acting
like our current scheme (with helpers) in a slightly more complex form.
So I think we can go ahead and rip out this vestige of the original design for
intrinsics.
As for why to change it: rbtrees are simply more complex to deal with, and it
seems unlikely to me that the difference is going to matter. I do not expect any
program to define large quantities of intrinsics; linked list search should be
good enough.
This field is now analogous to vkd3d_shader_register_index.rel_addr.
Also, it makes sense to rename it now because all the constant part of
the offset is now handled to hlsl_deref.const_offset. Consequently, it
may also be NULL now.
This uint will be used for the following:
- Since SM4's relative addressing (the capability of passing a register
as an index to another register) only has whole-register granularity,
we will need to make the offset node express the offset in
whole-registers and specify the register component in this uint,
otherwise we would have to add additional / and % operations in the
output binary.
- If, after we apply constant folding and copy propagation, we determine
that the offset is a single constant node, we can store all the offset
in this uint constant, and remove the offset src.
This allows DCE to remove a good bunch of the nodes previously required
only for the offset constants, which makes the output more liteweight
and readable, and simplifies the implementation of relative addressing
when writing tpf in the following patches.
In dump_deref(), we use "c" to indicate components instead of whole
registers. Since now both the offset node and the offset uint are in
components a lowered deref would look like:
var[@42c + 2c]
But, once we express the offset node in whole registers we will remove
the "c" from the node part:
var[@22 + 3c]
Some functions work with dereferences and need to know if they are
lowered yet.
This can be known checking if deref->offset.node is NULL or
deref->data_type is NULL. I am using the latter since it keeps working
even after the following patches that split deref->offset into
constant and variable parts.
We have to distinguish between the "bind count" and the "allocation size"
of variables.
The "allocation size" affects the starting register id for the resource to
be allocated next, while the "bind count" is determined by the last field
actually used. The former may be larger than the latter.
What we are currently calling hlsl_reg.bind_count is actually the
"allocation size", so a rename is in order.
The real "bind count", which will be introduced in following patches,
is important because it is what should be shown in the RDEF table and
some resource allocation rules depend on it.
For instance, for this shader:
texture2D texs[3];
texture2D tex;
float4 main() : sv_target
{
return texs[0].Load(int3(0, 0, 0)) + tex.Load(int3(0, 0, 0));
}
the variable "texs" has a "bind count" of 1, but an "allocation size" of
3:
// Resource Bindings:
//
// Name Type Format Dim HLSL Bind Count
// ------------------------------ ---------- ------- ----------- -------------- ------
// texs texture float4 2d t0 1
// tex texture float4 2d t3 1
We are using the hlsl_ir_var.is_uniform flag to indicate when an object
is a uniform copy created from a variable with the HLSL_STORAGE_UNIFORM
modifier.
We should be checking for this instead of the HLSL_STORAGE_UNIFORM flag
which is also set to 1 for the original variables, and there should be
no reason to use this flag instead of "is_uniform" after the uniform
copies and combined/separated samplers are created.
After lowering the derefs path to a single offset node, there was no way
of knowing the type of the referenced part of the variable. This little
modification allows to avoid having to pass the data type everywhere and
it is required for supporting instructions that reference objects
components within struct types.
Since deref->data_type allows us to retrieve the type of the deref,
deref->offset_regset is no longer necessary.
lower_narrowing_casts() currently creates a new cast calling
hlsl_new_cast(). This cast may be redundant, but it is not folded, which
is making SM1 emit an unnecessary fixme in some shaders:
Aborting due to not yet implemented feature: SM1 "cast" expression.
Other passes that call hlsl_new_cast() are lower_int_division() and
lower_int_modulus(), so the new fold_redundant_casts() pass is called
after these as well.
Non-constant vector indexing is not solved with relative addressing
in the register indexes because this indexation cannot be at the level
of register-components.
Mathematical operations must be used instead.
Variables that contain more than one object (arrays or structs) require
the allocation of contiguous registers in the respective object
register spaces.
This patch makes index expressions on resources hlsl_ir_index nodes
instead of hlsl_ir_resource_load nodes, because it is not known if they
will be used later as the lhs of an hlsl_ir_resource_store.
For now, the only benefit is consistency.
Since in SM1 all vector types use 4 register components, and since SM1
doesn't consider vectors of different dimx incompatible, it is necessary
to ensure that the semantic var is created with dimx=4, and to add a
cast node.
The use of the hlsl_semantic.reported_duplicated_output_next_index field
allows reporting multiple overlapping indexes, such as in the following
vertex shader:
void main(out float1x3 x : OVERLAP0, out float1x3 y : OVERLAP1)
{
x = float3(1.0, 2.0, 3.2);
y = float3(5.0, 6.0, 5.0);
}
apple.hlsl:1:41: E5013: Output semantic "OVERLAP1" is used multiple times.
apple.hlsl:1:13: First use of "OVERLAP1" is here.
apple.hlsl:1:41: E5013: Output semantic "OVERLAP2" is used multiple times.
apple.hlsl:1:13: First use of "OVERLAP2" is here.
While at the same time avoiding reporting overlaps more than once for
large arrays:
struct apple
{
float2 p : sv_position;
};
void main(out apple aps[4])
{
}
apple.hlsl:3:8: E5013: Output semantic "sv_position0" is used multiple times.
apple.hlsl:3:8: First use of "sv_position0" is here.
From this point on, it is no longer true that only hlsl_ir_loads can
return objects, because an object can also come from chain of
hlsl_ir_indexes that ends in an hlsl_ir_load.
The lower_index_loads pass takes care of lowering all hlsl_ir_indexes
into hlsl_ir_loads.
For this reason, hlsl_resource_load_params now expects both the resource
as the sampler to be just an hlsl_ir_node pointer instead of a pointer
to a more specific hlsl_ir_load.
This node type is intended for use during parse-time.
While we parse an indexing expression such as "a[3]", we don't know if
it will end up as part of an expression (in which case it must be folded
into a load) or it is for the lhs of a store (in which case it must be
folded into the store's deref).
Otherwise we may create nodes of different dimensions than the ones we
are replacing.
"count" is the number of components of the source deref (without
considering the swizzle), while "instr_component_count" is the actual
number of components of the instruction to be replaced.
Because of the change introduced in
f21693b2 vkd3d-shader/hlsl: Use reg_size as component count when allocating a single register.
SM1 scalars and vectors were not longer getting the correct writemask
when they are allocated.
This happened because they have to reserve the whole register even if
they only use some of its components, so their reg_size may differ from
the number of components.
This commit fixes that.
Co-authored-by: Francisco Casas <fcasas@codeweavers.com>
Co-authored-by: Zebediah Figura <zfigura@codeweavers.com>
Because copy_propagation_transform_object_load() replaces a deref
instead of an instruction, it is currently prone to two problems:
1- It can replace a deref with the same deref, returning true every
time and getting the compilation stuck in an endless loop of
copy-propagation iterations.
2- When performed multiple times in the same deref, the second time it
can replace the deref with a deref from a temp that is only valid in
another point of the program execution, resulting in an incorrect value.
This patch preempts this by avoiding replacing derefs when the new deref
doesn't point to a uniform variable. Because, uniform variables cannot
be written to.
If a hlsl_ir_load loads a variable whose components are stored from different
instructions, copy propagation doesn't replace it.
But if all these instructions are constants (which currently is the case
for value constructors), the load could be replaced with a constant value.
Which is expected in some other instructions, e.g. texel_offsets when
using aoffimmi modifiers.
For instance, this shader:
```
sampler s;
Texture2D t;
float4 main() : sv_target
{
return t.Gather(s, float2(0.6, 0.6), int2(0, 0));
}
```
results in the following IR before applying the patch:
```
float | 6.00000024e-01
float | 6.00000024e-01
uint | 0
| = (<constructor-2>[@4].x @2)
uint | 1
| = (<constructor-2>[@6].x @3)
float2 | <constructor-2>
int | 0
int | 0
uint | 0
| = (<constructor-5>[@11].x @9)
uint | 1
| = (<constructor-5>[@13].x @10)
int2 | <constructor-5>
float4 | gather_red(resource = t, sampler = s, coords = @8, offset = @15)
| return
| = (<output-sv_target0> @16)
```
and this IR afterwards:
```
float2 | {6.00000024e-01 6.00000024e-01 }
int2 | {0 0 }
float4 | gather_red(resource = t, sampler = s, coords = @2, offset = @3)
| return
| = (<output-sv_target0> @4)
```
Rename it to copy_propagation_replace_with_single_instr() accordingly.
The idea is to introduce a constant vector replacement pass which will do the
same thing.
copy_propagation_compute_replacement() is not doing very much for us, and
conceptually is a bit of an odd fit anyway, since it's meant to deal with
multi-component types.
Note that in the future we should call
validate_static_object_references() after DCE and pruning branches,
because shaders such as these compile (at least in more modern versions
of the native compiler):
Branch pruning:
```
static RWTexture2D<float> tex;
float4 main() : sv_target
{
if (0)
{
tex[int2(0, 0)] = 2;
}
return 0;
}
```
DCE:
```
static Texture2D tex;
uniform uint i;
float4 main() : sv_target
{
float4 unused = tex.Load(int3(0, 1, 2));
return 0;
}
```
These are "todo" tests in hlsl-static-initializer.shader_test
that depend on this.
Otherwise, for instance, the added test results in:
debug_hlsl_writemask: Assertion `!(writemask & ~VKD3DSP_WRITEMASK_ALL)' failed.
Which happens in allocate_variable_temp_register() when the variable's
type reg_size is <= 4 but its component count is larger, which may
happen if it contains objects.
This should silence warnings about some branches non returning any value
without requiring additional "return 0" statement or similar.
Also, in theory this might enable to compiler to optimize the program
a little bit more, though that's unlikely to have any measurable effect.
It is responsibility of the shader's programmer to ensure that
object references can be solved statically.
Resource arrays for ps_5_1 and vs_5_1 are an exception which is not
properly handled yet. They probably deserve a different object type.
Signed-off-by: Francisco Casas <fcasas@codeweavers.com>
hlsl_new_store() and hlsl_new_load() are deleted, so now there are no more
direct ways to create derefs with offsets in hlsl.c and hlsl.h.
Signed-off-by: Francisco Casas <fcasas@codeweavers.com>
Signed-off-by: Giovanni Mascellani <gmascellani@codeweavers.com>
Signed-off-by: Francisco Casas <fcasas@codeweavers.com>
Signed-off-by: Zebediah Figura <zfigura@codeweavers.com>
Signed-off-by: Giovanni Mascellani <gmascellani@codeweavers.com>
This can be done now, to ensure that register offsets are no longer used
in hlsl.c and hlsl.h.
Signed-off-by: Francisco Casas <fcasas@codeweavers.com>
Signed-off-by: Giovanni Mascellani <gmascellani@codeweavers.com>