It is meant as generic pass to host all program changes to single
instructions that do not require keeping a global state, intstead
of having to loop through the whole program many times.
This is in preparation of handling more than one function (as
it happens for Hull Shaders), which will require having a single
row of declarations, but handling more than one CFG.
RADV converts temps to phi instructions, so converting phis to MOVC in
VSIR just translates back to phis feeding into a MOVC. This commit
eliminates the MOVC.
vsir_program_normalise() has function calls to basically everything in
ir.c, so it's useful to have it in an easily reachable place to
quickly jump to wherever you need using your favorite code editor's
features.
The label itself is certainly an unsigned integer, but the register
has no meaningful data type. It cannot be evaluated to anything.
The goal of this is to reduce cluttering in the internal ASM dumps.
The new structurizer therefore reaches feature parity with the
older simple one, except for a couple of points:
* the old structurizer accepts any CFG, without requiring reducibility;
however, the DXIL specification requires the CFG to be reducible
anyway, so we're not really losing anything;
* the new structurizer additionally requires that no block has two
incoming back arrows; AFAIK this is condition that can happen,
but in practice it seems to be rare; also, it's not hard to add
support for it, as soon as it is decided it is useful.
On the other hand, the new structurizer makes use of the merging
information that are reconstructed from the CFG, which is important
for downstream optimization and fundamental for correctly emitting
tangled instructions.
Taking these considerations into account, the old structurizer is
considered superseded and is therefore removed.
The structurizer is implemented along the lines of what is usually called
the "structured program theorem": the control flow is completely
virtualized by mean of an additional TEMP register which stores the
block index which is currently running. The whole program is then
converted to a huge switch construction enclosed in a loop, executing
at each iteration the appropriate block and updating the register
depending on block jump instruction.
The algorithm's generality is also its major weakness: it accepts any
input program, even if its CFG is not reducible, but the output
program lacks any useful convergence information. It satisfies the
letter of the SPIR-V requirements, but it is expected that it will
be very inefficient to run on a GPU (unless a downstream compiler is
able to devirtualize the control flow and do a proper convergence
analysis pass). The algorithm is however very simple, and good enough
to at least pass tests, enabling further development. A better
alternative is expected to be upstreamed incrementally.
Side note: the structured program theorem is often called the
Böhm-Jacopini theorem; Böhm and Jacopini did indeed prove a variation
of it, but their algorithm is different from what is commontly attributed
to them and implemented here, so I opted for not using their name.
For simplicity PHI nodes are not currently handled.
The goal for this pass is to make the CFG structurizer simpler, because
it doesn't have to care about the more rigid rules SSA registers have
to satisfy than TEMP registers.
It is likely that the generated code will be harder for downstream
compilers to optimize and execute efficiently, so once a complete
structurizer is in place this pass should be removed, or at least
greatly reduced in scope.
PHI nodes must be fixed up after this pass, because the block references
might have become broken. For simplicitly this is not handled yet.
The goal for this pass is to make the CFG structurizer simpler, because
only conditional and unconditional branches must be supported.
Eventually this limitation might be lifted if there is advantage in
doing so.
The relative-addressed case in shader_register_normalise_arrayed_addressing()
leaves the control point id in idx[0], while for constant register
indices it is placed in idx[1]. The latter case could be fixed instead,
but placing the control point count in the outer dimension is more
logical.
For example, this occurred in a shader:
reg_idx write_mask
0 xyz
1 xyzw
2 xyzw
3 xyz
The dcl_indexrange instruction covered only xyz, so once merged, searching for
xyzw failed.
It is impossible to declare an input array where elements have different
component counts, but the optimiser can create this case. One way for
this to occur is to dynamically index input values via a local array
containing copies of the input values. The optimiser converts this to
dynamically indexed inputs.