The code doesn't make sense in the first place, even if it's
accepted by the compiler, so it makes sense that the behaviour
is undefined. And indeed the behaviour is different on AMD (4 is
returned), NVIDIA (QNaN is returned) and WARP (device is removed).
The stride didn't match the structure size used in the shader.
This didn't seem to be a problem on AMD and WARP, but it was
on NVIDIA on Windows. Specifically, it seems that the buffer
is read using the shader structure size (so most tests pass),
but bounds are checked using the buffer stride, so a test
returned zero simply because an out-of-bounds read was detected.
Note that we still have to preempt the propagation to SM1 pixel shader
uniforms. Otherwise this will turn the many constant derefs that appear
from the <index-val> copy generated in lower_index_loads() into a single
non-constant deref, causing it to allocate all the registers instead of
up until the last one used.
Here, a vertex shader version of the previous test by Shaun is
introduced. Note that in this case the uniform allocates all 4 registers
instead of 3 because it is indirectly addressed.
Note that, for indexes with a decimal part, the behavior is different
depending on whether it is a temp load or a direct uniform load (which
can only happen on vertex shaders). The former rounds to the
closest-to-zero, while the latter rounds to the nearest even.