The goal is to simplify the CS queue handling: with this and the following
changes operations are always started by d3d12_command_queue_flush_ops(),
in order to make further refactoring easier.
Notice that while with this change executing an operation on an empty CS
queue is a bit less efficient, it doesn't require more locking. On the other
hand, this change paves the road for executing CS operations without holding
the queue lock.
Otherwise it could be added more than once.
Note that the deleted comment is wrong: between when d3d12_command_queue_flush_ops()
returns and when the queue is added back to the blocked list, the queue
might have been pushed to and flushed an arbitrary number of times.
Otherwise it's not clear which clauses in vkd3d_shader_compile() really
apply to other functions. For example, many of the functions currently
refering to vkd3d_shader_compile() don't even take a vkd3d_shader_compile_info
parameter.
Without this patch, dp3 and dp4 map src swizzles to the dst writemask,
which is not correct.
Before b84f560bdf59c35e093e51bfdf9a166c196d3a9b, these ops worked
despite this, because the dst register had, incorrectly, the full
writemask.
To solve this problem, write_sm1_binary_op_dot() is introduced,
similarly to write_sm4_binary_op_dot().