Mostly to be able to associate a version number to each tag and
get rid of all the foo<1.2.3 tags. The new system also has fixed
tag slots, rather than dealing with strings, so we don't have to
manually adjust the size of the `tags' array.
With the new system each tag can be present or not, and if it is
present it can have an associated version number (of the form
major.minor.patch). If the version is not available, it is set to
0.0.0. Each tag can be queried for existence and for comparison
with the version number.
So we have a more recent version of SPIRV-Tools and also don't
have to recompile Mesa to test llvmpipe. This fixes a few failing
tests, but also breaks a couple.
This used to work when the macOS runner had a Sonoma host system.
Now it has Sequoia, even if the guest is still Sonoma, and the
test crashes with:
[mvk-error] VK_TIMEOUT: MTLCommandBuffer "vkQueueSubmit MTLCommandBuffer on Queue 3-0" execution failed (code 2): Caused GPU Hang Error (00000003:kIOGPUCommandBufferCallbackErrorHang)
vkd3d:56072:err:vkd3d_wait_for_gpu_timeline_semaphore Failed to wait for Vulkan timeline semaphore, vr -4.
Upgrading MoltenVK or the guest to Sequoia doesn't seem to help.
I haven't investigated the problem, but my experience is that
the paravirtualized Metal driver has a number of problems.
It seems that the NVIDIA drivers leaves VBVs bindings untouched
when they are NULL (or the GPU buffer address is NULL), instead of
setting them to a null binding.
Differently from other cases of inconsistent behaviour between AMD
and NVIDIA, here I'm explicitly marking the NVIDIA behaviour as
broken, because the expected behaviour is spelled out explicitly
(at least for the D3D12 specification standards).
It's hard to pinpoint exactly what's going wrong with these
tests. They seem to be related to atomics and GPU timestamps,
both categories that are known to have problems on MoltenVK in a
way or another; those failures clearly depend on a few factors
like the MoltenVK version, the macOS version and whether we're in
a virtual machine or not, but the exact dependency on those factors
is hard to describe (for example, in general the paravirtualized
device offered inside virtual machines has a lot more problems than
real devices, but I've seen tests, fixed all other conditions,
working on the paravirtualized device and not on the real device).
The only thing all tests in this batch have in common is that I've
never seen them fail on a Sequoia system, thus I've settled for
using just that as the bug_if() condition. Ultimately, wasting a
lot of time to get to the bottom of each single test failure is
pointless, and being able to mark the CI job as not allowed to
fail gives better regression protection than investigating each
of those. Also, I routinely run the tests on a Sequoia system, so
if these tests get broken this is going to be noticed anyway.
Some test programs, particularly the shader runner, are built from
many different files nowadays, and a line number is relatively
cumbersome to use if you don't know which file that line comes from.
I don't know the exact version that fixed this todo, but on the
same hardware this test was failing a couple of years ago, so
I presume something was fixed at some point. I am writing my
current driver version, but a lower one might turn out to be
sufficient.