This can be used by all test users. Avoids duplicated code. We can handle all
known issues in one place.
There is a Docker bug which causes restore to fail sporadically. See
https://github.com/moby/moby/issues/42900. This has been broken at least since
Docker v19.03.12 (when the issue was reported) and was fixed in v25.0.4. Added
the handling for this issue.
Also got rid of the testutil.Poll() around restore. That can hide gVisor
restore flakiness issues. That was added in 0990ef7517 ("Make
checkpoint/restore e2e test less flaky"). The original sleep has been restored.
PiperOrigin-RevId: 734303878
Unlike Ubuntu VMs where we use Docker's `--gpus` flag, COS VMs do not use
this flag and instead mount the NVIDIA library directories automatically.
However, nothing guarantees that these directories are added to the LD
config. This change fixes that. It take advantage of the fact that all GPU
tests have the sniffer binary as entrypoint, which slightly overloads the
role of the sniffer within the GPU test infrastructure... but then again
the ioctl sniffer is already deeply intertwined with ld configuration
because it already overrides the `ioctl` libc function, so this doesn't
seem like too big of a stretch.
This change makes the ffmpeg test succeed with `runc` on COS, but they still
fail with gVisor (with `CUDA_ERROR_OUT_OF_MEMORY` errors). So there must be
some further gVisor-specific error.
Updates #11351
Updates #11321
PiperOrigin-RevId: 715222952
This includes the ability to force using local images only (do not check
for updated manifests), and to explicitly mount and specify external caches
for Go repositories via `rules_go`'s `GO_REPOSITORY_USE_HOST_MODCACHE`.
PiperOrigin-RevId: 713473497
This test was broken by 5e6589e0b7 ("Update CUDA test compatibility to keep
up with added gVisor support.") which requires all
images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities.
PiperOrigin-RevId: 713170590
These CUDA tests were initially broken in gVisor but now appear to pass.
The test now also verifies that all capabilities are enabled when running.
PiperOrigin-RevId: 713094806
Without this, tests that `exec` commands into existing GPU containers (e.g.
the CUDA samples test) would not actually get ioctl compatibility
enforcement.
Updates issue #10885.
PiperOrigin-RevId: 686686310
This makes it easier to deal with client/server-type GPU tests, such as
ollama or vLLM, where the main GPU process is a long-running one. Prior to
this CL, when setting enforcement mode on this process, this would only be
checked if adding explicit code to shut down the container orderly by
sending a signal to the server process, in order to let the ioctl sniffer
have a chance to produce its report. This is easy to miss, because there
is no feedback that would suggest that the ioctl sniffer isn't being
respected in such tests. By exiting instantly as soon as an unsupported
ioctl is found, such code doesn't need to be added.
However, the previous behavior is still useful when testing new
applications, so it is still available as well. The
`--enforce_compatibility` flag is changed to a tri-state flag that can
handle either being turned off, or enforcing compatibility on the spot vs
at exit time.
Updates issue #10885.
PiperOrigin-RevId: 686653876
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.
This has multiple functions:
- Verifying that the application does not call ioctls unsupported by
nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
is initially set to `true` in all tests to mirror current behavior, but
should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
of applications.
- Later down the line, enforcing that the application only calls ioctls
that are part of GPU capabilities that it has a need for. This is
controlled by a capability string which is currently only used to set
the `NVIDIA_DRIVER_CAPABILITIES` environment variable.
Updates issue #10856
PiperOrigin-RevId: 672714520
Recently printf.Analyzer has become stricter
(https://github.com/golang/go/issues/60529)
which led to new findings.
gvisor nogo tests run this analyzer and fail if it produces findings.
PiperOrigin-RevId: 671657227
This forwards the output of `runsc debug` to stderr if it fails during
a container run. Additionally, for runs with profiling enabled, it checks
the runtime arguments and prints an error if the `--profile` flag is not
found in it.
Profiling is also disabled by default on benchmarks now. This forces the
user to be explicit about where benchmarks are stored, which is less
confusing than the current behavior of empty output with no explanation
as to where the profiles are.
Fixes#10433
PiperOrigin-RevId: 642132089
This is part of a series of changes to add metric charts in performance
benchmarks.
This is intended to increase the reliability of logs-based profiling metrics
by ensuring that Docker keeps enough of them around during long tests, or
tests where there are a lot of metrics being profiled, or tests with a very
fast profiling rate, or all of the above.
This also calls `sync` on the underlying logging file descriptor if possible.
Give a more helpful error messages if a log buffer overrun does occur.
PiperOrigin-RevId: 633389921
This is a set of CUDA tests defined by NVIDIA in this repository:
https://github.com/NVIDIA/cuda-samples
This change introduces a large new test (`cuda_test`) which runs each CUDA
sample test in a container.
There are many subtleties involved due to how the CUDA samples repository
isn't always meant to be run as a test, some of it involves graphical
applications, and a lot of them require this or that CUDA feature which not
all NVIDIA GPUs support, some require multiple GPUs to be on the machine, etc.
Therefore, the test maps each test to their `Compatibility` data which
determines whether or not a sample test is expected to fail when run in a
certain environment. The overall test also has a
`--cuda_verify_compatibility` flag to verify the veracity of this mapping,
by running expected-to-be-broken tests and verifying that their failure
matches how this expected failure typically manifests.
Because there are a lot of CUDA sample tests (213 as of the CUDA 12.3 release
of the cuda-samples repo), and they don't all require the whole GPU to
themselves, and spawning a GPU-using container is expensive (~seconds),
the test uses a pool of reusable containers in which it `exec`s tests (at
most one per container at any given time, but in parallel across containers).
If any test unexpectedly fails, we drain the entire pool of containers and
only run this one test without anything else running on other containers.
This de-flakes tests, especially those that fail because they require more
resources than the GPU has when other tests are using it at the same time.
However, this removes parallelism and therefore increases test time
significantly.
Despite these optimizations, the test is very long and has the maximum deadline
of 1 hour. Because it may get close to the timeout (especially when
`--cuda_verify_compatibility` is on, because that needs to run all the tests
even if they are known to fail, and the ones that do fail can hang for a
while rather than crash), the test also has more logging and debugging than
the typical test, as enabled with the `--cuda_log_successful_tests` and
`--cuda_test_debug` flag. It periodically logs the status of each container
and the pool's utilization ratio. This is useful when debugging the test to
see which pooled container is doing what, and/or to `docker exec` into
containers while they are running a certain test.
The test also does its own timekeeping, which is useful so that it can print
a more helpful failure message that distinguishes between tests failing due
to actual failure reasons vs those that are failing purely because the test
timed out.
To run the test manually (from a VM with the repo checked out):
```
$ docker build -t gvisor.dev/images/gpu/cuda-tests images/gpu/cuda-tests -f images/gpu/cuda-tests/Dockerfile.x86_64 && mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/ && ./bin/runsc install -- --nvproxy=true --debug=true --debug-log=/tmp/runsc/ && systemctl reload docker && make test TEST_OPTIONS='--test_output=streamed --verbose_failures=true' TARGETS=//test/gpu:cuda_test OPTIONS='--test_env=RUNTIME=runsc --test_arg=--cuda_test_debug=true --test_arg=cuda_verify_compatibility=true --test_arg=--cuda_log_successful_tests=true'
```
PiperOrigin-RevId: 626528982
`ContainerPool` is useful to make sure tests can productively use the CPU
by using pre-spun containers in parallel. While the test is running, it is
difficult to know which containers are doing what, and how effective the
container pool actually is.
This change adds state tracking to each container to make this process
easier. Each container has a state and a user-assignable label (e.g. the
test name) to indicate what it is doing. Then, the pool now has a `String`
function to print what each container is doing.
This is useful for CUDA tests which are very close to the test runtime limit
of one hour, in order to see how they can better spend their time, and to
know which container to `exec` into when debugging.
PiperOrigin-RevId: 626501528
This allows tests to consider the exit code of `exec`'d processes.
This is useful for CUDA tests because they exit with a specific error code
(`EXIT_WAIVED`) when they run on GPUs that don't support the features the
test needs.
PiperOrigin-RevId: 626478594
This is useful for CUDA tests, some of which work in gVisor and some not.
By checking whether the runtime in use is gVisor, the test can adjust its
expectations of success/failure.
PiperOrigin-RevId: 626443132
Enables save resume with checkpoint command. Previously when --leave-running
was set, the sandbox was destroyed after the checkpoint and restored with the
same id. With this change the sandbox will not be destroyed and resumes running
after the checkpoint.
PiperOrigin-RevId: 623282685
Callers may request a container from the pool, and must release it back
when they are done with it.
This is useful for large tests which can `exec` individual test cases
inside the same set of reusable containers, to avoid the cost of creating
and destroying containers for each test.
It also supports reserving the whole pool ("exclusive"), which locks out
all other callers from getting any other container from the pool.
This is useful for tests where running in parallel may induce unexpected
errors which running serially would not cause. This allows the test to
first try to run in parallel, and then re-run failing tests exclusively
to make sure their failure is not due to parallel execution.
I plan to use this for CUDA sample tests.
PiperOrigin-RevId: 619377512