10 Commits

Author SHA1 Message Date
Etienne Perot 3649ca9d9e On COS, add NVIDIA library directory to LD configuration and update cache.
Unlike Ubuntu VMs where we use Docker's `--gpus` flag, COS VMs do not use
this flag and instead mount the NVIDIA library directories automatically.
However, nothing guarantees that these directories are added to the LD
config. This change fixes that. It take advantage of the fact that all GPU
tests have the sniffer binary as entrypoint, which slightly overloads the
role of the sniffer within the GPU test infrastructure... but then again
the ioctl sniffer is already deeply intertwined with ld configuration
because it already overrides the `ioctl` libc function, so this doesn't
seem like too big of a stretch.

This change makes the ffmpeg test succeed with `runc` on COS, but they still
fail with gVisor (with `CUDA_ERROR_OUT_OF_MEMORY` errors). So there must be
some further gVisor-specific error.

Updates #11351
Updates #11321

PiperOrigin-RevId: 715222952
2025-01-13 21:13:14 -08:00
Ayush Ranjan 2edcee37d9 Specify all capabilities in TestGPUCheckpointRestore.
This test was broken by 5e6589e0b7 ("Update CUDA test compatibility to keep
up with added gVisor support.") which requires all
images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities.

PiperOrigin-RevId: 713170590
2025-01-07 23:28:03 -08:00
Etienne Perot d299b3998c Sniffer: Exit instantly on unknown ioctls in compatibility enforcement mode.
This makes it easier to deal with client/server-type GPU tests, such as
ollama or vLLM, where the main GPU process is a long-running one. Prior to
this CL, when setting enforcement mode on this process, this would only be
checked if adding explicit code to shut down the container orderly by
sending a signal to the server process, in order to let the ioctl sniffer
have a chance to produce its report. This is easy to miss, because there
is no feedback that would suggest that the ioctl sniffer isn't being
respected in such tests. By exiting instantly as soon as an unsupported
ioctl is found, such code doesn't need to be added.

However, the previous behavior is still useful when testing new
applications, so it is still available as well. The
`--enforce_compatibility` flag is changed to a tri-state flag that can
handle either being turned off, or enforcing compatibility on the spot vs
at exit time.

Updates issue #10885.

PiperOrigin-RevId: 686653876
2024-10-16 15:03:45 -07:00
Etienne Perot 1e97c039bf Automated rollback of changelist 673541771
PiperOrigin-RevId: 673651019
2024-09-11 20:42:21 -07:00
Etienne Perot 64de876102 Do not embed the run_sniffer binary in the dockerutil library.
This is causing nogo test failures. Use a `data` dependency instead.

Updates #10885

PiperOrigin-RevId: 673541771
2024-09-11 14:42:34 -07:00
Etienne Perot a689c11a76 Integrate GPU ioctl sniffer in GPU tests.
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 672714520
2024-09-09 16:34:19 -07:00
Etienne Perot 4810afc36c GPU support: Add NVIDIA CUDA sample tests.
This is a set of CUDA tests defined by NVIDIA in this repository:
https://github.com/NVIDIA/cuda-samples

This change introduces a large new test (`cuda_test`) which runs each CUDA
sample test in a container.

There are many subtleties involved due to how the CUDA samples repository
isn't always meant to be run as a test, some of it involves graphical
applications, and a lot of them require this or that CUDA feature which not
all NVIDIA GPUs support, some require multiple GPUs to be on the machine, etc.
Therefore, the test maps each test to their `Compatibility` data which
determines whether or not a sample test is expected to fail when run in a
certain environment. The overall test also has a
`--cuda_verify_compatibility` flag to verify the veracity of this mapping,
by running expected-to-be-broken tests and verifying that their failure
matches how this expected failure typically manifests.

Because there are a lot of CUDA sample tests (213 as of the CUDA 12.3 release
of the cuda-samples repo), and they don't all require the whole GPU to
themselves, and spawning a GPU-using container is expensive (~seconds),
the test uses a pool of reusable containers in which it `exec`s tests (at
most one per container at any given time, but in parallel across containers).
If any test unexpectedly fails, we drain the entire pool of containers and
only run this one test without anything else running on other containers.
This de-flakes tests, especially those that fail because they require more
resources than the GPU has when other tests are using it at the same time.
However, this removes parallelism and therefore increases test time
significantly.

Despite these optimizations, the test is very long and has the maximum deadline
of 1 hour. Because it may get close to the timeout (especially when
`--cuda_verify_compatibility` is on, because that needs to run all the tests
even if they are known to fail, and the ones that do fail can hang for a
while rather than crash), the test also has more logging and debugging than
the typical test, as enabled with the `--cuda_log_successful_tests` and
`--cuda_test_debug` flag. It periodically logs the status of each container
and the pool's utilization ratio. This is useful when debugging the test to
see which pooled container is doing what, and/or to `docker exec` into
containers while they are running a certain test.
The test also does its own timekeeping, which is useful so that it can print
a more helpful failure message that distinguishes between tests failing due
to actual failure reasons vs those that are failing purely because the test
timed out.

To run the test manually (from a VM with the repo checked out):

```
$ docker build -t gvisor.dev/images/gpu/cuda-tests images/gpu/cuda-tests -f images/gpu/cuda-tests/Dockerfile.x86_64 && mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/ && ./bin/runsc install -- --nvproxy=true --debug=true --debug-log=/tmp/runsc/ && systemctl reload docker && make test TEST_OPTIONS='--test_output=streamed --verbose_failures=true' TARGETS=//test/gpu:cuda_test OPTIONS='--test_env=RUNTIME=runsc --test_arg=--cuda_test_debug=true --test_arg=cuda_verify_compatibility=true --test_arg=--cuda_log_successful_tests=true'
```

PiperOrigin-RevId: 626528982
2024-04-19 19:13:42 -07:00
Etienne Perot 641a1a56b8 dockerutil.GPURunOpts: Expose all GPUs to containers, not just the first.
PiperOrigin-RevId: 617956471
2024-03-21 14:05:10 -07:00
Etienne Perot 07e86e27b0 Add ollama GPU test.
This runs https://ollama.ai/ in a gVisor container and loads two models:
an English-Chinese translation model, and a code assistant model.

It asks the first one to translate "Hello World" to Chinese, and then asks
the second one to generate a test case to verify that the translation is
correct.

This change includes a server and client library for spawning ollama in a
container and interacting through its HTTP API. This will be useful to turn
it into a benchmark that measures its throughput in tokens/second.

PiperOrigin-RevId: 590295278
2023-12-12 12:28:23 -08:00
Etienne Perot e1d2ce8cfa Move GPU test utilities to its own package.
These will be reusable across GPU tests, not just the smoke tests.

PiperOrigin-RevId: 587888928
2023-12-04 17:26:28 -08:00