gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Etienne Perot	3649ca9d9e	On COS, add NVIDIA library directory to LD configuration and update cache. Unlike Ubuntu VMs where we use Docker's `--gpus` flag, COS VMs do not use this flag and instead mount the NVIDIA library directories automatically. However, nothing guarantees that these directories are added to the LD config. This change fixes that. It take advantage of the fact that all GPU tests have the sniffer binary as entrypoint, which slightly overloads the role of the sniffer within the GPU test infrastructure... but then again the ioctl sniffer is already deeply intertwined with ld configuration because it already overrides the `ioctl` libc function, so this doesn't seem like too big of a stretch. This change makes the ffmpeg test succeed with `runc` on COS, but they still fail with gVisor (with `CUDA_ERROR_OUT_OF_MEMORY` errors). So there must be some further gVisor-specific error. Updates #11351 Updates #11321 PiperOrigin-RevId: 715222952	2025-01-13 21:13:14 -08:00
Ayush Ranjan	2edcee37d9	Specify all capabilities in TestGPUCheckpointRestore. This test was broken by `5e6589e0b7` ("Update CUDA test compatibility to keep up with added gVisor support.") which requires all images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities. PiperOrigin-RevId: 713170590	2025-01-07 23:28:03 -08:00
Etienne Perot	d299b3998c	Sniffer: Exit instantly on unknown ioctls in compatibility enforcement mode. This makes it easier to deal with client/server-type GPU tests, such as ollama or vLLM, where the main GPU process is a long-running one. Prior to this CL, when setting enforcement mode on this process, this would only be checked if adding explicit code to shut down the container orderly by sending a signal to the server process, in order to let the ioctl sniffer have a chance to produce its report. This is easy to miss, because there is no feedback that would suggest that the ioctl sniffer isn't being respected in such tests. By exiting instantly as soon as an unsupported ioctl is found, such code doesn't need to be added. However, the previous behavior is still useful when testing new applications, so it is still available as well. The `--enforce_compatibility` flag is changed to a tri-state flag that can handle either being turned off, or enforcing compatibility on the spot vs at exit time. Updates issue #10885. PiperOrigin-RevId: 686653876	2024-10-16 15:03:45 -07:00
Etienne Perot	1e97c039bf	Automated rollback of changelist 673541771 PiperOrigin-RevId: 673651019	2024-09-11 20:42:21 -07:00
Etienne Perot	64de876102	Do not embed the `run_sniffer` binary in the `dockerutil` library. This is causing nogo test failures. Use a `data` dependency instead. Updates #10885 PiperOrigin-RevId: 673541771	2024-09-11 14:42:34 -07:00
Etienne Perot	a689c11a76	Integrate GPU `ioctl` sniffer in GPU tests. This wraps all GPU tests' command line with the nvproxy ioctl sniffer. This has multiple functions: - Verifying that the application does not call ioctls unsupported by nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which is initially set to `true` in all tests to mirror current behavior, but should be flipped as we verify that they do not call unsupported ioctls. - Verifying that the sniffer itself works transparently for a wide range of applications. - Later down the line, enforcing that the application only calls ioctls that are part of GPU capabilities that it has a need for. This is controlled by a capability string which is currently only used to set the `NVIDIA_DRIVER_CAPABILITIES` environment variable. Updates issue #10856 PiperOrigin-RevId: 672714520	2024-09-09 16:34:19 -07:00
Etienne Perot	4810afc36c	GPU support: Add NVIDIA CUDA sample tests. This is a set of CUDA tests defined by NVIDIA in this repository: https://github.com/NVIDIA/cuda-samples This change introduces a large new test (`cuda_test`) which runs each CUDA sample test in a container. There are many subtleties involved due to how the CUDA samples repository isn't always meant to be run as a test, some of it involves graphical applications, and a lot of them require this or that CUDA feature which not all NVIDIA GPUs support, some require multiple GPUs to be on the machine, etc. Therefore, the test maps each test to their `Compatibility` data which determines whether or not a sample test is expected to fail when run in a certain environment. The overall test also has a `--cuda_verify_compatibility` flag to verify the veracity of this mapping, by running expected-to-be-broken tests and verifying that their failure matches how this expected failure typically manifests. Because there are a lot of CUDA sample tests (213 as of the CUDA 12.3 release of the cuda-samples repo), and they don't all require the whole GPU to themselves, and spawning a GPU-using container is expensive (~seconds), the test uses a pool of reusable containers in which it `exec`s tests (at most one per container at any given time, but in parallel across containers). If any test unexpectedly fails, we drain the entire pool of containers and only run this one test without anything else running on other containers. This de-flakes tests, especially those that fail because they require more resources than the GPU has when other tests are using it at the same time. However, this removes parallelism and therefore increases test time significantly. Despite these optimizations, the test is very long and has the maximum deadline of 1 hour. Because it may get close to the timeout (especially when `--cuda_verify_compatibility` is on, because that needs to run all the tests even if they are known to fail, and the ones that do fail can hang for a while rather than crash), the test also has more logging and debugging than the typical test, as enabled with the `--cuda_log_successful_tests` and `--cuda_test_debug` flag. It periodically logs the status of each container and the pool's utilization ratio. This is useful when debugging the test to see which pooled container is doing what, and/or to `docker exec` into containers while they are running a certain test. The test also does its own timekeeping, which is useful so that it can print a more helpful failure message that distinguishes between tests failing due to actual failure reasons vs those that are failing purely because the test timed out. To run the test manually (from a VM with the repo checked out): ``` $ docker build -t gvisor.dev/images/gpu/cuda-tests images/gpu/cuda-tests -f images/gpu/cuda-tests/Dockerfile.x86_64 && mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/ && ./bin/runsc install -- --nvproxy=true --debug=true --debug-log=/tmp/runsc/ && systemctl reload docker && make test TEST_OPTIONS='--test_output=streamed --verbose_failures=true' TARGETS=//test/gpu:cuda_test OPTIONS='--test_env=RUNTIME=runsc --test_arg=--cuda_test_debug=true --test_arg=cuda_verify_compatibility=true --test_arg=--cuda_log_successful_tests=true' ``` PiperOrigin-RevId: 626528982	2024-04-19 19:13:42 -07:00
Etienne Perot	641a1a56b8	`dockerutil.GPURunOpts`: Expose all GPUs to containers, not just the first. PiperOrigin-RevId: 617956471	2024-03-21 14:05:10 -07:00
Etienne Perot	07e86e27b0	Add ollama GPU test. This runs https://ollama.ai/ in a gVisor container and loads two models: an English-Chinese translation model, and a code assistant model. It asks the first one to translate "Hello World" to Chinese, and then asks the second one to generate a test case to verify that the translation is correct. This change includes a server and client library for spawning ollama in a container and interacting through its HTTP API. This will be useful to turn it into a benchmark that measures its throughput in tokens/second. PiperOrigin-RevId: 590295278	2023-12-12 12:28:23 -08:00
Etienne Perot	e1d2ce8cfa	Move GPU test utilities to its own package. These will be reusable across GPU tests, not just the smoke tests. PiperOrigin-RevId: 587888928	2023-12-04 17:26:28 -08:00

10 Commits