75 Commits

Author SHA1 Message Date
gVisor bot 225a7bc0d2 Internal change.
PiperOrigin-RevId: 739263572
2025-03-21 12:32:21 -07:00
Ayush Ranjan e699298d58 Add Container.RestoreInTest() to handle known Docker bugs.
This can be used by all test users. Avoids duplicated code. We can handle all
known issues in one place.

There is a Docker bug which causes restore to fail sporadically. See
https://github.com/moby/moby/issues/42900. This has been broken at least since
Docker v19.03.12 (when the issue was reported) and was fixed in v25.0.4. Added
the handling for this issue.

Also got rid of the testutil.Poll() around restore. That can hide gVisor
restore flakiness issues. That was added in 0990ef7517 ("Make
checkpoint/restore e2e test less flaky"). The original sleep has been restored.

PiperOrigin-RevId: 734303878
2025-03-06 15:13:16 -08:00
Etienne Perot 3649ca9d9e On COS, add NVIDIA library directory to LD configuration and update cache.
Unlike Ubuntu VMs where we use Docker's `--gpus` flag, COS VMs do not use
this flag and instead mount the NVIDIA library directories automatically.
However, nothing guarantees that these directories are added to the LD
config. This change fixes that. It take advantage of the fact that all GPU
tests have the sniffer binary as entrypoint, which slightly overloads the
role of the sniffer within the GPU test infrastructure... but then again
the ioctl sniffer is already deeply intertwined with ld configuration
because it already overrides the `ioctl` libc function, so this doesn't
seem like too big of a stretch.

This change makes the ffmpeg test succeed with `runc` on COS, but they still
fail with gVisor (with `CUDA_ERROR_OUT_OF_MEMORY` errors). So there must be
some further gVisor-specific error.

Updates #11351
Updates #11321

PiperOrigin-RevId: 715222952
2025-01-13 21:13:14 -08:00
Etienne Perot 501618dbe2 Modify make rules to allow using a fully-local build cache.
This includes the ability to force using local images only (do not check
for updated manifests), and to explicitly mount and specify external caches
for Go repositories via `rules_go`'s `GO_REPOSITORY_USE_HOST_MODCACHE`.

PiperOrigin-RevId: 713473497
2025-01-09 12:23:53 -08:00
Ayush Ranjan 2edcee37d9 Specify all capabilities in TestGPUCheckpointRestore.
This test was broken by 5e6589e0b7 ("Update CUDA test compatibility to keep
up with added gVisor support.") which requires all
images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities.

PiperOrigin-RevId: 713170590
2025-01-07 23:28:03 -08:00
Etienne Perot 5e6589e0b7 Update CUDA test compatibility to keep up with added gVisor support.
These CUDA tests were initially broken in gVisor but now appear to pass.

The test now also verifies that all capabilities are enabled when running.

PiperOrigin-RevId: 713094806
2025-01-07 17:28:00 -08:00
gVisor bot 680df82012 Internal change
PiperOrigin-RevId: 698529562
2024-11-20 14:53:44 -08:00
Etienne Perot abe38d82ac Docker GPU tests: Use the sniffer on exec'd commands too.
Without this, tests that `exec` commands into existing GPU containers (e.g.
the CUDA samples test) would not actually get ioctl compatibility
enforcement.

Updates issue #10885.

PiperOrigin-RevId: 686686310
2024-10-16 16:48:27 -07:00
Etienne Perot d299b3998c Sniffer: Exit instantly on unknown ioctls in compatibility enforcement mode.
This makes it easier to deal with client/server-type GPU tests, such as
ollama or vLLM, where the main GPU process is a long-running one. Prior to
this CL, when setting enforcement mode on this process, this would only be
checked if adding explicit code to shut down the container orderly by
sending a signal to the server process, in order to let the ioctl sniffer
have a chance to produce its report. This is easy to miss, because there
is no feedback that would suggest that the ioctl sniffer isn't being
respected in such tests. By exiting instantly as soon as an unsupported
ioctl is found, such code doesn't need to be added.

However, the previous behavior is still useful when testing new
applications, so it is still available as well. The
`--enforce_compatibility` flag is changed to a tri-state flag that can
handle either being turned off, or enforcing compatibility on the spot vs
at exit time.

Updates issue #10885.

PiperOrigin-RevId: 686653876
2024-10-16 15:03:45 -07:00
Koichi Shiraishi 0cf77c02f8 all: remove use io/ioutil deprecated package & fix some deprecated thing
Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>
2024-10-10 20:36:24 +09:00
Etienne Perot 1ea84d6db0 Add test that runs runsc do inside a non-gVisor container.
This is used in contexts such as Dangerzone:
https://gvisor.dev/blog/2024/09/23/safe-ride-into-the-dangerzone/

Updates issue #10944.

PiperOrigin-RevId: 682454284
2024-10-04 14:40:07 -07:00
Ayush Ranjan e3aa1bf7dd Disable nogo for pkg/test/criutil:criutil and pkg/test/dockerutil:profile_test.
PiperOrigin-RevId: 674352028
2024-09-13 10:40:44 -07:00
Etienne Perot 1e97c039bf Automated rollback of changelist 673541771
PiperOrigin-RevId: 673651019
2024-09-11 20:42:21 -07:00
Etienne Perot 64de876102 Do not embed the run_sniffer binary in the dockerutil library.
This is causing nogo test failures. Use a `data` dependency instead.

Updates #10885

PiperOrigin-RevId: 673541771
2024-09-11 14:42:34 -07:00
Etienne Perot a689c11a76 Integrate GPU ioctl sniffer in GPU tests.
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 672714520
2024-09-09 16:34:19 -07:00
gVisor bot 3c4b246cf2 Fix printf violations inside of the gvisor code
Recently printf.Analyzer has become stricter
(https://github.com/golang/go/issues/60529)
which led to new findings.
gvisor nogo tests run this analyzer and fail if it produces findings.

PiperOrigin-RevId: 671657227
2024-09-06 00:45:23 -07:00
Etienne Perot c1661e7c84 Provide more helpful error messages when profiling is misconfigured.
This forwards the output of `runsc debug` to stderr if it fails during
a container run. Additionally, for runs with profiling enabled, it checks
the runtime arguments and prints an error if the `--profile` flag is not
found in it.

Profiling is also disabled by default on benchmarks now. This forces the
user to be explicit about where benchmarks are stored, which is less
confusing than the current behavior of empty output with no explanation
as to where the profiles are.

Fixes #10433

PiperOrigin-RevId: 642132089
2024-06-10 22:02:12 -07:00
Etienne Perot 1d800dc14b Set default test container log config to allow higher logging volume.
This is part of a series of changes to add metric charts in performance
benchmarks.

This is intended to increase the reliability of logs-based profiling metrics
by ensuring that Docker keeps enough of them around during long tests, or
tests where there are a lot of metrics being profiled, or tests with a very
fast profiling rate, or all of the above.

This also calls `sync` on the underlying logging file descriptor if possible.

Give a more helpful error messages if a log buffer overrun does occur.

PiperOrigin-RevId: 633389921
2024-05-13 18:07:57 -07:00
Etienne Perot 4810afc36c GPU support: Add NVIDIA CUDA sample tests.
This is a set of CUDA tests defined by NVIDIA in this repository:
https://github.com/NVIDIA/cuda-samples

This change introduces a large new test (`cuda_test`) which runs each CUDA
sample test in a container.

There are many subtleties involved due to how the CUDA samples repository
isn't always meant to be run as a test, some of it involves graphical
applications, and a lot of them require this or that CUDA feature which not
all NVIDIA GPUs support, some require multiple GPUs to be on the machine, etc.
Therefore, the test maps each test to their `Compatibility` data which
determines whether or not a sample test is expected to fail when run in a
certain environment. The overall test also has a
`--cuda_verify_compatibility` flag to verify the veracity of this mapping,
by running expected-to-be-broken tests and verifying that their failure
matches how this expected failure typically manifests.

Because there are a lot of CUDA sample tests (213 as of the CUDA 12.3 release
of the cuda-samples repo), and they don't all require the whole GPU to
themselves, and spawning a GPU-using container is expensive (~seconds),
the test uses a pool of reusable containers in which it `exec`s tests (at
most one per container at any given time, but in parallel across containers).
If any test unexpectedly fails, we drain the entire pool of containers and
only run this one test without anything else running on other containers.
This de-flakes tests, especially those that fail because they require more
resources than the GPU has when other tests are using it at the same time.
However, this removes parallelism and therefore increases test time
significantly.

Despite these optimizations, the test is very long and has the maximum deadline
of 1 hour. Because it may get close to the timeout (especially when
`--cuda_verify_compatibility` is on, because that needs to run all the tests
even if they are known to fail, and the ones that do fail can hang for a
while rather than crash), the test also has more logging and debugging than
the typical test, as enabled with the `--cuda_log_successful_tests` and
`--cuda_test_debug` flag. It periodically logs the status of each container
and the pool's utilization ratio. This is useful when debugging the test to
see which pooled container is doing what, and/or to `docker exec` into
containers while they are running a certain test.
The test also does its own timekeeping, which is useful so that it can print
a more helpful failure message that distinguishes between tests failing due
to actual failure reasons vs those that are failing purely because the test
timed out.

To run the test manually (from a VM with the repo checked out):

```
$ docker build -t gvisor.dev/images/gpu/cuda-tests images/gpu/cuda-tests -f images/gpu/cuda-tests/Dockerfile.x86_64 && mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/ && ./bin/runsc install -- --nvproxy=true --debug=true --debug-log=/tmp/runsc/ && systemctl reload docker && make test TEST_OPTIONS='--test_output=streamed --verbose_failures=true' TARGETS=//test/gpu:cuda_test OPTIONS='--test_env=RUNTIME=runsc --test_arg=--cuda_test_debug=true --test_arg=cuda_verify_compatibility=true --test_arg=--cuda_log_successful_tests=true'
```

PiperOrigin-RevId: 626528982
2024-04-19 19:13:42 -07:00
Etienne Perot cc8c584508 dockerutil.ContainerPool: Add debugging and utilization information.
`ContainerPool` is useful to make sure tests can productively use the CPU
by using pre-spun containers in parallel. While the test is running, it is
difficult to know which containers are doing what, and how effective the
container pool actually is.

This change adds state tracking to each container to make this process
easier. Each container has a state and a user-assignable label (e.g. the
test name) to indicate what it is doing. Then, the pool now has a `String`
function to print what each container is doing.

This is useful for CUDA tests which are very close to the test runtime limit
of one hour, in order to see how they can better spend their time, and to
know which container to `exec` into when debugging.

PiperOrigin-RevId: 626501528
2024-04-19 16:50:13 -07:00
Etienne Perot 74b903782c dockerutil: Return exit code in Container.Exec.
This allows tests to consider the exit code of `exec`'d processes.

This is useful for CUDA tests because they exit with a specific error code
(`EXIT_WAIVED`) when they run on GPUs that don't support the features the
test needs.

PiperOrigin-RevId: 626478594
2024-04-19 15:04:08 -07:00
Etienne Perot 7c4d57fbe6 dockerutil: Add IsGVisorRuntime helper function.
This is useful for CUDA tests, some of which work in gVisor and some not.
By checking whether the runtime in use is gVisor, the test can adjust its
expectations of success/failure.

PiperOrigin-RevId: 626443132
2024-04-19 12:45:49 -07:00
Nayana Bidari 87d8df37c7 Enable save/checkpoint resume with runsc checkpoint command.
Enables save resume with checkpoint command. Previously when --leave-running
was set, the sandbox was destroyed after the checkpoint and restored with the
same id. With this change the sandbox will not be destroyed and resumes running
after the checkpoint.

PiperOrigin-RevId: 623282685
2024-04-09 14:34:50 -07:00
Etienne Perot 08ed01b285 dockerutil: Implement ContainerPool, a pool of reusable test containers.
Callers may request a container from the pool, and must release it back
when they are done with it.

This is useful for large tests which can `exec` individual test cases
inside the same set of reusable containers, to avoid the cost of creating
and destroying containers for each test.

It also supports reserving the whole pool ("exclusive"), which locks out
all other callers from getting any other container from the pool.
This is useful for tests where running in parallel may induce unexpected
errors which running serially would not cause. This allows the test to
first try to run in parallel, and then re-run failing tests exclusively
to make sure their failure is not due to parallel execution.

I plan to use this for CUDA sample tests.

PiperOrigin-RevId: 619377512
2024-03-26 18:53:14 -07:00
Etienne Perot 641a1a56b8 dockerutil.GPURunOpts: Expose all GPUs to containers, not just the first.
PiperOrigin-RevId: 617956471
2024-03-21 14:05:10 -07:00