gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
gVisor bot	225a7bc0d2	Internal change. PiperOrigin-RevId: 739263572	2025-03-21 12:32:21 -07:00
Ayush Ranjan	e699298d58	Add Container.RestoreInTest() to handle known Docker bugs. This can be used by all test users. Avoids duplicated code. We can handle all known issues in one place. There is a Docker bug which causes restore to fail sporadically. See https://github.com/moby/moby/issues/42900. This has been broken at least since Docker v19.03.12 (when the issue was reported) and was fixed in v25.0.4. Added the handling for this issue. Also got rid of the testutil.Poll() around restore. That can hide gVisor restore flakiness issues. That was added in `0990ef7517` ("Make checkpoint/restore e2e test less flaky"). The original sleep has been restored. PiperOrigin-RevId: 734303878	2025-03-06 15:13:16 -08:00
Etienne Perot	3649ca9d9e	On COS, add NVIDIA library directory to LD configuration and update cache. Unlike Ubuntu VMs where we use Docker's `--gpus` flag, COS VMs do not use this flag and instead mount the NVIDIA library directories automatically. However, nothing guarantees that these directories are added to the LD config. This change fixes that. It take advantage of the fact that all GPU tests have the sniffer binary as entrypoint, which slightly overloads the role of the sniffer within the GPU test infrastructure... but then again the ioctl sniffer is already deeply intertwined with ld configuration because it already overrides the `ioctl` libc function, so this doesn't seem like too big of a stretch. This change makes the ffmpeg test succeed with `runc` on COS, but they still fail with gVisor (with `CUDA_ERROR_OUT_OF_MEMORY` errors). So there must be some further gVisor-specific error. Updates #11351 Updates #11321 PiperOrigin-RevId: 715222952	2025-01-13 21:13:14 -08:00
Etienne Perot	501618dbe2	Modify `make` rules to allow using a fully-local build cache. This includes the ability to force using local images only (do not check for updated manifests), and to explicitly mount and specify external caches for Go repositories via `rules_go`'s `GO_REPOSITORY_USE_HOST_MODCACHE`. PiperOrigin-RevId: 713473497	2025-01-09 12:23:53 -08:00
Ayush Ranjan	2edcee37d9	Specify all capabilities in TestGPUCheckpointRestore. This test was broken by `5e6589e0b7` ("Update CUDA test compatibility to keep up with added gVisor support.") which requires all images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities. PiperOrigin-RevId: 713170590	2025-01-07 23:28:03 -08:00
Etienne Perot	5e6589e0b7	Update CUDA test compatibility to keep up with added gVisor support. These CUDA tests were initially broken in gVisor but now appear to pass. The test now also verifies that all capabilities are enabled when running. PiperOrigin-RevId: 713094806	2025-01-07 17:28:00 -08:00
gVisor bot	680df82012	Internal change PiperOrigin-RevId: 698529562	2024-11-20 14:53:44 -08:00
Etienne Perot	abe38d82ac	Docker GPU tests: Use the sniffer on `exec`'d commands too. Without this, tests that `exec` commands into existing GPU containers (e.g. the CUDA samples test) would not actually get ioctl compatibility enforcement. Updates issue #10885. PiperOrigin-RevId: 686686310	2024-10-16 16:48:27 -07:00
Etienne Perot	d299b3998c	Sniffer: Exit instantly on unknown ioctls in compatibility enforcement mode. This makes it easier to deal with client/server-type GPU tests, such as ollama or vLLM, where the main GPU process is a long-running one. Prior to this CL, when setting enforcement mode on this process, this would only be checked if adding explicit code to shut down the container orderly by sending a signal to the server process, in order to let the ioctl sniffer have a chance to produce its report. This is easy to miss, because there is no feedback that would suggest that the ioctl sniffer isn't being respected in such tests. By exiting instantly as soon as an unsupported ioctl is found, such code doesn't need to be added. However, the previous behavior is still useful when testing new applications, so it is still available as well. The `--enforce_compatibility` flag is changed to a tri-state flag that can handle either being turned off, or enforcing compatibility on the spot vs at exit time. Updates issue #10885. PiperOrigin-RevId: 686653876	2024-10-16 15:03:45 -07:00
Koichi Shiraishi	0cf77c02f8	all: remove use io/ioutil deprecated package & fix some deprecated thing Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>	2024-10-10 20:36:24 +09:00
Etienne Perot	1ea84d6db0	Add test that runs `runsc do` inside a non-gVisor container. This is used in contexts such as Dangerzone: https://gvisor.dev/blog/2024/09/23/safe-ride-into-the-dangerzone/ Updates issue #10944. PiperOrigin-RevId: 682454284	2024-10-04 14:40:07 -07:00
Ayush Ranjan	e3aa1bf7dd	Disable nogo for pkg/test/criutil:criutil and pkg/test/dockerutil:profile_test. PiperOrigin-RevId: 674352028	2024-09-13 10:40:44 -07:00
Etienne Perot	1e97c039bf	Automated rollback of changelist 673541771 PiperOrigin-RevId: 673651019	2024-09-11 20:42:21 -07:00
Etienne Perot	64de876102	Do not embed the `run_sniffer` binary in the `dockerutil` library. This is causing nogo test failures. Use a `data` dependency instead. Updates #10885 PiperOrigin-RevId: 673541771	2024-09-11 14:42:34 -07:00
Etienne Perot	a689c11a76	Integrate GPU `ioctl` sniffer in GPU tests. This wraps all GPU tests' command line with the nvproxy ioctl sniffer. This has multiple functions: - Verifying that the application does not call ioctls unsupported by nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which is initially set to `true` in all tests to mirror current behavior, but should be flipped as we verify that they do not call unsupported ioctls. - Verifying that the sniffer itself works transparently for a wide range of applications. - Later down the line, enforcing that the application only calls ioctls that are part of GPU capabilities that it has a need for. This is controlled by a capability string which is currently only used to set the `NVIDIA_DRIVER_CAPABILITIES` environment variable. Updates issue #10856 PiperOrigin-RevId: 672714520	2024-09-09 16:34:19 -07:00
gVisor bot	3c4b246cf2	Fix printf violations inside of the gvisor code Recently printf.Analyzer has become stricter (https://github.com/golang/go/issues/60529) which led to new findings. gvisor nogo tests run this analyzer and fail if it produces findings. PiperOrigin-RevId: 671657227	2024-09-06 00:45:23 -07:00
Etienne Perot	c1661e7c84	Provide more helpful error messages when profiling is misconfigured. This forwards the output of `runsc debug` to stderr if it fails during a container run. Additionally, for runs with profiling enabled, it checks the runtime arguments and prints an error if the `--profile` flag is not found in it. Profiling is also disabled by default on benchmarks now. This forces the user to be explicit about where benchmarks are stored, which is less confusing than the current behavior of empty output with no explanation as to where the profiles are. Fixes #10433 PiperOrigin-RevId: 642132089	2024-06-10 22:02:12 -07:00
Etienne Perot	1d800dc14b	Set default test container log config to allow higher logging volume. This is part of a series of changes to add metric charts in performance benchmarks. This is intended to increase the reliability of logs-based profiling metrics by ensuring that Docker keeps enough of them around during long tests, or tests where there are a lot of metrics being profiled, or tests with a very fast profiling rate, or all of the above. This also calls `sync` on the underlying logging file descriptor if possible. Give a more helpful error messages if a log buffer overrun does occur. PiperOrigin-RevId: 633389921	2024-05-13 18:07:57 -07:00
Etienne Perot	4810afc36c	GPU support: Add NVIDIA CUDA sample tests. This is a set of CUDA tests defined by NVIDIA in this repository: https://github.com/NVIDIA/cuda-samples This change introduces a large new test (`cuda_test`) which runs each CUDA sample test in a container. There are many subtleties involved due to how the CUDA samples repository isn't always meant to be run as a test, some of it involves graphical applications, and a lot of them require this or that CUDA feature which not all NVIDIA GPUs support, some require multiple GPUs to be on the machine, etc. Therefore, the test maps each test to their `Compatibility` data which determines whether or not a sample test is expected to fail when run in a certain environment. The overall test also has a `--cuda_verify_compatibility` flag to verify the veracity of this mapping, by running expected-to-be-broken tests and verifying that their failure matches how this expected failure typically manifests. Because there are a lot of CUDA sample tests (213 as of the CUDA 12.3 release of the cuda-samples repo), and they don't all require the whole GPU to themselves, and spawning a GPU-using container is expensive (~seconds), the test uses a pool of reusable containers in which it `exec`s tests (at most one per container at any given time, but in parallel across containers). If any test unexpectedly fails, we drain the entire pool of containers and only run this one test without anything else running on other containers. This de-flakes tests, especially those that fail because they require more resources than the GPU has when other tests are using it at the same time. However, this removes parallelism and therefore increases test time significantly. Despite these optimizations, the test is very long and has the maximum deadline of 1 hour. Because it may get close to the timeout (especially when `--cuda_verify_compatibility` is on, because that needs to run all the tests even if they are known to fail, and the ones that do fail can hang for a while rather than crash), the test also has more logging and debugging than the typical test, as enabled with the `--cuda_log_successful_tests` and `--cuda_test_debug` flag. It periodically logs the status of each container and the pool's utilization ratio. This is useful when debugging the test to see which pooled container is doing what, and/or to `docker exec` into containers while they are running a certain test. The test also does its own timekeeping, which is useful so that it can print a more helpful failure message that distinguishes between tests failing due to actual failure reasons vs those that are failing purely because the test timed out. To run the test manually (from a VM with the repo checked out): ``` $ docker build -t gvisor.dev/images/gpu/cuda-tests images/gpu/cuda-tests -f images/gpu/cuda-tests/Dockerfile.x86_64 && mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/ && ./bin/runsc install -- --nvproxy=true --debug=true --debug-log=/tmp/runsc/ && systemctl reload docker && make test TEST_OPTIONS='--test_output=streamed --verbose_failures=true' TARGETS=//test/gpu:cuda_test OPTIONS='--test_env=RUNTIME=runsc --test_arg=--cuda_test_debug=true --test_arg=cuda_verify_compatibility=true --test_arg=--cuda_log_successful_tests=true' ``` PiperOrigin-RevId: 626528982	2024-04-19 19:13:42 -07:00
Etienne Perot	cc8c584508	`dockerutil.ContainerPool`: Add debugging and utilization information. `ContainerPool` is useful to make sure tests can productively use the CPU by using pre-spun containers in parallel. While the test is running, it is difficult to know which containers are doing what, and how effective the container pool actually is. This change adds state tracking to each container to make this process easier. Each container has a state and a user-assignable label (e.g. the test name) to indicate what it is doing. Then, the pool now has a `String` function to print what each container is doing. This is useful for CUDA tests which are very close to the test runtime limit of one hour, in order to see how they can better spend their time, and to know which container to `exec` into when debugging. PiperOrigin-RevId: 626501528	2024-04-19 16:50:13 -07:00
Etienne Perot	74b903782c	dockerutil: Return exit code in `Container.Exec`. This allows tests to consider the exit code of `exec`'d processes. This is useful for CUDA tests because they exit with a specific error code (`EXIT_WAIVED`) when they run on GPUs that don't support the features the test needs. PiperOrigin-RevId: 626478594	2024-04-19 15:04:08 -07:00
Etienne Perot	7c4d57fbe6	`dockerutil`: Add `IsGVisorRuntime` helper function. This is useful for CUDA tests, some of which work in gVisor and some not. By checking whether the runtime in use is gVisor, the test can adjust its expectations of success/failure. PiperOrigin-RevId: 626443132	2024-04-19 12:45:49 -07:00
Nayana Bidari	87d8df37c7	Enable save/checkpoint resume with runsc checkpoint command. Enables save resume with checkpoint command. Previously when --leave-running was set, the sandbox was destroyed after the checkpoint and restored with the same id. With this change the sandbox will not be destroyed and resumes running after the checkpoint. PiperOrigin-RevId: 623282685	2024-04-09 14:34:50 -07:00
Etienne Perot	08ed01b285	`dockerutil`: Implement `ContainerPool`, a pool of reusable test containers. Callers may request a container from the pool, and must release it back when they are done with it. This is useful for large tests which can `exec` individual test cases inside the same set of reusable containers, to avoid the cost of creating and destroying containers for each test. It also supports reserving the whole pool ("exclusive"), which locks out all other callers from getting any other container from the pool. This is useful for tests where running in parallel may induce unexpected errors which running serially would not cause. This allows the test to first try to run in parallel, and then re-run failing tests exclusively to make sure their failure is not due to parallel execution. I plan to use this for CUDA sample tests. PiperOrigin-RevId: 619377512	2024-03-26 18:53:14 -07:00
Etienne Perot	641a1a56b8	`dockerutil.GPURunOpts`: Expose all GPUs to containers, not just the first. PiperOrigin-RevId: 617956471	2024-03-21 14:05:10 -07:00

1 2 3

75 Commits