57 Commits

Author SHA1 Message Date
Zach Koopmans 4ce00d28f6 Fix broken cuda tests
PiperOrigin-RevId: 734342887
2025-03-06 17:11:19 -08:00
Ayush Ranjan e699298d58 Add Container.RestoreInTest() to handle known Docker bugs.
This can be used by all test users. Avoids duplicated code. We can handle all
known issues in one place.

There is a Docker bug which causes restore to fail sporadically. See
https://github.com/moby/moby/issues/42900. This has been broken at least since
Docker v19.03.12 (when the issue was reported) and was fixed in v25.0.4. Added
the handling for this issue.

Also got rid of the testutil.Poll() around restore. That can hide gVisor
restore flakiness issues. That was added in 0990ef7517 ("Make
checkpoint/restore e2e test less flaky"). The original sleep has been restored.

PiperOrigin-RevId: 734303878
2025-03-06 15:13:16 -08:00
gVisor bot 7d4b4bd634 Merge pull request #11495 from zkoopmans:cuda
PiperOrigin-RevId: 733445831
2025-03-04 13:35:08 -08:00
zkoopmans 725669a152 Update cuda-tests for ARM workloads
Add image for ARM workloads for cuda-tests and mark tests that work on ARM.

Most tests don't work due to cross-compiling between sbma and aarch64.
However, a few do. Add an image to support them.
2025-02-27 21:55:38 +00:00
Etienne Perot 4ba931dd22 CUDA test compatibility: Remove p2pBandwidthLatencyTest special case.
While this test does use the P2P capability, it also has fallback code to
do regular memcopies when P2P is not available. So it does not require the
P2P capability.

PiperOrigin-RevId: 714344567
2025-01-10 22:34:39 -08:00
gVisor bot c5c74f4c90 Merge pull request #11321 from 2022tgoel:nvproxy_video_cap
PiperOrigin-RevId: 713755984
2025-01-09 12:27:22 -08:00
2022tgoel 7399a32b4c Add GPU video codecs support to nvproxy (so that tools like ffmpeg work)
adding cap

tests work

fix merge

unit test

small fixes

additional ioctls for L4 gpu
2025-01-09 01:09:47 +00:00
Zach Koopmans 9803629d92 Don't fail compatibility test when COS versions are not yet released.
COS can take a day or two to release a new version. While this is happening
versions can appear in gcloud projects, but not on the public site. In this
case, pass the tests so that we don't have failures while release is happening.

PiperOrigin-RevId: 713388040
2025-01-08 13:02:49 -08:00
Ayush Ranjan 2edcee37d9 Specify all capabilities in TestGPUCheckpointRestore.
This test was broken by 5e6589e0b7 ("Update CUDA test compatibility to keep
up with added gVisor support.") which requires all
images/gpu/cuda-tests/run_sample.go users to specify "all" driver capabilities.

PiperOrigin-RevId: 713170590
2025-01-07 23:28:03 -08:00
Etienne Perot 5e6589e0b7 Update CUDA test compatibility to keep up with added gVisor support.
These CUDA tests were initially broken in gVisor but now appear to pass.

The test now also verifies that all capabilities are enabled when running.

PiperOrigin-RevId: 713094806
2025-01-07 17:28:00 -08:00
Etienne Perot 232c17cbb6 ollama benchmark: Add embedding benchmark, refresh set of models.
This refreshes the set of models built into the image to a more diverse
set of models while keeping the same categories covered.

It also adds support for embedding generation and benchmark metrics for
embedding-type models.

The image is also (slightly) smaller which helps make benchmarks not take
forever.

PiperOrigin-RevId: 706594537
2024-12-15 23:53:52 -08:00
Zach Koopmans 23c8b4b042 Add test to check COS drivers as they are posted.
Our current check of COS drivers often lags behind COS releases.
This is due to needing to preload GPU docker images onto the
images that run in our CI pipelines.

In addition, COS can be a bit more complex than originally thought
releasing driver versions both across GPU types and release branches.

Thus, this test searches the latest COS images on each family for
new drivers. It does this by looking at COS's published release notes
which include a proto of LATEST/DEFAULT drivers selected for each device.

This will flag new versions faster with more coverage than our
CI pipeline currently. Due to this not actually needing a GPU
to run, this can run on any VM.

PiperOrigin-RevId: 693736100
2024-11-06 08:30:38 -08:00
Etienne Perot b0042bfc5b Parallelize CUDA test and run it continuously.
Tested in presubmit mode:
https://buildkite.com/gvisor/pipeline/builds/32983#0192e4d7-2a41-4505-8eea-422477b07644

PiperOrigin-RevId: 692198767
2024-11-01 09:17:14 -07:00
Etienne Perot 0c4a709bc1 Run CUDA tests as part of GPU tests.
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 691516066
2024-10-30 12:39:30 -07:00
Etienne Perot 74f6136c45 Automated rollback of changelist 688755446
PiperOrigin-RevId: 688831563
2024-10-22 23:24:30 -07:00
Etienne Perot 7a0401fb8e Run CUDA tests as part of GPU tests.
This runs in continuous mode only.

PiperOrigin-RevId: 688755446
2024-10-22 18:04:35 -07:00
Etienne Perot e13cf36ad7 Update all GPU tests to use the ioctl sniffer.
Fixes issue #10885.

PiperOrigin-RevId: 688728104
2024-10-22 16:21:47 -07:00
Nathan Wang 229d01f0d4 Add ffmpeg nvdec test 2024-10-07 12:16:02 +00:00
Etienne Perot b89f53b2ce Add ffmpeg GPU test with h264_nvenc video codec (which uses NVENC).
This test does NOT work yet in gVisor.

Updates #9452

PiperOrigin-RevId: 682127429
2024-10-03 19:33:50 -07:00
Etienne Perot 0fcb9b7f2e Disable GPU sniffer on all non-smoke GPU tests.
The presubmit pipeline does not exercise these, so commit
a689c11a76 broke them.

This change disables the sniffer on all the non-smoke tests to unbreak
the release pipeline. I will then send another change to re-enable them
on tests where the sniffer works fine after manual testing.

Updates #10885

PiperOrigin-RevId: 673555026
2024-09-11 15:17:58 -07:00
Etienne Perot a689c11a76 Integrate GPU ioctl sniffer in GPU tests.
This wraps all GPU tests' command line with the nvproxy ioctl sniffer.

This has multiple functions:

- Verifying that the application does not call ioctls unsupported by
  nvproxy. This is controlled by a `AllowIncompatibleIoctl` option, which
  is initially set to `true` in all tests to mirror current behavior, but
  should be flipped as we verify that they do not call unsupported ioctls.
- Verifying that the sniffer itself works transparently for a wide range
  of applications.
- Later down the line, enforcing that the application only calls ioctls
  that are part of GPU capabilities that it has a need for. This is
  controlled by a capability string which is currently only used to set
  the `NVIDIA_DRIVER_CAPABILITIES` environment variable.

Updates issue #10856

PiperOrigin-RevId: 672714520
2024-09-09 16:34:19 -07:00
Etienne Perot 043ce9c5d2 Fix sniffer_test by embedding the run_sniffer binary in it.
This works around file path resolution issues.

PiperOrigin-RevId: 663985469
2024-08-16 20:56:29 -07:00
Anthony Cui 6199fc8395 Fix sniffer_test to work.
Previously, run_sniffer was not correctly reporting when unsupported ioctls
were found with the compatibility flag set. At the same time, the sniffer test
was not correctly testing a supported CUDA program, since it was using
run_sample which is currently broken with the sniffer.

This also adds the sniffer test to the list of gpu tests.

PiperOrigin-RevId: 663911778
2024-08-16 16:34:11 -07:00
Anthony Cui 1294157cbe Create named subtests for each NCCL test
PiperOrigin-RevId: 651601366
2024-07-11 18:39:59 -07:00
Anthony Cui ab513ff9bb Add NCCL tests as a regression test.
PiperOrigin-RevId: 651541375
2024-07-11 14:48:31 -07:00