14 Commits

Author SHA1 Message Date
Ayush Ranjan 138e98fb7d nvproxy: Refactor DriverVersion out to nvconf package.
This allows for runsc to be able to use DriverVersion without having to depend
on the entirety of nvproxy.

PiperOrigin-RevId: 733912696
2025-03-05 16:43:03 -08:00
Etienne Perot 5fdedb5276 Parallelize BuildKite "All GPU drivers test".
Each shard takes on a subset of the supported driver versions, as determined
by a counter.

Also sort the order in which the list of supported versions is written out
so that console output isn't confusing.

PiperOrigin-RevId: 706887043
2024-12-16 17:24:57 -08:00
Zach Koopmans 891ab9fd14 Add -test.v to cos gpu driver test.
PiperOrigin-RevId: 693919284
2024-11-06 17:25:19 -08:00
Zach Koopmans 23c8b4b042 Add test to check COS drivers as they are posted.
Our current check of COS drivers often lags behind COS releases.
This is due to needing to preload GPU docker images onto the
images that run in our CI pipelines.

In addition, COS can be a bit more complex than originally thought
releasing driver versions both across GPU types and release branches.

Thus, this test searches the latest COS images on each family for
new drivers. It does this by looking at COS's published release notes
which include a proto of LATEST/DEFAULT drivers selected for each device.

This will flag new versions faster with more coverage than our
CI pipeline currently. Due to this not actually needing a GPU
to run, this can run on any VM.

PiperOrigin-RevId: 693736100
2024-11-06 08:30:38 -08:00
Jamie Liu 0608f803c8 Work around Ubuntu 22.04 kernel compiler version problem.
Before this CL (extraneous whitespace trimmed):

```
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.90.07..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.
ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
...
[nvidia-installer]:    warning: the compiler differs from the one used to build the kernel
[nvidia-installer]:      The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
[nvidia-installer]:      You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[nvidia-installer]:
[nvidia-installer]:    Warning: Compiler version check failed:
[nvidia-installer]:
[nvidia-installer]:    The major and minor number of the compiler used to
[nvidia-installer]:    compile the kernel:
[nvidia-installer]:
[nvidia-installer]:    x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38
[nvidia-installer]:
[nvidia-installer]:    does not match the compiler used here:
[nvidia-installer]:
[nvidia-installer]:    cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[nvidia-installer]:    Copyright (C) 2021 Free Software Foundation, Inc.
[nvidia-installer]:    This is free software; see the source for copying conditions.  There is NO
[nvidia-installer]:    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[nvidia-installer]:
[nvidia-installer]:
[nvidia-installer]:    It is recommended to set the CC environment variable
[nvidia-installer]:    to the compiler that was used to compile the kernel.
[nvidia-installer]:
[nvidia-installer]:    To skip the test and silence this warning message, set
[nvidia-installer]:    the IGNORE_CC_MISMATCH environment variable to "1".
[nvidia-installer]:    However, mixing compiler versions between the kernel
[nvidia-installer]:    and kernel modules can result in subtle bugs that are
[nvidia-installer]:    difficult to diagnose.
[nvidia-installer]:
[nvidia-installer]:    *** Failed CC version check. ***
...
[nvidia-installer]:    cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
[nvidia-installer]:    make[4]: *** [scripts/Makefile.build:243: /tmp/selfgz40736/NVIDIA-Linux-x86_64-550.90.07/kernel/nvidia/nv.o] Error 1
```

After this CL:

```
I1025 19:39:00.337553   62025 install_driver.go:325] getOSRelease() = map[BUG_REPORT_URL:https://bugs.launchpad.net/ubuntu/ HOME_URL:https://www.ubuntu.com/ ID:ubuntu ID_LIKE:debian NAME:Ubuntu PRETTY_NAME:Ubuntu 22.04.5 LTS PRIVACY_POLICY_URL:https://www.ubuntu.com/legal/terms-and-policies/privacy-policy SUPPORT_URL:https://help.ubuntu.com/ UBUNTU_CODENAME:jammy VERSION:22.04.5 LTS (Jammy Jellyfish) VERSION_CODENAME:jammy VERSION_ID:22.04]
I1025 19:39:00.337657   62025 install_driver.go:281] Forcing gcc-12
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.90.07..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

Fri Oct 25 19:42:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   76C    P0             31W /   70W |       1MiB /  15360MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
I1025 19:42:46.758261   62025 install_driver.go:125] Installation Complete!
```

PiperOrigin-RevId: 692271389
2024-11-01 13:18:16 -07:00
Ayush Ranjan fd6a338977 Remove --debug from GPU tests
This flag is not required. Unsupported ioctls are logged with warning level so
they will appear in the logs regardless.

PiperOrigin-RevId: 655611563
2024-07-24 10:19:47 -07:00
Ayush Ranjan 0fa8b565ac Ignore Nvidia drivers that fail to install in all_drivers_test.sh.
Some drivers fail to build on our Ubuntu VMs due to
NVIDIA/open-gpu-kernel-modules#594.

PiperOrigin-RevId: 634085739
2024-05-15 15:00:33 -07:00
Ayush Ranjan 426d08bdab Don't run GPU tests with strace.
We usually only need the "nvproxy: " logs to debug GPU workload issues. These
are visible with --debug=true itself. Strace makes things really slow.

And consistently run "All GPU Drivers Test" with --debug=true.

PiperOrigin-RevId: 633199684
2024-05-13 07:20:17 -07:00
Etienne Perot 1643e55713 gVisor GPU installer: Support the multi-GPU case.
PiperOrigin-RevId: 629829045
2024-05-01 13:59:44 -07:00
Ayush Ranjan e89d94be48 Refactor nvproxy to expose useful API for supported driver versions.
- Replaced GetSupportedDriversAndChecksums() with ForEachSupportDriver(),
  LatestDriver() and GetDriverChecksum(). This API is more efficient.
  GetSupportedDriversAndChecksums() was allocating a map with all versions and
  checksums while most callers did not want checksum.
- Add validate_checksum command to tools/gpu:main which validates that abis
  map has valid and correct checksums. The checksum command was fixed to print
  the checksum for the provided version (as the description suggested).

PiperOrigin-RevId: 599230148
2024-01-17 11:09:44 -08:00
Etienne Perot 8042c6f3f5 Rename existing GPU tests to GPU *smoke* tests.
This is their current function, and it is useful to have smoke tests for
the gVisor release pipeline.

GPU tests that aren't of the "hello world" variety require multi-GiB
images and longer time to run, so we need to have them as separate
targets so that they can be run separately from the smoke tests.

PiperOrigin-RevId: 587187005
2023-12-01 18:32:18 -08:00
Ayush Ranjan 3183080393 Mark 535.129.03 Nvidia driver as supported.
PiperOrigin-RevId: 586710906
2023-11-30 10:13:03 -08:00
Zach Koopmans 59af1edc78 Add script to run gpu test for all supported driver versions.
Add script and subsequent calls in the buildkite pipeline to run
gpu tests on all supported drivers.

In addition, add an outfile flag to the "list" command for the
driver installer which allows us to get the list of drivers
and use it in the script (instead of a bunch of output from
make/bazel/other stuff).

PiperOrigin-RevId: 575870051
2023-10-23 11:05:11 -07:00
Zach Koopmans d6e83e2802 Add nvidia installer tool for installing NVIDIA drivers in buildkite tests.
PiperOrigin-RevId: 574371646
2023-10-17 23:20:17 -07:00