Files
Ayush Ranjan a5e93550c1 Move GPU device ownership to gofer process.
Tested on a T4 GPU with driver version 525.60.13:
```
$ docker run --runtime=runsc --gpus=all --rm -it nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Also tested this on GKE with the same vectoradd workload. Checked that the
device gofer connection is actually being closed when the container is deleted.
Something to note is that the gofer logs for the GPU-container sometimes end
abruptly (the "All lisafs servers exited." line does not print). This is
because runsc/container/container.go:stop() SIGKILLs the gofer before it can
cleanup naturally. The device gofer connection is only closed at the end of
Loader.destroySubcontainer(), which gives little time before the gofer is
SIGKILL-ed.

PiperOrigin-RevId: 581365665
2023-11-10 14:20:31 -08:00

23 lines
449 B
Python

load("//tools:defs.bzl", "go_library")
package(default_applicable_licenses = ["//:license"])
licenses(["notice"])
go_library(
name = "devutil",
srcs = [
"context.go",
"devutil.go",
],
visibility = ["//visibility:public"],
deps = [
"//pkg/context",
"//pkg/fsutil",
"//pkg/lisafs",
"//pkg/log",
"//pkg/unet",
"@org_golang_x_sys//unix:go_default_library",
],
)