5 Commits

Author SHA1 Message Date
Jamie Liu d08e4a850b Internal change
PiperOrigin-RevId: 631677570
2024-05-08 00:09:42 -07:00
Ayush Ranjan db6ee959df Add GoferClientProvider to devutil.
Also introduce CtxDevGoferClientProvider, which is provided in restore context.

PiperOrigin-RevId: 631237657
2024-05-06 17:37:13 -07:00
Ayush Ranjan 6e61813c1b Save container name in nvproxy FDs.
PiperOrigin-RevId: 630626291
2024-05-04 03:10:56 -07:00
Ayush Ranjan a5e93550c1 Move GPU device ownership to gofer process.
Tested on a T4 GPU with driver version 525.60.13:
```
$ docker run --runtime=runsc --gpus=all --rm -it nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Also tested this on GKE with the same vectoradd workload. Checked that the
device gofer connection is actually being closed when the container is deleted.
Something to note is that the gofer logs for the GPU-container sometimes end
abruptly (the "All lisafs servers exited." line does not print). This is
because runsc/container/container.go:stop() SIGKILLs the gofer before it can
cleanup naturally. The device gofer connection is only closed at the end of
Loader.destroySubcontainer(), which gives little time before the gofer is
SIGKILL-ed.

PiperOrigin-RevId: 581365665
2023-11-10 14:20:31 -08:00
Ayush Ranjan cf9d55bb6e Add device gofer connection.
Adds a gofer connection for /dev directory on the gofer when GPU functionality
is requested. This gofer connection is currently unused. The gofer client is
owned by the kernel, which injects the connection into the context. The gofer
connection is closed on container exit. S/R should be supported with this.

PiperOrigin-RevId: 581298536
2023-11-10 10:22:54 -08:00