Ubuntu TPU images do not have the vfio-dev directories that COS images do,
so we need a more robust way of setting up the sandbox chroot to handle this
case. This change implements a way to get devices and minor numbers into the
sandbox with minimal support from the host filesystem and cleans up a few
methods to reflect their current usage.
Addresses #10795
PiperOrigin-RevId: 674363342
This is required for Open MPI => hwloc to detect the number of CPUs and set the
number of Open MPI slots accordingly.
Before this CL, on amd64:
```
root@0496c77e84e9:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
96
root@0496c77e84e9:/# mpirun --allow-run-as-root -np 96 /bin/true
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
[0496c77e84e9:00667] OPAL ERROR: Not supported in file ../../../../../opal/mca/hwloc/base/hwloc_base_util.c at line 418
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
topology discovery failed
--> Returned value Not supported (-8) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
```
Before this CL, on arm64:
```
root@bfad6b71e996:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
4
root@bfad6b71e996:/# mpirun --allow-run-as-root -np 4 /bin/true
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
```
After this CL, on amd64:
```
root@0512ce557005:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
96
root@0512ce557005:/# mpirun --allow-run-as-root -np 96 /bin/true
root@0512ce557005:/# mpirun --allow-run-as-root -np 97 /bin/true
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 97
slots that were requested by the application:
...
```
After this CL, on arm64:
```
root@faf6ff491bc5:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
4
root@faf6ff491bc5:/# mpirun --allow-run-as-root -np 4 /bin/true
root@faf6ff491bc5:/# mpirun --allow-run-as-root -np 5 /bin/true
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 5
slots that were requested by the application:
...
```
Per #10484, this also lets `nvidia-smi topo` make more progress:
```
root@d5a696bb7e2c:/# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
Failed to run topology matrix
```
PiperOrigin-RevId: 663395602
V5 support was broken in a few different ways:
- tpu device files (/dev/vfio/X) were created with incorrect minor device nums.
- PCI paths on bus' other than 0000:00 were not supported.
- VFIO unmap was broken and not properly added to the seccomp allowlist.
- The VFIO main device file (/dev/vfio/vfio) did not account for overlapping
device address ranges that correspond to different VFIO container groups.
Previously TPU support was tested on machines with single TPUs, which masked
most of these issues.
All these issues should be fixed by this change. Tested manually on GKE.
PiperOrigin-RevId: 656037853
The symlinks mirrors the relationship between TPU devices and its respective
IOMMU group and on the host.
Before the change, iommu_group has been treated as a normal file, which doesn't
create such symlink for the devices.
PiperOrigin-RevId: 597939650
- Added `fsutil.ForEachDirent()` (used by directfs).
- Added `fsutil.DirentNames()`, which uses `ForEachDirent()`. Used by sysfs
and in the future will be used by nvproxy/tpuproxy.
Separately, fixed some bugs in sysfs:
- We were leaking FDs in `hostDirEntries()`.
- We were leaking FD in `hostFile.Generate()` on error path.
- Got rid of `hostFileBufSize` users. They expected contents to be of 4096
bytes only. Instead made the functions agnostic of file size.
PiperOrigin-RevId: 580688250
The TPU userspace driver needs access to specific PCI device information
located in Linux sysfs. We mirror the sysfs paths the driver reads on the host
in the Sentry sysfs. This way we can ensure we only expose the host device
information that's strictly necessary for TPU to run.
PiperOrigin-RevId: 550005271
These inodes can never be part of a filesystem tree. They are nameless and
never have a parent.
This allows us to avoid taking a lock in kernfs.InotifyWithParent for such
anonymous inodes.
PiperOrigin-RevId: 538823227
This catches up the interface to the `EmitUnimplementedEvent` method signature
on `kernel.Kernel`.
Also add build-time test to verify that `kernel.Kernel` implements this
interface, in order to catch such breakages at build time in the future.
PiperOrigin-RevId: 519000411
Some implementations handle more flags than others, so it doesn't
make sense to have one set of rules for all.
This change should functionally be a no-op.
PiperOrigin-RevId: 502712415