76 Commits

Author SHA1 Message Date
Lucas Manning 8482715727 Enable save/restore with TPUproxy.
This change also adds some small cleanup to TPU code.

PiperOrigin-RevId: 737673712
2025-03-17 10:55:06 -07:00
Andrei Vagin 9fcf0b5b53 proc: invalidate task inodes when tasks are destroyed
PiperOrigin-RevId: 705785809
2024-12-13 00:58:08 -08:00
Jing Chen a093ad0450 Simplify and format gVisor codebase.
The changes are just output of `gofmt -s -w .`.
2024-10-13 00:50:32 -07:00
Lucas Manning 290789bab8 Refactor tpu chroot operations.
Ubuntu TPU images do not have the vfio-dev directories that COS images do,
so we need a more robust way of setting up the sandbox chroot to handle this
case. This change implements a way to get devices and minor numbers into the
sandbox with minimal support from the host filesystem and cleans up a few
methods to reflect their current usage.

Addresses #10795

PiperOrigin-RevId: 674363342
2024-09-13 11:15:28 -07:00
Lucas Manning 932d9dc64b Add nested PCI device support and option to read directly from host dev files.
PiperOrigin-RevId: 670751194
2024-09-03 16:51:13 -07:00
Lucas Manning 974e6dac72 Internal change.
PiperOrigin-RevId: 670667358
2024-09-03 12:48:06 -07:00
Jamie Liu 371107d57f sysfs: implement some cpu topology files
This is required for Open MPI => hwloc to detect the number of CPUs and set the
number of Open MPI slots accordingly.

Before this CL, on amd64:

```
root@0496c77e84e9:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
96
root@0496c77e84e9:/# mpirun --allow-run-as-root -np 96 /bin/true
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
[0496c77e84e9:00667] OPAL ERROR: Not supported in file ../../../../../opal/mca/hwloc/base/hwloc_base_util.c at line 418
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  topology discovery failed
  --> Returned value Not supported (-8) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
```

Before this CL, on arm64:

```
root@bfad6b71e996:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
4
root@bfad6b71e996:/# mpirun --allow-run-as-root -np 4 /bin/true
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
[hwloc/linux] failed to find sysfs cpu topology directory, aborting linux discovery.
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
```

After this CL, on amd64:

```
root@0512ce557005:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
96
root@0512ce557005:/# mpirun --allow-run-as-root -np 96 /bin/true
root@0512ce557005:/# mpirun --allow-run-as-root -np 97 /bin/true
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 97
slots that were requested by the application:
...
```

After this CL, on arm64:

```
root@faf6ff491bc5:/# ls /sys/devices/system/cpu/ | grep cpu | wc -l
4
root@faf6ff491bc5:/# mpirun --allow-run-as-root -np 4 /bin/true
root@faf6ff491bc5:/# mpirun --allow-run-as-root -np 5 /bin/true
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 5
slots that were requested by the application:
...
```

Per #10484, this also lets `nvidia-smi topo` make more progress:

```
root@d5a696bb7e2c:/# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
Failed to run topology matrix
```

PiperOrigin-RevId: 663395602
2024-08-15 12:10:36 -07:00
Lucas Manning bbbecc35cc Add support for v5pod and fix TPU v5 bugs.
V5 support was broken in a few different ways:
- tpu device files (/dev/vfio/X) were created with incorrect minor device nums.
- PCI paths on bus' other than 0000:00 were not supported.
- VFIO unmap was broken and not properly added to the seccomp allowlist.
- The VFIO main device file (/dev/vfio/vfio) did not account for overlapping
  device address ranges that correspond to different VFIO container groups.

Previously TPU support was tested on machines with single TPUs, which masked
most of these issues.

All these issues should be fixed by this change. Tested manually on GKE.

PiperOrigin-RevId: 656037853
2024-07-25 12:12:30 -07:00
Lucas Manning 67596b46a7 Fix pci device mirroring so it doesn't overwrite device directories.
PiperOrigin-RevId: 650320903
2024-07-08 11:39:54 -07:00
Kevin Krakauer c9964aa985 netstack: remove GRO from ingress flow
GRO is getting moved and updated. This removes it in preparation for a
follow-up CL.

PiperOrigin-RevId: 621984030
2024-04-04 15:17:08 -07:00
Lucas Manning 655b50cc53 Change statfs of /sys/fs/cgroup to return TMPFS_MAGIC.
PiperOrigin-RevId: 616184176
2024-03-15 11:04:19 -07:00
Jing Chen 5ee7588184 Update TPU v5 unit tests to cover VFIO devices.
PiperOrigin-RevId: 612986266
2024-03-05 15:21:27 -08:00
Lucas Manning 4596265605 Add unit test for v5 TPU sysfs proxying.
PiperOrigin-RevId: 599297513
2024-01-17 14:55:19 -08:00
Lucas Manning 588d87b40a Add a unit test for sentry sysfs PCI mirroring.
PiperOrigin-RevId: 599006437
2024-01-16 17:29:08 -08:00
Jing Chen 056b797c6b Add path /sys/kernel/iommu_groups when TPU proxy is enabled.
PiperOrigin-RevId: 598941007
2024-01-16 13:36:15 -08:00
Jing Chen 21f697000c Create IOMMU symlink(s) for TPU devices.
The symlinks mirrors the relationship between TPU devices and its respective
IOMMU group and on the host.

Before the change, iommu_group has been treated as a normal file, which doesn't
create such symlink for the devices.

PiperOrigin-RevId: 597939650
2024-01-12 13:41:20 -08:00
Jing Chen d7a3ec8305 Allow vfio releted subdirectories to be mirrored by gVisor.
It expects to mirror sysfs device subdirectories like `vfio-dev` and `vfio#`

PiperOrigin-RevId: 588855389
2023-12-07 11:29:03 -08:00
Jing Chen 063000448f Make gVisor search for all potential registered TPU devices.
gVisor creates symlinks under /sys/class when the device presents.

PiperOrigin-RevId: 588222909
2023-12-05 16:30:22 -08:00
Ayush Ranjan 612e63a7c8 Refactor dirent parsing utilies.
- Added `fsutil.ForEachDirent()` (used by directfs).
- Added `fsutil.DirentNames()`, which uses `ForEachDirent()`. Used by sysfs
  and in the future will be used by nvproxy/tpuproxy.

Separately, fixed some bugs in sysfs:
- We were leaking FDs in `hostDirEntries()`.
- We were leaking FD in `hostFile.Generate()` on error path.
- Got rid of `hostFileBufSize` users. They expected contents to be of 4096
  bytes only. Instead made the functions agnostic of file size.

PiperOrigin-RevId: 580688250
2023-11-08 15:57:42 -08:00
Andrei Vagin aa2c8c33c6 Implement setns for mount namespaces
PiperOrigin-RevId: 552859231
2023-08-01 11:12:29 -07:00
Lucas Manning 19e04218b9 Add methods for generating PCI sysfs paths and registering accel devices.
The TPU userspace driver needs access to specific PCI device information
located in Linux sysfs. We mirror the sysfs paths the driver reads on the host
in the Sentry sysfs. This way we can ensure we only expose the host device
information that's strictly necessary for TPU to run.

PiperOrigin-RevId: 550005271
2023-07-21 11:43:55 -07:00
Nicolas Lacasse 8c975e6e6e Mark some kernfs inode as Anonymous.
These inodes can never be part of a filesystem tree. They are nameless and
never have a parent.

This allows us to avoid taking a lock in kernfs.InotifyWithParent for such
anonymous inodes.

PiperOrigin-RevId: 538823227
2023-06-08 10:25:04 -07:00
Etienne Perot f8b9824813 Update unimpl.EmitUnimplementedEvent interface to add the syscall number.
This catches up the interface to the `EmitUnimplementedEvent` method signature
on `kernel.Kernel`.

Also add build-time test to verify that `kernel.Kernel` implements this
interface, in order to catch such breakages at build time in the future.

PiperOrigin-RevId: 519000411
2023-03-23 17:01:37 -07:00
Adin Scannell 1ceb814544 Add default_applicable_licenses rules to packages.
PiperOrigin-RevId: 513581243
2023-03-02 10:50:04 -08:00
Lucas Manning 3c93bb1040 Defer kernfs openflag handling to inode implementations.
Some implementations handle more flags than others, so it doesn't
make sense to have one set of rules for all.

This change should functionally be a no-op.

PiperOrigin-RevId: 502712415
2023-01-17 16:01:40 -08:00