FDTable.descriptorTable is a slice of unsafe.Pointer-s and its maximum length
is MaxInt32. It requires up to 16GB of memory. A process can use just a few
descriptors but sets one or more of them to high numbers. In this case,
FDTable.descriptorTable is extended to the maximum size.
The problem here is that go-runtime zeros memory regions when they are reused.
In the case of fdtable, the memory region is 16GB, so it is a time consuming
operation. Second, it forces the kernel to allocate physical pages to
the entire region.
This change adds another level to descriptorTable, so the first level is
a slice of buckets where each bucket is a slice of descriptors. The bucket
size is fixed to 512 entries to fit one page.
Before:
BenchmarkFDLookupAndDecRef-12 50834290 23.70 ns/op
BenchmarkCreateWithMaxFD-12 2 7194873988 ns/op
BenchmarkFDLookupAndDecRefConcurrent-12 23775555 49.68 ns/op
BenchmarkTableLookup-12 412888780 2.835 ns/op
BenchmarkTableMapLookup-12 87944782 12.84 ns/op
After:
BenchmarkFDLookupAndDecRef-12 46229940 25.03 ns/op
BenchmarkCreateWithMaxFD-12 13 82573899 ns/op
BenchmarkFDLookupAndDecRefConcurrent-12 21889380 54.13 ns/op
BenchmarkTableLookup-12 415851230 2.821 ns/op
BenchmarkTableMapLookup-12 97236267 11.89 ns/op
Reported-by: syzbot+af17678e3bfb7ca7c65a@syzkaller.appspotmail.com
PiperOrigin-RevId: 539138632
This is the first in a series of changes that will enabled shared mount
subtrees. This will emulate the Linux kernel implementation described here:
https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
Mounts can now a be part of a mount groups. These groups are
identified by ids assigned by the VFS. Group IDs can be reused after all
mounts that are a member of that group are destroyed.
Newly created mounts inherit the propagation type of their parent mount. A
mount created on a shared mount will also be a shared mount, though it will
be part of a different group. A shared mount 'A' bound to a shared mount 'B'
will replicate that bind to all of 'B's peers. These new mounts will all be
part of a new shared peer group.
Unmounting a mount 'A' that is a direct child of a shared mount 'B' mounted
at dentry 'b' will propagate that event to all its peers. So peers B1, B2,
B3, etc will all unmount the mounts (A1, A2...) located at 'b'. However, if
any of peers of A have children, they are skipped. If the original mount A
has children, the mount is failed entirely.
The initial root mount has propagation type MS_PRIVATE.
This change only implements a basic version of mount groups. Notably it does
not implement MS_SLAVE, MS_UNBINDABLE, or MS_REC.
PiperOrigin-RevId: 483792757