51 Commits

Author SHA1 Message Date
Jimmy Tran de6637c27c Recompute max variable after setting FD in the bitmap.
`fdBitmap.FirstZero()` could return `max` value; if it does, then
recompute the max value to avoid reusing the old max value twice.

The default bitmap size for file descriptors in gVisor is 65535.

Add a pipe test that attempts to create more than 65535 FDs to hit the edge
case where fdBitmap.FirstZero() returns the default bitmap max value of 65535.

TESTED:
http://sponge2/4c12ce75-3763-4773-ad62-87c6b8fe0446
http://sponge2/9c9d6ea0-b69c-432c-a16b-9446214109ba
PiperOrigin-RevId: 724410846
2025-02-07 11:22:54 -08:00
Ayush Ranjan 448f894c70 Expose some kernel hooks.
- Add Kernel.IsPaused() which indicates whether the kernel is currently paused.
- Add TaskSet.ForEachThreadGroup() which allows callers to iterate through all
  thread groups in the kernel.
- Export FDTable.ForEach() which allows other packages to iterate over all FDs.

PiperOrigin-RevId: 640760723
2024-06-05 21:43:43 -07:00
Ayush Ranjan f52d36ccc2 Allow FDTable.forEach() to be interrupted from caller function.
FDTable.fdBitmap.ForEach() conveniently allows this interruption. Just plumb
the caller function result to Bitmap.ForEach().

PiperOrigin-RevId: 636775362
2024-05-23 21:13:12 -07:00
Ayush Ranjan 7e395bbbd4 Plumb restore context to load*() methods.
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").

Updates #1956.

PiperOrigin-RevId: 614125262
2024-03-08 20:28:02 -08:00
Ayush Ranjan 980de72deb Call FileDescription.OnClose() for newfd being replaced in dup2 and dup3.
dup(2) man page specifies:
       If the file descriptor newfd was previously open, it is closed
       before being reused; the close is performed silently (i.e., any
       errors during the close are not reported by dup2()).

Even though we were DecRef-ing and hence releasing the replaced FD, we were
not calling OnClose(). Compare fs/file.c:do_dup2() -> filp_close(tofree), which
in turn calls filp_flush(). In gVisor, FileDescription.OnClose() analogously
does such flush operations.
in turn

PiperOrigin-RevId: 583147682
2023-11-16 13:38:17 -08:00
Ayush Ranjan f154acfb7b Minor fixes in fd_table.
- Fixed up some documentation.
- Got rid of some redundant FDTable.get() calls from FDTable.Remove() and
  FDTable.RemoveNextInRange().
- Consistently handled the result of FDTable.set().
- Added file != nil precondition to NewFDAt().
- Only call fdBitmap.Add() and fdBitmap.Remove() when necessary.

PiperOrigin-RevId: 583096816
2023-11-16 10:51:24 -08:00
Andrei Vagin 68cdc88378 Implement the fs.nr_open sysctl
fs/nr_open limits the maximum size of fdtable-s.

PiperOrigin-RevId: 580795874
2023-11-08 23:41:32 -08:00
Andrei Vagin 52692c3647 fdtable: avoid large arrays
FDTable.descriptorTable is a slice of unsafe.Pointer-s and its maximum length
is MaxInt32. It requires up to 16GB of memory. A process can use just a few
descriptors but sets one or more of them to high numbers. In this case,
FDTable.descriptorTable is extended to the maximum size.

The problem here is that go-runtime zeros memory regions when they are reused.
In the case of fdtable, the memory region is 16GB, so it is a time consuming
operation. Second, it forces the kernel to allocate physical pages to
the entire region.

This change adds another level to descriptorTable, so the first level is
a slice of buckets where each bucket is a slice of descriptors. The bucket
size is fixed to 512 entries to fit one page.

Before:
BenchmarkFDLookupAndDecRef-12              	50834290	        23.70 ns/op
BenchmarkCreateWithMaxFD-12                	       2	7194873988 ns/op
BenchmarkFDLookupAndDecRefConcurrent-12    	23775555	        49.68 ns/op
BenchmarkTableLookup-12                    	412888780	         2.835 ns/op
BenchmarkTableMapLookup-12                 	87944782	        12.84 ns/op

After:
BenchmarkFDLookupAndDecRef-12              	46229940	        25.03 ns/op
BenchmarkCreateWithMaxFD-12                	      13	  82573899 ns/op
BenchmarkFDLookupAndDecRefConcurrent-12    	21889380	        54.13 ns/op
BenchmarkTableLookup-12                    	415851230	         2.821 ns/op
BenchmarkTableMapLookup-12                 	97236267	        11.89 ns/op

Reported-by: syzbot+af17678e3bfb7ca7c65a@syzkaller.appspotmail.com
PiperOrigin-RevId: 539138632
2023-06-09 11:49:28 -07:00
Fabricio Voznika fc94225c33 Fix crash with large FD value
There were 2 problems when trying to allocate a high FD value:
- Rlimit is stored as uint64 and could be truncated when converting to int32
  to calculate the max value allowed for the FD.
- While trying to double the FD table size, the new length for the table could
  end up short due to invalid type convertion again.

Reported-by: syzbot+e4a60cfb88b515cbd2b1@syzkaller.appspotmail.com
PiperOrigin-RevId: 518362257
2023-03-21 13:21:27 -07:00
Kevin Krakauer 28472cc03f don't take an unnecessary reference in proc.fdSymlink.Valid()
Reported-by: syzbot+8622a8a08287adc17bc9@syzkaller.appspotmail.com
PiperOrigin-RevId: 510454946
2023-02-17 09:51:21 -08:00
Ayush Ranjan 1fa3c06f1e Delete VFS1 completely.
- Delete pkg/sentry/fs/*.
- Move pkg/sentry/fs/fsutil out of VFS1 directory and remove VFS1 components.
- Remove remaining unused references to VFS1 from remaining codebase.
- Rename/refactor code to avoid even referencing VFS2, unless necessary.
- Rewrite VFS1-only tests to VFS2.

Updates #1624

PiperOrigin-RevId: 490064269
2022-11-21 13:57:52 -08:00
Ayush Ranjan 7eeeb796f8 Delete VFS1 filesystem implementations.
Updates #1624

PiperOrigin-RevId: 488986080
2022-11-16 11:05:10 -08:00
Ayush Ranjan 7c3ff55fab Update fd_table_test to use VFS2.
This unit test had been using VFS1.
Updates #1624

PiperOrigin-RevId: 488740255
2022-11-15 13:15:15 -08:00
Andrei Vagin 604233c9f6 kernel: use lockdep mutexes
PiperOrigin-RevId: 449877248
2022-05-19 18:33:59 -07:00
Andrei Vagin da439ae2f4 kernel: release FDTable lock before calling file methods
This change breaks dependency of FDTable.mu and kernfs.filesystemRWMutex.

panic: WARNING: circular locking detected: kernel.taskMutex -> kernel.fdTableMutex:

gvisor.dev/gvisor/pkg/sentry/kernel.(*fdTableMutex).Lock(0xc00491a110)
        bazel-out/k8-fastbuild-ST-1a50d7a562c6/bin/pkg/sentry/kernel/fd_table_mutex.go:16 +0x34
gvisor.dev/gvisor/pkg/sentry/kernel.(*FDTable).Fork(0xc00491a100, {0x158ef70, 0xc005fdb500}, 0x2073b60)
        pkg/sentry/kernel/fd_table.go:637 +0xca
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).Unshare(0xc005fdb500, 0x4040400)
        pkg/sentry/kernel/task_clone.go:483 +0x72d

known lock chain: kernel.fdTableMutex -> kernfs.filesystemRWMutex -> kernel.taskSetRWMutex -> kernel.signalHandlersMutex -> kernel.taskMutex

gvisor.dev/gvisor/pkg/sentry/kernel.(*FDTable).String(0xc000cd36c0)
        pkg/sentry/kernel/fd_table.go:245 +0x139

====== kernfs.filesystemRWMutex -> kernel.taskSetRWMutex =====
gvisor.dev/gvisor/pkg/sentry/kernel.(*PIDNamespace).IDOfThreadGroup(0xc00019ce40, 0xc0004a2800)
        pkg/sentry/kernel/threads.go:264 +0x35
gvisor.dev/gvisor/pkg/sentry/fsimpl/proc.(*selfSymlink).Readlink(0xc000074460, {0x158ef70, 0xc00027e000}, 0xc0003ba510)
        pkg/sentry/fsimpl/proc/tasks_files.go:58 +0x6d
gvisor.dev/gvisor/pkg/sentry/fsimpl/proc.(*selfSymlink).Getlink(0xc0004b5500, {0x158ef70, 0xc00027e000}, 0x4)
        pkg/sentry/fsimpl/proc/tasks_files.go:66 +0x2d

====== kernel.taskSetRWMutex -> kernel.signalHandlersMutex =====
gvisor.dev/gvisor/pkg/sentry/kernel.(*TaskSet).newTask(0xc00019cea0, 0xc0004f1570)
        pkg/sentry/kernel/task_start.go:177 +0x656
gvisor.dev/gvisor/pkg/sentry/kernel.(*TaskSet).NewTask(0xc00014f180, {0x158eff8, 0xc000436640}, 0xc0004f1570)
        pkg/sentry/kernel/task_start.go:122 +0xa5

====== kernel.signalHandlersMutex -> kernel.taskMutex =====
gvisor.dev/gvisor/pkg/sentry/kernel.(*TaskSet).newTask(0xc00019cea0, 0xc0004f1570)
        pkg/sentry/kernel/task_start.go:228 +0x9a8
gvisor.dev/gvisor/pkg/sentry/kernel.(*TaskSet).NewTask(0xc00014f180, {0x158eff8, 0xc000436640}, 0xc0004f1570)
        pkg/sentry/kernel/task_start.go:122 +0xa5
gvisor.dev/gvisor/pkg/sentry/kernel.(*Kernel).CreateProcess(0xc00014f180, {{0xc00039bf70, 0x5}, {0x0, 0x0}, {0xc00037d440, 0x1, 0x4}, {0xc000436620, 0x2, ...}, ...})
        pkg/sentry/kernel/kernel.go:1069 +0x147d
gvisor.dev/gvisor/runsc/boot.(*Loader).createContainerProcess(0xc0003f2000, 0x1, {0x7fff0a092fb9, 0x8}, 0xc0003f2010)
        runsc/boot/loader.go:809 +0x445
gvisor.dev/gvisor/runsc/boot.(*Loader).run(0xc0003f2000)
        runsc/boot/loader.go:630 +0x1ad
gvisor.dev/gvisor/runsc/boot.(*Loader).Run(0xc0003f2000)
        runsc/boot/loader.go:581 +0x25
2022-04-14 23:08:13 -07:00
Konstantin Bogomolov 4503ba3f5e Fix data race when using UNSHARE in close_range.
Also add test that fails under gotsan without the fix.

PiperOrigin-RevId: 433857741
2022-03-10 14:53:45 -08:00
Konstantin Bogomolov 5c95e1d39c Implement close_range.
Fixes #5500

PiperOrigin-RevId: 431454836
2022-02-28 09:37:03 -08:00
gVisor bot c9aac64e0f Merge pull request #6257 from zhlhahaha:2193-1
PiperOrigin-RevId: 387885663
2021-07-30 14:43:13 -07:00
Andrei Vagin 68cf8cc9a2 Don't create an extra fd bitmap to allocate a new fd. 2021-07-27 13:16:02 +08:00
Howard Zhang c8d252466f apply bitmap for fd_table
Apply bitmap in fd_table to record open file fd. It can
accelerate the speed of allocating or removing fd from
fdtable.

Signed-off-by: Howard Zhang <howard.zhang@arm.com>
2021-07-13 14:16:07 +08:00
Zach Koopmans e1dc1c78e7 [syserror] Add conversions to linuxerr with temporary Equals method.
Add Equals method to compare syserror and unix.Errno errors to linuxerr errors.
This will facilitate removal of syserror definitions in a followup, and
finding needed conversions from unix.Errno to linuxerr.

PiperOrigin-RevId: 380909667
2021-06-22 15:53:32 -07:00
Dean Deng 894187b2c6 Resolve remaining O_PATH TODOs.
O_PATH is now implemented in vfs2.

Fixes #2782.

PiperOrigin-RevId: 373861410
2021-05-14 14:04:46 -07:00
Ayush Ranjan a9441aea27 [op] Replace syscall package usage with golang.org/x/sys/unix in pkg/.
The syscall package has been deprecated in favor of golang.org/x/sys.

Note that syscall is still used in the following places:
- pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities
  are not yet available in golang.org/x/sys.
- syscall.Stat_t is still used in some places because os.FileInfo.Sys() still
  returns it and not unix.Stat_t.

Updates #214

PiperOrigin-RevId: 360701387
2021-03-03 10:25:58 -08:00
Dean Deng 3946075403 Do not generate extraneous IN_CLOSE inotify events.
IN_CLOSE should only be generated when a file description loses its last
reference; not when a file descriptor is closed.

See fs/file_table.c:__fput.

Updates #5348.

PiperOrigin-RevId: 353810697
2021-01-26 00:02:52 -08:00
Dean Deng 55332aca95 Move Lock/UnlockPOSIX into LockFD util.
PiperOrigin-RevId: 352904728
2021-01-20 16:55:07 -08:00