gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Ayush Ranjan	156f457e28	Add kernel.ThreadGroup.ForEachTask(). PiperOrigin-RevId: 733991896	2025-03-05 22:23:11 -08:00
Etienne Perot	04f9204697	Yield thread group leader `*Task` in `TaskSet.ForEachThreadGroup`. This makes this function usable from outside of the `kernel` package without needing to call `tg.Leader()` (which requires a lock that `TaskSet.ForEachThreadGroup` already acquires). PiperOrigin-RevId: 721168957	2025-01-29 17:30:05 -08:00
Jamie Liu	2d90353f9f	kernel: drive all CPU timers in CPU clock ticker gVisor currently implements CPU clocks as follows: - A per-sentry "CPU clock ticker goroutine" (task_sched.go:Kernel.runCPUClockTicker()) periodically advances Kernel.cpuClock, causing it to serve as a very coarse but inexpensive monotonic wall clock (that happens to be suspended when no tasks are running). - Task goroutines observe the most recent value of Kernel.cpuClock when changing state (Task.gosched.Timestamp), and use it to compute the number of CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are approximately based on the wall time during which they were marked as running. - ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer goroutines. This has three major problems: - ktime.SampledTimer goroutines for CPU clock timers run concurrently with the CPU clock ticker, and are not informed as to when corresponding tasks start or stop running (due to overhead on the task execution critical path), so they can't determine when CPU clocks have/will advance; instead, they simply poll CPU clocks on a period equal to that of the represented timer, resulting in significant overhead for CPU-clock-based POSIX interval timers and timerfds. - For the same reason, CPU clock interval timers and timerfds may expire much later than when the CPU clock is actually incremented; in the interval timer case, this can result in notification signals being sent long after tasks have stopped running. (This is the same problem as in b/116538398, which motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described above, but applied to POSIX interval timers.) - The sentry does not impose a limit on the number of tasks that may be concurrently marked running, so if more tasks are marked running than the number of CPUs advertised to applications, application CPU utilization can appear to exceed 100%. This CL fixes these problems by introducing explicit per-Task and ThreadGroup CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and RLIMIT_CPU timers lose their special-casing and instead behave like other CPU timers (see task_acct.go). Kernel.cpuClock is still required, but only for the sentry watchdog. Minor cleanup changes: - Gather all stateify hooks in kernel_state.go. - Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem (https://go.dev/blog/randv2#problem.rand). Test workload: ``` #include <err.h> #include <signal.h> #include <time.h> #include <chrono> #include <thread> constexpr int kNumTimers = 1000; constexpr long kTimerPeriodNS = 10000000; int main(int argc, char** argv) { for (int i = 0; i < kNumTimers; i++) { struct sigevent sev = {.sigev_notify = SIGEV_NONE}; timer_t timerid; if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) { err(1, "timer_create failed"); } struct itimerspec it = { .it_interval = {0, kTimerPeriodNS}, .it_value = {0, kTimerPeriodNS}, }; if (timer_settime(timerid, 0, &it, nullptr) < 0) { err(1, "timer_settime failed"); } } std::this_thread::sleep_for(std::chrono::seconds(5)); return 0; } ``` Before this CL: ``` # /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers 1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k 0inputs+184outputs (10major+20889minor)pagefaults 0swaps ``` After this CL: ``` # /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers 0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k 0inputs+192outputs (6major+20929minor)pagefaults 0swaps ``` PiperOrigin-RevId: 695198313	2024-11-10 22:19:30 -08:00
Jamie Liu	4298980325	Disallow task creation after Kernel.WaitExited() returns. Otherwise tasks can be created via the control server between when Kernel.WaitExited() returns and when the control server is stopped, resulting in task goroutines running when Kernel.Release() is called. PiperOrigin-RevId: 658891833	2024-08-02 13:41:17 -07:00
Ayush Ranjan	448f894c70	Expose some kernel hooks. - Add Kernel.IsPaused() which indicates whether the kernel is currently paused. - Add TaskSet.ForEachThreadGroup() which allows callers to iterate through all thread groups in the kernel. - Export FDTable.ForEach() which allows other packages to iterate over all FDs. PiperOrigin-RevId: 640760723	2024-06-05 21:43:43 -07:00
Shambhavi Srivastava	3657484eee	Adding /proc/[pid]/task/[tid]/children PiperOrigin-RevId: 557888186	2023-08-17 11:41:48 -07:00
Fabricio Voznika	500658dc81	Return correct number of PIDs with multi-container Updates #172 PiperOrigin-RevId: 552620987	2023-07-31 16:20:47 -07:00
Nicolas Lacasse	e7bd1b4c9c	Implement PR_{S,G}ET_CHILD_SUBREAPER. Closes #2323 PiperOrigin-RevId: 548205854	2023-07-14 13:19:25 -07:00
Jamie Liu	94bf4b6469	Return consistent IDs for PID namespaces via procfs. PiperOrigin-RevId: 543529000	2023-06-26 13:49:23 -07:00
Andrei Vagin	604233c9f6	kernel: use lockdep mutexes PiperOrigin-RevId: 449877248	2022-05-19 18:33:59 -07:00
Etienne Perot	c9f8b165cf	Cache each thread group's TID within their own namespace. This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the following lock cycle: `kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex` -> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex` (Also, less locking within `createVMALocked` is probably for the better in general.) PiperOrigin-RevId: 449588573	2022-05-18 15:14:14 -07:00
Rahat Mahmood	bf251f1838	cgroupfs: Implement pids controller. This also introduces the controller charge interface. PiperOrigin-RevId: 444703063	2022-04-26 16:50:15 -07:00
Etienne Perot	95d883a92e	Refactor task start and exit from a PID namespace into separate functions. PiperOrigin-RevId: 426083905	2022-02-03 01:47:27 -08:00
Jamie Liu	8682ce689e	Remove state:"nosave"/"zerovalue" annotations from all waiter.Queues. Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields, which meant that it couldn't handle intrusive linked lists, which meant that it couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a result, VFS1 unregisters all epoll waiters before saving and re-registers them after loading, and waitable VFS1 file implementations tag their waiter.Queues state:"nosave" (causing them to be skipped by the save/restore machinery) or state:"zerovalue" (causing them to only be checked for zero-value-equality on save). VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2 epoll assumes that waiter.Queues will be saved and loaded correctly, and VFS2 file implementations do not tag waiter.Queues. Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used by both VFS1 and VFS2 (the latter via signalfd); as a result of the above, tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll. Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2), and remove these tags from all waiter.Queues. Also clean up after the epoll test added by cl/402323053, which implied this issue (by instantiating DisableSave in the new test) without reporting it. PiperOrigin-RevId: 402596216	2021-10-12 10:25:30 -07:00
Rahat Mahmood	932c8abd0f	Implement cgroupfs. A skeleton implementation of cgroupfs. It supports trivial cpu and memory controllers with no support for hierarchies. PiperOrigin-RevId: 366561126	2021-04-02 21:10:44 -07:00
Dean Deng	acd516cfe2	Add YAMA security module restrictions on ptrace(2). Restrict ptrace(2) according to the default configurations of the YAMA security module (mode 1), which is a common default among various Linux distributions. The new access checks only permit the tracer to proceed if one of the following conditions is met: a) The tracer is already attached to the tracee. b) The target is a descendant of the tracer. c) The target has explicitly given permission to the tracer through the PR_SET_PTRACER prctl. d) The tracer has CAP_SYS_PTRACE. See security/yama/yama_lsm.c for more details. Note that these checks are added to CanTrace, which is checked for PTRACE_ATTACH as well as some other operations, e.g., checking a process' memory layout through /proc/[pid]/mem. Since this patch adds restrictions to ptrace, it may break compatibility for applications run by non-root users that, for instance, rely on being able to trace processes that are not descended from the tracer (e.g., `gdb -p`). YAMA restrictions can be turned off by setting /proc/sys/kernel/yama/ptrace_scope to 0, or exceptions can be made on a per-process basis with the PR_SET_PTRACER prctl. Reported-by: syzbot+622822d8bca08c99e8c8@syzkaller.appspotmail.com PiperOrigin-RevId: 359237723	2021-02-24 02:03:16 -08:00
Dean Deng	f52f0101bb	Implement F_GETLK fcntl. Fixes #5113. PiperOrigin-RevId: 353313374	2021-01-22 13:58:16 -08:00
Jamie Liu	6bbf662271	Reduce the cost of sysinfo(2). - sysinfo(2) does not actually require a fine-grained breakdown of memory usage. Accordingly, instead of calling pgalloc.MemoryFile.UpdateUsage() to update the sentry's fine-grained memory accounting snapshot, just use pgalloc.MemoryFile.TotalUsage() (which is a single fstat(), and therefore far cheaper). - Use the number of threads in the root PID namespace (i.e. globally) rather than in the task's PID namespace for consistency with Linux (which just reads global variable nr_threads), and add a new method to kernel.PIDNamespace to allow this to be read directly from an underlying map rather than requiring the allocation and population of an intermediate slice. PiperOrigin-RevId: 336353100	2020-10-09 13:23:30 -07:00
Rahat Mahmood	d201feb8c5	Enable automated marshalling for the syscall package. PiperOrigin-RevId: 331940975	2020-09-15 23:38:57 -07:00
Nicolas Lacasse	810748f5c9	Port aio to VFS2. In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380	2020-06-16 08:49:06 -07:00
Ian Gudger	27500d529f	New sync package. * Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387	2020-01-09 22:02:24 -08:00
gVisor bot	b50122379c	Merge pull request #452 from zhangningdlut:chris_test_pidns PiperOrigin-RevId: 260220279	2019-07-26 15:00:51 -07:00
Adin Scannell	add40fd6ad	Update canonical repository. This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620	2019-06-13 16:50:15 -07:00
Michael Pratt	4d52a55201	Change copyright notice to "The gVisor Authors" Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" \| xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9	2019-04-29 14:26:23 -07:00
Michael Pratt	75a5ccf5d9	Remove defer from trivial ThreadID methods In particular, ns.IDOfTask and tg.ID are used for gettid and getpid, respectively, where removing defer saves ~100ns. This may be a small improvement to application logging, which may call gettid/getpid frequently. PiperOrigin-RevId: 242039616 Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80	2019-04-04 17:14:27 -07:00

1 2

33 Commits