gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
gVisor bot	86abc85f37	Merge pull request #11473 from Champ-Goblem:shim-add-cgroup-v2-metrics-support PiperOrigin-RevId: 730560110	2025-02-25 14:52:09 -08:00
Fabricio Voznika	fb730ff784	Remove checkpoint_count from `runsc wait --checkpoint` This is done because external callers are not able to know the snapshot generation number from the outside. PiperOrigin-RevId: 707979556	2024-12-19 11:48:10 -08:00
Nayana Bidari	df9ba5fb67	Restore listening connections when netstack s/r is enabled. This CL restores the listening connections when netstack s/r is enabled. The changes include: - New method as a workaround to replace the new routes and nics to the loaded stack after restore. - New Restore() for transport layer protocols to restore the protocol level background workers. - Adds afterLoad() method for fdbased processors. - Adds a test to verify listening connection is restored after checkpointing with netstack s/r enabled. - Few other changes to save restore fields to enable netstack s/r. PiperOrigin-RevId: 698453124	2024-11-20 11:13:57 -08:00
Jamie Liu	7920b5b40a	kernel: improve tcpip.Timer implementation - Move ktime.VariableTimer to kernel.timekeeperTcpipTimer, its only use case. This allows timekeeperTcpipTimer to use concrete types kernel.timekeeperClock and ktime.SampledTimer instead of ktime.Clock and ktime.Timer, saving a tiny amount of memory (interface values consist of two pointers) and CPU (for interface method calls). - Fix a bug where timekeeperTcpipTimer expiration can cancel a racing call to timekeeperTcpipTimer.Reset() (see use of new field timekeeperTcpipTimer.resets). - Define Listener.NotifyTimer directly on timekeeperTcpipTimer (dropping ktime.functionNotifier), and move goroutine spawning from the anonymous function in ktime.AfterFunc() into timekeeperTcpipTimer.NotifyTimer(). This slightly simplifies the control flow and saves an allocation for the anonymous function object. - Use monotonicClock rather than realtimeClock. It doesn't make sense for time-of-day clock adjustments to affect netstack timeouts, and this is consistent with tcpip.stdClock => time.AfterFunc => runtime.timer. PiperOrigin-RevId: 695504159	2024-11-11 15:38:55 -08:00
Jamie Liu	2d90353f9f	kernel: drive all CPU timers in CPU clock ticker gVisor currently implements CPU clocks as follows: - A per-sentry "CPU clock ticker goroutine" (task_sched.go:Kernel.runCPUClockTicker()) periodically advances Kernel.cpuClock, causing it to serve as a very coarse but inexpensive monotonic wall clock (that happens to be suspended when no tasks are running). - Task goroutines observe the most recent value of Kernel.cpuClock when changing state (Task.gosched.Timestamp), and use it to compute the number of CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are approximately based on the wall time during which they were marked as running. - ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer goroutines. This has three major problems: - ktime.SampledTimer goroutines for CPU clock timers run concurrently with the CPU clock ticker, and are not informed as to when corresponding tasks start or stop running (due to overhead on the task execution critical path), so they can't determine when CPU clocks have/will advance; instead, they simply poll CPU clocks on a period equal to that of the represented timer, resulting in significant overhead for CPU-clock-based POSIX interval timers and timerfds. - For the same reason, CPU clock interval timers and timerfds may expire much later than when the CPU clock is actually incremented; in the interval timer case, this can result in notification signals being sent long after tasks have stopped running. (This is the same problem as in b/116538398, which motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described above, but applied to POSIX interval timers.) - The sentry does not impose a limit on the number of tasks that may be concurrently marked running, so if more tasks are marked running than the number of CPUs advertised to applications, application CPU utilization can appear to exceed 100%. This CL fixes these problems by introducing explicit per-Task and ThreadGroup CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and RLIMIT_CPU timers lose their special-casing and instead behave like other CPU timers (see task_acct.go). Kernel.cpuClock is still required, but only for the sentry watchdog. Minor cleanup changes: - Gather all stateify hooks in kernel_state.go. - Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem (https://go.dev/blog/randv2#problem.rand). Test workload: ``` #include <err.h> #include <signal.h> #include <time.h> #include <chrono> #include <thread> constexpr int kNumTimers = 1000; constexpr long kTimerPeriodNS = 10000000; int main(int argc, char** argv) { for (int i = 0; i < kNumTimers; i++) { struct sigevent sev = {.sigev_notify = SIGEV_NONE}; timer_t timerid; if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) { err(1, "timer_create failed"); } struct itimerspec it = { .it_interval = {0, kTimerPeriodNS}, .it_value = {0, kTimerPeriodNS}, }; if (timer_settime(timerid, 0, &it, nullptr) < 0) { err(1, "timer_settime failed"); } } std::this_thread::sleep_for(std::chrono::seconds(5)); return 0; } ``` Before this CL: ``` # /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers 1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k 0inputs+184outputs (10major+20889minor)pagefaults 0swaps ``` After this CL: ``` # /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers 0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k 0inputs+192outputs (6major+20929minor)pagefaults 0swaps ``` PiperOrigin-RevId: 695198313	2024-11-10 22:19:30 -08:00
Jamie Liu	2e6cfa72f2	ktime: support varying Timer implementations - Rename Timer to SampledTimer. - Move all Clock methods except Now to new interface SampledClock. - Move SampledTimer's exported methods (except SetClock) to new interface Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant methods that must be implemented. - Add interface method Clock.NewTimer. This is in preparation for cl/693856539, which adds a second Timer implementation. PiperOrigin-RevId: 694299679	2024-11-07 17:19:25 -08:00
Jamie Liu	e23347e5b5	Move //pkg/sentry/kernel/time to //pkg/sentry/ktime. This avoids needing to rename it everywhere it's imported. PiperOrigin-RevId: 693930089	2024-11-06 18:13:51 -08:00
Ayush Ranjan	1e5b6ec429	Add more context to errors during restore. This will help with debugging restore failures. PiperOrigin-RevId: 693436013	2024-11-05 12:18:57 -08:00
Fabricio Voznika	c8c41e5e30	Move S/R code to separate file Move Kernel S/R code to kernel_restore.go. PiperOrigin-RevId: 691441815	2024-10-30 09:15:35 -07:00
Ayush Ranjan	56ccc08f37	Rename OCIEnviron to SpecEnviron. PiperOrigin-RevId: 684626452	2024-10-10 17:12:03 -07:00
cweld510	db4ffada10	style feedback: remove newlines, fix import, remove stray comment	2024-10-07 22:39:13 +00:00
cweld510	727bc9c72a	Add and implement option to close unsaveable gofer-backed unix sockets on save	2024-10-04 20:13:38 +00:00
Ayush Ranjan	cb418b7f09	Add kernel.Saver.OCIEnviron(). PiperOrigin-RevId: 682080366	2024-10-03 16:48:03 -07:00
Nicolas Lacasse	cceb04f05a	Clean up host.TTYFileOperations. We used to track the foreground process group & session on the TTYFileOperation, but these are already tracked in kernel.TTY.ThreadGroup. So remove TTYFileOperations.fgProcessGroup and .session, and replace them with a kernel.TTY. This is analogous to how sentry-internal tty's already work. Updates #10925 PiperOrigin-RevId: 681957240	2024-10-03 11:25:52 -07:00
Nicolas Lacasse	d5a9d523bb	Implement /dev/tty for donated host TTYs Fixes #10925 PiperOrigin-RevId: 681684673	2024-10-02 19:40:43 -07:00
Jamie Liu	b99fd8711f	kernel: fix lock order inversion in ThreadGroup.Release() PiperOrigin-RevId: 681199251	2024-10-01 16:10:15 -07:00
Jamie Liu	03bebc4402	kernel: add ThreadGroup.signalLock() This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing to lock TaskSet.mu, analogously to Linux's lock_task_sighand(). This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally requires TaskSet.mu to be locked since it reads Task.exitState. To fix this, use atomic memory operations on Task.exitState when required. PiperOrigin-RevId: 681128063	2024-10-01 12:48:10 -07:00
Jamie Liu	41f01d8f9c	pgalloc: integrate async page loading When a pages file is provided to `runsc restore`, reads from that file are asynchronous (via statefile.AsyncReader) in order to maximize throughput. However, all such reads must complete before Kernel.LoadFrom() returns, so applications cannot execute before MemoryFile loading is complete. The main objective of this CL is to allow reads to continue after Kernel.LoadFrom() returns, allowing applications to execute while MemoryFile loading is still in progress. This behavior is user-visible: it affects whether deleting the pages file frees disk space immediately on POSIX filesystems, may affect whether deletion is possible on non-POSIX filesystems, and prevents unmounting regardless. Thus it is flag-guarded as `runsc restore --background`. MemoryFile ranges that have yet to be loaded, but that are being waited-for by applications, should be prioritized over ranges for which no application is waiting. This requires that application requests for data (calls to MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which ranges have not yet been loaded, request reads for such ranges with elevated priority, and wait for only those reads to be completed; none of these are supported by the existing statefile.AsyncReader. Thus: - Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is designed to be easily implementable using a goroutine pool, Linux native AIO, or io_uring, though only includes a goroutine pool implementation. (io_uring is widely disabled due to security vulnerabilities. In my testing, Linux native AIO is slower than the goroutine pool, but this may change with lower GOMAXPROCS which needs further testing.) - Move I/O scheduling into pgalloc: introduce an async page loader goroutine that is started by MemoryFile.LoadFrom() when async page loading is requested (implicitly, via the existence of a pages file), which is responsible for driving submission of read requests and handling their completions. PiperOrigin-RevId: 679321884	2024-09-26 15:51:13 -07:00
Nayana Bidari	740dc367db	Mark netstack as save and use it only in tests - Adds a new flag which will enable netstack s/r. When the flag is not enabled, there is no change in the existing behavior. The flag will be enabled only in tests to verify the s/r functionality of netstack. - Some additional fields in netstack were causing panic when netstack is save/restored. Such fields are marked as 'save'/'nosave' accordingly to resolve the panic. PiperOrigin-RevId: 668566657	2024-08-28 12:49:43 -07:00
Ayush Ranjan	218f52a9f5	Parallelize MemoryFile save and kernel save. This compliments `39730b714c` ("Load pgalloc.MemoryFile and kernel parallely with compression=none mode.") This is a performance optimization. Kernel and MemoryFile are saved independently. The save can be done in parallel when using --compression=none because the kernel and MemoryFile are being saved in different files. PiperOrigin-RevId: 668128205	2024-08-27 14:03:15 -07:00
Jamie Liu	87ec1007b4	Buffer page metadata file I/O. PiperOrigin-RevId: 666590672	2024-08-22 19:55:11 -07:00
Jamie Liu	4298980325	Disallow task creation after Kernel.WaitExited() returns. Otherwise tasks can be created via the control server between when Kernel.WaitExited() returns and when the control server is stopped, resulting in task goroutines running when Kernel.Release() is called. PiperOrigin-RevId: 658891833	2024-08-02 13:41:17 -07:00
Ayush Ranjan	bd9b5a819f	Add a `runsc wait --checkpoint n` command to wait for a checkpoint to complete. This command waits for (n-1)th checkpoint to complete successfully. Then waits for the next checkpoint attempt (which would increment checkpoint count to n) and returns its status. If sandbox checkpoint count has already reached n, it returns immediately. PiperOrigin-RevId: 651884599	2024-07-12 14:19:46 -07:00
Ayush Ranjan	847bd58dc7	Add a checkpoint counter to the kernel. This counter is incremented after each checkpoint upon a successful restore or resume. It can be used to track the number of times the sandbox has been checkpointed. The counter is protected by a new mutex in the kernel. This mutex is used to protect all checkpointing-related fields in the kernel. Earlier, set and get on Kernel.saver were not synchronized. PiperOrigin-RevId: 650858921	2024-07-09 21:57:12 -07:00
Ayush Ranjan	69c3e8d632	Move VDSOParamPage out of Timekeeper. We want to decouple the netstack from the kernel. This coupling is causing bugs in restore because netstack needs to be created before the Kernel is restored. So right now, netstack ends up using a "temporary" Kernel which is later destroyed in the restore sequence, but netstack keeps referencing it. Before this change, netstack was being initialized with a `kernel.TimeKeeper`. This `Timekeeper` was being initialized with `VDSOParamPage`, which references the MemoryFile of the kernel. So as a result, on restore the netstack ends up referencing the destroyed kernel's MemoryFile. So instead move out VDSOParamPage from TimeKeeper altogether. The callers of `TimeKeeper.SetClocks()` and `TimeKeeper.ResumeUpdates()` pass VDSOParamPage from the correct kernel being used currently. PiperOrigin-RevId: 647168588	2024-06-26 20:31:38 -07:00

1 2 3 4 5 ...

228 Commits