228 Commits

Author SHA1 Message Date
gVisor bot 86abc85f37 Merge pull request #11473 from Champ-Goblem:shim-add-cgroup-v2-metrics-support
PiperOrigin-RevId: 730560110
2025-02-25 14:52:09 -08:00
Fabricio Voznika fb730ff784 Remove checkpoint_count from runsc wait --checkpoint
This is done because external callers are not able to know
the snapshot generation number from the outside.

PiperOrigin-RevId: 707979556
2024-12-19 11:48:10 -08:00
Nayana Bidari df9ba5fb67 Restore listening connections when netstack s/r is enabled.
This CL restores the listening connections when netstack s/r is enabled.
The changes include:
- New method as a workaround to replace the new routes and nics to the loaded
stack after restore.
- New Restore() for transport layer protocols to restore the protocol level
background workers.
- Adds afterLoad() method for fdbased processors.
- Adds a test to verify listening connection is restored after checkpointing
with netstack s/r enabled.
- Few other changes to save restore fields to enable netstack s/r.

PiperOrigin-RevId: 698453124
2024-11-20 11:13:57 -08:00
Jamie Liu 7920b5b40a kernel: improve tcpip.Timer implementation
- Move ktime.VariableTimer to kernel.timekeeperTcpipTimer, its only use case.
  This allows timekeeperTcpipTimer to use concrete types kernel.timekeeperClock
  and ktime.SampledTimer instead of ktime.Clock and ktime.Timer, saving a tiny
  amount of memory (interface values consist of two pointers) and CPU (for
  interface method calls).

- Fix a bug where timekeeperTcpipTimer expiration can cancel a racing call to
  timekeeperTcpipTimer.Reset() (see use of new field
  timekeeperTcpipTimer.resets).

- Define Listener.NotifyTimer directly on timekeeperTcpipTimer (dropping
  ktime.functionNotifier), and move goroutine spawning from the anonymous
  function in ktime.AfterFunc() into timekeeperTcpipTimer.NotifyTimer(). This
  slightly simplifies the control flow and saves an allocation for the
  anonymous function object.

- Use monotonicClock rather than realtimeClock. It doesn't make sense for
  time-of-day clock adjustments to affect netstack timeouts, and this is
  consistent with tcpip.stdClock => time.AfterFunc => runtime.timer.

PiperOrigin-RevId: 695504159
2024-11-11 15:38:55 -08:00
Jamie Liu 2d90353f9f kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:

- A per-sentry "CPU clock ticker goroutine"
  (task_sched.go:Kernel.runCPUClockTicker()) periodically advances
  Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
  monotonic wall clock (that happens to be suspended when no tasks are
  running).

- Task goroutines observe the most recent value of Kernel.cpuClock when
  changing state (Task.gosched.Timestamp), and use it to compute the number of
  CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
  approximately based on the wall time during which they were marked as
  running.

- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
  ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
  timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
  goroutines.

This has three major problems:

- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
  CPU clock ticker, and are not informed as to when corresponding tasks start
  or stop running (due to overhead on the task execution critical path), so
  they can't determine when CPU clocks have/will advance; instead, they simply
  poll CPU clocks on a period equal to that of the represented timer, resulting
  in significant overhead for CPU-clock-based POSIX interval timers and
  timerfds.

- For the same reason, CPU clock interval timers and timerfds may expire much
  later than when the CPU clock is actually incremented; in the interval timer
  case, this can result in notification signals being sent long after tasks
  have stopped running. (This is the same problem as in b/116538398, which
  motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
  above, but applied to POSIX interval timers.)

- The sentry does not impose a limit on the number of tasks that may be
  concurrently marked running, so if more tasks are marked running than the
  number of CPUs advertised to applications, application CPU utilization can
  appear to exceed 100%.

This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.

Minor cleanup changes:

- Gather all stateify hooks in kernel_state.go.

- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
  (https://go.dev/blog/randv2#problem.rand).

Test workload:

```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>

constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;

int main(int argc, char** argv) {
  for (int i = 0; i < kNumTimers; i++) {
    struct sigevent sev = {.sigev_notify = SIGEV_NONE};
    timer_t timerid;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
      err(1, "timer_create failed");
    }
    struct itimerspec it = {
      .it_interval = {0, kTimerPeriodNS},
      .it_value = {0, kTimerPeriodNS},
    };
    if (timer_settime(timerid, 0, &it, nullptr) < 0) {
      err(1, "timer_settime failed");
    }
  }
  std::this_thread::sleep_for(std::chrono::seconds(5));
  return 0;
}
```

Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```

After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```

PiperOrigin-RevId: 695198313
2024-11-10 22:19:30 -08:00
Jamie Liu 2e6cfa72f2 ktime: support varying Timer implementations
- Rename Timer to SampledTimer.

- Move all Clock methods except Now to new interface SampledClock.

- Move SampledTimer's exported methods (except SetClock) to new interface
  Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
  methods that must be implemented.

- Add interface method Clock.NewTimer.

This is in preparation for cl/693856539, which adds a second Timer
implementation.

PiperOrigin-RevId: 694299679
2024-11-07 17:19:25 -08:00
Jamie Liu e23347e5b5 Move //pkg/sentry/kernel/time to //pkg/sentry/ktime.
This avoids needing to rename it everywhere it's imported.

PiperOrigin-RevId: 693930089
2024-11-06 18:13:51 -08:00
Ayush Ranjan 1e5b6ec429 Add more context to errors during restore.
This will help with debugging restore failures.

PiperOrigin-RevId: 693436013
2024-11-05 12:18:57 -08:00
Fabricio Voznika c8c41e5e30 Move S/R code to separate file
Move Kernel S/R code to kernel_restore.go.

PiperOrigin-RevId: 691441815
2024-10-30 09:15:35 -07:00
Ayush Ranjan 56ccc08f37 Rename OCIEnviron to SpecEnviron.
PiperOrigin-RevId: 684626452
2024-10-10 17:12:03 -07:00
cweld510 db4ffada10 style feedback: remove newlines, fix import, remove stray comment 2024-10-07 22:39:13 +00:00
cweld510 727bc9c72a Add and implement option to close unsaveable gofer-backed unix sockets
on save
2024-10-04 20:13:38 +00:00
Ayush Ranjan cb418b7f09 Add kernel.Saver.OCIEnviron().
PiperOrigin-RevId: 682080366
2024-10-03 16:48:03 -07:00
Nicolas Lacasse cceb04f05a Clean up host.TTYFileOperations.
We used to track the foreground process group & session on the
TTYFileOperation, but these are already tracked in kernel.TTY.ThreadGroup.

So remove TTYFileOperations.fgProcessGroup and .session, and replace them with
a kernel.TTY.

This is analogous to how sentry-internal tty's already work.

Updates #10925

PiperOrigin-RevId: 681957240
2024-10-03 11:25:52 -07:00
Nicolas Lacasse d5a9d523bb Implement /dev/tty for donated host TTYs
Fixes #10925

PiperOrigin-RevId: 681684673
2024-10-02 19:40:43 -07:00
Jamie Liu b99fd8711f kernel: fix lock order inversion in ThreadGroup.Release()
PiperOrigin-RevId: 681199251
2024-10-01 16:10:15 -07:00
Jamie Liu 03bebc4402 kernel: add ThreadGroup.signalLock()
This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().

This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.

PiperOrigin-RevId: 681128063
2024-10-01 12:48:10 -07:00
Jamie Liu 41f01d8f9c pgalloc: integrate async page loading
When a pages file is provided to `runsc restore`, reads from that file are
asynchronous (via statefile.AsyncReader) in order to maximize throughput.
However, all such reads must complete before Kernel.LoadFrom() returns, so
applications cannot execute before MemoryFile loading is complete. The main
objective of this CL is to allow reads to continue after Kernel.LoadFrom()
returns, allowing applications to execute while MemoryFile loading is still in
progress. This behavior is user-visible: it affects whether deleting the pages
file frees disk space immediately on POSIX filesystems, may affect whether
deletion is possible on non-POSIX filesystems, and prevents unmounting
regardless. Thus it is flag-guarded as `runsc restore --background`.

MemoryFile ranges that have yet to be loaded, but that are being waited-for by
applications, should be prioritized over ranges for which no application is
waiting. This requires that application requests for data (calls to
MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which
ranges have not yet been loaded, request reads for such ranges with elevated
priority, and wait for only those reads to be completed; none of these are
supported by the existing statefile.AsyncReader.

Thus:

- Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is
  designed to be easily implementable using a goroutine pool, Linux native AIO,
  or io_uring, though only includes a goroutine pool implementation. (io_uring
  is widely disabled due to security vulnerabilities. In my testing, Linux
  native AIO is slower than the goroutine pool, but this may change with lower
  GOMAXPROCS which needs further testing.)

- Move I/O scheduling into pgalloc: introduce an async page loader goroutine
  that is started by MemoryFile.LoadFrom() when async page loading is requested
  (implicitly, via the existence of a pages file), which is responsible for
  driving submission of read requests and handling their completions.

PiperOrigin-RevId: 679321884
2024-09-26 15:51:13 -07:00
Nayana Bidari 740dc367db Mark netstack as save and use it only in tests
- Adds a new flag which will enable netstack s/r. When the flag is not enabled,
there is no change in the existing behavior. The flag will be enabled only in
tests to verify the s/r functionality of netstack.
- Some additional fields in netstack were causing panic when netstack is
save/restored. Such fields are marked as 'save'/'nosave' accordingly to resolve
the panic.

PiperOrigin-RevId: 668566657
2024-08-28 12:49:43 -07:00
Ayush Ranjan 218f52a9f5 Parallelize MemoryFile save and kernel save.
This compliments 39730b714c ("Load pgalloc.MemoryFile and kernel parallely
with compression=none mode.")

This is a performance optimization. Kernel and MemoryFile are saved
independently. The save can be done in parallel when using --compression=none
because the kernel and MemoryFile are being saved in different files.

PiperOrigin-RevId: 668128205
2024-08-27 14:03:15 -07:00
Jamie Liu 87ec1007b4 Buffer page metadata file I/O.
PiperOrigin-RevId: 666590672
2024-08-22 19:55:11 -07:00
Jamie Liu 4298980325 Disallow task creation after Kernel.WaitExited() returns.
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.

PiperOrigin-RevId: 658891833
2024-08-02 13:41:17 -07:00
Ayush Ranjan bd9b5a819f Add a runsc wait --checkpoint n command to wait for a checkpoint to complete.
This command waits for (n-1)th checkpoint to complete successfully. Then waits
for the next checkpoint attempt (which would increment checkpoint count to n)
and returns its status.

If sandbox checkpoint count has already reached n, it returns immediately.

PiperOrigin-RevId: 651884599
2024-07-12 14:19:46 -07:00
Ayush Ranjan 847bd58dc7 Add a checkpoint counter to the kernel.
This counter is incremented after each checkpoint upon a successful restore or
resume. It can be used to track the number of times the sandbox has been
checkpointed.

The counter is protected by a new mutex in the kernel. This mutex is used to
protect all checkpointing-related fields in the kernel. Earlier, set and get on
Kernel.saver were not synchronized.

PiperOrigin-RevId: 650858921
2024-07-09 21:57:12 -07:00
Ayush Ranjan 69c3e8d632 Move VDSOParamPage out of Timekeeper.
We want to decouple the netstack from the kernel. This coupling is
causing bugs in restore because netstack needs to be created before the Kernel
is restored. So right now, netstack ends up using a "temporary" Kernel which is
later destroyed in the restore sequence, but netstack keeps referencing it.

Before this change, netstack was being initialized with a `kernel.TimeKeeper`.
This `Timekeeper` was being initialized with `VDSOParamPage`, which references
the MemoryFile of the kernel. So as a result, on restore the netstack ends up
referencing the destroyed kernel's MemoryFile.

So instead move out VDSOParamPage from TimeKeeper altogether. The callers of
`TimeKeeper.SetClocks()` and `TimeKeeper.ResumeUpdates()` pass VDSOParamPage
from the correct kernel being used currently.

PiperOrigin-RevId: 647168588
2024-06-26 20:31:38 -07:00