This CL restores the listening connections when netstack s/r is enabled.
The changes include:
- New method as a workaround to replace the new routes and nics to the loaded
stack after restore.
- New Restore() for transport layer protocols to restore the protocol level
background workers.
- Adds afterLoad() method for fdbased processors.
- Adds a test to verify listening connection is restored after checkpointing
with netstack s/r enabled.
- Few other changes to save restore fields to enable netstack s/r.
PiperOrigin-RevId: 698453124
- Move ktime.VariableTimer to kernel.timekeeperTcpipTimer, its only use case.
This allows timekeeperTcpipTimer to use concrete types kernel.timekeeperClock
and ktime.SampledTimer instead of ktime.Clock and ktime.Timer, saving a tiny
amount of memory (interface values consist of two pointers) and CPU (for
interface method calls).
- Fix a bug where timekeeperTcpipTimer expiration can cancel a racing call to
timekeeperTcpipTimer.Reset() (see use of new field
timekeeperTcpipTimer.resets).
- Define Listener.NotifyTimer directly on timekeeperTcpipTimer (dropping
ktime.functionNotifier), and move goroutine spawning from the anonymous
function in ktime.AfterFunc() into timekeeperTcpipTimer.NotifyTimer(). This
slightly simplifies the control flow and saves an allocation for the
anonymous function object.
- Use monotonicClock rather than realtimeClock. It doesn't make sense for
time-of-day clock adjustments to affect netstack timeouts, and this is
consistent with tcpip.stdClock => time.AfterFunc => runtime.timer.
PiperOrigin-RevId: 695504159
gVisor currently implements CPU clocks as follows:
- A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
- Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
- For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
- The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
Minor cleanup changes:
- Gather all stateify hooks in kernel_state.go.
- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
(https://go.dev/blog/randv2#problem.rand).
Test workload:
```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>
constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;
int main(int argc, char** argv) {
for (int i = 0; i < kNumTimers; i++) {
struct sigevent sev = {.sigev_notify = SIGEV_NONE};
timer_t timerid;
if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
err(1, "timer_create failed");
}
struct itimerspec it = {
.it_interval = {0, kTimerPeriodNS},
.it_value = {0, kTimerPeriodNS},
};
if (timer_settime(timerid, 0, &it, nullptr) < 0) {
err(1, "timer_settime failed");
}
}
std::this_thread::sleep_for(std::chrono::seconds(5));
return 0;
}
```
Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```
After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```
PiperOrigin-RevId: 695198313
- Rename Timer to SampledTimer.
- Move all Clock methods except Now to new interface SampledClock.
- Move SampledTimer's exported methods (except SetClock) to new interface
Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
methods that must be implemented.
- Add interface method Clock.NewTimer.
This is in preparation for cl/693856539, which adds a second Timer
implementation.
PiperOrigin-RevId: 694299679
We used to track the foreground process group & session on the
TTYFileOperation, but these are already tracked in kernel.TTY.ThreadGroup.
So remove TTYFileOperations.fgProcessGroup and .session, and replace them with
a kernel.TTY.
This is analogous to how sentry-internal tty's already work.
Updates #10925
PiperOrigin-RevId: 681957240
This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().
This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.
PiperOrigin-RevId: 681128063
When a pages file is provided to `runsc restore`, reads from that file are
asynchronous (via statefile.AsyncReader) in order to maximize throughput.
However, all such reads must complete before Kernel.LoadFrom() returns, so
applications cannot execute before MemoryFile loading is complete. The main
objective of this CL is to allow reads to continue after Kernel.LoadFrom()
returns, allowing applications to execute while MemoryFile loading is still in
progress. This behavior is user-visible: it affects whether deleting the pages
file frees disk space immediately on POSIX filesystems, may affect whether
deletion is possible on non-POSIX filesystems, and prevents unmounting
regardless. Thus it is flag-guarded as `runsc restore --background`.
MemoryFile ranges that have yet to be loaded, but that are being waited-for by
applications, should be prioritized over ranges for which no application is
waiting. This requires that application requests for data (calls to
MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which
ranges have not yet been loaded, request reads for such ranges with elevated
priority, and wait for only those reads to be completed; none of these are
supported by the existing statefile.AsyncReader.
Thus:
- Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is
designed to be easily implementable using a goroutine pool, Linux native AIO,
or io_uring, though only includes a goroutine pool implementation. (io_uring
is widely disabled due to security vulnerabilities. In my testing, Linux
native AIO is slower than the goroutine pool, but this may change with lower
GOMAXPROCS which needs further testing.)
- Move I/O scheduling into pgalloc: introduce an async page loader goroutine
that is started by MemoryFile.LoadFrom() when async page loading is requested
(implicitly, via the existence of a pages file), which is responsible for
driving submission of read requests and handling their completions.
PiperOrigin-RevId: 679321884
- Adds a new flag which will enable netstack s/r. When the flag is not enabled,
there is no change in the existing behavior. The flag will be enabled only in
tests to verify the s/r functionality of netstack.
- Some additional fields in netstack were causing panic when netstack is
save/restored. Such fields are marked as 'save'/'nosave' accordingly to resolve
the panic.
PiperOrigin-RevId: 668566657
This compliments 39730b714c ("Load pgalloc.MemoryFile and kernel parallely
with compression=none mode.")
This is a performance optimization. Kernel and MemoryFile are saved
independently. The save can be done in parallel when using --compression=none
because the kernel and MemoryFile are being saved in different files.
PiperOrigin-RevId: 668128205
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.
PiperOrigin-RevId: 658891833
This command waits for (n-1)th checkpoint to complete successfully. Then waits
for the next checkpoint attempt (which would increment checkpoint count to n)
and returns its status.
If sandbox checkpoint count has already reached n, it returns immediately.
PiperOrigin-RevId: 651884599
This counter is incremented after each checkpoint upon a successful restore or
resume. It can be used to track the number of times the sandbox has been
checkpointed.
The counter is protected by a new mutex in the kernel. This mutex is used to
protect all checkpointing-related fields in the kernel. Earlier, set and get on
Kernel.saver were not synchronized.
PiperOrigin-RevId: 650858921
We want to decouple the netstack from the kernel. This coupling is
causing bugs in restore because netstack needs to be created before the Kernel
is restored. So right now, netstack ends up using a "temporary" Kernel which is
later destroyed in the restore sequence, but netstack keeps referencing it.
Before this change, netstack was being initialized with a `kernel.TimeKeeper`.
This `Timekeeper` was being initialized with `VDSOParamPage`, which references
the MemoryFile of the kernel. So as a result, on restore the netstack ends up
referencing the destroyed kernel's MemoryFile.
So instead move out VDSOParamPage from TimeKeeper altogether. The callers of
`TimeKeeper.SetClocks()` and `TimeKeeper.ResumeUpdates()` pass VDSOParamPage
from the correct kernel being used currently.
PiperOrigin-RevId: 647168588