gVisor currently implements CPU clocks as follows:
- A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
- Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
- For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
- The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
Minor cleanup changes:
- Gather all stateify hooks in kernel_state.go.
- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
(https://go.dev/blog/randv2#problem.rand).
Test workload:
```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>
constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;
int main(int argc, char** argv) {
for (int i = 0; i < kNumTimers; i++) {
struct sigevent sev = {.sigev_notify = SIGEV_NONE};
timer_t timerid;
if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
err(1, "timer_create failed");
}
struct itimerspec it = {
.it_interval = {0, kTimerPeriodNS},
.it_value = {0, kTimerPeriodNS},
};
if (timer_settime(timerid, 0, &it, nullptr) < 0) {
err(1, "timer_settime failed");
}
}
std::this_thread::sleep_for(std::chrono::seconds(5));
return 0;
}
```
Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```
After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```
PiperOrigin-RevId: 695198313
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.
PiperOrigin-RevId: 658891833
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").
Updates #1956.
PiperOrigin-RevId: 614125262
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.
PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284