11 Commits

Author SHA1 Message Date
Jamie Liu 2d90353f9f kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:

- A per-sentry "CPU clock ticker goroutine"
  (task_sched.go:Kernel.runCPUClockTicker()) periodically advances
  Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
  monotonic wall clock (that happens to be suspended when no tasks are
  running).

- Task goroutines observe the most recent value of Kernel.cpuClock when
  changing state (Task.gosched.Timestamp), and use it to compute the number of
  CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
  approximately based on the wall time during which they were marked as
  running.

- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
  ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
  timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
  goroutines.

This has three major problems:

- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
  CPU clock ticker, and are not informed as to when corresponding tasks start
  or stop running (due to overhead on the task execution critical path), so
  they can't determine when CPU clocks have/will advance; instead, they simply
  poll CPU clocks on a period equal to that of the represented timer, resulting
  in significant overhead for CPU-clock-based POSIX interval timers and
  timerfds.

- For the same reason, CPU clock interval timers and timerfds may expire much
  later than when the CPU clock is actually incremented; in the interval timer
  case, this can result in notification signals being sent long after tasks
  have stopped running. (This is the same problem as in b/116538398, which
  motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
  above, but applied to POSIX interval timers.)

- The sentry does not impose a limit on the number of tasks that may be
  concurrently marked running, so if more tasks are marked running than the
  number of CPUs advertised to applications, application CPU utilization can
  appear to exceed 100%.

This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.

Minor cleanup changes:

- Gather all stateify hooks in kernel_state.go.

- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
  (https://go.dev/blog/randv2#problem.rand).

Test workload:

```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>

constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;

int main(int argc, char** argv) {
  for (int i = 0; i < kNumTimers; i++) {
    struct sigevent sev = {.sigev_notify = SIGEV_NONE};
    timer_t timerid;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
      err(1, "timer_create failed");
    }
    struct itimerspec it = {
      .it_interval = {0, kTimerPeriodNS},
      .it_value = {0, kTimerPeriodNS},
    };
    if (timer_settime(timerid, 0, &it, nullptr) < 0) {
      err(1, "timer_settime failed");
    }
  }
  std::this_thread::sleep_for(std::chrono::seconds(5));
  return 0;
}
```

Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```

After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```

PiperOrigin-RevId: 695198313
2024-11-10 22:19:30 -08:00
Jamie Liu 4298980325 Disallow task creation after Kernel.WaitExited() returns.
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.

PiperOrigin-RevId: 658891833
2024-08-02 13:41:17 -07:00
Ayush Ranjan 7e395bbbd4 Plumb restore context to load*() methods.
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").

Updates #1956.

PiperOrigin-RevId: 614125262
2024-03-08 20:28:02 -08:00
Jamie Liu ff81c0c639 Remove //pkg/sentry/device.
This package was used for VFS1 device number assignment.

PiperOrigin-RevId: 538918926
2023-06-08 16:21:04 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Rahat Mahmood 7cff746ef2 Save/restore simple devices.
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.

PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-04-01 15:39:16 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Zhaozhong Ni b1683df90b netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 203958972
Change-Id: Ia6fe16547539296d48e2c6731edacdd96bd6e93c
2018-07-10 09:23:35 -07:00
Brian Geffon 51c1e510ab Automated rollback of changelist 201596247
PiperOrigin-RevId: 202151720
Change-Id: I0491172c436bbb32b977f557953ba0bc41cfe299
2018-06-26 10:33:24 -07:00
Zhaozhong Ni 0e434b66a6 netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 201596247
Change-Id: Id22f47b2cdcbe14aa0d930f7807ba75f91a56724
2018-06-21 15:19:45 -07:00