33 Commits

Author SHA1 Message Date
Ayush Ranjan 156f457e28 Add kernel.ThreadGroup.ForEachTask().
PiperOrigin-RevId: 733991896
2025-03-05 22:23:11 -08:00
Etienne Perot 04f9204697 Yield thread group leader *Task in TaskSet.ForEachThreadGroup.
This makes this function usable from outside of the `kernel` package
without needing to call `tg.Leader()` (which requires a lock that
`TaskSet.ForEachThreadGroup` already acquires).

PiperOrigin-RevId: 721168957
2025-01-29 17:30:05 -08:00
Jamie Liu 2d90353f9f kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:

- A per-sentry "CPU clock ticker goroutine"
  (task_sched.go:Kernel.runCPUClockTicker()) periodically advances
  Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
  monotonic wall clock (that happens to be suspended when no tasks are
  running).

- Task goroutines observe the most recent value of Kernel.cpuClock when
  changing state (Task.gosched.Timestamp), and use it to compute the number of
  CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
  approximately based on the wall time during which they were marked as
  running.

- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
  ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
  timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
  goroutines.

This has three major problems:

- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
  CPU clock ticker, and are not informed as to when corresponding tasks start
  or stop running (due to overhead on the task execution critical path), so
  they can't determine when CPU clocks have/will advance; instead, they simply
  poll CPU clocks on a period equal to that of the represented timer, resulting
  in significant overhead for CPU-clock-based POSIX interval timers and
  timerfds.

- For the same reason, CPU clock interval timers and timerfds may expire much
  later than when the CPU clock is actually incremented; in the interval timer
  case, this can result in notification signals being sent long after tasks
  have stopped running. (This is the same problem as in b/116538398, which
  motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
  above, but applied to POSIX interval timers.)

- The sentry does not impose a limit on the number of tasks that may be
  concurrently marked running, so if more tasks are marked running than the
  number of CPUs advertised to applications, application CPU utilization can
  appear to exceed 100%.

This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.

Minor cleanup changes:

- Gather all stateify hooks in kernel_state.go.

- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
  (https://go.dev/blog/randv2#problem.rand).

Test workload:

```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>

constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;

int main(int argc, char** argv) {
  for (int i = 0; i < kNumTimers; i++) {
    struct sigevent sev = {.sigev_notify = SIGEV_NONE};
    timer_t timerid;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
      err(1, "timer_create failed");
    }
    struct itimerspec it = {
      .it_interval = {0, kTimerPeriodNS},
      .it_value = {0, kTimerPeriodNS},
    };
    if (timer_settime(timerid, 0, &it, nullptr) < 0) {
      err(1, "timer_settime failed");
    }
  }
  std::this_thread::sleep_for(std::chrono::seconds(5));
  return 0;
}
```

Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```

After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```

PiperOrigin-RevId: 695198313
2024-11-10 22:19:30 -08:00
Jamie Liu 4298980325 Disallow task creation after Kernel.WaitExited() returns.
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.

PiperOrigin-RevId: 658891833
2024-08-02 13:41:17 -07:00
Ayush Ranjan 448f894c70 Expose some kernel hooks.
- Add Kernel.IsPaused() which indicates whether the kernel is currently paused.
- Add TaskSet.ForEachThreadGroup() which allows callers to iterate through all
  thread groups in the kernel.
- Export FDTable.ForEach() which allows other packages to iterate over all FDs.

PiperOrigin-RevId: 640760723
2024-06-05 21:43:43 -07:00
Shambhavi Srivastava 3657484eee Adding /proc/[pid]/task/[tid]/children
PiperOrigin-RevId: 557888186
2023-08-17 11:41:48 -07:00
Fabricio Voznika 500658dc81 Return correct number of PIDs with multi-container
Updates #172

PiperOrigin-RevId: 552620987
2023-07-31 16:20:47 -07:00
Nicolas Lacasse e7bd1b4c9c Implement PR_{S,G}ET_CHILD_SUBREAPER.
Closes #2323

PiperOrigin-RevId: 548205854
2023-07-14 13:19:25 -07:00
Jamie Liu 94bf4b6469 Return consistent IDs for PID namespaces via procfs.
PiperOrigin-RevId: 543529000
2023-06-26 13:49:23 -07:00
Andrei Vagin 604233c9f6 kernel: use lockdep mutexes
PiperOrigin-RevId: 449877248
2022-05-19 18:33:59 -07:00
Etienne Perot c9f8b165cf Cache each thread group's TID within their own namespace.
This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the
following lock cycle:
`kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex`
-> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex`

(Also, less locking within `createVMALocked` is probably for the better in
general.)

PiperOrigin-RevId: 449588573
2022-05-18 15:14:14 -07:00
Rahat Mahmood bf251f1838 cgroupfs: Implement pids controller.
This also introduces the controller charge interface.

PiperOrigin-RevId: 444703063
2022-04-26 16:50:15 -07:00
Etienne Perot 95d883a92e Refactor task start and exit from a PID namespace into separate functions.
PiperOrigin-RevId: 426083905
2022-02-03 01:47:27 -08:00
Jamie Liu 8682ce689e Remove state:"nosave"/"zerovalue" annotations from all waiter.Queues.
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields,
which meant that it couldn't handle intrusive linked lists, which meant that it
couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a
result, VFS1 unregisters all epoll waiters before saving and re-registers them
after loading, and waitable VFS1 file implementations tag their waiter.Queues
state:"nosave" (causing them to be skipped by the save/restore machinery) or
state:"zerovalue" (causing them to only be checked for zero-value-equality on
save).

VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance
pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2
epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2
file implementations do not tag waiter.Queues.

Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used
by both VFS1 and VFS2 (the latter via signalfd); as a result of the above,
tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll.
Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2),
and remove these tags from all waiter.Queues.

Also clean up after the epoll test added by cl/402323053, which implied this
issue (by instantiating DisableSave in the new test) without reporting it.

PiperOrigin-RevId: 402596216
2021-10-12 10:25:30 -07:00
Rahat Mahmood 932c8abd0f Implement cgroupfs.
A skeleton implementation of cgroupfs. It supports trivial cpu and
memory controllers with no support for hierarchies.

PiperOrigin-RevId: 366561126
2021-04-02 21:10:44 -07:00
Dean Deng acd516cfe2 Add YAMA security module restrictions on ptrace(2).
Restrict ptrace(2) according to the default configurations of the YAMA security
module (mode 1), which is a common default among various Linux distributions.
The new access checks only permit the tracer to proceed if one of the following
conditions is met:

a) The tracer is already attached to the tracee.

b) The target is a descendant of the tracer.

c) The target has explicitly given permission to the tracer through the
PR_SET_PTRACER prctl.

d) The tracer has CAP_SYS_PTRACE.

See security/yama/yama_lsm.c for more details.

Note that these checks are added to CanTrace, which is checked for
PTRACE_ATTACH as well as some other operations, e.g., checking a process'
memory layout through /proc/[pid]/mem.

Since this patch adds restrictions to ptrace, it may break compatibility for
applications run by non-root users that, for instance, rely on being able to
trace processes that are not descended from the tracer (e.g., `gdb -p`). YAMA
restrictions can be turned off by setting /proc/sys/kernel/yama/ptrace_scope
to 0, or exceptions can be made on a per-process basis with the PR_SET_PTRACER
prctl.

Reported-by: syzbot+622822d8bca08c99e8c8@syzkaller.appspotmail.com
PiperOrigin-RevId: 359237723
2021-02-24 02:03:16 -08:00
Dean Deng f52f0101bb Implement F_GETLK fcntl.
Fixes #5113.

PiperOrigin-RevId: 353313374
2021-01-22 13:58:16 -08:00
Jamie Liu 6bbf662271 Reduce the cost of sysinfo(2).
- sysinfo(2) does not actually require a fine-grained breakdown of memory
  usage. Accordingly, instead of calling pgalloc.MemoryFile.UpdateUsage() to
  update the sentry's fine-grained memory accounting snapshot, just use
  pgalloc.MemoryFile.TotalUsage() (which is a single fstat(), and therefore far
  cheaper).

- Use the number of threads in the root PID namespace (i.e. globally) rather
  than in the task's PID namespace for consistency with Linux (which just reads
  global variable nr_threads), and add a new method to kernel.PIDNamespace to
  allow this to be read directly from an underlying map rather than requiring
  the allocation and population of an intermediate slice.

PiperOrigin-RevId: 336353100
2020-10-09 13:23:30 -07:00
Rahat Mahmood d201feb8c5 Enable automated marshalling for the syscall package.
PiperOrigin-RevId: 331940975
2020-09-15 23:38:57 -07:00
Nicolas Lacasse 810748f5c9 Port aio to VFS2.
In order to make sure all aio goroutines have stopped during S/R, a new
WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup
is incremented with each aio goroutine, and waited on during kernel.Pause.

The old VFS1 aio code was changed to use this new WaitGroup, rather than
fs.Async. The only uses of fs.Async are now inode and mount Release operations,
which do not call fs.Async recursively. This fixes a lock-ordering violation
that can cause deadlocks.

Updates #1035.

PiperOrigin-RevId: 316689380
2020-06-16 08:49:06 -07:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
gVisor bot b50122379c Merge pull request #452 from zhangningdlut:chris_test_pidns
PiperOrigin-RevId: 260220279
2019-07-26 15:00:51 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Michael Pratt 75a5ccf5d9 Remove defer from trivial ThreadID methods
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.

PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-04 17:14:27 -07:00