This makes this function usable from outside of the `kernel` package
without needing to call `tg.Leader()` (which requires a lock that
`TaskSet.ForEachThreadGroup` already acquires).
PiperOrigin-RevId: 721168957
gVisor currently implements CPU clocks as follows:
- A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
- Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
- For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
- The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
Minor cleanup changes:
- Gather all stateify hooks in kernel_state.go.
- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
(https://go.dev/blog/randv2#problem.rand).
Test workload:
```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>
constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;
int main(int argc, char** argv) {
for (int i = 0; i < kNumTimers; i++) {
struct sigevent sev = {.sigev_notify = SIGEV_NONE};
timer_t timerid;
if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
err(1, "timer_create failed");
}
struct itimerspec it = {
.it_interval = {0, kTimerPeriodNS},
.it_value = {0, kTimerPeriodNS},
};
if (timer_settime(timerid, 0, &it, nullptr) < 0) {
err(1, "timer_settime failed");
}
}
std::this_thread::sleep_for(std::chrono::seconds(5));
return 0;
}
```
Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```
After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```
PiperOrigin-RevId: 695198313
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.
PiperOrigin-RevId: 658891833
- Add Kernel.IsPaused() which indicates whether the kernel is currently paused.
- Add TaskSet.ForEachThreadGroup() which allows callers to iterate through all
thread groups in the kernel.
- Export FDTable.ForEach() which allows other packages to iterate over all FDs.
PiperOrigin-RevId: 640760723
This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the
following lock cycle:
`kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex`
-> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex`
(Also, less locking within `createVMALocked` is probably for the better in
general.)
PiperOrigin-RevId: 449588573
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields,
which meant that it couldn't handle intrusive linked lists, which meant that it
couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a
result, VFS1 unregisters all epoll waiters before saving and re-registers them
after loading, and waitable VFS1 file implementations tag their waiter.Queues
state:"nosave" (causing them to be skipped by the save/restore machinery) or
state:"zerovalue" (causing them to only be checked for zero-value-equality on
save).
VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance
pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2
epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2
file implementations do not tag waiter.Queues.
Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used
by both VFS1 and VFS2 (the latter via signalfd); as a result of the above,
tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll.
Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2),
and remove these tags from all waiter.Queues.
Also clean up after the epoll test added by cl/402323053, which implied this
issue (by instantiating DisableSave in the new test) without reporting it.
PiperOrigin-RevId: 402596216
Restrict ptrace(2) according to the default configurations of the YAMA security
module (mode 1), which is a common default among various Linux distributions.
The new access checks only permit the tracer to proceed if one of the following
conditions is met:
a) The tracer is already attached to the tracee.
b) The target is a descendant of the tracer.
c) The target has explicitly given permission to the tracer through the
PR_SET_PTRACER prctl.
d) The tracer has CAP_SYS_PTRACE.
See security/yama/yama_lsm.c for more details.
Note that these checks are added to CanTrace, which is checked for
PTRACE_ATTACH as well as some other operations, e.g., checking a process'
memory layout through /proc/[pid]/mem.
Since this patch adds restrictions to ptrace, it may break compatibility for
applications run by non-root users that, for instance, rely on being able to
trace processes that are not descended from the tracer (e.g., `gdb -p`). YAMA
restrictions can be turned off by setting /proc/sys/kernel/yama/ptrace_scope
to 0, or exceptions can be made on a per-process basis with the PR_SET_PTRACER
prctl.
Reported-by: syzbot+622822d8bca08c99e8c8@syzkaller.appspotmail.com
PiperOrigin-RevId: 359237723
- sysinfo(2) does not actually require a fine-grained breakdown of memory
usage. Accordingly, instead of calling pgalloc.MemoryFile.UpdateUsage() to
update the sentry's fine-grained memory accounting snapshot, just use
pgalloc.MemoryFile.TotalUsage() (which is a single fstat(), and therefore far
cheaper).
- Use the number of threads in the root PID namespace (i.e. globally) rather
than in the task's PID namespace for consistency with Linux (which just reads
global variable nr_threads), and add a new method to kernel.PIDNamespace to
allow this to be read directly from an underlying map rather than requiring
the allocation and population of an intermediate slice.
PiperOrigin-RevId: 336353100
In order to make sure all aio goroutines have stopped during S/R, a new
WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup
is incremented with each aio goroutine, and waited on during kernel.Pause.
The old VFS1 aio code was changed to use this new WaitGroup, rather than
fs.Async. The only uses of fs.Async are now inode and mount Release operations,
which do not call fs.Async recursively. This fixes a lock-ordering violation
that can cause deadlocks.
Updates #1035.
PiperOrigin-RevId: 316689380
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.
PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80