Previously, CheckChange (corresponding to Linux's tty/tty_check_change()) was
only used the host TTY implementation, not the devpts implementation.
Furthermore, ThreadGroup.SetForegroundProcessGroup() duplicated some of the
logic in CheckChange, notably sending SIGTTOU to background tasks. This means
that, for host TTYs, we could send SIGTTOU multiple times. In some
circumstances, this leads the ioctl returning ERESTARTSYS in an infinite loop.
PiperOrigin-RevId: 735934036
From setpgid manpage,
EACCES - An attempt was made to change the process group ID of one
of the children of the calling process and the child had
already performed an execve(2) (setpgid(), setpgrp()).
This CL makes gVisor implement this rule and updates the exec test
suite accordingly.
TESTED: http://sponge2/7f364e8a-4f82-463e-ba62-79234c4d054d
PiperOrigin-RevId: 727095560
Before cl/695198313, this bug only affected RLIMIT_CPU soft limits, which were
represented by tg.rlimitCPUSoftSetting and was similarly uninitialized by
Kernel.NewThreadGroup(); the CPU clock ticker fetched RLIMIT_CPU hard limits in
each tick. After cl/695198313, this bug affects both RLIMIT_CPU soft and hard
limits.
Itimers don't have the same issue since they're not preserved across fork().
PiperOrigin-RevId: 695936410
gVisor currently implements CPU clocks as follows:
- A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
- Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
- For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
- The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
Minor cleanup changes:
- Gather all stateify hooks in kernel_state.go.
- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
(https://go.dev/blog/randv2#problem.rand).
Test workload:
```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>
constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;
int main(int argc, char** argv) {
for (int i = 0; i < kNumTimers; i++) {
struct sigevent sev = {.sigev_notify = SIGEV_NONE};
timer_t timerid;
if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
err(1, "timer_create failed");
}
struct itimerspec it = {
.it_interval = {0, kTimerPeriodNS},
.it_value = {0, kTimerPeriodNS},
};
if (timer_settime(timerid, 0, &it, nullptr) < 0) {
err(1, "timer_settime failed");
}
}
std::this_thread::sleep_for(std::chrono::seconds(5));
return 0;
}
```
Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```
After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```
PiperOrigin-RevId: 695198313
- Rename Timer to SampledTimer.
- Move all Clock methods except Now to new interface SampledClock.
- Move SampledTimer's exported methods (except SetClock) to new interface
Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
methods that must be implemented.
- Add interface method Clock.NewTimer.
This is in preparation for cl/693856539, which adds a second Timer
implementation.
PiperOrigin-RevId: 694299679
We used to track the foreground process group & session on the
TTYFileOperation, but these are already tracked in kernel.TTY.ThreadGroup.
So remove TTYFileOperations.fgProcessGroup and .session, and replace them with
a kernel.TTY.
This is analogous to how sentry-internal tty's already work.
Updates #10925
PiperOrigin-RevId: 681957240
This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().
This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.
PiperOrigin-RevId: 681128063
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").
Updates #1956.
PiperOrigin-RevId: 614125262
This means that when tg is being released, IFF tg.tty.tg == tg (which means tg
was tg.tty's controlling process), then we can reset tty.tg to nil.
Otherwise, as shown in reproducers of #9898, when a non-controlling process
exits, it resets the TTY's tg field (which indicates the controlling thread
group) and subsequently the alive controlling thread group can no longer
receive signals from the TTY.
Fixes#9898
PiperOrigin-RevId: 600987817
Added syscall test case and also tested with:
```
$ docker container run -it --name debian-runsc --runtime=runsc debian:12 bash -c "apt update && apt install -y curl"
<snip>
Setting up libkeyutils1:amd64 (1.6.3-2) ...
Setting up libpsl5:amd64 (0.21.2-1) ...
Setting up libbrotli1:amd64 (1.0.9-2+b6) ...
Setting up libssl3:amd64 (3.0.11-1~deb12u2) ...
Setting up libnghttp2-14:amd64 (1.52.0-1) ...
Setting up krb5-locales (1.20.1-2+deb12u1) ...
Setting up libldap-common (2.5.13+dfsg-5) ...
Setting up libkrb5support0:amd64 (1.20.1-2+deb12u1) ...
Setting up libsasl2-modules-db:amd64 (2.1.28+dfsg-10) ...
Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2+b2) ...
Setting up libk5crypto3:amd64 (1.20.1-2+deb12u1) ...
Setting up libsasl2-2:amd64 (2.1.28+dfsg-10) ...
Setting up libssh2-1:amd64 (1.10.0-3+b1) ...
Setting up libkrb5-3:amd64 (1.20.1-2+deb12u1) ...
Setting up openssl (3.0.11-1~deb12u2) ...
Setting up publicsuffix (20230209.2326-1) ...
Setting up libsasl2-modules:amd64 (2.1.28+dfsg-10) ...
Setting up libldap-2.5-0:amd64 (2.5.13+dfsg-5) ...
Setting up ca-certificates (20230311) ...
<snip>
```
With these changes, there is no more error about TIOCSCTTY, and the `Setting
up..` log lines are formatted properly.
Fixes#9642
PiperOrigin-RevId: 580204710
We no longer store the foreground process directly in the terminal. Instead, we
get it from the terminal TTY's ThreadGroup. Added a new method:
tty.SignalForegroundProcessGroup to simplify this.
Cleaned up some things along the way:
* Terminal had a bunch of methods to get/set foreground process group and
controlling TTY, but those methods were only usable by Ioctl, since they
read/wrote to syscall arguments. I moved that logic to Ioctl, and deleted the
methods from Terminal, which is now a very simple type.
* Fixed a bug in ThreadGroud.SetForegroundProcessGroup where we were
overwriting the ID of an existing process group, rather than setting a new
process group on the session.
* Simplified the construction of lineDiscipline type.
Reported-by: syzbot+ae5b769cec8ad969c086@syzkaller.appspotmail.com
PiperOrigin-RevId: 512330758
This CL does the following:
- Add the ability for nested locks to have names.
- Give names to all current uses of nested locks in the codebase.
- Truncate `lockdep` debug stack traces to avoid the clutter from the
`lockdep` code itself
- Simplify `lockdep` to not longer require `classMap`.
PiperOrigin-RevId: 491486620
Fixes two issues in TTOU handling while handling TIOCSPGRP on a tty
device.
fixes#7941; see that issue for details of the bugs.
Updates the tests to test the fixed behavior; both tests are verified
to fail without the `thread_group.go` fixes.