65 Commits

Author SHA1 Message Date
Nicolas Lacasse f9b1ce2f7d Clean up tty.CheckChange and call it in SetForegroundProcessGroup.
Previously, CheckChange (corresponding to Linux's tty/tty_check_change()) was
only used the host TTY implementation, not the devpts implementation.

Furthermore, ThreadGroup.SetForegroundProcessGroup() duplicated some of the
logic in CheckChange, notably sending SIGTTOU to background tasks. This means
that, for host TTYs, we could send SIGTTOU multiple times. In some
circumstances, this leads the ioctl returning ERESTARTSYS in an infinite loop.

PiperOrigin-RevId: 735934036
2025-03-11 16:46:55 -07:00
Jimmy Tran 17563a8af9 Return EACCES when calling setpgid() after execve()
From setpgid manpage,

EACCES - An attempt was made to change the process group ID of one
of the children of the calling process and the child had
already performed an execve(2) (setpgid(), setpgrp()).

This CL makes gVisor implement this rule and updates the exec test
suite accordingly.

TESTED: http://sponge2/7f364e8a-4f82-463e-ba62-79234c4d054d
PiperOrigin-RevId: 727095560
2025-02-14 16:14:14 -08:00
Etienne Perot 2b55090a58 Do not crash when creating thread group with already-exceeded soft CPU limit.
Reported-by: syzbot+da9595a72d0762aaa48d@syzkaller.appspotmail.com
PiperOrigin-RevId: 699425946
2024-11-23 01:28:50 -08:00
Jamie Liu 94aa652d10 kernel: start RLIMIT_CPU timers in NewThreadGroup
Before cl/695198313, this bug only affected RLIMIT_CPU soft limits, which were
represented by tg.rlimitCPUSoftSetting and was similarly uninitialized by
Kernel.NewThreadGroup(); the CPU clock ticker fetched RLIMIT_CPU hard limits in
each tick. After cl/695198313, this bug affects both RLIMIT_CPU soft and hard
limits.

Itimers don't have the same issue since they're not preserved across fork().

PiperOrigin-RevId: 695936410
2024-11-12 18:15:32 -08:00
Jamie Liu 2d90353f9f kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:

- A per-sentry "CPU clock ticker goroutine"
  (task_sched.go:Kernel.runCPUClockTicker()) periodically advances
  Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
  monotonic wall clock (that happens to be suspended when no tasks are
  running).

- Task goroutines observe the most recent value of Kernel.cpuClock when
  changing state (Task.gosched.Timestamp), and use it to compute the number of
  CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
  approximately based on the wall time during which they were marked as
  running.

- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
  ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
  timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
  goroutines.

This has three major problems:

- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
  CPU clock ticker, and are not informed as to when corresponding tasks start
  or stop running (due to overhead on the task execution critical path), so
  they can't determine when CPU clocks have/will advance; instead, they simply
  poll CPU clocks on a period equal to that of the represented timer, resulting
  in significant overhead for CPU-clock-based POSIX interval timers and
  timerfds.

- For the same reason, CPU clock interval timers and timerfds may expire much
  later than when the CPU clock is actually incremented; in the interval timer
  case, this can result in notification signals being sent long after tasks
  have stopped running. (This is the same problem as in b/116538398, which
  motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
  above, but applied to POSIX interval timers.)

- The sentry does not impose a limit on the number of tasks that may be
  concurrently marked running, so if more tasks are marked running than the
  number of CPUs advertised to applications, application CPU utilization can
  appear to exceed 100%.

This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.

Minor cleanup changes:

- Gather all stateify hooks in kernel_state.go.

- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
  (https://go.dev/blog/randv2#problem.rand).

Test workload:

```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>

constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;

int main(int argc, char** argv) {
  for (int i = 0; i < kNumTimers; i++) {
    struct sigevent sev = {.sigev_notify = SIGEV_NONE};
    timer_t timerid;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
      err(1, "timer_create failed");
    }
    struct itimerspec it = {
      .it_interval = {0, kTimerPeriodNS},
      .it_value = {0, kTimerPeriodNS},
    };
    if (timer_settime(timerid, 0, &it, nullptr) < 0) {
      err(1, "timer_settime failed");
    }
  }
  std::this_thread::sleep_for(std::chrono::seconds(5));
  return 0;
}
```

Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```

After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```

PiperOrigin-RevId: 695198313
2024-11-10 22:19:30 -08:00
Jamie Liu 2e6cfa72f2 ktime: support varying Timer implementations
- Rename Timer to SampledTimer.

- Move all Clock methods except Now to new interface SampledClock.

- Move SampledTimer's exported methods (except SetClock) to new interface
  Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
  methods that must be implemented.

- Add interface method Clock.NewTimer.

This is in preparation for cl/693856539, which adds a second Timer
implementation.

PiperOrigin-RevId: 694299679
2024-11-07 17:19:25 -08:00
Jamie Liu 379108ca91 ktime: simplify Listener.NotifyTimer()
No implementations of Listener use the Setting argument or the ability to
override the new Setting, so remove these.

PiperOrigin-RevId: 694195729
2024-11-07 11:47:58 -08:00
Jamie Liu e23347e5b5 Move //pkg/sentry/kernel/time to //pkg/sentry/ktime.
This avoids needing to rename it everywhere it's imported.

PiperOrigin-RevId: 693930089
2024-11-06 18:13:51 -08:00
Nicolas Lacasse cceb04f05a Clean up host.TTYFileOperations.
We used to track the foreground process group & session on the
TTYFileOperation, but these are already tracked in kernel.TTY.ThreadGroup.

So remove TTYFileOperations.fgProcessGroup and .session, and replace them with
a kernel.TTY.

This is analogous to how sentry-internal tty's already work.

Updates #10925

PiperOrigin-RevId: 681957240
2024-10-03 11:25:52 -07:00
Jamie Liu b99fd8711f kernel: fix lock order inversion in ThreadGroup.Release()
PiperOrigin-RevId: 681199251
2024-10-01 16:10:15 -07:00
Jamie Liu a32d047f68 kernel: don't hold TaskSet.mu during most of Kernel.runCPUClockTicker()
The removed `tg.leader == nil` check doesn't actually affect the correctness of
the rest of the loop body.

PiperOrigin-RevId: 681163998
2024-10-01 14:25:47 -07:00
Jamie Liu 03bebc4402 kernel: add ThreadGroup.signalLock()
This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().

This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.

PiperOrigin-RevId: 681128063
2024-10-01 12:48:10 -07:00
Ayush Ranjan 7e395bbbd4 Plumb restore context to load*() methods.
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").

Updates #1956.

PiperOrigin-RevId: 614125262
2024-03-08 20:28:02 -08:00
NymanRobin f481172b53 Convert atomic.Value to atomic.Pointer[T] 2024-03-05 11:09:23 +02:00
Ayush Ranjan f82d97c9ee Only reset tty.tg to nil when its controlling process is being released.
This means that when tg is being released, IFF tg.tty.tg == tg (which means tg
was tg.tty's controlling process), then we can reset tty.tg to nil.

Otherwise, as shown in reproducers of #9898, when a non-controlling process
exits, it resets the TTY's tg field (which indicates the controlling thread
group) and subsequently the alive controlling thread group can no longer
receive signals from the TTY.

Fixes #9898

PiperOrigin-RevId: 600987817
2024-01-23 20:22:06 -08:00
Etienne Perot 69e0c7643d Use clear on map types wherever possible.
This is similar as pull request #9749 but for maps rather than slices.

PiperOrigin-RevId: 586504320
2023-11-29 18:00:07 -08:00
Nicolas Lacasse 47db4119a2 ThreadGroup should disassociate from tty on exit.
Added syscall test case and also tested with:
```
$ docker container run -it --name debian-runsc --runtime=runsc debian:12 bash -c "apt update && apt install -y curl"
<snip>
Setting up libkeyutils1:amd64 (1.6.3-2) ...
Setting up libpsl5:amd64 (0.21.2-1) ...
Setting up libbrotli1:amd64 (1.0.9-2+b6) ...
Setting up libssl3:amd64 (3.0.11-1~deb12u2) ...
Setting up libnghttp2-14:amd64 (1.52.0-1) ...
Setting up krb5-locales (1.20.1-2+deb12u1) ...
Setting up libldap-common (2.5.13+dfsg-5) ...
Setting up libkrb5support0:amd64 (1.20.1-2+deb12u1) ...
Setting up libsasl2-modules-db:amd64 (2.1.28+dfsg-10) ...
Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2+b2) ...
Setting up libk5crypto3:amd64 (1.20.1-2+deb12u1) ...
Setting up libsasl2-2:amd64 (2.1.28+dfsg-10) ...
Setting up libssh2-1:amd64 (1.10.0-3+b1) ...
Setting up libkrb5-3:amd64 (1.20.1-2+deb12u1) ...
Setting up openssl (3.0.11-1~deb12u2) ...
Setting up publicsuffix (20230209.2326-1) ...
Setting up libsasl2-modules:amd64 (2.1.28+dfsg-10) ...
Setting up libldap-2.5-0:amd64 (2.5.13+dfsg-5) ...
Setting up ca-certificates (20230311) ...
<snip>
```

With these changes, there is no more error about TIOCSCTTY, and the `Setting
up..` log lines are formatted properly.

Fixes #9642

PiperOrigin-RevId: 580204710
2023-11-07 09:26:16 -08:00
Andrei Vagin b357d71828 TIOCSCTTY has to succeed if a specified tty is a controlling one already
This behavior isn't documented in the tty_ioctl man,
but it is in the kernel for ages.

PiperOrigin-RevId: 577097643
2023-10-26 23:57:42 -07:00
Nicolas Lacasse e7bd1b4c9c Implement PR_{S,G}ET_CHILD_SUBREAPER.
Closes #2323

PiperOrigin-RevId: 548205854
2023-07-14 13:19:25 -07:00
Nicolas Lacasse 8184fa1db0 Clean up devpts code, and deduplicate the foreground process state.
We no longer store the foreground process directly in the terminal. Instead, we
get it from the terminal TTY's ThreadGroup. Added a new method:
tty.SignalForegroundProcessGroup to simplify this.

Cleaned up some things along the way:
* Terminal had a bunch of methods to get/set foreground process group and
  controlling TTY, but those methods were only usable by Ioctl, since they
  read/wrote to syscall arguments. I moved that logic to Ioctl, and deleted the
  methods from Terminal, which is now a very simple type.
* Fixed a bug in ThreadGroud.SetForegroundProcessGroup where we were
  overwriting the ID of an existing process group, rather than setting a new
  process group on the session.
* Simplified the construction of lineDiscipline type.

Reported-by: syzbot+ae5b769cec8ad969c086@syzkaller.appspotmail.com
PiperOrigin-RevId: 512330758
2023-02-25 14:08:58 -08:00
Etienne Perot 445fa6f40c Lockdep: Print more info in the "unbalanced unlock" case.
This CL does the following:

- Add the ability for nested locks to have names.
- Give names to all current uses of nested locks in the codebase.
- Truncate `lockdep` debug stack traces to avoid the clutter from the
  `lockdep` code itself
- Simplify `lockdep` to not longer require `classMap`.

PiperOrigin-RevId: 491486620
2022-11-28 17:53:09 -08:00
Ayush Ranjan 1fa3c06f1e Delete VFS1 completely.
- Delete pkg/sentry/fs/*.
- Move pkg/sentry/fs/fsutil out of VFS1 directory and remove VFS1 components.
- Remove remaining unused references to VFS1 from remaining codebase.
- Rename/refactor code to avoid even referencing VFS2, unless necessary.
- Rewrite VFS1-only tests to VFS2.

Updates #1624

PiperOrigin-RevId: 490064269
2022-11-21 13:57:52 -08:00
Nelson Elhage a3a5772491 Fixes to TTOU handling in TIOCSPGRP
Fixes two issues in TTOU handling while handling TIOCSPGRP on a tty
device.

fixes #7941; see that issue for details of the bugs.

Updates the tests to test the fixed behavior; both tests are verified
to fail without the `thread_group.go` fixes.
2022-09-02 14:32:54 -07:00
Andrei Vagin 604233c9f6 kernel: use lockdep mutexes
PiperOrigin-RevId: 449877248
2022-05-19 18:33:59 -07:00
Ayush Ranjan f6ed4523dc Reformat codebase.
PiperOrigin-RevId: 449358041
2022-05-17 17:48:35 -07:00