779 Commits

Author SHA1 Message Date
Nicolas Lacasse f9b1ce2f7d Clean up tty.CheckChange and call it in SetForegroundProcessGroup.
Previously, CheckChange (corresponding to Linux's tty/tty_check_change()) was
only used the host TTY implementation, not the devpts implementation.

Furthermore, ThreadGroup.SetForegroundProcessGroup() duplicated some of the
logic in CheckChange, notably sending SIGTTOU to background tasks. This means
that, for host TTYs, we could send SIGTTOU multiple times. In some
circumstances, this leads the ioctl returning ERESTARTSYS in an infinite loop.

PiperOrigin-RevId: 735934036
2025-03-11 16:46:55 -07:00
Fabricio Voznika c041d9bd58 Add missing binary_sha256 field
Fixes #11466

PiperOrigin-RevId: 734209881
2025-03-06 11:01:58 -08:00
Ayush Ranjan 156f457e28 Add kernel.ThreadGroup.ForEachTask().
PiperOrigin-RevId: 733991896
2025-03-05 22:23:11 -08:00
gVisor bot 86abc85f37 Merge pull request #11473 from Champ-Goblem:shim-add-cgroup-v2-metrics-support
PiperOrigin-RevId: 730560110
2025-02-25 14:52:09 -08:00
Jimmy Tran 17563a8af9 Return EACCES when calling setpgid() after execve()
From setpgid manpage,

EACCES - An attempt was made to change the process group ID of one
of the children of the calling process and the child had
already performed an execve(2) (setpgid(), setpgrp()).

This CL makes gVisor implement this rule and updates the exec test
suite accordingly.

TESTED: http://sponge2/7f364e8a-4f82-463e-ba62-79234c4d054d
PiperOrigin-RevId: 727095560
2025-02-14 16:14:14 -08:00
Nicolas Lacasse d949e7177c taskCopyContext should not require holding task.mu.
The primary existing user (ptrace) does not do this, and it leads to lock
inversion with MemoryManager.mappingMu.

PiperOrigin-RevId: 725353311
2025-02-10 14:49:56 -08:00
Jimmy Tran de6637c27c Recompute max variable after setting FD in the bitmap.
`fdBitmap.FirstZero()` could return `max` value; if it does, then
recompute the max value to avoid reusing the old max value twice.

The default bitmap size for file descriptors in gVisor is 65535.

Add a pipe test that attempts to create more than 65535 FDs to hit the edge
case where fdBitmap.FirstZero() returns the default bitmap max value of 65535.

TESTED:
http://sponge2/4c12ce75-3763-4773-ad62-87c6b8fe0446
http://sponge2/9c9d6ea0-b69c-432c-a16b-9446214109ba
PiperOrigin-RevId: 724410846
2025-02-07 11:22:54 -08:00
gVisor bot e0435b9a53 Merge pull request #11415 from avagin:codespell
PiperOrigin-RevId: 721421397
2025-01-30 09:44:28 -08:00
Andrei Vagin f010ae01ac Fix a few typos 2025-01-29 21:16:51 -08:00
Etienne Perot 04f9204697 Yield thread group leader *Task in TaskSet.ForEachThreadGroup.
This makes this function usable from outside of the `kernel` package
without needing to call `tg.Leader()` (which requires a lock that
`TaskSet.ForEachThreadGroup` already acquires).

PiperOrigin-RevId: 721168957
2025-01-29 17:30:05 -08:00
Andrei Vagin 1864d9d091 Untag user addresses before handling them in the Sentry
Top-Byte-Ignore (TBI) is a feature on all ARMv8.0 CPUs that causes the top byte
of virtual addresses to be ignored on loads and stores. Instead, bit 55 is
extended over bits 56-63 before address translation. This feature allows use of
the (ignored) top byte as a tag or for other in-band metadata.

In Linux, brk()/mmap()/mremap() syscalls don't untag addresses. More details
are in dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()")

PiperOrigin-RevId: 715885990
2025-01-15 11:52:40 -08:00
gVisor bot 7aa4c49b0d Merge pull request #11291 from xianzhe-databricks:fix-uds-auth
PiperOrigin-RevId: 712981221
2025-01-07 11:25:40 -08:00
xianzhe-databricks c4f686f4e1 Add a new RPC ConnectWithCreds to allow gofer to connect to a unix domain socket with application's credentials 2025-01-03 17:50:06 +01:00
Fabricio Voznika fb730ff784 Remove checkpoint_count from runsc wait --checkpoint
This is done because external callers are not able to know
the snapshot generation number from the outside.

PiperOrigin-RevId: 707979556
2024-12-19 11:48:10 -08:00
Nayana Bidari a3e5887415 Changes to support netstack save restore.
- Added a new Stats() method in inet.Stack to get the saved stats
during restore.
- Mark stack.nic, tcpip.Route and stack.addressState structs as "nosave".
These fields should not be saved because the IP addresses and routes can
change during restore and new configuration of routes and IP addresses will be
extracted from the restore spec and initialized in the saved stack.
- Changes in Restore() method in icmp, udp, tcp, packet and raw endpoint files
to support save restore of these endpoints. These changes are flag guarded by
the TESTONLY-save-restore-netstack flag.

PiperOrigin-RevId: 707639274
2024-12-18 12:52:22 -08:00
Andrei Vagin c27c9a02ae kernel: use the kernel context to run task destroy actions
A task context can be used only if actions are executed in a task goroutine.
In addition, these actions are executed asynchronously, so the task can be
destroyed.

Reported-by: syzbot+a9f3e03ea801374b8089@syzkaller.appspotmail.com
PiperOrigin-RevId: 706078457
2024-12-13 19:08:53 -08:00
Andrei Vagin 9fcf0b5b53 proc: invalidate task inodes when tasks are destroyed
PiperOrigin-RevId: 705785809
2024-12-13 00:58:08 -08:00
Etienne Perot 2b55090a58 Do not crash when creating thread group with already-exceeded soft CPU limit.
Reported-by: syzbot+da9595a72d0762aaa48d@syzkaller.appspotmail.com
PiperOrigin-RevId: 699425946
2024-11-23 01:28:50 -08:00
Nayana Bidari df9ba5fb67 Restore listening connections when netstack s/r is enabled.
This CL restores the listening connections when netstack s/r is enabled.
The changes include:
- New method as a workaround to replace the new routes and nics to the loaded
stack after restore.
- New Restore() for transport layer protocols to restore the protocol level
background workers.
- Adds afterLoad() method for fdbased processors.
- Adds a test to verify listening connection is restored after checkpointing
with netstack s/r enabled.
- Few other changes to save restore fields to enable netstack s/r.

PiperOrigin-RevId: 698453124
2024-11-20 11:13:57 -08:00
Jamie Liu 94aa652d10 kernel: start RLIMIT_CPU timers in NewThreadGroup
Before cl/695198313, this bug only affected RLIMIT_CPU soft limits, which were
represented by tg.rlimitCPUSoftSetting and was similarly uninitialized by
Kernel.NewThreadGroup(); the CPU clock ticker fetched RLIMIT_CPU hard limits in
each tick. After cl/695198313, this bug affects both RLIMIT_CPU soft and hard
limits.

Itimers don't have the same issue since they're not preserved across fork().

PiperOrigin-RevId: 695936410
2024-11-12 18:15:32 -08:00
Jamie Liu 7920b5b40a kernel: improve tcpip.Timer implementation
- Move ktime.VariableTimer to kernel.timekeeperTcpipTimer, its only use case.
  This allows timekeeperTcpipTimer to use concrete types kernel.timekeeperClock
  and ktime.SampledTimer instead of ktime.Clock and ktime.Timer, saving a tiny
  amount of memory (interface values consist of two pointers) and CPU (for
  interface method calls).

- Fix a bug where timekeeperTcpipTimer expiration can cancel a racing call to
  timekeeperTcpipTimer.Reset() (see use of new field
  timekeeperTcpipTimer.resets).

- Define Listener.NotifyTimer directly on timekeeperTcpipTimer (dropping
  ktime.functionNotifier), and move goroutine spawning from the anonymous
  function in ktime.AfterFunc() into timekeeperTcpipTimer.NotifyTimer(). This
  slightly simplifies the control flow and saves an allocation for the
  anonymous function object.

- Use monotonicClock rather than realtimeClock. It doesn't make sense for
  time-of-day clock adjustments to affect netstack timeouts, and this is
  consistent with tcpip.stdClock => time.AfterFunc => runtime.timer.

PiperOrigin-RevId: 695504159
2024-11-11 15:38:55 -08:00
Jamie Liu 2d90353f9f kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:

- A per-sentry "CPU clock ticker goroutine"
  (task_sched.go:Kernel.runCPUClockTicker()) periodically advances
  Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
  monotonic wall clock (that happens to be suspended when no tasks are
  running).

- Task goroutines observe the most recent value of Kernel.cpuClock when
  changing state (Task.gosched.Timestamp), and use it to compute the number of
  CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
  approximately based on the wall time during which they were marked as
  running.

- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
  ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
  timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
  goroutines.

This has three major problems:

- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
  CPU clock ticker, and are not informed as to when corresponding tasks start
  or stop running (due to overhead on the task execution critical path), so
  they can't determine when CPU clocks have/will advance; instead, they simply
  poll CPU clocks on a period equal to that of the represented timer, resulting
  in significant overhead for CPU-clock-based POSIX interval timers and
  timerfds.

- For the same reason, CPU clock interval timers and timerfds may expire much
  later than when the CPU clock is actually incremented; in the interval timer
  case, this can result in notification signals being sent long after tasks
  have stopped running. (This is the same problem as in b/116538398, which
  motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
  above, but applied to POSIX interval timers.)

- The sentry does not impose a limit on the number of tasks that may be
  concurrently marked running, so if more tasks are marked running than the
  number of CPUs advertised to applications, application CPU utilization can
  appear to exceed 100%.

This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.

Minor cleanup changes:

- Gather all stateify hooks in kernel_state.go.

- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
  (https://go.dev/blog/randv2#problem.rand).

Test workload:

```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>

constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;

int main(int argc, char** argv) {
  for (int i = 0; i < kNumTimers; i++) {
    struct sigevent sev = {.sigev_notify = SIGEV_NONE};
    timer_t timerid;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
      err(1, "timer_create failed");
    }
    struct itimerspec it = {
      .it_interval = {0, kTimerPeriodNS},
      .it_value = {0, kTimerPeriodNS},
    };
    if (timer_settime(timerid, 0, &it, nullptr) < 0) {
      err(1, "timer_settime failed");
    }
  }
  std::this_thread::sleep_for(std::chrono::seconds(5));
  return 0;
}
```

Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```

After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```

PiperOrigin-RevId: 695198313
2024-11-10 22:19:30 -08:00
Jamie Liu 2e6cfa72f2 ktime: support varying Timer implementations
- Rename Timer to SampledTimer.

- Move all Clock methods except Now to new interface SampledClock.

- Move SampledTimer's exported methods (except SetClock) to new interface
  Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
  methods that must be implemented.

- Add interface method Clock.NewTimer.

This is in preparation for cl/693856539, which adds a second Timer
implementation.

PiperOrigin-RevId: 694299679
2024-11-07 17:19:25 -08:00
Jamie Liu 379108ca91 ktime: simplify Listener.NotifyTimer()
No implementations of Listener use the Setting argument or the ability to
override the new Setting, so remove these.

PiperOrigin-RevId: 694195729
2024-11-07 11:47:58 -08:00
Jamie Liu e23347e5b5 Move //pkg/sentry/kernel/time to //pkg/sentry/ktime.
This avoids needing to rename it everywhere it's imported.

PiperOrigin-RevId: 693930089
2024-11-06 18:13:51 -08:00