Previously, CheckChange (corresponding to Linux's tty/tty_check_change()) was
only used the host TTY implementation, not the devpts implementation.
Furthermore, ThreadGroup.SetForegroundProcessGroup() duplicated some of the
logic in CheckChange, notably sending SIGTTOU to background tasks. This means
that, for host TTYs, we could send SIGTTOU multiple times. In some
circumstances, this leads the ioctl returning ERESTARTSYS in an infinite loop.
PiperOrigin-RevId: 735934036
From setpgid manpage,
EACCES - An attempt was made to change the process group ID of one
of the children of the calling process and the child had
already performed an execve(2) (setpgid(), setpgrp()).
This CL makes gVisor implement this rule and updates the exec test
suite accordingly.
TESTED: http://sponge2/7f364e8a-4f82-463e-ba62-79234c4d054d
PiperOrigin-RevId: 727095560
`fdBitmap.FirstZero()` could return `max` value; if it does, then
recompute the max value to avoid reusing the old max value twice.
The default bitmap size for file descriptors in gVisor is 65535.
Add a pipe test that attempts to create more than 65535 FDs to hit the edge
case where fdBitmap.FirstZero() returns the default bitmap max value of 65535.
TESTED:
http://sponge2/4c12ce75-3763-4773-ad62-87c6b8fe0446http://sponge2/9c9d6ea0-b69c-432c-a16b-9446214109ba
PiperOrigin-RevId: 724410846
This makes this function usable from outside of the `kernel` package
without needing to call `tg.Leader()` (which requires a lock that
`TaskSet.ForEachThreadGroup` already acquires).
PiperOrigin-RevId: 721168957
Top-Byte-Ignore (TBI) is a feature on all ARMv8.0 CPUs that causes the top byte
of virtual addresses to be ignored on loads and stores. Instead, bit 55 is
extended over bits 56-63 before address translation. This feature allows use of
the (ignored) top byte as a tag or for other in-band metadata.
In Linux, brk()/mmap()/mremap() syscalls don't untag addresses. More details
are in dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()")
PiperOrigin-RevId: 715885990
- Added a new Stats() method in inet.Stack to get the saved stats
during restore.
- Mark stack.nic, tcpip.Route and stack.addressState structs as "nosave".
These fields should not be saved because the IP addresses and routes can
change during restore and new configuration of routes and IP addresses will be
extracted from the restore spec and initialized in the saved stack.
- Changes in Restore() method in icmp, udp, tcp, packet and raw endpoint files
to support save restore of these endpoints. These changes are flag guarded by
the TESTONLY-save-restore-netstack flag.
PiperOrigin-RevId: 707639274
A task context can be used only if actions are executed in a task goroutine.
In addition, these actions are executed asynchronously, so the task can be
destroyed.
Reported-by: syzbot+a9f3e03ea801374b8089@syzkaller.appspotmail.com
PiperOrigin-RevId: 706078457
This CL restores the listening connections when netstack s/r is enabled.
The changes include:
- New method as a workaround to replace the new routes and nics to the loaded
stack after restore.
- New Restore() for transport layer protocols to restore the protocol level
background workers.
- Adds afterLoad() method for fdbased processors.
- Adds a test to verify listening connection is restored after checkpointing
with netstack s/r enabled.
- Few other changes to save restore fields to enable netstack s/r.
PiperOrigin-RevId: 698453124
Before cl/695198313, this bug only affected RLIMIT_CPU soft limits, which were
represented by tg.rlimitCPUSoftSetting and was similarly uninitialized by
Kernel.NewThreadGroup(); the CPU clock ticker fetched RLIMIT_CPU hard limits in
each tick. After cl/695198313, this bug affects both RLIMIT_CPU soft and hard
limits.
Itimers don't have the same issue since they're not preserved across fork().
PiperOrigin-RevId: 695936410
- Move ktime.VariableTimer to kernel.timekeeperTcpipTimer, its only use case.
This allows timekeeperTcpipTimer to use concrete types kernel.timekeeperClock
and ktime.SampledTimer instead of ktime.Clock and ktime.Timer, saving a tiny
amount of memory (interface values consist of two pointers) and CPU (for
interface method calls).
- Fix a bug where timekeeperTcpipTimer expiration can cancel a racing call to
timekeeperTcpipTimer.Reset() (see use of new field
timekeeperTcpipTimer.resets).
- Define Listener.NotifyTimer directly on timekeeperTcpipTimer (dropping
ktime.functionNotifier), and move goroutine spawning from the anonymous
function in ktime.AfterFunc() into timekeeperTcpipTimer.NotifyTimer(). This
slightly simplifies the control flow and saves an allocation for the
anonymous function object.
- Use monotonicClock rather than realtimeClock. It doesn't make sense for
time-of-day clock adjustments to affect netstack timeouts, and this is
consistent with tcpip.stdClock => time.AfterFunc => runtime.timer.
PiperOrigin-RevId: 695504159
gVisor currently implements CPU clocks as follows:
- A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
- Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
- ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
- ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
- For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
- The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
Minor cleanup changes:
- Gather all stateify hooks in kernel_state.go.
- Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem
(https://go.dev/blog/randv2#problem.rand).
Test workload:
```
#include <err.h>
#include <signal.h>
#include <time.h>
#include <chrono>
#include <thread>
constexpr int kNumTimers = 1000;
constexpr long kTimerPeriodNS = 10000000;
int main(int argc, char** argv) {
for (int i = 0; i < kNumTimers; i++) {
struct sigevent sev = {.sigev_notify = SIGEV_NONE};
timer_t timerid;
if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timerid) < 0) {
err(1, "timer_create failed");
}
struct itimerspec it = {
.it_interval = {0, kTimerPeriodNS},
.it_value = {0, kTimerPeriodNS},
};
if (timer_settime(timerid, 0, &it, nullptr) < 0) {
err(1, "timer_settime failed");
}
}
std::this_thread::sleep_for(std::chrono::seconds(5));
return 0;
}
```
Before this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
1.50user 0.17system 0:05.25elapsed 31%CPU (0avgtext+0avgdata 35792maxresident)k
0inputs+184outputs (10major+20889minor)pagefaults 0swaps
```
After this CL:
```
# /usr/bin/time ./runsc --ignore-cgroups --platform kvm --network none do $(pwd)/workloads/threadcputimers
0.10user 0.12system 0:05.22elapsed 4%CPU (0avgtext+0avgdata 34040maxresident)k
0inputs+192outputs (6major+20929minor)pagefaults 0swaps
```
PiperOrigin-RevId: 695198313
- Rename Timer to SampledTimer.
- Move all Clock methods except Now to new interface SampledClock.
- Move SampledTimer's exported methods (except SetClock) to new interface
Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant
methods that must be implemented.
- Add interface method Clock.NewTimer.
This is in preparation for cl/693856539, which adds a second Timer
implementation.
PiperOrigin-RevId: 694299679