Now we don't need to trigger a second fault to figure out whether it was write
or read access.
Fixes#11008
Co-developed-by: Jamie Liu <jamieliu@google.com>
PiperOrigin-RevId: 697677262
This helps to rectify a long standing problem of Systrap panicking
when encountering corrupted sysmsg stub memory.
These errors specifically are easier to notice and debug since we
check for them in the stub code and flag them to the sentry
explicitly. They are now very grep-able to make finding their origin
in the stub code easier.
PiperOrigin-RevId: 604743496
The current systrap fastpath heuristics do a good job getting high
performance when there are idle CPUs, but fail when there are not
enough and do much worse then even the "pure slowpath".
Here is a summary of changes made to remedy that:
1. Disable stub and dispatcher fastpath by default.
2. Decouple fastpath states to be separate between dispatcher and stub
fastpath.
3. Implement response latency metrics for both sentry->stub and
stub->sentry messages. Use these latency metrics in order keep track of
the baseline latency for both sides. With baseline latency established,
compare fastpath latency to determine how beneficial it is to keep
fastpath enabled.
Some sampled benchmarks:
- sysbench-X-Y:
```
```
- gettid_benchmark
- getpid_benchmark
Some benchmark results (5-run average):
- On a 4 core machine:
[]() | HEAD | ThisCL
-------------------|----------|-----------
sysbench-1-8: | 48218ms | 50282ms
sysbench-2-4: | 65900ms | 72282ms
sysbench-4-2: |427880ms | 175714ms
sysbench-1-2: | 12998ms | 13688ms
getpid_benchmark: (HEAD)
```
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Getpid 3471 ns 3441 ns 212121
BM_GetpidOpt 1039 ns 1029 ns 700000
```
getpid_benchmark: (This CL)
```
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Getpid 3718 ns 3600 ns 200000
BM_GetpidOpt 1320 ns 1281 ns 538462
```
gettid_benchmark: Like getpid, this CL slightly slower on lower thread count
test variants.
- On a 1 core machine:
getpid_benchmark: (HEAD)
```
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Getpid 74868 ns 75000 ns 10000
BM_GetpidOpt 74463 ns 74286 ns 8750
```
getpid_benchmark: (This CL)
```
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Getpid 12425 ns 12443 ns 53846
BM_GetpidOpt 8645 ns 8686 ns 87500
```
gettid_benchmark: Same trend as for getpid_benchmark across the board.
Another interesting case to look at for 1-core machines is copying one large file:
```
./runsc --rootless --network none --ignore-cgroups do --force-overlay=false sh -c "time head -c 1073741824 </dev/zero >full-file"
```
- file copy (HEAD): 36.07user 0.00system 0:36.44elapsed 98%CPU
- file copy (This CL): 2.96user 0.23system 0:07.14elapsed 44%CPU
Fixes#9119.
PiperOrigin-RevId: 576600019
Now all threads are waiting on queue->num_thread_to_wakeup,
it is a single point for all threads.
This change allows us to avoid cases when num_active_threads
are inconsistent with threads states, because they can't be
changed atomically.
PiperOrigin-RevId: 531020012
syshandler can be interrupted by SIGCHLD. Here are two separate cases. The first
one is when the interrupt are addressed to a context that has triggered
syshanlder. In this case, the interrupt can be ignored, because the context
is switching to the sentry. Another case is when syshandler is resuming a new
context. It means we need to stop resuming it and return the context back to
the sentry. The good thing is that the context state is up to date, and so
sighandler can do its job ignoring a state from a signal frame.
Reported-by: syzbot+2e305803e0d29e8faeb3@syzkaller.appspotmail.com
PiperOrigin-RevId: 520986791
The spinning queue is a queue of spinning threads. It solves the
fragmentation problem. The idea is to minimize the number of threads
processing requests. We can't control how system threads are scheduled, so
can't distribute requests efficiently. The spinning queue emulates virtual
threads sorted by their spinning time.
PiperOrigin-RevId: 518470754
This is the initial implementation of the systrap context queue via a ringbuffer
in shared memory between stub threads and the sentry.
In this new model there is no longer a bound sysmsg thread for every context;
instead each subprocess starts with one initial sysmsg thread, which starts
polling the context queue for new contexts arriving from the sentry. If the
sentry detects that contexts are spending too much time in the context queue
without being processed, it will create new sysmsg threads or wake sleeping
ones. Tangentially, sysmsg threads will go to sleep if they spend too much time
busy looping without new context arrivals.
This model does not yet take into account the full load of the host system or
even multiple subprocesses in the same sandbox. Multiple overloaded subprocesses
are liable to make each other run slower by kicking sysmsg threads more often
than they need to; this will be remedied in follow up CLs.
PiperOrigin-RevId: 516680504
Also add interrupt handling for context decoupling.
Previously syshandler interrupts would be retriggered no matter if the interrupt
arrived before the switch to sentry or after. We only need to handle the case of
it arriving after.
Additionally this CL introduces interrupt handling for the decoupled context
mode, by making interrupts target task contexts rather than sysmsg threads.
PiperOrigin-RevId: 515065101
Rewrite the syshandler assembly routine to save the full state of user threads,
like the sighandler would. With fpstate, it does so by writing straight to the
thread context struct, so there is no need to do an intermediate copy.
PiperOrigin-RevId: 514751894
Saves task context state to the separate context memory region which is mapped
to all subprocess sysmsg threads, instead of always saving the context to the
thread-specific sysmsg.
When context decoupling is disabled fpstate is not saved to this region, but
GP registers and signal info are.
PiperOrigin-RevId: 514432596
Introduces what a ThreadContext struct is in the context of systrap. It
makes the mappings of the region where the contexts will be stored into both the
sentry and the address space of stub processes.
PiperOrigin-RevId: 513913793
This is a quick and dirty way to activate context decoupling related changes.
Will be removed once context decoupling is finalized.
PiperOrigin-RevId: 513854004
The systrap platform like the ptrace platform uses stub processes to manage
the user address space. The difference is how they intercept system calls and
other events like memory faults, exceptions, etc.
In case of systrap, all events that have to be handled by the Sentry trigger
signals that are handled by a custom signal handler installed on stub
processes. The signal handler switches control to the Sentry.
Here are a few other optimizations:
* On x86, system calls can be replaced with a function call to remove overhead
of signals.
* For fast interactions of sentry and stub processes, futex wait/wake can
be a bottle neck, so we use a polling mode.
The platform is launched for the purpose of testing and gathering initial
feedback. It is not yet ready for use in production.
PiperOrigin-RevId: 511650064