17 Commits

Author SHA1 Message Date
Andrei Vagin 03a28d158e platform/systrap: return memory access type based on a page fault error code
Now we don't need to trigger a second fault to figure out whether it was write
or read access.

Fixes #11008

Co-developed-by: Jamie Liu <jamieliu@google.com>
PiperOrigin-RevId: 697677262
2024-11-18 10:33:59 -08:00
Konstantin Bogomolov fe66cae2ed Enumerate known systrap stub failures to exit process cleanly.
This helps to rectify a long standing problem of Systrap panicking
when encountering corrupted sysmsg stub memory.

These errors specifically are easier to notice and debug since we
check for them in the stub code and flag them to the sentry
explicitly. They are now very grep-able to make finding their origin
in the stub code easier.

PiperOrigin-RevId: 604743496
2024-02-06 13:19:26 -08:00
Konstantin Bogomolov cffce1a94a systrap: Revise slow-path enablement.
The current systrap fastpath heuristics do a good job getting high
performance when there are idle CPUs, but fail when there are not
enough and do much worse then even the "pure slowpath".

Here is a summary of changes made to remedy that:

1. Disable stub and dispatcher fastpath by default.
2. Decouple fastpath states to be separate between dispatcher and stub
   fastpath.
3. Implement response latency metrics for both sentry->stub and
   stub->sentry messages. Use these latency metrics in order keep track of
   the baseline latency for both sides. With baseline latency established,
   compare fastpath latency to determine how beneficial it is to keep
   fastpath enabled.

Some sampled benchmarks:
- sysbench-X-Y:
```
```
- gettid_benchmark
- getpid_benchmark

Some benchmark results (5-run average):
- On a 4 core machine:

[]() | HEAD | ThisCL
-------------------|----------|-----------
sysbench-1-8:      | 48218ms  |  50282ms
sysbench-2-4:      | 65900ms  |  72282ms
sysbench-4-2:      |427880ms  | 175714ms
sysbench-1-2:      | 12998ms  |  13688ms

getpid_benchmark: (HEAD)
```
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_Getpid          3471 ns         3441 ns       212121
BM_GetpidOpt       1039 ns         1029 ns       700000
```

getpid_benchmark: (This CL)
```
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_Getpid          3718 ns         3600 ns       200000
BM_GetpidOpt       1320 ns         1281 ns       538462
```

gettid_benchmark: Like getpid, this CL slightly slower on lower thread count
                  test variants.

- On a 1 core machine:
getpid_benchmark: (HEAD)

```
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_Getpid         74868 ns        75000 ns        10000
BM_GetpidOpt      74463 ns        74286 ns         8750
```

getpid_benchmark: (This CL)
```
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_Getpid         12425 ns        12443 ns        53846
BM_GetpidOpt       8645 ns         8686 ns        87500
```

gettid_benchmark: Same trend as for getpid_benchmark across the board.

  Another interesting case to look at for 1-core machines is copying one large file:
```
  ./runsc --rootless --network none --ignore-cgroups do --force-overlay=false sh -c "time head -c 1073741824 </dev/zero >full-file"
```
- file copy (HEAD):   36.07user 0.00system 0:36.44elapsed 98%CPU
- file copy (This CL): 2.96user 0.23system 0:07.14elapsed 44%CPU

Fixes #9119.

PiperOrigin-RevId: 576600019
2023-10-25 11:56:38 -07:00
Andrei Vagin 74e63e9e29 Update packages
PiperOrigin-RevId: 532582853
2023-05-16 15:01:22 -07:00
Andrei Vagin cd358f833a systrap: don't wake up each thread separately
Now all threads are waiting on queue->num_thread_to_wakeup,
it is a single point for all threads.

This change allows us to avoid cases when num_active_threads
are inconsistent with threads states, because they can't be
changed atomically.

PiperOrigin-RevId: 531020012
2023-05-10 15:36:38 -07:00
Konstantin Bogomolov d7f590dd00 Clean up context decoupling experiment.
This change removes code branches and variables only used in coupled-context
mode.

PiperOrigin-RevId: 529776383
2023-05-05 11:55:50 -07:00
Andrei Vagin 96aa115516 systrap: simplify interrupt handling in syshandler
syshandler can be interrupted by SIGCHLD. Here are two separate cases. The first
one is when the interrupt are addressed to a context that has triggered
syshanlder. In this case, the interrupt can be ignored, because the context
is switching to the sentry. Another case is when syshandler is resuming a new
context. It means we need to stop resuming it and return the context back to
the sentry. The good thing is that the context state is up to date, and so
sighandler can do its job ignoring a state from a signal frame.

Reported-by: syzbot+2e305803e0d29e8faeb3@syzkaller.appspotmail.com
PiperOrigin-RevId: 520986791
2023-03-31 12:34:16 -07:00
Andrei Vagin d6ed799ade systrap: save context pointer on sysmsg
We don't need to calculate an address from context_id each time.

PiperOrigin-RevId: 518640998
2023-03-22 12:26:03 -07:00
Andrei Vagin f8a73a7d1a Remove sysmsg->interrupted_context_id
ctx->interrupt can be used to find out where the current context has to
be interrupted or not.

PiperOrigin-RevId: 518597531
2023-03-22 09:58:00 -07:00
Andrei Vagin 0cbe6fc835 systrap: introduce a spinning queue
The spinning queue is a queue of spinning threads. It solves the
fragmentation problem. The idea is to minimize the number of threads
processing requests. We can't control how system threads are scheduled, so
can't distribute requests efficiently. The spinning queue emulates virtual
threads sorted by their spinning time.

PiperOrigin-RevId: 518470754
2023-03-21 22:10:15 -07:00
Konstantin Bogomolov 897c03039e Implement systrap context queue.
This is the initial implementation of the systrap context queue via a ringbuffer
in shared memory between stub threads and the sentry.

In this new model there is no longer a bound sysmsg thread for every context;
instead each subprocess starts with one initial sysmsg thread, which starts
polling the context queue for new contexts arriving from the sentry. If the
sentry detects that contexts are spending too much time in the context queue
without being processed, it will create new sysmsg threads or wake sleeping
ones. Tangentially, sysmsg threads will go to sleep if they spend too much time
busy looping without new context arrivals.

This model does not yet take into account the full load of the host system or
even multiple subprocesses in the same sandbox. Multiple overloaded subprocesses
are liable to make each other run slower by kicking sysmsg threads more often
than they need to; this will be remedied in follow up CLs.

PiperOrigin-RevId: 516680504
2023-03-14 17:48:13 -07:00
Konstantin Bogomolov 263dad6258 Handle context interrupts based on syshandler state.
Also add interrupt handling for context decoupling.

Previously syshandler interrupts would be retriggered no matter if the interrupt
arrived before the switch to sentry or after. We only need to handle the case of
it arriving after.

Additionally this CL introduces interrupt handling for the decoupled context
mode, by making interrupts target task contexts rather than sysmsg threads.

PiperOrigin-RevId: 515065101
2023-03-08 09:58:12 -08:00
Konstantin Bogomolov 39f2721c9b Implement saving decoupled context from syshandler.
Rewrite the syshandler assembly routine to save the full state of user threads,
like the sighandler would. With fpstate, it does so by writing straight to the
thread context struct, so there is no need to do an intermediate copy.

PiperOrigin-RevId: 514751894
2023-03-07 09:16:13 -08:00
Konstantin Bogomolov 702540baec Implement saving decoupled context from sighandler.
Saves task context state to the separate context memory region which is mapped
to all subprocess sysmsg threads, instead of always saving the context to the
thread-specific sysmsg.

When context decoupling is disabled fpstate is not saved to this region, but
GP registers and signal info are.

PiperOrigin-RevId: 514432596
2023-03-06 09:24:24 -08:00
Konstantin Bogomolov 9ec69054f8 Map shared region for systrap thread contexts.
Introduces what a ThreadContext struct is in the context of systrap. It
makes the mappings of the region where the contexts will be stored into both the
sentry and the address space of stub processes.

PiperOrigin-RevId: 513913793
2023-03-04 00:42:24 -08:00
Konstantin Bogomolov 35937b7f61 Add context decoupling flag.
This is a quick and dirty way to activate context decoupling related changes.
Will be removed once context decoupling is finalized.

PiperOrigin-RevId: 513854004
2023-03-03 09:58:24 -08:00
Andrei Vagin 192bfb03fb Open-sourcing the systrap platform.
The systrap platform like the ptrace platform uses stub processes to manage
the user address space. The difference is how they intercept system calls and
other events like memory faults, exceptions, etc.

In case of systrap, all events that have to be handled by the Sentry trigger
signals that are handled by a custom signal handler installed on stub
processes. The signal handler switches control to the Sentry.

Here are a few other optimizations:
* On x86, system calls can be replaced with a function call to remove overhead
  of signals.
* For fast interactions of sentry and stub processes, futex wait/wake can
  be a bottle neck, so we use a polling mode.

The platform is launched for the purpose of testing and gathering initial
feedback. It is not yet ready for use in production.

PiperOrigin-RevId: 511650064
2023-02-22 18:22:49 -08:00