16 Commits

Author SHA1 Message Date
Jing Chen a093ad0450 Simplify and format gVisor codebase.
The changes are just output of `gofmt -s -w .`.
2024-10-13 00:50:32 -07:00
Andrei Vagin 32bbb18823 systrap: use seccomp notifications to communicate with syscall threads
The new synchronous mode of seccomp-unotify (v6.6-rc1~205^2~6) reduces overhead
of context switches.

PiperOrigin-RevId: 622924980
2024-04-08 12:47:44 -07:00
Fabricio Voznika c087777e37 Plumb restore context to afterLoad()
This allows for external information to be passed to restore code, like
host FDs to be remapped.

Updates #1956

PiperOrigin-RevId: 612540749
2024-03-04 12:21:50 -08:00
Etienne Perot 62175dea49 seccomp.BuildProgram: Add ProgramOptions struct.
This is just a refactoring, but the intention of this `struct` is to add
other useful options for the program build, such as the list of expected
"hottest" syscalls by frequency.

The interface is a bit awkward, because of the need for two entry points
(one which didn't have any way to set the default actions), and because the
zero value of `linux.BPFAction` is a valid (and common) "default action":
killing the program (`linux.SECCOMP_RET_KILL_THREAD`).
We also can't use `*linux.BPFAction`, as `linux.SECCOMP_RET_*` are constants.
So the struct fields use functions that "resolve" to an action.
This reads fairly well in the call sites (`DefaultAction: Return(action)`),
at the cost of slightly convoluted logic in `seccomp.go`.

PiperOrigin-RevId: 581477348
2023-11-11 00:44:40 -08:00
Etienne Perot f098b9b06e seccomp: Make SyscallRules map type opaque.
This wraps the `map[uintptry]SyscallRule` into an unexported field of a struct
so that it cannot be accessed directly.

This is helpful for the `runsc` and `fsgofer` seccomp filters which are quite
complex and built across multiple files and multiple functions, where it is
not always clear which order they are executed in. By forcing mutations to be
more explicit about their intent (especially "merge with this new rule" vs
"override what happens for this syscall with this new rule"), we can crash if
that intent isn't what's actually happening.

PiperOrigin-RevId: 572361619
2023-10-10 14:11:01 -07:00
Etienne Perot 71dc79e653 secbench: Benchmark optimization duration and compression ratio.
Current values for the Sentry filters:

```
              │   current   │
              │  build-sec  │
SentrySystrap   13.73m ± 0%
SentryKVM       16.36m ± 0%

              │      current      │
              │ compression-ratio │
SentrySystrap          2.165 ± 0%
SentryKVM              2.132 ± 0%

              │   current   │
              │  gen-instr  │
SentrySystrap   1.288k ± 0%
SentryKVM       1.373k ± 0%

              │  current   │
              │ opt-instr  │
SentrySystrap   595.0 ± 0%
SentryKVM       644.0 ± 0%

              │   current   │
              │   opt-sec   │
SentrySystrap   819.0µ ± 2%
SentryKVM       897.0µ ± 1%
```

PiperOrigin-RevId: 572089103
2023-10-09 17:53:22 -07:00
Etienne Perot addac5f248 Refactor seccomp rules with interfaces rather than disjunctive normal form.
This replaces the `seccomp.Rule` type with the `seccomp.SyscallRule`
interface, which is an abstraction that defines how to match a syscall's
arguments and RIP.

This has the following benefits:

- The code can verify that rules are self-contained, as the
  `SyscallRule.Render` contract specifies that the rule must jump to
  either a "matched" or "not matched" label, and may not fall through.
  It uses `ProgramBuilder`'s support for asserting unreachability to
  enforce this.
- Rules that match everything are more explicit (no more implicit
  "no rules means everything matches" behavior, instead you have to
  explicitly specify `seccomp.MatchAll{}`).
- "OR" behavior is explicit (a disjunctive rule is marked as `seccomp.Or`
  rather than the current implicit meaning of a list of rules).
- Allows the creation of more sophisticated matching rules that don't work
  on a per-argument basis. This change does not do any of that yet, it
  simply refactors existing rules without changing the way they work.
- Decouples rule-specific rendering code from the larger program generation
  code (BST, architecture check, etc.).

Unfortunately there is no easy way to split this change into multiple
sub-changes without introducing additional complexity to support both forms
of expressing rules, so sorry if this is a large change. But note that it
is actually net-negative in line count.

Despite the size of this change, please review it carefully, as this is a
security-sensitive change.

PiperOrigin-RevId: 571459670
2023-10-06 16:19:26 -07:00
Etienne Perot dcfe2d169e seccomp: Rename seccomp.MatchAny to seccomp.AnyValue.
This reduces the diff on an upcoming refactor which modifies all seccomp
rules.

`AnyValue` better reflects the fact that the matcher is about matching a
single syscall argument value, as opposed to e.g. a rule that allows a
syscall through regardless of its argument.

PiperOrigin-RevId: 571110444
2023-10-05 13:19:06 -07:00
Etienne Perot 5f5692dd20 bpf: Replace most uses of linux.BPFInstruction with bpf.Instruction.
`bpf.Instruction` is the same type as `linux.BPFInstruction`, except that it
uses the BPF instruction-to-string decoder to give a nice human-readable
stringification.

PiperOrigin-RevId: 570499020
2023-10-03 14:34:53 -07:00
Konstantin Bogomolov 6763252ef0 Cleanup unused systrap code.
PiperOrigin-RevId: 529810117
2023-05-05 14:10:37 -07:00
Konstantin Bogomolov d7f590dd00 Clean up context decoupling experiment.
This change removes code branches and variables only used in coupled-context
mode.

PiperOrigin-RevId: 529776383
2023-05-05 11:55:50 -07:00
Konstantin Bogomolov f727f06c81 Add debug logging to systrap futex waits.
In general it is probably a good idea to set a timeout on any futex waits that
the sentry is doing. For now just output some helpful logs about what the shared
memory looks like; in the future we may want to do something more useful on
ETIMEDOUT events.

PiperOrigin-RevId: 518919966
2023-03-23 11:44:22 -07:00
Konstantin Bogomolov 897c03039e Implement systrap context queue.
This is the initial implementation of the systrap context queue via a ringbuffer
in shared memory between stub threads and the sentry.

In this new model there is no longer a bound sysmsg thread for every context;
instead each subprocess starts with one initial sysmsg thread, which starts
polling the context queue for new contexts arriving from the sentry. If the
sentry detects that contexts are spending too much time in the context queue
without being processed, it will create new sysmsg threads or wake sleeping
ones. Tangentially, sysmsg threads will go to sleep if they spend too much time
busy looping without new context arrivals.

This model does not yet take into account the full load of the host system or
even multiple subprocesses in the same sandbox. Multiple overloaded subprocesses
are liable to make each other run slower by kicking sysmsg threads more often
than they need to; this will be remedied in follow up CLs.

PiperOrigin-RevId: 516680504
2023-03-14 17:48:13 -07:00
Konstantin Bogomolov 263dad6258 Handle context interrupts based on syshandler state.
Also add interrupt handling for context decoupling.

Previously syshandler interrupts would be retriggered no matter if the interrupt
arrived before the switch to sentry or after. We only need to handle the case of
it arriving after.

Additionally this CL introduces interrupt handling for the decoupled context
mode, by making interrupts target task contexts rather than sysmsg threads.

PiperOrigin-RevId: 515065101
2023-03-08 09:58:12 -08:00
Konstantin Bogomolov 702540baec Implement saving decoupled context from sighandler.
Saves task context state to the separate context memory region which is mapped
to all subprocess sysmsg threads, instead of always saving the context to the
thread-specific sysmsg.

When context decoupling is disabled fpstate is not saved to this region, but
GP registers and signal info are.

PiperOrigin-RevId: 514432596
2023-03-06 09:24:24 -08:00
Andrei Vagin 192bfb03fb Open-sourcing the systrap platform.
The systrap platform like the ptrace platform uses stub processes to manage
the user address space. The difference is how they intercept system calls and
other events like memory faults, exceptions, etc.

In case of systrap, all events that have to be handled by the Sentry trigger
signals that are handled by a custom signal handler installed on stub
processes. The signal handler switches control to the Sentry.

Here are a few other optimizations:
* On x86, system calls can be replaced with a function call to remove overhead
  of signals.
* For fast interactions of sentry and stub processes, futex wait/wake can
  be a bottle neck, so we use a polling mode.

The platform is launched for the purpose of testing and gathering initial
feedback. It is not yet ready for use in production.

PiperOrigin-RevId: 511650064
2023-02-22 18:22:49 -08:00