This is just a refactoring, but the intention of this `struct` is to add
other useful options for the program build, such as the list of expected
"hottest" syscalls by frequency.
The interface is a bit awkward, because of the need for two entry points
(one which didn't have any way to set the default actions), and because the
zero value of `linux.BPFAction` is a valid (and common) "default action":
killing the program (`linux.SECCOMP_RET_KILL_THREAD`).
We also can't use `*linux.BPFAction`, as `linux.SECCOMP_RET_*` are constants.
So the struct fields use functions that "resolve" to an action.
This reads fairly well in the call sites (`DefaultAction: Return(action)`),
at the cost of slightly convoluted logic in `seccomp.go`.
PiperOrigin-RevId: 581477348
This wraps the `map[uintptry]SyscallRule` into an unexported field of a struct
so that it cannot be accessed directly.
This is helpful for the `runsc` and `fsgofer` seccomp filters which are quite
complex and built across multiple files and multiple functions, where it is
not always clear which order they are executed in. By forcing mutations to be
more explicit about their intent (especially "merge with this new rule" vs
"override what happens for this syscall with this new rule"), we can crash if
that intent isn't what's actually happening.
PiperOrigin-RevId: 572361619
This replaces the `seccomp.Rule` type with the `seccomp.SyscallRule`
interface, which is an abstraction that defines how to match a syscall's
arguments and RIP.
This has the following benefits:
- The code can verify that rules are self-contained, as the
`SyscallRule.Render` contract specifies that the rule must jump to
either a "matched" or "not matched" label, and may not fall through.
It uses `ProgramBuilder`'s support for asserting unreachability to
enforce this.
- Rules that match everything are more explicit (no more implicit
"no rules means everything matches" behavior, instead you have to
explicitly specify `seccomp.MatchAll{}`).
- "OR" behavior is explicit (a disjunctive rule is marked as `seccomp.Or`
rather than the current implicit meaning of a list of rules).
- Allows the creation of more sophisticated matching rules that don't work
on a per-argument basis. This change does not do any of that yet, it
simply refactors existing rules without changing the way they work.
- Decouples rule-specific rendering code from the larger program generation
code (BST, architecture check, etc.).
Unfortunately there is no easy way to split this change into multiple
sub-changes without introducing additional complexity to support both forms
of expressing rules, so sorry if this is a large change. But note that it
is actually net-negative in line count.
Despite the size of this change, please review it carefully, as this is a
security-sensitive change.
PiperOrigin-RevId: 571459670
This reduces the diff on an upcoming refactor which modifies all seccomp
rules.
`AnyValue` better reflects the fact that the matcher is about matching a
single syscall argument value, as opposed to e.g. a rule that allows a
syscall through regardless of its argument.
PiperOrigin-RevId: 571110444
`bpf.Instruction` is the same type as `linux.BPFInstruction`, except that it
uses the BPF instruction-to-string decoder to give a nice human-readable
stringification.
PiperOrigin-RevId: 570499020
In general it is probably a good idea to set a timeout on any futex waits that
the sentry is doing. For now just output some helpful logs about what the shared
memory looks like; in the future we may want to do something more useful on
ETIMEDOUT events.
PiperOrigin-RevId: 518919966
This is the initial implementation of the systrap context queue via a ringbuffer
in shared memory between stub threads and the sentry.
In this new model there is no longer a bound sysmsg thread for every context;
instead each subprocess starts with one initial sysmsg thread, which starts
polling the context queue for new contexts arriving from the sentry. If the
sentry detects that contexts are spending too much time in the context queue
without being processed, it will create new sysmsg threads or wake sleeping
ones. Tangentially, sysmsg threads will go to sleep if they spend too much time
busy looping without new context arrivals.
This model does not yet take into account the full load of the host system or
even multiple subprocesses in the same sandbox. Multiple overloaded subprocesses
are liable to make each other run slower by kicking sysmsg threads more often
than they need to; this will be remedied in follow up CLs.
PiperOrigin-RevId: 516680504
Also add interrupt handling for context decoupling.
Previously syshandler interrupts would be retriggered no matter if the interrupt
arrived before the switch to sentry or after. We only need to handle the case of
it arriving after.
Additionally this CL introduces interrupt handling for the decoupled context
mode, by making interrupts target task contexts rather than sysmsg threads.
PiperOrigin-RevId: 515065101
Saves task context state to the separate context memory region which is mapped
to all subprocess sysmsg threads, instead of always saving the context to the
thread-specific sysmsg.
When context decoupling is disabled fpstate is not saved to this region, but
GP registers and signal info are.
PiperOrigin-RevId: 514432596
The systrap platform like the ptrace platform uses stub processes to manage
the user address space. The difference is how they intercept system calls and
other events like memory faults, exceptions, etc.
In case of systrap, all events that have to be handled by the Sentry trigger
signals that are handled by a custom signal handler installed on stub
processes. The signal handler switches control to the Sentry.
Here are a few other optimizations:
* On x86, system calls can be replaced with a function call to remove overhead
of signals.
* For fast interactions of sentry and stub processes, futex wait/wake can
be a bottle neck, so we use a polling mode.
The platform is launched for the purpose of testing and gathering initial
feedback. It is not yet ready for use in production.
PiperOrigin-RevId: 511650064