As of https://go.dev/cl/646095, the Go runtime calls
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) when mapping memory to annotate
mappings in /proc/self/maps. Since this is a system call made throughout the
application lifetime, it needs to be allowed through the system call filters.
PiperOrigin-RevId: 734182524
Debug build functions use more stack space than normal, such that the
KVM-nosplit function call chain doesn't fit. This patch replaces calls into
unix.RawSyscall* functions with variants that do not grow the stack, and inlines
some functions in ring0/pagetables in order to reduce stack usage. Additionally
seccompMmapHandler is not used during debug builds anymore for making it fit
into the nosplit stack size requirements.
PiperOrigin-RevId: 679774881
Structure before:
```
if sysno < current node value:
goto left_node
if sysno > current node value:
goto right_node
// Render rules for current sysno here...
left_node:
// Recursively render left node code here...
right_node:
// Recursively render right node code here...
```
This is fine, but if the "render rules for current sysno" part is larger
enough for any syscall in the BST, this makes the jumps to `left_node` and
`right_node` of the current nodes (and all its ancestor nodes) have to use
unconditional jumps (i.e. extra instructions) during BST traversal.
Since BST traversal must be fast, it is better to keep all the rules for
each syscall separate from the BST traversal code. This ensures that the BST
traversal code all fits in the maximum unconditional jump size (255
instructions), and then we do just one possibly-unconditional jump to the set
of rules for that syscall. Compare that to the previous structure, where
multiple jumps during BST traversal could have been unconditional jumps.
Structure after:
```
if sysno < current node value:
goto left_node
if sysno > current node value:
goto right_node
goto sysno_rules
left_node:
// Recursively render left node traversal code here...
right_node:
// Recursively render right node traversal code here...
sysno_rules:
// Render rules for current sysno here...
// Recursively render left node syscall rules code here...
// Recursively render right node syscall rules code here...
```
With all the optimizations done in previous CLs, BSTs in practice are actually
small enough that both the traversal and and the syscall rules together all
fit under 255 instructions, so this only rarely comes into play.
However, as we add more syscalls and syscall rules, the effect of this
optimization should increase.
This actually results in slightly larger bytecode (because most syscall filter
rules do fit in just a few instructions, but now they have to be grouped at
the end which takes a bigger jump to reach), but *execution* is still faster,
because only one unconditional jump is ever done per program execution.
Benchmarks expectedly don't show much change, except KVM which is quite
happy for some reason:
```
│ before │ after │
│ sec/op │ sec/op vs base │
SentrySystrap/Postgres-48 51.43n ± 20% 50.50n ± 14% ~ (p=0.394 n=93+92)
SentryKVM/Postgres-48 48.15n ± 10% 40.19n ± 14% -16.53% (p=0.012 n=96+99)
NVProxyIoctl/nvproxy-48 65.75n ± 2% 65.43n ± 2% ~ (p=0.703 n=99+98)
geomean 57.84n 57.69n -0.49%
```
(Most other benchmarks including`futex` and such show no change,
which makes sense because they're part of the "hot" syscalls which
aren't in a BST at all.)
PiperOrigin-RevId: 595816325
Changing the default action in `seccomp.Install` now does nothing,
because the rules are never instantiated when they are loaded from
a precompiled state.
Moving the debugging instructions to the `filter` package also makes
more sense for debugging the Sentry, as the `seccomp` package is also
used to install other seccomp rules which are not meant to have their
default action overwritten even when debugging.
PiperOrigin-RevId: 586538349
While precompilation is pretty fast, its presence in the build graph
removes concurrency from the `runsc` build process. This is because the
precompilation generation binary is dependency-heavy (e.g. it needs all
platform implementations), yet must be built before the `//runsc/boot/filter`
package is built, thus slowing down the `runsc` build process.
This change creates a low-dependency "stubbed" version of this generation
binary when running in `fastbuild` mode. The generated code in this mode
contains no precompiled seccomp programs; they are instead all built from
scratch on container startup. This trades off build speed vs container
startup speed.
PiperOrigin-RevId: 586520033
No change in behavior, but now it matches based on AND'd the flags bits once
rather than serially comparing it four times.
(`FUTEX_WAIT` = 0, it is the opposite of `FUTEX_WAKE`. The bitmask still
includes it for completeness.)
This also includes a small optimization: replacing `halfMaskedEqual` when
the value to match is zero with `halfNotSet`. This replaces two operations
(AND + equal) with a single "is any of these bits set" operation. The futex
rule benefit from this.
PiperOrigin-RevId: 586510742
This turns applies the same logic as `extractRepeatedMatchers` to each half
of all `splitMatcher`s. This allows common 32-bit matchers in disjunctions
to be extracted out of them.
This is useful for almost all `EqualTo` rules, because they tend to look for
values that fit in the lower 32 bits. As such, the check for the higher 32
bits (that must be equal to zero in all cases of the disjunction) can be moved
out of the `Or`.
`splitMatcher` is updated to more efficiently handle the cases where either
of its branches are set to `halfAnyValue{}`.
```
│ before │ after │
│ sec/op │ sec/op vs base │
SentrySystrap 71.31n ± 9% 71.88n ± 5% ~ (p=0.744 n=145+146)
SentryKVM 58.23n ± 8% 59.17n ± 5% ~ (p=0.930 n=145)
NVProxyIoctl 92.92n ± 1% 86.14n ± 1% -7.30% (n=145)
│ before │ after │
│ build-sec │ build-sec vs base │
SentrySystrap 74.15m ± 0% 42.34m ± 0% -42.90% (n=145+146)
SentryKVM 419.51m ± 0% 54.67m ± 0% -86.97% (n=145)
NVProxyIoctl 1139.7m ± 0% 116.9m ± 0% -89.75% (n=145)
│ before │ after │
│ compression-ratio │ compression-ratio vs base │
SentrySystrap 3.443 ± 0% 3.872 ± 0% +12.46% (p=0.000 n=145+146)
SentryKVM 3.501 ± 0% 3.983 ± 0% +13.77% (p=0.000 n=145)
NVProxyIoctl 3.504 ± 0% 4.026 ± 0% +14.90% (p=0.000 n=145)
│ before │ after │
│ gen-instr │ gen-instr vs base │
SentrySystrap 1.618k ± 0% 1.545k ± 0% -4.51% (n=145+146)
SentryKVM 1.733k ± 0% 1.677k ± 0% -3.23% (n=145)
NVProxyIoctl 2.355k ± 0% 2.351k ± 0% -0.17% (n=145)
│ before │ after │
│ opt-instr │ opt-instr vs base │
SentrySystrap 470.0 ± 0% 399.0 ± 0% -15.11% (n=145+146)
SentryKVM 495.0 ± 0% 421.0 ± 0% -14.95% (n=145)
NVProxyIoctl 672.0 ± 0% 584.0 ± 0% -13.10% (n=145)
│ before │ after │
│ opt-sec │ opt-sec vs base │
SentrySystrap 109.70m ± 1% 91.88m ± 1% -16.25% (n=145+146)
SentryKVM 103.24m ± 1% 90.06m ± 1% -12.77% (n=145)
NVProxyIoctl 264.9m ± 1% 205.0m ± 1% -22.61% (n=145)
```
PiperOrigin-RevId: 586505148
For `MaskedEqual`'s matchers, this looks at the mask being matched against,
and simplifies the matcher if that mask is either 0 (in which case any value
is allowed), or full bits (in which case there is no need to run an `AND`
operation on the bits).
Similar thing for `halfNotSet`.
Benchmarks:
```
│ before │ after │
│ build-sec │ build-sec vs base │
SentrySystrap 34.10m ± 0% 74.15m ± 0% +117.49% (p=0.000 n=144+145)
SentryKVM 142.2m ± 0% 419.5m ± 0% +195.08% (p=0.000 n=144+145)
NVProxyIoctl 380.0m ± 0% 1139.7m ± 0% +199.96% (p=0.000 n=144+145)
│ before │ after │
│ compression-ratio │ compression-ratio vs base │
SentrySystrap 3.252 ± 0% 3.443 ± 0% +5.87% (p=0.000 n=144+145)
SentryKVM 3.318 ± 0% 3.501 ± 0% +5.52% (p=0.000 n=144+145)
NVProxyIoctl 3.298 ± 0% 3.504 ± 0% +6.25% (p=0.000 n=144+145)
│ before │ after │
│ gen-instr │ gen-instr vs base │
SentrySystrap 1.538k ± 0% 1.618k ± 0% +5.20% (p=0.000 n=144+145)
SentryKVM 1.649k ± 0% 1.733k ± 0% +5.09% (p=0.000 n=144+145)
NVProxyIoctl 2.233k ± 0% 2.355k ± 0% +5.46% (p=0.000 n=144+145)
│ before │ after │
│ opt-instr │ opt-instr vs base │
SentrySystrap 473.0 ± 0% 470.0 ± 0% -0.63% (n=144+145)
SentryKVM 497.0 ± 0% 495.0 ± 0% -0.40% (n=144+145)
NVProxyIoctl 677.0 ± 0% 672.0 ± 0% -0.74% (n=144+145)
│ before │ after │
│ opt-sec │ opt-sec vs base │
SentrySystrap 109.8m ± 0% 109.7m ± 1% ~ (p=0.179 n=144+145)
SentryKVM 104.8m ± 0% 103.2m ± 1% -1.44% (p=0.000 n=144+145)
NVProxyIoctl 265.7m ± 0% 264.9m ± 1% ~ (p=0.115 n=144+145)
```
PiperOrigin-RevId: 586484636
This does the following:
- Only allocate maps once.
- Check whether the filter can run before doing any expensive allocation
or map modifications.
- Recursively optimize other arguments earlier on.
- Replace `PerArg.Copy` with a specialized version.
This helps make this function more efficient in `gotsan` mode.
PiperOrigin-RevId: 586382284
This adds a `Copy` function to the `SyscallRule` interface, so that syscall
rules can be deeply copied.
This is useful in the context of precompiled rules, which can be precompiled
in parallel and where some optimizers will modify the `SyscallRule` objects
themselves (e.g. removing the `MatchAll` rules from an `Or` rule). When this
happens in parallel, this causes a race of two goroutines writing to the same
slice when the rules are based on a similar source.
Each goroutine should be dealing with its own set of rules, hence making
`Copy` a deep copy.
PiperOrigin-RevId: 584169637
Prior to this change, the Sentry seccomp filters use the program's PID
as filter to the `tgkill(2)` system call. However, some platforms expand
this filter to allow any PID, which the optimizer detects and optimizes
the initial filter away. So the PID ends up being treated as an unused
variable. This change addresses this by detecting this situation and
allowing unused variables in optimized bytecode, so long as they do show
up in non-optimized bytecode.
Also add some small helper functions for using 64-bit variables.
PiperOrigin-RevId: 583222206
This adds a `precompiledseccomp` library which provides tooling to compile
`seccomp-bpf` programs and generate Go source code that contains the
resulting bytecode embedded into it. In turn, this bytecode can be used in
Go libraries.
This avoids spending time compiling and optimizing `seccomp-bpf` programs
at runsc container creation time.
This library also contains support for "variables", which are `uint32`s whose
values are part of the seccomp filters but only known at runtime. To support
this, the program is compiled twice with placeholder values for these
variables, and we verify that the offsets at which these values show up in the
bytecode is consistent across these two compilation attempts.
PiperOrigin-RevId: 583117683
This change adds a `filter_fuzz_golden.bpf` BPF program that was generated
manually prior to my recent set of changes to seccomp bytecode and rule
optimization changes. It represents the "reference logic"; the new test
verifies that the current seccomp-bpf library produces BPF bytecode that
has the same behavior, using fuzz testing with full line-based coverage.
PiperOrigin-RevId: 582914572
This ensures that the optimized and unoptimized seccomp programs
are equivalent in behavior. Full coverage is enforced on the optimized
program. It cannot be enforced on the unoptimized program, because it
naturally ends up generating code that can never be satisfied.
PiperOrigin-RevId: 582892981
This subdivides all `RuleSet`s into single-syscall rulesets, and then
classifies them depending on:
- Whether they are "trivial" or not, where "trivial" means that the syscall
rules do not perform any verification of the syscall arguments or RIP.
- Whether they are marked "hot" or not, where "hot" means "expected to be
frequently called".
It then orders the program as follows:
- All hot non-trivial rules go first. This makes it so that the host kernel
can clear the syscall faster for frequently-called syscalls. These are
checked linearly, as they tend to follow a Pareto distribution in terms of
frequency. If they need a vsyscall check, that check is added individually.
- All cold rules go next, and form a BST. This mimics the structure of the BST
construction that existed prior to this change.
- Lastly, all the trivial syscalls are added as a last BST.
This speeds up rule evaluation because it maximizes the use of Linux's
seccomp cache for trivial syscalls. These are therefore only ever checked
once, so they can stay at the "bottom" of the program.
All remaining (non-trivial) syscalls are ordered such that hot syscalls
are checked first, and then cold syscalls are checked with a BST.
This is a complex and security-sensitive change, but fuzz testing with full
branch coverage has shown that this has the exact same behavior as a BPF
program taken from before any of my recent seccomp/BPF changes (other than
the one adding non-negative FD checks to all `ioctl(2)` system calls).
Some benchmark results ("orig" is the state before this change):
```
│ orig │ reordered │
│ sec/op │ sec/op vs base │
SentrySystrap/futex 79.44n ± 2% 73.93n ± 2% -6.93% (n=729+722)
SentrySystrap/nanosleep 112.3n ± 12% 107.2n ± 12% ~ (p=0.505 n=482+477)
SentrySystrap/sendmmsg 88.50n ± 1% 81.62n ± 1% -7.78% (n=729+722)
SentrySystrap/fstat 30.80n ± 2% 30.63n ± 3% ~ (p=0.903 n=722+712)
[...]
SentrySystrap/Postgres-48 64.30n ± 5% 61.74n ± 6% -3.97% (p=0.039 n=376+377)
```
PiperOrigin-RevId: 582808055
This does an optimization pass over `RuleSet`s prior to rendering anything,
rather than optimizing at the last minute for each syscall.
PiperOrigin-RevId: 581671748