98 Commits

Author SHA1 Message Date
Michael Pratt 46833fbeee Allow prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) through syscall filters
As of https://go.dev/cl/646095, the Go runtime calls
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) when mapping memory to annotate
mappings in /proc/self/maps. Since this is a system call made throughout the
application lifetime, it needs to be allowed through the system call filters.

PiperOrigin-RevId: 734182524
2025-03-06 09:54:22 -08:00
Etienne Perot b7334af658 Internal change.
PiperOrigin-RevId: 686612361
2024-10-16 13:07:09 -07:00
Jing Chen a093ad0450 Simplify and format gVisor codebase.
The changes are just output of `gofmt -s -w .`.
2024-10-13 00:50:32 -07:00
Koichi Shiraishi 0cf77c02f8 all: remove use io/ioutil deprecated package & fix some deprecated thing
Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>
2024-10-10 20:36:24 +09:00
Konstantin Bogomolov 0760a3df59 kvm: reduce stack usage
Debug build functions use more stack space than normal, such that the
KVM-nosplit function call chain doesn't fit. This patch replaces calls into
unix.RawSyscall* functions with variants that do not grow the stack, and inlines
some functions in ring0/pagetables in order to reduce stack usage. Additionally
seccompMmapHandler is not used during debug builds anymore for making it fit
into the nosplit stack size requirements.

PiperOrigin-RevId: 679774881
2024-09-27 16:58:17 -07:00
Etienne Perot 11efa60e01 Display list of precompiled seccomp-bpf programs in debug logs.
See [this comment](https://github.com/freedomofpress/dangerzone/pull/590#issuecomment-2149642086) for context.

PiperOrigin-RevId: 643463161
2024-06-14 15:07:41 -07:00
Jing Chen cf5c4c9cbf Replace reflect.DeepEqual with [slices/maps].Equal.
They are faster on slice/map comparisons.

PiperOrigin-RevId: 633080355
2024-05-12 21:20:18 -07:00
Etienne Perot 1e61310ce6 seccomp-bpf: Render syscall rules after binary search tree traversal code.
Structure before:

```
if sysno < current node value:
  goto left_node
if sysno > current node value:
  goto right_node
// Render rules for current sysno here...
left_node:
  // Recursively render left node code here...
right_node:
  // Recursively render right node code here...
```

This is fine, but if the "render rules for current sysno" part is larger
enough for any syscall in the BST, this makes the jumps to `left_node` and
`right_node` of the current nodes (and all its ancestor nodes) have to use
unconditional jumps (i.e. extra instructions) during BST traversal.

Since BST traversal must be fast, it is better to keep all the rules for
each syscall separate from the BST traversal code. This ensures that the BST
traversal code all fits in the maximum unconditional jump size (255
instructions), and then we do just one possibly-unconditional jump to the set
of rules for that syscall. Compare that to the previous structure, where
multiple jumps during BST traversal could have been unconditional jumps.

Structure after:

```
if sysno < current node value:
  goto left_node
if sysno > current node value:
  goto right_node
goto sysno_rules
left_node:
  // Recursively render left node traversal code here...
right_node:
  // Recursively render right node traversal code here...

sysno_rules:
  // Render rules for current sysno here...
// Recursively render left node syscall rules code here...
// Recursively render right node syscall rules code here...
```

With all the optimizations done in previous CLs, BSTs in practice are actually
small enough that both the traversal and and the syscall rules together all
fit under 255 instructions, so this only rarely comes into play.
However, as we add more syscalls and syscall rules, the effect of this
optimization should increase.

This actually results in slightly larger bytecode (because most syscall filter
rules do fit in just a few instructions, but now they have to be grouped at
the end which takes a bigger jump to reach), but *execution* is still faster,
because only one unconditional jump is ever done per program execution.

Benchmarks expectedly don't show much change, except KVM which is quite
happy for some reason:

```
                                      │    before    │                   after                   │
                                      │    sec/op    │    sec/op     vs base                     │
SentrySystrap/Postgres-48               51.43n ± 20%   50.50n ± 14%        ~ (p=0.394 n=93+92)
SentryKVM/Postgres-48                   48.15n ± 10%   40.19n ± 14%  -16.53% (p=0.012 n=96+99)
NVProxyIoctl/nvproxy-48                 65.75n ±  2%   65.43n ±  2%        ~ (p=0.703 n=99+98)
geomean                                 57.84n         57.69n         -0.49%
```

(Most other benchmarks including`futex` and such show no change,
which makes sense because they're part of the "hot" syscalls which
aren't in a BST at all.)

PiperOrigin-RevId: 595816325
2024-01-04 15:22:45 -08:00
Etienne Perot 326e1681e7 Improve seccomp ruleset debug logging readability.
Before (sample ruleset):

```
Hot non-trivial syscalls:
  - sysno=1: {(arg0 == 0x4) => trace (0)}
  - sysno=39[vsyscall]: {(arg0 == 0x14) => errno (0)}
  - sysno=73: {(arg0 == 0x83) => trace (0)}
  - sysno=257: {(arg0 == 0x1cf) => kill thread, (arg0 == 0x71267) => kill process}
Cold non-trivial syscalls:
  - sysno=27[vsyscall]: {(arg0 == 0x4e) => errno (0)}
  - sysno=96[vsyscall]: {(true) => errno (0)}
  - sysno=202[vsyscall]: {(true) => errno (0)}
  - sysno=263: {((arg0 high=halfEq(0x0) && (arg0 low=halfEq(0x1d8) || arg0 low=halfEq(0x73598)))) => kill process, (arg0 == 0x1c295b98) => trap (0)}
  - sysno=265: {(arg0 == 0x1d7) => kill thread, (arg0 == 0x731af) => kill process}
Trivial syscalls:
  - sysno=0: {(true) => trace (0)}
  - sysno=3: {(true) => kill thread}
  - sysno=80: {(true) => kill thread}
  - sysno=81: {(true) => kill process}
```

After (same ruleset):

```
Hot non-trivial syscalls:
  - Syscall    1: (arg[0] == 0x4) => trace
  - Vsyscall  39: (arg[0] == 0x14) => return errno=0x0
  - Syscall   73: (arg[0] == 0x83) => trace
  - Syscall  257: {(arg[0] == 0x1cf) => kill thread; (arg[0] == 0x71267) => kill process}
Cold non-trivial syscalls:
  - Vsyscall  27: (arg[0] == 0x4e) => return errno=0x0
  - Vsyscall  96: return errno=0x0
  - Vsyscall 202: return errno=0x0
  - Syscall  263: {((arg[0].high == 0 && (arg[0].low == 0x1d8 || arg[0].low == 0x73598))) => kill process; (arg[0] == 0x1c295b98) => trap}
  - Syscall  265: {(arg[0] == 0x1d7) => kill thread; (arg[0] == 0x731af) => kill process}
Trivial syscalls:
  - Syscall    0: trace
  - Syscall    3: kill thread
  - Syscall   80: kill thread
  - Syscall   81: kill process
```

PiperOrigin-RevId: 587130340
2023-12-01 15:02:32 -08:00
Etienne Perot bcbb32955e Fix seccomp debugging tip to work with precompiled filters.
Changing the default action in `seccomp.Install` now does nothing,
because the rules are never instantiated when they are loaded from
a precompiled state.

Moving the debugging instructions to the `filter` package also makes
more sense for debugging the Sentry, as the `seccomp` package is also
used to install other seccomp rules which are not meant to have their
default action overwritten even when debugging.

PiperOrigin-RevId: 586538349
2023-11-29 21:23:27 -08:00
Etienne Perot c11e182262 runsc: Do not precompile seccomp-bpf filters in fastbuild mode.
While precompilation is pretty fast, its presence in the build graph
removes concurrency from the `runsc` build process. This is because the
precompilation generation binary is dependency-heavy (e.g. it needs all
platform implementations), yet must be built before the `//runsc/boot/filter`
package is built, thus slowing down the `runsc` build process.

This change creates a low-dependency "stubbed" version of this generation
binary when running in `fastbuild` mode. The generated code in this mode
contains no precompiled seccomp programs; they are instead all built from
scratch on container startup. This trades off build speed vs container
startup speed.

PiperOrigin-RevId: 586520033
2023-11-29 19:45:20 -08:00
Etienne Perot d61fcf15de Optimize syscall filter rule for futex(2).
No change in behavior, but now it matches based on AND'd the flags bits once
rather than serially comparing it four times.

(`FUTEX_WAIT` = 0, it is the opposite of `FUTEX_WAKE`. The bitmask still
includes it for completeness.)

This also includes a small optimization: replacing `halfMaskedEqual` when
the value to match is zero with `halfNotSet`. This replaces two operations
(AND + equal) with a single "is any of these bits set" operation. The futex
rule benefit from this.

PiperOrigin-RevId: 586510742
2023-11-29 18:42:59 -08:00
Etienne Perot a880da69f2 seccomp: Optimize common 32-bit matchers away from disjunctions.
This turns applies the same logic as `extractRepeatedMatchers` to each half
of all `splitMatcher`s. This allows common 32-bit matchers in disjunctions
to be extracted out of them.

This is useful for almost all `EqualTo` rules, because they tend to look for
values that fit in the lower 32 bits. As such, the check for the higher 32
bits (that must be equal to zero in all cases of the disjunction) can be moved
out of the `Or`.

`splitMatcher` is updated to more efficiently handle the cases where either
of its branches are set to `halfAnyValue{}`.

```
              │   before    │                  after                  │
              │   sec/op    │   sec/op     vs base                    │
SentrySystrap   71.31n ± 9%   71.88n ± 5%       ~ (p=0.744 n=145+146)
SentryKVM       58.23n ± 8%   59.17n ± 5%       ~ (p=0.930 n=145)
NVProxyIoctl    92.92n ± 1%   86.14n ± 1%  -7.30% (n=145)

              │    before    │              after               │
              │  build-sec   │  build-sec   vs base             │
SentrySystrap    74.15m ± 0%   42.34m ± 0%  -42.90% (n=145+146)
SentryKVM       419.51m ± 0%   54.67m ± 0%  -86.97% (n=145)
NVProxyIoctl    1139.7m ± 0%   116.9m ± 0%  -89.75% (n=145)

              │      before       │                     after                      │
              │ compression-ratio │ compression-ratio  vs base                     │
SentrySystrap          3.443 ± 0%          3.872 ± 0%  +12.46% (p=0.000 n=145+146)
SentryKVM              3.501 ± 0%          3.983 ± 0%  +13.77% (p=0.000 n=145)
NVProxyIoctl           3.504 ± 0%          4.026 ± 0%  +14.90% (p=0.000 n=145)

              │   before    │              after              │
              │  gen-instr  │  gen-instr   vs base            │
SentrySystrap   1.618k ± 0%   1.545k ± 0%  -4.51% (n=145+146)
SentryKVM       1.733k ± 0%   1.677k ± 0%  -3.23% (n=145)
NVProxyIoctl    2.355k ± 0%   2.351k ± 0%  -0.17% (n=145)

              │   before   │              after              │
              │ opt-instr  │ opt-instr   vs base             │
SentrySystrap   470.0 ± 0%   399.0 ± 0%  -15.11% (n=145+146)
SentryKVM       495.0 ± 0%   421.0 ± 0%  -14.95% (n=145)
NVProxyIoctl    672.0 ± 0%   584.0 ± 0%  -13.10% (n=145)

              │    before    │              after               │
              │   opt-sec    │   opt-sec    vs base             │
SentrySystrap   109.70m ± 1%   91.88m ± 1%  -16.25% (n=145+146)
SentryKVM       103.24m ± 1%   90.06m ± 1%  -12.77% (n=145)
NVProxyIoctl     264.9m ± 1%   205.0m ± 1%  -22.61% (n=145)
```

PiperOrigin-RevId: 586505148
2023-11-29 18:08:54 -08:00
Etienne Perot be011b9bfe seccomp: Optimize half value matchers when possible.
For `MaskedEqual`'s matchers, this looks at the mask being matched against,
and simplifies the matcher if that mask is either 0 (in which case any value
is allowed), or full bits (in which case there is no need to run an `AND`
operation on the bits).

Similar thing for `halfNotSet`.

Benchmarks:

```
              │   before    │                   after                    │
              │  build-sec  │  build-sec    vs base                      │
SentrySystrap   34.10m ± 0%    74.15m ± 0%  +117.49% (p=0.000 n=144+145)
SentryKVM       142.2m ± 0%    419.5m ± 0%  +195.08% (p=0.000 n=144+145)
NVProxyIoctl    380.0m ± 0%   1139.7m ± 0%  +199.96% (p=0.000 n=144+145)

              │      before       │                     after                     │
              │ compression-ratio │ compression-ratio  vs base                    │
SentrySystrap          3.252 ± 0%          3.443 ± 0%  +5.87% (p=0.000 n=144+145)
SentryKVM              3.318 ± 0%          3.501 ± 0%  +5.52% (p=0.000 n=144+145)
NVProxyIoctl           3.298 ± 0%          3.504 ± 0%  +6.25% (p=0.000 n=144+145)

              │   before    │                  after                  │
              │  gen-instr  │  gen-instr   vs base                    │
SentrySystrap   1.538k ± 0%   1.618k ± 0%  +5.20% (p=0.000 n=144+145)
SentryKVM       1.649k ± 0%   1.733k ± 0%  +5.09% (p=0.000 n=144+145)
NVProxyIoctl    2.233k ± 0%   2.355k ± 0%  +5.46% (p=0.000 n=144+145)

              │   before   │             after              │
              │ opt-instr  │ opt-instr   vs base            │
SentrySystrap   473.0 ± 0%   470.0 ± 0%  -0.63% (n=144+145)
SentryKVM       497.0 ± 0%   495.0 ± 0%  -0.40% (n=144+145)
NVProxyIoctl    677.0 ± 0%   672.0 ± 0%  -0.74% (n=144+145)

              │   before    │                  after                  │
              │   opt-sec   │   opt-sec    vs base                    │
SentrySystrap   109.8m ± 0%   109.7m ± 1%       ~ (p=0.179 n=144+145)
SentryKVM       104.8m ± 0%   103.2m ± 1%  -1.44% (p=0.000 n=144+145)
NVProxyIoctl    265.7m ± 0%   264.9m ± 1%       ~ (p=0.115 n=144+145)
```

PiperOrigin-RevId: 586484636
2023-11-29 16:27:20 -08:00
Etienne Perot 5d45603a55 seccomp: Make extractRepeatedMatchers more efficient.
This does the following:

- Only allocate maps once.
- Check whether the filter can run before doing any expensive allocation
  or map modifications.
- Recursively optimize other arguments earlier on.
- Replace `PerArg.Copy` with a specialized version.

This helps make this function more efficient in `gotsan` mode.

PiperOrigin-RevId: 586382284
2023-11-29 10:25:10 -08:00
Etienne Perot 14e291014a seccomp: Optimize PerArg rules.
This adds a few `SyscallRule` optimizers aimed at reducing the work done
by `PerArg` disjunctions.

It identifies the common matchers from `Or` rules and extracts them out
as possible.

Benchmark results:

```
              │   before    │                  after                  │
              │   sec/op    │   sec/op     vs base                    │
SentrySystrap   69.91n ± 4%   69.81n ± 8%       ~ (p=0.685 n=565+144)
SentryKVM       57.80n ± 3%   58.20n ± 9%       ~ (p=0.768 n=568+144)
NVProxyIoctl    99.18n ± 1%   90.46n ± 1%  -8.79% (n=570+144)

              │   before    │                   after                    │
              │  build-sec  │  build-sec    vs base                      │
SentrySystrap   13.76m ± 0%    34.10m ± 0%  +147.71% (p=0.000 n=570+144)
SentryKVM       16.57m ± 0%   142.17m ± 0%  +758.07% (p=0.000 n=570+144)
NVProxyIoctl    42.55m ± 0%   379.96m ± 0%  +792.99% (p=0.000 n=570+144)

              │      before       │                     after                      │
              │ compression-ratio │ compression-ratio  vs base                     │
SentrySystrap          2.678 ± 0%          3.252 ± 0%  +21.43% (p=0.000 n=570+144)
SentryKVM              2.630 ± 0%          3.318 ± 0%  +26.16% (p=0.000 n=570+144)
NVProxyIoctl           2.275 ± 0%          3.298 ± 0%  +44.97% (p=0.000 n=570+144)

              │   before    │                  after                   │
              │  gen-instr  │  gen-instr   vs base                     │
SentrySystrap   1.288k ± 0%   1.538k ± 0%  +19.41% (p=0.000 n=570+144)
SentryKVM       1.373k ± 0%   1.649k ± 0%  +20.10% (p=0.000 n=570+144)
NVProxyIoctl    2.250k ± 0%   2.233k ± 0%   -0.76% (n=570+144)

              │   before   │              after              │
              │ opt-instr  │ opt-instr   vs base             │
SentrySystrap   481.0 ± 0%   473.0 ± 0%   -1.66% (n=570+144)
SentryKVM       522.0 ± 0%   497.0 ± 0%   -4.79% (n=570+144)
NVProxyIoctl    989.0 ± 0%   677.0 ± 0%  -31.55% (n=570+144)

              │   before    │                  after                   │
              │   opt-sec   │   opt-sec    vs base                     │
SentrySystrap   108.2m ± 0%   109.8m ± 0%   +1.46% (p=0.000 n=570+144)
SentryKVM       101.0m ± 0%   104.8m ± 0%   +3.68% (p=0.000 n=570+144)
NVProxyIoctl    396.6m ± 0%   265.7m ± 0%  -33.01% (n=570+144)
```

PiperOrigin-RevId: 586196073
2023-11-28 21:36:07 -08:00
Jamie Liu 9cf9d1d01d Avoid redundant allocation in seccomp.optimizeSyscallRuleFunc().
PiperOrigin-RevId: 586120281
2023-11-28 15:25:20 -08:00
Etienne Perot f221e212aa seccomp: Make SyscallRules.Copy do a deep copy.
This adds a `Copy` function to the `SyscallRule` interface, so that syscall
rules can be deeply copied.

This is useful in the context of precompiled rules, which can be precompiled
in parallel and where some optimizers will modify the `SyscallRule` objects
themselves (e.g. removing the `MatchAll` rules from an `Or` rule). When this
happens in parallel, this causes a race of two goroutines writing to the same
slice when the rules are based on a similar source.
Each goroutine should be dealing with its own set of rules, hence making
`Copy` a deep copy.

PiperOrigin-RevId: 584169637
2023-11-20 17:25:50 -08:00
Etienne Perot 2bc70b209b seccomp: Don't treat variables that are optimized away as unused.
Prior to this change, the Sentry seccomp filters use the program's PID
as filter to the `tgkill(2)` system call. However, some platforms expand
this filter to allow any PID, which the optimizer detects and optimizes
the initial filter away. So the PID ends up being treated as an unused
variable. This change addresses this by detecting this situation and
allowing unused variables in optimized bytecode, so long as they do show
up in non-optimized bytecode.

Also add some small helper functions for using 64-bit variables.

PiperOrigin-RevId: 583222206
2023-11-16 18:04:59 -08:00
Etienne Perot 0a3bced479 Add tooling to compile seccomp-bpf programs at bazel build time.
This adds a `precompiledseccomp` library which provides tooling to compile
`seccomp-bpf` programs and generate Go source code that contains the
resulting bytecode embedded into it. In turn, this bytecode can be used in
Go libraries.

This avoids spending time compiling and optimizing `seccomp-bpf` programs
at runsc container creation time.

This library also contains support for "variables", which are `uint32`s whose
values are part of the seccomp filters but only known at runtime. To support
this, the program is compiled twice with placeholder values for these
variables, and we verify that the offsets at which these values show up in the
bytecode is consistent across these two compilation attempts.

PiperOrigin-RevId: 583117683
2023-11-16 12:00:44 -08:00
Etienne Perot 201a046299 seccomp: Enforce that Sentry filters match against reference program.
This change adds a `filter_fuzz_golden.bpf` BPF program that was generated
manually prior to my recent set of changes to seccomp bytecode and rule
optimization changes. It represents the "reference logic"; the new test
verifies that the current seccomp-bpf library produces BPF bytecode that
has the same behavior, using fuzz testing with full line-based coverage.

PiperOrigin-RevId: 582914572
2023-11-15 22:38:00 -08:00
Etienne Perot 6eed17ce4b seccomp: Add fuzz test for Sentry syscall filters.
This ensures that the optimized and unoptimized seccomp programs
are equivalent in behavior. Full coverage is enforced on the optimized
program. It cannot be enforced on the unoptimized program, because it
naturally ends up generating code that can never be satisfied.

PiperOrigin-RevId: 582892981
2023-11-15 20:32:34 -08:00
Etienne Perot e6979cb4d6 seccomp: Reorder generated syscall rules for better efficiency.
This subdivides all `RuleSet`s into single-syscall rulesets, and then
classifies them depending on:

- Whether they are "trivial" or not, where "trivial" means that the syscall
  rules do not perform any verification of the syscall arguments or RIP.
- Whether they are marked "hot" or not, where "hot" means "expected to be
  frequently called".

It then orders the program as follows:

- All hot non-trivial rules go first. This makes it so that the host kernel
  can clear the syscall faster for frequently-called syscalls. These are
  checked linearly, as they tend to follow a Pareto distribution in terms of
  frequency. If they need a vsyscall check, that check is added individually.
- All cold rules go next, and form a BST. This mimics the structure of the BST
  construction that existed prior to this change.
- Lastly, all the trivial syscalls are added as a last BST.

This speeds up rule evaluation because it maximizes the use of Linux's
seccomp cache for trivial syscalls. These are therefore only ever checked
once, so they can stay at the "bottom" of the program.
All remaining (non-trivial) syscalls are ordered such that hot syscalls
are checked first, and then cold syscalls are checked with a BST.

This is a complex and security-sensitive change, but fuzz testing with full
branch coverage has shown that this has the exact same behavior as a BPF
program taken from before any of my recent seccomp/BPF changes (other than
the one adding non-negative FD checks to all `ioctl(2)` system calls).

Some benchmark results ("orig" is the state before this change):

```
                                      │     orig      │                  reordered                   │
                                      │    sec/op     │    sec/op      vs base                       │
SentrySystrap/futex                      79.44n ±  2%    73.93n ±  2%   -6.93% (n=729+722)
SentrySystrap/nanosleep                  112.3n ± 12%    107.2n ± 12%        ~ (p=0.505 n=482+477)
SentrySystrap/sendmmsg                   88.50n ±  1%    81.62n ±  1%   -7.78% (n=729+722)
SentrySystrap/fstat                      30.80n ±  2%    30.63n ±  3%        ~ (p=0.903 n=722+712)
[...]
SentrySystrap/Postgres-48                64.30n ±  5%    61.74n ±  6%   -3.97% (p=0.039 n=376+377)
```

PiperOrigin-RevId: 582808055
2023-11-15 14:32:12 -08:00
Etienne Perot e671a64c47 seccomp: Add basic PerArg optimizations.
PiperOrigin-RevId: 582387222
2023-11-14 11:31:28 -08:00
Etienne Perot c46ffacf2f Separate out rule optimizers from main syscall rendering.
This does an optimization pass over `RuleSet`s prior to rendering anything,
rather than optimizing at the last minute for each syscall.

PiperOrigin-RevId: 581671748
2023-11-12 00:47:32 -08:00