gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Michael Pratt	46833fbeee	Allow prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) through syscall filters As of https://go.dev/cl/646095, the Go runtime calls prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) when mapping memory to annotate mappings in /proc/self/maps. Since this is a system call made throughout the application lifetime, it needs to be allowed through the system call filters. PiperOrigin-RevId: 734182524	2025-03-06 09:54:22 -08:00
Etienne Perot	b7334af658	Internal change. PiperOrigin-RevId: 686612361	2024-10-16 13:07:09 -07:00
Jing Chen	a093ad0450	Simplify and format gVisor codebase. The changes are just output of `gofmt -s -w .`.	2024-10-13 00:50:32 -07:00
Koichi Shiraishi	0cf77c02f8	all: remove use io/ioutil deprecated package & fix some deprecated thing Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>	2024-10-10 20:36:24 +09:00
Konstantin Bogomolov	0760a3df59	kvm: reduce stack usage Debug build functions use more stack space than normal, such that the KVM-nosplit function call chain doesn't fit. This patch replaces calls into unix.RawSyscall* functions with variants that do not grow the stack, and inlines some functions in ring0/pagetables in order to reduce stack usage. Additionally seccompMmapHandler is not used during debug builds anymore for making it fit into the nosplit stack size requirements. PiperOrigin-RevId: 679774881	2024-09-27 16:58:17 -07:00
Etienne Perot	11efa60e01	Display list of precompiled seccomp-bpf programs in debug logs. See [this comment](https://github.com/freedomofpress/dangerzone/pull/590#issuecomment-2149642086) for context. PiperOrigin-RevId: 643463161	2024-06-14 15:07:41 -07:00
Jing Chen	cf5c4c9cbf	Replace `reflect.DeepEqual` with `[slices/maps].Equal`. They are faster on slice/map comparisons. PiperOrigin-RevId: 633080355	2024-05-12 21:20:18 -07:00
Etienne Perot	1e61310ce6	`seccomp-bpf`: Render syscall rules after binary search tree traversal code. Structure before: ``` if sysno < current node value: goto left_node if sysno > current node value: goto right_node // Render rules for current sysno here... left_node: // Recursively render left node code here... right_node: // Recursively render right node code here... ``` This is fine, but if the "render rules for current sysno" part is larger enough for any syscall in the BST, this makes the jumps to `left_node` and `right_node` of the current nodes (and all its ancestor nodes) have to use unconditional jumps (i.e. extra instructions) during BST traversal. Since BST traversal must be fast, it is better to keep all the rules for each syscall separate from the BST traversal code. This ensures that the BST traversal code all fits in the maximum unconditional jump size (255 instructions), and then we do just one possibly-unconditional jump to the set of rules for that syscall. Compare that to the previous structure, where multiple jumps during BST traversal could have been unconditional jumps. Structure after: ``` if sysno < current node value: goto left_node if sysno > current node value: goto right_node goto sysno_rules left_node: // Recursively render left node traversal code here... right_node: // Recursively render right node traversal code here... sysno_rules: // Render rules for current sysno here... // Recursively render left node syscall rules code here... // Recursively render right node syscall rules code here... ``` With all the optimizations done in previous CLs, BSTs in practice are actually small enough that both the traversal and and the syscall rules together all fit under 255 instructions, so this only rarely comes into play. However, as we add more syscalls and syscall rules, the effect of this optimization should increase. This actually results in slightly larger bytecode (because most syscall filter rules do fit in just a few instructions, but now they have to be grouped at the end which takes a bigger jump to reach), but execution is still faster, because only one unconditional jump is ever done per program execution. Benchmarks expectedly don't show much change, except KVM which is quite happy for some reason: ``` │ before │ after │ │ sec/op │ sec/op vs base │ SentrySystrap/Postgres-48 51.43n ± 20% 50.50n ± 14% ~ (p=0.394 n=93+92) SentryKVM/Postgres-48 48.15n ± 10% 40.19n ± 14% -16.53% (p=0.012 n=96+99) NVProxyIoctl/nvproxy-48 65.75n ± 2% 65.43n ± 2% ~ (p=0.703 n=99+98) geomean 57.84n 57.69n -0.49% ``` (Most other benchmarks including`futex` and such show no change, which makes sense because they're part of the "hot" syscalls which aren't in a BST at all.) PiperOrigin-RevId: 595816325	2024-01-04 15:22:45 -08:00
Etienne Perot	326e1681e7	Improve seccomp ruleset debug logging readability. Before (sample ruleset): ``` Hot non-trivial syscalls: - sysno=1: {(arg0 == 0x4) => trace (0)} - sysno=39[vsyscall]: {(arg0 == 0x14) => errno (0)} - sysno=73: {(arg0 == 0x83) => trace (0)} - sysno=257: {(arg0 == 0x1cf) => kill thread, (arg0 == 0x71267) => kill process} Cold non-trivial syscalls: - sysno=27[vsyscall]: {(arg0 == 0x4e) => errno (0)} - sysno=96[vsyscall]: {(true) => errno (0)} - sysno=202[vsyscall]: {(true) => errno (0)} - sysno=263: {((arg0 high=halfEq(0x0) && (arg0 low=halfEq(0x1d8) \|\| arg0 low=halfEq(0x73598)))) => kill process, (arg0 == 0x1c295b98) => trap (0)} - sysno=265: {(arg0 == 0x1d7) => kill thread, (arg0 == 0x731af) => kill process} Trivial syscalls: - sysno=0: {(true) => trace (0)} - sysno=3: {(true) => kill thread} - sysno=80: {(true) => kill thread} - sysno=81: {(true) => kill process} ``` After (same ruleset): ``` Hot non-trivial syscalls: - Syscall 1: (arg[0] == 0x4) => trace - Vsyscall 39: (arg[0] == 0x14) => return errno=0x0 - Syscall 73: (arg[0] == 0x83) => trace - Syscall 257: {(arg[0] == 0x1cf) => kill thread; (arg[0] == 0x71267) => kill process} Cold non-trivial syscalls: - Vsyscall 27: (arg[0] == 0x4e) => return errno=0x0 - Vsyscall 96: return errno=0x0 - Vsyscall 202: return errno=0x0 - Syscall 263: {((arg[0].high == 0 && (arg[0].low == 0x1d8 \|\| arg[0].low == 0x73598))) => kill process; (arg[0] == 0x1c295b98) => trap} - Syscall 265: {(arg[0] == 0x1d7) => kill thread; (arg[0] == 0x731af) => kill process} Trivial syscalls: - Syscall 0: trace - Syscall 3: kill thread - Syscall 80: kill thread - Syscall 81: kill process ``` PiperOrigin-RevId: 587130340	2023-12-01 15:02:32 -08:00
Etienne Perot	bcbb32955e	Fix seccomp debugging tip to work with precompiled filters. Changing the default action in `seccomp.Install` now does nothing, because the rules are never instantiated when they are loaded from a precompiled state. Moving the debugging instructions to the `filter` package also makes more sense for debugging the Sentry, as the `seccomp` package is also used to install other seccomp rules which are not meant to have their default action overwritten even when debugging. PiperOrigin-RevId: 586538349	2023-11-29 21:23:27 -08:00
Etienne Perot	c11e182262	`runsc`: Do not precompile seccomp-bpf filters in `fastbuild` mode. While precompilation is pretty fast, its presence in the build graph removes concurrency from the `runsc` build process. This is because the precompilation generation binary is dependency-heavy (e.g. it needs all platform implementations), yet must be built before the `//runsc/boot/filter` package is built, thus slowing down the `runsc` build process. This change creates a low-dependency "stubbed" version of this generation binary when running in `fastbuild` mode. The generated code in this mode contains no precompiled seccomp programs; they are instead all built from scratch on container startup. This trades off build speed vs container startup speed. PiperOrigin-RevId: 586520033	2023-11-29 19:45:20 -08:00
Etienne Perot	d61fcf15de	Optimize syscall filter rule for `futex(2)`. No change in behavior, but now it matches based on AND'd the flags bits once rather than serially comparing it four times. (`FUTEX_WAIT` = 0, it is the opposite of `FUTEX_WAKE`. The bitmask still includes it for completeness.) This also includes a small optimization: replacing `halfMaskedEqual` when the value to match is zero with `halfNotSet`. This replaces two operations (AND + equal) with a single "is any of these bits set" operation. The futex rule benefit from this. PiperOrigin-RevId: 586510742	2023-11-29 18:42:59 -08:00
Etienne Perot	a880da69f2	`seccomp`: Optimize common 32-bit matchers away from disjunctions. This turns applies the same logic as `extractRepeatedMatchers` to each half of all `splitMatcher`s. This allows common 32-bit matchers in disjunctions to be extracted out of them. This is useful for almost all `EqualTo` rules, because they tend to look for values that fit in the lower 32 bits. As such, the check for the higher 32 bits (that must be equal to zero in all cases of the disjunction) can be moved out of the `Or`. `splitMatcher` is updated to more efficiently handle the cases where either of its branches are set to `halfAnyValue{}`. ``` │ before │ after │ │ sec/op │ sec/op vs base │ SentrySystrap 71.31n ± 9% 71.88n ± 5% ~ (p=0.744 n=145+146) SentryKVM 58.23n ± 8% 59.17n ± 5% ~ (p=0.930 n=145) NVProxyIoctl 92.92n ± 1% 86.14n ± 1% -7.30% (n=145) │ before │ after │ │ build-sec │ build-sec vs base │ SentrySystrap 74.15m ± 0% 42.34m ± 0% -42.90% (n=145+146) SentryKVM 419.51m ± 0% 54.67m ± 0% -86.97% (n=145) NVProxyIoctl 1139.7m ± 0% 116.9m ± 0% -89.75% (n=145) │ before │ after │ │ compression-ratio │ compression-ratio vs base │ SentrySystrap 3.443 ± 0% 3.872 ± 0% +12.46% (p=0.000 n=145+146) SentryKVM 3.501 ± 0% 3.983 ± 0% +13.77% (p=0.000 n=145) NVProxyIoctl 3.504 ± 0% 4.026 ± 0% +14.90% (p=0.000 n=145) │ before │ after │ │ gen-instr │ gen-instr vs base │ SentrySystrap 1.618k ± 0% 1.545k ± 0% -4.51% (n=145+146) SentryKVM 1.733k ± 0% 1.677k ± 0% -3.23% (n=145) NVProxyIoctl 2.355k ± 0% 2.351k ± 0% -0.17% (n=145) │ before │ after │ │ opt-instr │ opt-instr vs base │ SentrySystrap 470.0 ± 0% 399.0 ± 0% -15.11% (n=145+146) SentryKVM 495.0 ± 0% 421.0 ± 0% -14.95% (n=145) NVProxyIoctl 672.0 ± 0% 584.0 ± 0% -13.10% (n=145) │ before │ after │ │ opt-sec │ opt-sec vs base │ SentrySystrap 109.70m ± 1% 91.88m ± 1% -16.25% (n=145+146) SentryKVM 103.24m ± 1% 90.06m ± 1% -12.77% (n=145) NVProxyIoctl 264.9m ± 1% 205.0m ± 1% -22.61% (n=145) ``` PiperOrigin-RevId: 586505148	2023-11-29 18:08:54 -08:00
Etienne Perot	be011b9bfe	`seccomp`: Optimize half value matchers when possible. For `MaskedEqual`'s matchers, this looks at the mask being matched against, and simplifies the matcher if that mask is either 0 (in which case any value is allowed), or full bits (in which case there is no need to run an `AND` operation on the bits). Similar thing for `halfNotSet`. Benchmarks: ``` │ before │ after │ │ build-sec │ build-sec vs base │ SentrySystrap 34.10m ± 0% 74.15m ± 0% +117.49% (p=0.000 n=144+145) SentryKVM 142.2m ± 0% 419.5m ± 0% +195.08% (p=0.000 n=144+145) NVProxyIoctl 380.0m ± 0% 1139.7m ± 0% +199.96% (p=0.000 n=144+145) │ before │ after │ │ compression-ratio │ compression-ratio vs base │ SentrySystrap 3.252 ± 0% 3.443 ± 0% +5.87% (p=0.000 n=144+145) SentryKVM 3.318 ± 0% 3.501 ± 0% +5.52% (p=0.000 n=144+145) NVProxyIoctl 3.298 ± 0% 3.504 ± 0% +6.25% (p=0.000 n=144+145) │ before │ after │ │ gen-instr │ gen-instr vs base │ SentrySystrap 1.538k ± 0% 1.618k ± 0% +5.20% (p=0.000 n=144+145) SentryKVM 1.649k ± 0% 1.733k ± 0% +5.09% (p=0.000 n=144+145) NVProxyIoctl 2.233k ± 0% 2.355k ± 0% +5.46% (p=0.000 n=144+145) │ before │ after │ │ opt-instr │ opt-instr vs base │ SentrySystrap 473.0 ± 0% 470.0 ± 0% -0.63% (n=144+145) SentryKVM 497.0 ± 0% 495.0 ± 0% -0.40% (n=144+145) NVProxyIoctl 677.0 ± 0% 672.0 ± 0% -0.74% (n=144+145) │ before │ after │ │ opt-sec │ opt-sec vs base │ SentrySystrap 109.8m ± 0% 109.7m ± 1% ~ (p=0.179 n=144+145) SentryKVM 104.8m ± 0% 103.2m ± 1% -1.44% (p=0.000 n=144+145) NVProxyIoctl 265.7m ± 0% 264.9m ± 1% ~ (p=0.115 n=144+145) ``` PiperOrigin-RevId: 586484636	2023-11-29 16:27:20 -08:00
Etienne Perot	5d45603a55	`seccomp`: Make `extractRepeatedMatchers` more efficient. This does the following: - Only allocate maps once. - Check whether the filter can run before doing any expensive allocation or map modifications. - Recursively optimize other arguments earlier on. - Replace `PerArg.Copy` with a specialized version. This helps make this function more efficient in `gotsan` mode. PiperOrigin-RevId: 586382284	2023-11-29 10:25:10 -08:00
Etienne Perot	14e291014a	`seccomp`: Optimize `PerArg` rules. This adds a few `SyscallRule` optimizers aimed at reducing the work done by `PerArg` disjunctions. It identifies the common matchers from `Or` rules and extracts them out as possible. Benchmark results: ``` │ before │ after │ │ sec/op │ sec/op vs base │ SentrySystrap 69.91n ± 4% 69.81n ± 8% ~ (p=0.685 n=565+144) SentryKVM 57.80n ± 3% 58.20n ± 9% ~ (p=0.768 n=568+144) NVProxyIoctl 99.18n ± 1% 90.46n ± 1% -8.79% (n=570+144) │ before │ after │ │ build-sec │ build-sec vs base │ SentrySystrap 13.76m ± 0% 34.10m ± 0% +147.71% (p=0.000 n=570+144) SentryKVM 16.57m ± 0% 142.17m ± 0% +758.07% (p=0.000 n=570+144) NVProxyIoctl 42.55m ± 0% 379.96m ± 0% +792.99% (p=0.000 n=570+144) │ before │ after │ │ compression-ratio │ compression-ratio vs base │ SentrySystrap 2.678 ± 0% 3.252 ± 0% +21.43% (p=0.000 n=570+144) SentryKVM 2.630 ± 0% 3.318 ± 0% +26.16% (p=0.000 n=570+144) NVProxyIoctl 2.275 ± 0% 3.298 ± 0% +44.97% (p=0.000 n=570+144) │ before │ after │ │ gen-instr │ gen-instr vs base │ SentrySystrap 1.288k ± 0% 1.538k ± 0% +19.41% (p=0.000 n=570+144) SentryKVM 1.373k ± 0% 1.649k ± 0% +20.10% (p=0.000 n=570+144) NVProxyIoctl 2.250k ± 0% 2.233k ± 0% -0.76% (n=570+144) │ before │ after │ │ opt-instr │ opt-instr vs base │ SentrySystrap 481.0 ± 0% 473.0 ± 0% -1.66% (n=570+144) SentryKVM 522.0 ± 0% 497.0 ± 0% -4.79% (n=570+144) NVProxyIoctl 989.0 ± 0% 677.0 ± 0% -31.55% (n=570+144) │ before │ after │ │ opt-sec │ opt-sec vs base │ SentrySystrap 108.2m ± 0% 109.8m ± 0% +1.46% (p=0.000 n=570+144) SentryKVM 101.0m ± 0% 104.8m ± 0% +3.68% (p=0.000 n=570+144) NVProxyIoctl 396.6m ± 0% 265.7m ± 0% -33.01% (n=570+144) ``` PiperOrigin-RevId: 586196073	2023-11-28 21:36:07 -08:00
Jamie Liu	9cf9d1d01d	Avoid redundant allocation in seccomp.optimizeSyscallRuleFunc(). PiperOrigin-RevId: 586120281	2023-11-28 15:25:20 -08:00
Etienne Perot	f221e212aa	`seccomp`: Make `SyscallRules.Copy` do a deep copy. This adds a `Copy` function to the `SyscallRule` interface, so that syscall rules can be deeply copied. This is useful in the context of precompiled rules, which can be precompiled in parallel and where some optimizers will modify the `SyscallRule` objects themselves (e.g. removing the `MatchAll` rules from an `Or` rule). When this happens in parallel, this causes a race of two goroutines writing to the same slice when the rules are based on a similar source. Each goroutine should be dealing with its own set of rules, hence making `Copy` a deep copy. PiperOrigin-RevId: 584169637	2023-11-20 17:25:50 -08:00
Etienne Perot	2bc70b209b	`seccomp`: Don't treat variables that are optimized away as unused. Prior to this change, the Sentry seccomp filters use the program's PID as filter to the `tgkill(2)` system call. However, some platforms expand this filter to allow any PID, which the optimizer detects and optimizes the initial filter away. So the PID ends up being treated as an unused variable. This change addresses this by detecting this situation and allowing unused variables in optimized bytecode, so long as they do show up in non-optimized bytecode. Also add some small helper functions for using 64-bit variables. PiperOrigin-RevId: 583222206	2023-11-16 18:04:59 -08:00
Etienne Perot	0a3bced479	Add tooling to compile `seccomp-bpf` programs at `bazel build` time. This adds a `precompiledseccomp` library which provides tooling to compile `seccomp-bpf` programs and generate Go source code that contains the resulting bytecode embedded into it. In turn, this bytecode can be used in Go libraries. This avoids spending time compiling and optimizing `seccomp-bpf` programs at runsc container creation time. This library also contains support for "variables", which are `uint32`s whose values are part of the seccomp filters but only known at runtime. To support this, the program is compiled twice with placeholder values for these variables, and we verify that the offsets at which these values show up in the bytecode is consistent across these two compilation attempts. PiperOrigin-RevId: 583117683	2023-11-16 12:00:44 -08:00
Etienne Perot	201a046299	`seccomp`: Enforce that Sentry filters match against reference program. This change adds a `filter_fuzz_golden.bpf` BPF program that was generated manually prior to my recent set of changes to seccomp bytecode and rule optimization changes. It represents the "reference logic"; the new test verifies that the current seccomp-bpf library produces BPF bytecode that has the same behavior, using fuzz testing with full line-based coverage. PiperOrigin-RevId: 582914572	2023-11-15 22:38:00 -08:00
Etienne Perot	6eed17ce4b	`seccomp`: Add fuzz test for Sentry syscall filters. This ensures that the optimized and unoptimized seccomp programs are equivalent in behavior. Full coverage is enforced on the optimized program. It cannot be enforced on the unoptimized program, because it naturally ends up generating code that can never be satisfied. PiperOrigin-RevId: 582892981	2023-11-15 20:32:34 -08:00
Etienne Perot	e6979cb4d6	`seccomp`: Reorder generated syscall rules for better efficiency. This subdivides all `RuleSet`s into single-syscall rulesets, and then classifies them depending on: - Whether they are "trivial" or not, where "trivial" means that the syscall rules do not perform any verification of the syscall arguments or RIP. - Whether they are marked "hot" or not, where "hot" means "expected to be frequently called". It then orders the program as follows: - All hot non-trivial rules go first. This makes it so that the host kernel can clear the syscall faster for frequently-called syscalls. These are checked linearly, as they tend to follow a Pareto distribution in terms of frequency. If they need a vsyscall check, that check is added individually. - All cold rules go next, and form a BST. This mimics the structure of the BST construction that existed prior to this change. - Lastly, all the trivial syscalls are added as a last BST. This speeds up rule evaluation because it maximizes the use of Linux's seccomp cache for trivial syscalls. These are therefore only ever checked once, so they can stay at the "bottom" of the program. All remaining (non-trivial) syscalls are ordered such that hot syscalls are checked first, and then cold syscalls are checked with a BST. This is a complex and security-sensitive change, but fuzz testing with full branch coverage has shown that this has the exact same behavior as a BPF program taken from before any of my recent seccomp/BPF changes (other than the one adding non-negative FD checks to all `ioctl(2)` system calls). Some benchmark results ("orig" is the state before this change): ``` │ orig │ reordered │ │ sec/op │ sec/op vs base │ SentrySystrap/futex 79.44n ± 2% 73.93n ± 2% -6.93% (n=729+722) SentrySystrap/nanosleep 112.3n ± 12% 107.2n ± 12% ~ (p=0.505 n=482+477) SentrySystrap/sendmmsg 88.50n ± 1% 81.62n ± 1% -7.78% (n=729+722) SentrySystrap/fstat 30.80n ± 2% 30.63n ± 3% ~ (p=0.903 n=722+712) [...] SentrySystrap/Postgres-48 64.30n ± 5% 61.74n ± 6% -3.97% (p=0.039 n=376+377) ``` PiperOrigin-RevId: 582808055	2023-11-15 14:32:12 -08:00
Etienne Perot	e671a64c47	`seccomp`: Add basic `PerArg` optimizations. PiperOrigin-RevId: 582387222	2023-11-14 11:31:28 -08:00
Etienne Perot	c46ffacf2f	Separate out rule optimizers from main syscall rendering. This does an optimization pass over `RuleSet`s prior to rendering anything, rather than optimizing at the last minute for each syscall. PiperOrigin-RevId: 581671748	2023-11-12 00:47:32 -08:00

1 2 3 4

98 Commits