You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Documentation: prctl/seccomp_filter
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 and a semi-generic
example using a macro-based code generator.
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
v18: - added acked by
- update no new privs numbers
v17: - remove @compat note and add Pitfalls section for arch checking
(keescook@chromium.org)
v16: -
v15: -
v14: - rebase/nochanges
v13: - rebase on to 88ebdda615
v12: - comment on the ptrace_event use
- update arch support comment
- note the behavior of SECCOMP_RET_DATA when there are multiple filters
(keescook@chromium.org)
- lots of samples/ clean up incl 64-bit bpf-direct support
(markus@chromium.org)
- rebase to linux-next
v11: - overhaul return value language, updates (keescook@chromium.org)
- comment on do_exit(SIGSYS)
v10: - update for SIGSYS
- update for new seccomp_data layout
- update for ptrace option use
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check (coreyb@linux.vnet.ibm.com)
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value (eparis@redhat.com)
- language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)
Signed-off-by: James Morris <james.l.morris@oracle.com>
This commit is contained in:
committed by
James Morris
parent
c6cfbeb402
commit
8ac270d1e2
@@ -0,0 +1,163 @@
|
||||
SECure COMPuting with filters
|
||||
=============================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
A large number of system calls are exposed to every userland process
|
||||
with many of them going unused for the entire lifetime of the process.
|
||||
As system calls change and mature, bugs are found and eradicated. A
|
||||
certain subset of userland applications benefit by having a reduced set
|
||||
of available system calls. The resulting set reduces the total kernel
|
||||
surface exposed to the application. System call filtering is meant for
|
||||
use with those applications.
|
||||
|
||||
Seccomp filtering provides a means for a process to specify a filter for
|
||||
incoming system calls. The filter is expressed as a Berkeley Packet
|
||||
Filter (BPF) program, as with socket filters, except that the data
|
||||
operated on is related to the system call being made: system call
|
||||
number and the system call arguments. This allows for expressive
|
||||
filtering of system calls using a filter program language with a long
|
||||
history of being exposed to userland and a straightforward data set.
|
||||
|
||||
Additionally, BPF makes it impossible for users of seccomp to fall prey
|
||||
to time-of-check-time-of-use (TOCTOU) attacks that are common in system
|
||||
call interposition frameworks. BPF programs may not dereference
|
||||
pointers which constrains all filters to solely evaluating the system
|
||||
call arguments directly.
|
||||
|
||||
What it isn't
|
||||
-------------
|
||||
|
||||
System call filtering isn't a sandbox. It provides a clearly defined
|
||||
mechanism for minimizing the exposed kernel surface. It is meant to be
|
||||
a tool for sandbox developers to use. Beyond that, policy for logical
|
||||
behavior and information flow should be managed with a combination of
|
||||
other system hardening techniques and, potentially, an LSM of your
|
||||
choosing. Expressive, dynamic filters provide further options down this
|
||||
path (avoiding pathological sizes or selecting which of the multiplexed
|
||||
system calls in socketcall() is allowed, for instance) which could be
|
||||
construed, incorrectly, as a more complete sandboxing solution.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
An additional seccomp mode is added and is enabled using the same
|
||||
prctl(2) call as the strict seccomp. If the architecture has
|
||||
CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
|
||||
|
||||
PR_SET_SECCOMP:
|
||||
Now takes an additional argument which specifies a new filter
|
||||
using a BPF program.
|
||||
The BPF program will be executed over struct seccomp_data
|
||||
reflecting the system call number, arguments, and other
|
||||
metadata. The BPF program must then return one of the
|
||||
acceptable values to inform the kernel which action should be
|
||||
taken.
|
||||
|
||||
Usage:
|
||||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
|
||||
|
||||
The 'prog' argument is a pointer to a struct sock_fprog which
|
||||
will contain the filter program. If the program is invalid, the
|
||||
call will return -1 and set errno to EINVAL.
|
||||
|
||||
If fork/clone and execve are allowed by @prog, any child
|
||||
processes will be constrained to the same filters and system
|
||||
call ABI as the parent.
|
||||
|
||||
Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
|
||||
run with CAP_SYS_ADMIN privileges in its namespace. If these are not
|
||||
true, -EACCES will be returned. This requirement ensures that filter
|
||||
programs cannot be applied to child processes with greater privileges
|
||||
than the task that installed them.
|
||||
|
||||
Additionally, if prctl(2) is allowed by the attached filter,
|
||||
additional filters may be layered on which will increase evaluation
|
||||
time, but allow for further decreasing the attack surface during
|
||||
execution of a process.
|
||||
|
||||
The above call returns 0 on success and non-zero on error.
|
||||
|
||||
Return values
|
||||
-------------
|
||||
A seccomp filter may return any of the following values. If multiple
|
||||
filters exist, the return value for the evaluation of a given system
|
||||
call will always use the highest precedent value. (For example,
|
||||
SECCOMP_RET_KILL will always take precedence.)
|
||||
|
||||
In precedence order, they are:
|
||||
|
||||
SECCOMP_RET_KILL:
|
||||
Results in the task exiting immediately without executing the
|
||||
system call. The exit status of the task (status & 0x7f) will
|
||||
be SIGSYS, not SIGKILL.
|
||||
|
||||
SECCOMP_RET_TRAP:
|
||||
Results in the kernel sending a SIGSYS signal to the triggering
|
||||
task without executing the system call. The kernel will
|
||||
rollback the register state to just before the system call
|
||||
entry such that a signal handler in the task will be able to
|
||||
inspect the ucontext_t->uc_mcontext registers and emulate
|
||||
system call success or failure upon return from the signal
|
||||
handler.
|
||||
|
||||
The SECCOMP_RET_DATA portion of the return value will be passed
|
||||
as si_errno.
|
||||
|
||||
SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
|
||||
|
||||
SECCOMP_RET_ERRNO:
|
||||
Results in the lower 16-bits of the return value being passed
|
||||
to userland as the errno without executing the system call.
|
||||
|
||||
SECCOMP_RET_TRACE:
|
||||
When returned, this value will cause the kernel to attempt to
|
||||
notify a ptrace()-based tracer prior to executing the system
|
||||
call. If there is no tracer present, -ENOSYS is returned to
|
||||
userland and the system call is not executed.
|
||||
|
||||
A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
|
||||
using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
|
||||
of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
|
||||
the BPF program return value will be available to the tracer
|
||||
via PTRACE_GETEVENTMSG.
|
||||
|
||||
SECCOMP_RET_ALLOW:
|
||||
Results in the system call being executed.
|
||||
|
||||
If multiple filters exist, the return value for the evaluation of a
|
||||
given system call will always use the highest precedent value.
|
||||
|
||||
Precedence is only determined using the SECCOMP_RET_ACTION mask. When
|
||||
multiple filters return values of the same precedence, only the
|
||||
SECCOMP_RET_DATA from the most recently installed filter will be
|
||||
returned.
|
||||
|
||||
Pitfalls
|
||||
--------
|
||||
|
||||
The biggest pitfall to avoid during use is filtering on system call
|
||||
number without checking the architecture value. Why? On any
|
||||
architecture that supports multiple system call invocation conventions,
|
||||
the system call numbers may vary based on the specific invocation. If
|
||||
the numbers in the different calling conventions overlap, then checks in
|
||||
the filters may be abused. Always check the arch value!
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
The samples/seccomp/ directory contains both an x86-specific example
|
||||
and a more generic example of a higher level macro interface for BPF
|
||||
program generation.
|
||||
|
||||
|
||||
|
||||
Adding architecture support
|
||||
-----------------------
|
||||
|
||||
See arch/Kconfig for the authoritative requirements. In general, if an
|
||||
architecture supports both ptrace_event and seccomp, it will be able to
|
||||
support seccomp filter with minor fixup: SIGSYS support and seccomp return
|
||||
value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
|
||||
to its arch-specific Kconfig.
|
||||
Reference in New Issue
Block a user