Commit Graph

164 Commits

Author SHA1 Message Date
Linus Torvalds
bcfeebbff3 Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exit cleanups from Eric Biederman:
 "In preparation of doing something about PTRACE_EVENT_EXIT I have
  started cleaning up various pieces of code related to do_exit. Most of
  that code I did not manage to get tested and reviewed before the merge
  window opened but a handful of very useful cleanups are ready to be
  merged.

  The first change is simply the removal of the bdflush system call. The
  code has now been disabled long enough that even the oldest userspace
  working userspace setups anyone can find to test are fine with the
  bdflush system call being removed.

  Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
  calling do_exit directly is interesting only in that it is nearly the
  most difficult of the incorrect uses of do_exit to remove.

  The change to the seccomp code to simply send a signal instead of
  calling do_coredump directly is a very nice little cleanup made
  possible by realizing the existing signal sending helpers were missing
  a little bit of functionality that is easy to provide"

* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal/seccomp: Dump core when there is only one live thread
  signal/seccomp: Refactor seccomp signal and coredump generation
  signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
  exit/bdflush: Remove the deprecated bdflush system call
2021-09-01 14:52:05 -07:00
Eric W. Biederman
d21918e5a9 signal/seccomp: Dump core when there is only one live thread
Replace get_nr_threads with atomic_read(&current->signal->live) as
that is a more accurate number that is decremented sooner.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87lf6z6qbd.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-26 18:06:41 -05:00
Eric W. Biederman
307d522f5e signal/seccomp: Refactor seccomp signal and coredump generation
Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c.  The function force_sig_seccomp takes a
parameter force_coredump to indicate that the sigaction field should
be reset to SIGDFL so that a coredump will be generated when the
signal is delivered.

force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.

force_sig_info_to_task gains an extra parameter to force using
the default signal action.

With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.

Further it no longer becomes necessary for __seccomp_filter
to call do_group_exit.

Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-26 10:30:12 -05:00
Hsuan-Chi Kuo
b4d8a58f8d seccomp: Fix setting loaded filter count during TSYNC
The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
exposed to userspace; it is not used for reference counting, etc.

Signed-off-by: Hsuan-Chi Kuo <hsuanchikuo@gmail.com>
Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com
Co-developed-by: Wiktor Garbacz <wiktorg@google.com>
Signed-off-by: Wiktor Garbacz <wiktorg@google.com>
Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Fixes: c818c03b66 ("seccomp: Report number of loaded filters in /proc/$pid/status")
2021-08-11 11:48:28 -07:00
Rodrigo Campos
0ae71c7720 seccomp: Support atomic "addfd + send reply"
Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag.

This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and
SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's
return value, nor error can be set by the user. Upon successful invocation
of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND
flag, the notifying process's errno will be 0, and the return value will
be the file descriptor number that was installed.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me
2021-06-28 12:49:52 -07:00
Sargun Dhillon
ddc4739169 seccomp: Refactor notification handler to prepare for new semantics
This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Rodrigo Campos also identified a bug that can result in addfd causing
an early return, when the supervisor didn't actually handle the
syscall [1].

[1]: https://lore.kernel.org/lkml/20210413160151.3301-1-rodrigo@kinvolk.io/

Fixes: 7cf97b1254 ("seccomp: Introduce addfd ioctl to seccomp user notifier")
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Rodrigo Campos <rodrigo@kinvolk.io>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210517193908.3113-3-sargun@sargun.me
2021-05-29 11:13:27 -07:00
Linus Torvalds
23806a3e96 Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull receive_fd update from Al Viro:
 "Cleanup of receive_fd mess"

* 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: split receive_fd_replace from __receive_fd
2021-05-03 11:05:28 -07:00
Christoph Hellwig
42eb0d54c0 fs: split receive_fd_replace from __receive_fd
receive_fd_replace shares almost no code with the general case, so split
it out.  Also remove the "Bump the sock usage counts" comment from
both copies, as that is now what __receive_sock actually does.

[AV: ... and make the only user of receive_fd_replace() choose between
it and receive_fd() according to what userland had passed to it in
flags]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-16 00:13:04 -04:00
Cui GaoSheng
a3fc712c5b seccomp: Fix "cacheable" typo in comments
Do a trivial typo fix: s/cachable/cacheable/

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Cui GaoSheng <cuigaosheng1@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210331030724.84419-1-cuigaosheng1@huawei.com
2021-03-30 22:34:30 -07:00
wanghongzhe
a381b70a1c seccomp: Improve performace by optimizing rmb()
According to Kees's suggest, we started with the patch that just replaces
rmb() with smp_rmb() and did a performance test with UnixBench. The
results showed the overhead about 2.53% in rmb() test compared to the
smp_rmb() one, in a x86-64 kernel with CONFIG_SMP enabled running inside a
qemu-kvm vm. The test is a "syscall" testcase in UnixBench, which executes
5 syscalls in a loop during a certain timeout (100 second in our test) and
counts the total number of executions of this 5-syscall sequence. We set
a seccomp filter with all allow rule for all used syscalls in this test
(which will go bitmap path) to make sure the rmb() will be executed. The
details for the test:

with rmb():
/txm # ./syscall_allow_min 100
COUNT|35861159|1|lps
/txm # ./syscall_allow_min 100
COUNT|35545501|1|lps
/txm # ./syscall_allow_min 100
COUNT|35664495|1|lps

with smp_rmb():
/txm # ./syscall_allow_min 100
COUNT|36552771|1|lps
/txm # ./syscall_allow_min 100
COUNT|36491247|1|lps
/txm # ./syscall_allow_min 100
COUNT|36504746|1|lps

For a x86-64 kernel with CONFIG_SMP enabled, the smp_rmb() is just a
compiler barrier() which have no impact in runtime, while rmb() is a
lfence which will prevent all memory access operations (not just load
according the recently claim by Intel) behind itself. We can also figure
it out in disassembly:

with rmb():
0000000000001430 <__seccomp_filter>:
    1430:   41 57                   push   %r15
    1432:   41 56                   push   %r14
    1434:   41 55                   push   %r13
    1436:   41 54                   push   %r12
    1438:   55                      push   %rbp
    1439:   53                      push   %rbx
    143a:   48 81 ec 90 00 00 00    sub    $0x90,%rsp
    1441:   89 7c 24 10             mov    %edi,0x10(%rsp)
    1445:   89 54 24 14             mov    %edx,0x14(%rsp)
    1449:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
    1450:   00 00
    1452:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
    1459:   00
    145a:   31 c0                   xor    %eax,%eax
*   145c:   0f ae e8                lfence
    145f:   48 85 f6                test   %rsi,%rsi
    1462:   49 89 f4                mov    %rsi,%r12
    1465:   0f 84 42 03 00 00       je     17ad <__seccomp_filter+0x37d>
    146b:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
    1472:   00 00
    1474:   48 8b 98 80 07 00 00    mov    0x780(%rax),%rbx
    147b:   48 85 db                test   %rbx,%rbx

with smp_rmb();
0000000000001430 <__seccomp_filter>:
    1430:   41 57                   push   %r15
    1432:   41 56                   push   %r14
    1434:   41 55                   push   %r13
    1436:   41 54                   push   %r12
    1438:   55                      push   %rbp
    1439:   53                      push   %rbx
    143a:   48 81 ec 90 00 00 00    sub    $0x90,%rsp
    1441:   89 7c 24 10             mov    %edi,0x10(%rsp)
    1445:   89 54 24 14             mov    %edx,0x14(%rsp)
    1449:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
    1450:   00 00
    1452:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
    1459:   00
    145a:   31 c0                   xor    %eax,%eax
    145c:   48 85 f6                test   %rsi,%rsi
    145f:   49 89 f4                mov    %rsi,%r12
    1462:   0f 84 42 03 00 00       je     17aa <__seccomp_filter+0x37a>
    1468:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
    146f:   00 00
    1471:   48 8b 98 80 07 00 00    mov    0x780(%rax),%rbx
    1478:   48 85 db                test   %rbx,%rbx

Signed-off-by: wanghongzhe <wanghongzhe@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1612496049-32507-1-git-send-email-wanghongzhe@huawei.com
2021-02-10 12:40:11 -08:00
Paul Cercueil
04b38d0125 seccomp: Add missing return in non-void function
We don't actually care about the value, since the kernel will panic
before that; but a value should nonetheless be returned, otherwise the
compiler will complain.

Fixes: 8112c4f140 ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210111172839.640914-1-paul@crapouillou.net
2021-01-11 12:05:14 -08:00
Linus Torvalds
e994cc240a Merge tag 'seccomp-v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
 "The major change here is finally gaining seccomp constant-action
  bitmaps, which internally reduces the seccomp overhead for many
  real-world syscall filters to O(1), as discussed at Plumbers this
  year.

   - Improve seccomp performance via constant-action bitmaps (YiFei Zhu
     & Kees Cook)

   - Fix bogus __user annotations (Jann Horn)

   - Add missed CONFIG for improved selftest coverage (Mickaël Salaün)"

* tag 'seccomp-v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: Update kernel config
  seccomp: Remove bogus __user annotations
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  xtensa: Enable seccomp architecture tracking
  sh: Enable seccomp architecture tracking
  s390: Enable seccomp architecture tracking
  riscv: Enable seccomp architecture tracking
  powerpc: Enable seccomp architecture tracking
  parisc: Enable seccomp architecture tracking
  csky: Enable seccomp architecture tracking
  arm: Enable seccomp architecture tracking
  arm64: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead
  x86: Enable seccomp architecture tracking
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Lookup syscall allowlist bitmap for fast path
2020-12-16 11:30:10 -08:00
Linus Torvalds
1ac0884d54 Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:

   - More generalization of entry/exit functionality

   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.

   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.

   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.

   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.

   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"

* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ...
2020-12-14 17:13:53 -08:00
Kees Cook
7ef95e3dbc Merge branch 'for-linus/seccomp' into for-next/seccomp 2020-11-20 11:39:39 -08:00
Jann Horn
fab686eb03 seccomp: Remove bogus __user annotations
Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com
2020-11-20 11:39:21 -08:00
YiFei Zhu
0d8315dddd seccomp/cache: Report cache data through /proc/pid/seccomp_cache
Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.edu
2020-11-20 11:16:35 -08:00
YiFei Zhu
8e01b51a31 seccomp/cache: Add "emulator" to check if filter is constant allow
SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

Nearly all seccomp filters are built from these cBPF instructions:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND  | BPF_K

Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu
2020-11-20 11:16:34 -08:00
YiFei Zhu
f9d480b6ff seccomp/cache: Lookup syscall allowlist bitmap for fast path
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.

Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.

When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] ae0ef82b90/profiles/seccomp/default.json
[5] 6743a1caf4/src/shared/seccomp-util.c (L270)
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu
2020-11-20 11:16:34 -08:00
Mickaël Salaün
fb14528e44 seccomp: Set PF_SUPERPRIV when checking capability
Replace the use of security_capable(current_cred(), ...) with
ns_capable_noaudit() which set PF_SUPERPRIV.

Since commit 98f368e9e2 ("kernel: Add noaudit variant of
ns_capable()"), a new ns_capable_noaudit() helper is available.  Let's
use it!

Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Fixes: e2cfabdfd0 ("seccomp: add system call filtering using BPF")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net
2020-11-17 12:53:22 -08:00
Gabriel Krisman Bertazi
23d67a5485 seccomp: Migrate to use SYSCALL_WORK flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com
2020-11-16 21:53:15 +01:00
Jann Horn
dfe719fef0 seccomp: Make duplicate listener detection non-racy
Currently, init_listener() tries to prevent adding a filter with
SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already
has a listener. However, this check happens without holding any lock that
would prevent another thread from concurrently installing a new filter
(potentially with a listener) on top of the ones we already have.

Theoretically, this is also a data race: The plain load from
current->seccomp.filter can race with concurrent writes to the same
location.

Fix it by moving the check into the region that holds the siglock to guard
against concurrent TSYNC.

(The "Fixes" tag points to the commit that introduced the theoretical
data race; concurrent installation of another filter with TSYNC only
became possible later, in commit 51891498f2 ("seccomp: allow TSYNC and
USER_NOTIF together").)

Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com
2020-10-08 13:17:47 -07:00
Denis Efremov
2d9ca267a9 seccomp: Use current_pt_regs() instead of task_pt_regs(current)
As described in commit a3460a5974 ("new helper: current_pt_regs()"):
- arch versions are "optimized versions".
- some architectures have task_pt_regs() working only for traced tasks
  blocked on signal delivery. current_pt_regs() needs to work for *all*
  processes.

In preparation for adding a coccinelle rule for using current_*(), instead
of raw accesses to current members, modify seccomp_do_user_notification(),
__seccomp_filter(), __secure_computing() to use current_pt_regs().

Signed-off-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20200824125921.488311-1-efremov@linux.com
[kees: Reworded commit log, add comment to populate_seccomp_data()]
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08 16:26:45 -07:00
Rich Felker
4d671d922d seccomp: kill process instead of thread for unknown actions
Asynchronous termination of a thread outside of the userspace thread
library's knowledge is an unsafe operation that leaves the process in
an inconsistent, corrupt, and possibly unrecoverable state. In order
to make new actions that may be added in the future safe on kernels
not aware of them, change the default action from
SECCOMP_RET_KILL_THREAD to SECCOMP_RET_KILL_PROCESS.

Signed-off-by: Rich Felker <dalias@libc.org>
Link: https://lore.kernel.org/r/20200829015609.GA32566@brightrain.aerifal.cx
[kees: Fixed up coredump selection logic to match]
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08 12:00:49 -07:00
Tycho Andersen
e839317900 seccomp: don't leave dangling ->notif if file allocation fails
Christian and Kees both pointed out that this is a bit sloppy to open-code
both places, and Christian points out that we leave a dangling pointer to
->notif if file allocation fails. Since we check ->notif for null in order
to determine if it's ok to install a filter, this means people won't be
able to install a filter if the file allocation fails for some reason, even
if they subsequently should be able to.

To fix this, let's hoist this free+null into its own little helper and use
it.

Reported-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200902140953.1201956-1-tycho@tycho.pizza
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08 11:30:16 -07:00
Tycho Andersen
a566a9012a seccomp: don't leak memory when filter install races
In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
the listener fd, then check to see if we can actually use it later in
seccomp_may_assign_mode(), which can fail if anyone else in our thread
group has installed a filter and caused some divergence. If we can't, we
partially clean up the newly allocated file: we put the fd, put the file,
but don't actually clean up the *memory* that was allocated at
filter->notif. Let's clean that up too.

To accomplish this, let's hoist the actual "detach a notifier from a
filter" code to its own helper out of seccomp_notify_release(), so that in
case anyone adds stuff to init_listener(), they only have to add the
cleanup code in one spot. This does a bit of extra locking and such on the
failure path when the filter is not attached, but it's a slow failure path
anyway.

Fixes: 51891498f2 ("seccomp: allow TSYNC and USER_NOTIF together")
Reported-by: syzbot+3ad9614a12f80994c32e@syzkaller.appspotmail.com
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200902014017.934315-1-tycho@tycho.pizza
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08 11:19:50 -07:00