Commit Graph

15873 Commits

Author SHA1 Message Date
Kees Cook
81ff7fa232 seccomp: implement SECCOMP_FILTER_FLAG_TSYNC
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:54:06 +00:00
Kees Cook
d924727911 seccomp: allow mode setting across threads
This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c

Change-Id: I091ffa55d8f4e83ff02558a55e2b4dc76ac26905
2014-08-28 01:53:49 +00:00
Kees Cook
743266ae25 seccomp: introduce writer locking
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/fork.c

Change-Id: Ie01ece43b610867013f7d0e0a2a7be0b9077630f
2014-08-28 01:53:30 +00:00
Kees Cook
3497a88f55 seccomp: split filter prep from check and apply
In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c

Change-Id: I8d89f80a5b4f2826d90474dcea441c41f0af6594
2014-08-28 01:53:09 +00:00
Kees Cook
a75a29b16e seccomp: add "seccomp" syscall
This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	arch/x86/syscalls/syscall_32.tbl
	arch/x86/syscalls/syscall_64.tbl
	include/uapi/asm-generic/unistd.h
	kernel/seccomp.c

Change-Id: Id7a365079829fd9164315dec75d6ee415c29b176
2014-08-28 01:51:54 +00:00
Kees Cook
1a63bcec4f seccomp: split mode setting routines
Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:51:22 +00:00
Kees Cook
c208e4e9f1 seccomp: extract check/assign mode helpers
To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:50:55 +00:00
Kees Cook
6862b01436 seccomp: create internal mode-setting function
In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:50:03 +00:00
Kees Cook
c2da3eba6a seccomp: fix memory leak on filter attach
This sets the correct error code when final filter memory is unavailable,
and frees the raw filter no matter what.

unreferenced object 0xffff8800d6ea4000 (size 512):
  comm "sshd", pid 278, jiffies 4294898315 (age 46.653s)
  hex dump (first 32 bytes):
    21 00 00 00 04 00 00 00 15 00 01 00 3e 00 00 c0  !...........>...
    06 00 00 00 00 00 00 00 21 00 00 00 00 00 00 00  ........!.......
  backtrace:
    [<ffffffff8151414e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811a3a40>] __kmalloc+0x280/0x320
    [<ffffffff8110842e>] prctl_set_seccomp+0x11e/0x3b0
    [<ffffffff8107bb6b>] SyS_prctl+0x3bb/0x4a0
    [<ffffffff8152ef2d>] system_call_fastpath+0x1a/0x1f
    [<ffffffffffffffff>] 0xffffffffffffffff

Reported-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Masami Ichikawa <masami256@gmail.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Conflicts:
	kernel/seccomp.c

Change-Id: Ide3c27bf378397f8faf4218e75c31e4b8bc43c4c
2014-08-28 01:48:29 +00:00
Colin Cross
9bc0c15675 mm: fix prctl_set_vma_anon_name
prctl_set_vma_anon_name could attempt to set the name across
two vmas at the same time due to a typo, which might corrupt
the vma list.  Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.

Change-Id: Ie32d8ddb0fd547efbeedd6528acdab5ca5b308b4
Reported-by: Jed Davis <jld@mozilla.com>
Signed-off-by: Colin Cross <ccross@android.com>
2014-08-22 23:47:15 +00:00
Ruchi Kandoi
8ae872f1d5 prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.
Adds a capable() check to make sure that arbitary apps do not change the
timer slack for other apps.

Bug: 15000427
Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-06-16 17:49:10 +00:00
Thomas Gleixner
b50939f89d futex: Make lookup_pi_state more robust
The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters
bit or the owner died bit is set or we are called from the requeue_pi
path or from user space just for fun.

The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.

Handle the cases explicit:

     Waiter | pi_state | pi->owner | uTID      | uODIED | ?

[1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
[2]  NULL   | ---      | ---       | >0        | 0/1    | Valid

[3]  Found  | NULL     | --        | Any       | 0/1    | Invalid

[4]  Found  | Found    | NULL      | 0         | 1      | Valid
[5]  Found  | Found    | NULL      | >0        | 1      | Invalid

[6]  Found  | Found    | task      | 0         | 1      | Valid

[7]  Found  | Found    | NULL      | Any       | 0      | Invalid

[8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
[9]  Found  | Found    | task      | 0         | 0      | Invalid
[10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid

[1]  Indicates that the kernel can acquire the futex atomically. We
     came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

[2]  Valid, if TID does not belong to a kernel thread. If no matching
     thread is found then it indicates that the owner TID has died.

[3]  Invalid. The waiter is queued on a non PI futex

[4]  Valid state after exit_robust_list(), which sets the user space
     value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

[5]  The user space value got manipulated between exit_robust_list()
     and exit_pi_state_list()

[6]  Valid state after exit_pi_state_list() which sets the new owner in
     the pi_state but cannot access the user space value.

[7]  pi_state->owner can only be NULL when the OWNER_DIED bit is set.

[8]  Owner and user space value match

[9]  There is no transient state which sets the user space TID to 0
     except exit_robust_list(), but this is indicated by the
     FUTEX_OWNER_DIED bit. See [4]

[10] There is no transient state which leaves owner and user space
     TID out of sync.

Backport to 3.13
  conflicts: kernel/futex.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
2014-06-06 14:53:48 -07:00
Thomas Gleixner
a2ec8e3dcd futex: Always cleanup owner tid in unlock_pi
If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.

Clean it up unconditionally.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
2014-06-06 14:53:47 -07:00
Thomas Gleixner
550c7910f0 futex: Validate atomic acquisition in futex_lock_pi_atomic()
We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.

Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.

[ tglx: Use futex_top_waiter() and explain why we do not need to try
  	restoring the already corrupted user space state. ]

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-06-06 14:53:47 -07:00
Thomas Gleixner
5b3bbe42f5 futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
then dangling pointers may be left for rt_waiter resulting in an
exploitable condition.

This change brings futex_requeue() into line with
futex_wait_requeue_pi() which performs the same check as per commit
6f7b0a2a5 (futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi())

[ tglx: Compare the resulting keys as well, as uaddrs might be
  	different depending on the mapping ]

Fixes CVE-2014-3153.

Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-06-06 14:53:47 -07:00
Ruchi Kandoi
70dc3ed48f Power: Changes the permission to read only for sysfs file
/sys/kernel/wakeup_reasons/last_resume_reason

Change-Id: I8ac568a7cb58c31decd379195de517ff3c6f9c65
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-24 22:14:30 +00:00
Ruchi Kandoi
37a591d407 prctl: adds PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the
slack is set to that value otherwise sets it to the default for the thread.

Takes PID of the thread as the third argument.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)

Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-22 17:31:53 -07:00
Greg Hackmann
c9331cabfd power: wakeup_reason: rename irq_count to irqcount
On x86, irq_count conflicts with a declaration in
arch/x86/include/asm/processor.h

Change-Id: I3e4fde0ff64ef59ff5ed2adc0ea3a644641ee0b7
Signed-off-by: Greg Hackmann <ghackmann@google.com>
2014-03-10 14:26:18 -07:00
Ruchi Kandoi
8b531976d5 Power: Add guard condition for maximum wakeup reasons
Ensure the array for the wakeup reason IRQs does not overflow.

Change-Id: Iddc57a3aeb1888f39d4e7b004164611803a4d37c
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
(cherry picked from commit b5ea40cdfcf38296535f931a7e5e7bf47b6fad7f)
2014-03-10 18:17:35 +00:00
Ruchi Kandoi
39129eff0f POWER: fix compile warnings in log_wakeup_reason
Change I81addaf420f1338255c5d0638b0d244a99d777d1 introduced compile
warnings, fix these.

Change-Id: I05482a5335599ab96c0a088a7d175c8d4cf1cf69
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-02-21 04:16:10 +00:00
Ruchi Kandoi
520bc0723d Power: add an API to log wakeup reasons
Add API log_wakeup_reason() and expose it to userspace via sysfs path
/sys/kernel/wakeup_reasons/last_resume_reason

Change-Id: I81addaf420f1338255c5d0638b0d244a99d777d1
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-02-20 02:43:15 +00:00
Arve Hjønnevåg
47e0f4d1dc ARM: Fix "Make low-level printk work" to use a separate config option
Signed-off-by: Arve Hjønnevåg <arve@android.com>
2013-11-13 17:34:12 -08:00
Colin Cross
d67a07b6ec anonymous vma names: fix build with !MMU
Disable PR_SET_VMA when building with !MMU

Change-Id: I896b6979b99aa61df85caf4c3ec22eb8a8204e64
Signed-off-by: Colin Cross <ccross@android.com>
2013-11-07 16:25:12 -08:00
Colin Cross
527aae5291 mm: fix anon vma naming
Fix two bugs caused by merging anon vma_naming, a typo in
mempolicy.c and a bad merge in sys.c.

Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90
Signed-off-by: Colin Cross <ccross@android.com>
2013-11-07 16:25:11 -08:00
Colin Cross
6ebfe5864a mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory.  When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.

This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma.  vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas.  If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".

The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen.  This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging.  The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.

Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
2013-09-19 14:14:28 -05:00