____call_usermodehelper() now erases any credentials set by the
subprocess_inf::init() function. The problem is that commit
17f60a7da1 ("capabilites: allow the application of capability limits
to usermode helpers") creates and commits new credentials with
prepare_kernel_cred() after the call to the init() function. This wipes
all keyrings after umh_keys_init() is called.
The best way to deal with this is to put the init() call just prior to
the commit_creds() call, and pass the cred pointer to init(). That
means that umh_keys_init() and suchlike can modify the credentials
_before_ they are published and potentially in use by the rest of the
system.
This prevents request_key() from working as it is prevented from passing
the session keyring it set up with the authorisation token to
/sbin/request-key, and so the latter can't assume the authority to
instantiate the key. This causes the in-kernel DNS resolver to fail
with ENOKEY unconditionally.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 7f81c8890c.
It turns out that it's not actually a build-time check on x86-64 UML,
which does some seriously crazy stuff with VM_STACK_FLAGS.
The VM_STACK_FLAGS define depends on the arch-supplied
VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have
arch/um/sys-x86_64/shared/sysdep/vm-flags.h:
#define VM_STACK_DEFAULT_FLAGS \
(test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)
#define VM_STACK_DEFAULT_FLAGS vm_stack_flags
(yes, seriously: two different #define's for that thing, with the first
one being inside an "#ifdef TIF_IA32")
It's possible that it is UML that should just be fixed in this area, but
for now let's just undo the (very small) optimization.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a8bef8ff6e ("mm: migration: avoid race between shift_arg_pages()
and rmap_walk() during migration by not migrating temporary stacks")
introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
one, so BUILD_BUG_ON is more appropriate.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unconditionally changing the address limit to USER_DS and not restoring
it to its old value in the error path is wrong because it prevents us
using kernel memory on repeated calls to this function. This, in fact,
breaks the fallback of hard coded paths to the init program from being
ever successful if the first candidate fails to load.
With this patch applied switching to USER_DS is delayed until the point
of no return is reached which makes it possible to have a multi-arch
rootfs with one arch specific init binary for each of the (hard coded)
probed paths.
Since the address limit is already set to USER_DS when start_thread()
will be invoked, this redundancy can be safely removed.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, exe_file is not proc FS dependent, so we can use it to name core
file. So we add %E pattern for core file name cration which extract path
from mm_struct->exe_file. Then it converts slashes to exclamation marks
and pastes the result to the core file name itself.
This is useful for environments where binary names are longer than 16
character (the current->comm limitation). Also where there are binaries
with same name but in a different path. Further in case the binery itself
changes its current->comm after exec.
So by doing (s/$/#/ -- # is treated as git comment):
$ sysctl kernel.core_pattern='core.%p.%e.%E'
$ ln /bin/cat cat45678901234567890
$ ./cat45678901234567890
^Z
$ rm cat45678901234567890
$ fg
^\Quit (core dumped)
$ ls core*
we now get:
core.2434.cat456789012345.!root!cat45678901234567890 (deleted)
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe. Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.
To achieve that move that out of proc FS to the kernel/ where in fact it
should belong. By doing that we can make dup_mm_exe_file static. Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the existing mmu_gather infrastructure.
The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.
The first 9 patches rework the mmu_gather infrastructure. For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.
The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.
Then there is one patch that extends the generic mmu_gather batching.
After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.
Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.
Furthermore, it removes the new and universially detested unmap_mutex.
This patch:
Remove the first obstacle towards a fully preemptible mmu_gather.
The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.
Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex. Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.
Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.
[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.
check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.
Let's clean this up by exporting both functions and make those names
consistent. Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion). expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
tty/serial: Fix break handling for PORT_TEGRA
tty/serial: Add explicit PORT_TEGRA type
n_tracerouter and n_tracesink ldisc additions.
Intel PTI implementaiton of MIPI 1149.7.
Kernel documentation for the PTI feature.
export kernel call get_task_comm().
tty: Remove to support serial for S5P6442
pch_phub: Support new device ML7223
8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
ASoC: Update cx20442 for TTY API change
pch_uart: Support new device ML7223 IOH
parport: Use request_muxed_region for IT87 probe and lock
tty/serial: add support for Xilinx PS UART
n_gsm: Use print_hex_dump_bytes
drivers/tty/moxa.c: Put correct tty value
TTY: tty_io, annotate locking functions
TTY: serial_core, remove superfluous set_task_state
TTY: serial_core, remove invalid test
Char: moxa, fix locking in moxa_write
...
Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
drivers/tty/serial/Makefile.
I did the hci_ldisc thing as an evil merge, cleaning things up.
This allows drivers who call this function to be compiled modularly.
Otherwise, a driver who is interested in this type of functionality
has to implement their own get_task_comm() call, causing code
duplication in the Linux source tree.
Signed-off-by: J Freyensee <james_p_freyensee@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add the appropriate members into struct user_arg_ptr and teach
get_user_arg_ptr() to handle is_compat = T case correctly.
This allows us to remove the compat_do_execve() code from fs/compat.c
and reimplement compat_do_execve() as the trivial wrapper on top of
do_execve_common(is_compat => true).
In fact, this fixes another (minor) bug. "compat_uptr_t str" can
overflow after "str += len" in compat_copy_strings() if a 64bit
application execs via sys32_execve().
Unexport acct_arg_size() and get_arg_page(), fs/compat.c doesn't
need them any longer.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
No functional changes, preparation.
Introduce struct user_arg_ptr, change do_execve() paths to use it
instead of "char __user * const __user *argv".
This makes the argv/envp arguments opaque, we are ready to handle the
compat case which needs argv pointing to compat_uptr_t.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Introduce get_user_arg_ptr() helper, convert count() and copy_strings()
to use it.
No functional changes, preparation. This helper is trivial, it just
reads the pointer from argv/envp user-space array.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Currently task->signal->group_stop_count is used to decide whether to
stop for group stop. However, if there is a task in the group which
is taking a long time to stop, other tasks which are continued by
ptrace would repeatedly stop for the same group stop until the group
stop is complete.
Conversely, if a ptraced task is in TASK_TRACED state, the debugger
won't get notified of group stops which is inconsistent compared to
the ptraced task in any other state.
This patch introduces GROUP_STOP_PENDING which tracks whether a task
is yet to stop for the group stop in progress. The flag is set when a
group stop starts and cleared when the task stops the first time for
the group stop, and consulted whenever whether the task should
participate in a group stop needs to be determined. Note that now
tasks in TASK_TRACED also participate in group stop.
This results in the following behavior changes.
* For a single group stop, a ptracer would see at most one stop
reported.
* A ptracee in TASK_TRACED now also participates in group stop and the
tracer would get the notification. However, as a ptraced task could
be in TASK_STOPPED state or any ptrace trap could consume group
stop, the notification may still be missing. These will be
addressed with further patches.
* A ptracee may start a group stop while one is still in progress if
the tracer let it continue with stop signal delivery. Group stop
code handles this correctly.
Oleg:
* Spotted that a task might skip signal check even when its
GROUP_STOP_PENDING is set. Fixed by updating
recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
group_stop_count.
* Pointed out that task->group_stop should be cleared whenever
task->signal->group_stop_count is cleared. Fixed accordingly.
* Pointed out the behavior inconsistency between TASK_TRACED and
RUNNING and the last behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Hi,
I was backporting the coredump over pipe feature and noticed this small typo,
I wish I would have something bigger to contribute...
>From 15d6080e0ed4267da103c706917a33b1015e8804 Mon Sep 17 00:00:00 2001
From: Holger Hans Peter Freyther <holger@moiji-mobile.com>
Date: Thu, 24 Feb 2011 17:42:50 +0100
Subject: [PATCH] fs: Fix a small typo in the comment
The function is called umh_pipe_setup not uhm_pipe_setup.
Signed-off-by: Holger Hans Peter Freyther <holger@moiji-mobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
take calculation of open_flags by open(2) arguments into new helper
in fs/open.c, move filp_open() over there, have it and do_sys_open()
use that helper, switch exec.c callers of do_filp_open() to explicit
(and constant) struct open_flags.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
FMODE_EXEC is a constant type of fmode_t but was used with normal integer
constants. This results in following warnings from sparse. Fix it using
new macro __FMODE_EXEC.
fs/exec.c:116:58: warning: restricted fmode_t degrades to integer
fs/exec.c:689:58: warning: restricted fmode_t degrades to integer
fs/fcntl.c:777:9: warning: restricted fmode_t degrades to integer
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.
bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.
$ uname -m
x86_64
$ cat /proc/sys/vm/mmap_min_addr
65536
$ cat install_special_mapping.s
section .bss
resb BSS_SIZE
section .text
global _start
_start:
mov eax, __NR_pause
int 0x80
$ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
$ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
$ ./install_special_mapping &
[1] 14303
$ cat /proc/14303/maps
0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
00011000-ffffe000 rwxp 00000000 00:00 0 [stack]
It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.
Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.
compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.
Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.
Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c
execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.
With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.
Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.
Reported-by: Brad Spengler <spender@grsecurity.net>
Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
turn PF_KTHREAD off in the ENOEXEC case.
This patch moves this flag modification to after the finding of the
executable file.
This is only a theorical issue because kthreads do not call do_execve()
directly. But fixing would be better.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We met a parameter truncated issue, consider following:
> echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
%s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
/proc/sys/kernel/core_pattern
This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
/proc/sys/kernel/core_pattern" shows the whole string. but after we run
core_pattern_pipe_test in man page, we found last parameter was truncated
like below:
argc[10]=<12807486>
The root cause is core_pattern allows % specifiers, which need to be
replaced during parse time, but the replace may expand the strings to
larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
will write out of corename array.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>