An unprivileged user can trigger an oops on a kernel with
CONFIG_CHECKPOINT_RESTORE.
proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
start/end values. These get sanity checked as follows:
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
These can be changed by prctl_set_mm. Turns out also takes the semaphore for
reading, effectively rendering it useless. This results in:
kernel BUG at fs/proc/base.c:240!
invalid opcode: 0000 [#1] SMP
Modules linked in: virtio_net
CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
RIP: proc_pid_cmdline_read+0x520/0x530
RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
Call Trace:
__vfs_read+0x37/0x100
vfs_read+0x82/0x130
SyS_read+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RIP proc_pid_cmdline_read+0x520/0x530
---[ end trace 97882617ae9c6818 ]---
Turns out there are instances where the code just reads aformentioned
values without locking whatsoever - namely environ_read and get_cmdline.
Interestingly these functions look quite resilient against bogus values,
but I don't believe this should be relied upon.
The first patch gets rid of the oops bug by grabbing mmap_sem for
writing.
The second patch is optional and puts locking around aformentioned
consumers for safety. Consumers of other fields don't seem to benefit
from similar treatment and are left untouched.
This patch (of 2):
The code was taking the semaphore for reading, which does not protect
against readers nor concurrent modifications.
The problem could cause a sanity checks to fail in procfs's cmdline
reader, resulting in an OOPS.
Note that some functions perform an unlocked read of various mm fields,
but they seem to be fine despite possible modificaton.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the stuff currently only used by the kexec file code within
CONFIG_KEXEC_FILE (and CONFIG_KEXEC_VERIFY_SIG).
Also move internal "struct kexec_sha_region" and "struct kexec_buf" into
"kexec_internal.h".
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change cpu_possible_bits and friends (online, present, active) from being
bitmaps that happen to have the right size to actually being struct
cpumasks. Also rename them to __cpu_xyz_mask. This is mostly a small
cleanup in preparation for exporting them and, eventually, eliminating the
extra indirection through the cpu_xyz_mask variables.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_stopped_code()->task_is_stopped_or_traced() doesn't look right, the
traced task must never be TASK_STOPPED.
We can not add WARN_ON(task_is_stopped(p)), but this is only because
do_wait() can race with PTRACE_ATTACH from another thread.
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Roland McGrath <roland@hack.frob.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_attach() can hang waiting for STOPPED -> TRACED transition if the
tracee gets frozen in between, change wait_on_bit() to use TASK_KILLABLE.
This doesn't really solve the problem(s) and we probably need to fix the
freezer. In particular, note that this means that pm freezer will fail if
it races attach-to-stopped-task.
And otoh perhaps we can just remove JOBCTL_TRAPPING_BIT altogether, it is
not clear if we really need to hide this transition from debugger, WNOHANG
after PTRACE_ATTACH can fail anyway if it races with SIGCONT.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Roland McGrath <roland@hack.frob.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security subsystem updates from James Morris:
- EVM gains support for loading an x509 cert from the kernel
(EVM_LOAD_X509), into the EVM trusted kernel keyring.
- Smack implements 'file receive' process-based permission checking for
sockets, rather than just depending on inode checks.
- Misc enhancments for TPM & TPM2.
- Cleanups and bugfixes for SELinux, Keys, and IMA.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (41 commits)
selinux: Inode label revalidation performance fix
KEYS: refcount bug fix
ima: ima_write_policy() limit locking
IMA: policy can be updated zero times
selinux: rate-limit netlink message warnings in selinux_nlmsg_perm()
selinux: export validatetrans decisions
gfs2: Invalid security labels of inodes when they go invalid
selinux: Revalidate invalid inode security labels
security: Add hook to invalidate inode security labels
selinux: Add accessor functions for inode->i_security
security: Make inode argument of inode_getsecid non-const
security: Make inode argument of inode_getsecurity non-const
selinux: Remove unused variable in selinux_inode_init_security
keys, trusted: seal with a TPM2 authorization policy
keys, trusted: select hash algorithm for TPM2 chips
keys, trusted: fix: *do not* allow duplicate key options
tpm_ibmvtpm: properly handle interrupted packet receptions
tpm_tis: Tighten IRQ auto-probing
tpm_tis: Refactor the interrupt setup
tpm_tis: Get rid of the duplicate IRQ probing code
...
Pull audit updates from Paul Moore:
"Seven audit patches for 4.5, all very minor despite the diffstat.
The diffstat churn for linux/audit.h can be attributed to needing to
reshuffle the linux/audit.h header to fix the seccomp auditing issue
(see the commit description for details).
Besides the seccomp/audit fix, most of the fixes are around trying to
improve the connection with the audit daemon and a Kconfig
simplification. Nothing crazy, and everything passes our little
audit-testsuite"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: always enable syscall auditing when supported and audit is enabled
audit: force seccomp event logging to honor the audit_enabled flag
audit: Delete unnecessary checks before two function calls
audit: wake up threads if queue switched from limited to unlimited
audit: include auditd's threads in audit_log_start() wait exception
audit: remove audit_backlog_wait_overflow
audit: don't needlessly reset valid wait time
Merge second patch-bomb from Andrew Morton:
- more MM stuff:
- Kirill's page-flags rework
- Kirill's now-allegedly-fixed THP rework
- MADV_FREE implementation
- DAX feature work (msync/fsync). This isn't quite complete but DAX
is new and it's good enough and the guys have a handle on what
needs to be done - I expect this to be wrapped in the next week or
two.
- some vsprintf maintenance work
- various other misc bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (145 commits)
printk: change recursion_bug type to bool
lib/vsprintf: factor out %pN[F] handler as netdev_bits()
lib/vsprintf: refactor duplicate code to special_hex_number()
printk-formats.txt: remove unimplemented %pT
printk: help pr_debug and pr_devel to optimize out arguments
lib/test_printf.c: test dentry printing
lib/test_printf.c: add test for large bitmaps
lib/test_printf.c: account for kvasprintf tests
lib/test_printf.c: add a few number() tests
lib/test_printf.c: test precision quirks
lib/test_printf.c: check for out-of-bound writes
lib/test_printf.c: don't BUG
lib/kasprintf.c: add sanity check to kvasprintf
lib/vsprintf.c: warn about too large precisions and field widths
lib/vsprintf.c: help gcc make number() smaller
lib/vsprintf.c: expand field_width to 24 bits
lib/vsprintf.c: eliminate potential race in string()
lib/vsprintf.c: move string() below widen_string()
lib/vsprintf.c: pull out padding code from dentry_name()
printk: do cond_resched() between lines while outputting to consoles
...
Pull sound updates from Takashi Iwai:
"We've had quite busy weeks in this cycle. Looking at ALSA core, the
significant changes are a few fixes wrt timer and sequencer ioctls
that have been revealed by fuzzer recently. Other than that, ASoC
core got a few updates about DAI link handling, but these are rather
straightforward refactoring.
In drivers scene, ASoC received quite lots of new drivers in addition
to bunch of updates for still ongoing Intel Skylake support and
topology API. HD-audio gained a new HDMI/DP hotplug notification via
component. FireWire got a pile of code refactoring/updates with
SCS.1x driver integration.
More highlights are shown below.
[ NOTE: this contains also many commits for DRM. This is due to the
pull of drm stable branch into sound tree, as the base of i915 audio
component work for HD-audio. The highlights below don't contain
these DRM changes, as these are supposed to be pulled via drm tree
in anyway sooner or later. ]
Core:
- Handful fixes to harden ALSA timer and sequencer ioctls against
races reported by syzkaller fuzzer
- Irq description string can be unique to each card; only for
HD-audio for now
ASoC:
- Conversion of the array of DAI links to a list for supporting
dynamically adding and removing DAI links
- Topology API enhancements to make everything more component based
and being able to specify PCM links via topology
- Some more fixes for the topology code, though it is still not final
and ready for enabling in production; we really need to get to the
point where that can be done
- A pile of changes for Intel SkyLake drivers which hopefully deliver
some useful initial functionality for systems with this chipset,
though there is more work still to come
- Lots of new features and cleanups for the Renesas drivers
- ANC support for WM5110
- New drivers: Imagination Technologies IPs, Atmel class D speaker,
Cirrus CS47L24 and WM1831, Dialog DA7128, Realtek RT5659 and
RT56156, Rockchip RK3036, TI PC3168A, and AMD ACP
- Rename PCM1792a driver to be generic pcm179x
HD-Audio:
- Use audio component for i915 HDMI/DP hotplug handling
- On-demand binding with i915 driver
- bdl_pos_adj parameter adjustment for Baytrail controllers
- Enable power_save_node for CX20722; this shouldn't lead to
regression, hopefully
- Kabylake HDMI/DP codec support
- Quirks for Lenovo E50-80, Dell Latitude E-series, and other Dell
machines
- A few code refactoring
FireWire:
- Lots of code cleanup and refactoring
- Integrate the support of SCS.1x devices into snd-oxfw driver;
snd-scs1x driver is obsoleted
USB-audio:
- Fix possible NULL dereference at disconnection
- A regression fix for Native Instruments devices
Misc:
- A few code cleanups of fm801 driver"
* tag 'sound-4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (722 commits)
ALSA: timer: Code cleanup
ALSA: timer: Harden slave timer list handling
ALSA: hda - Add fixup for Dell Latitidue E6540
ALSA: timer: Fix race among timer ioctls
ALSA: hda - add codec support for Kabylake display audio codec
ALSA: timer: Fix double unlink of active_list
ALSA: usb-audio: Fix mixer ctl regression of Native Instrument devices
ALSA: hda - fix the headset mic detection problem for a Dell laptop
ALSA: hda - Fix white noise on Dell Latitude E5550
ALSA: hda_intel: add card number to irq description
ALSA: seq: Fix race at timer setup and close
ALSA: seq: Fix missing NULL check at remove_events ioctl
ALSA: usb-audio: Avoid calling usb_autopm_put_interface() at disconnect
ASoC: hdac_hdmi: remove unused hdac_hdmi_query_pin_connlist
ASoC: AMD: Add missing include file
ALSA: hda - Fixup inverted internal mic for Lenovo E50-80
ALSA: usb: Add native DSD support for Oppo HA-1
ASoC: Make aux_dev more like a generic component
ASoC: bcm2835: cleanup includes by ordering them alphabetically
ASoC: AMD: Manage ACP 2.x SRAM banks power
...
@console_may_schedule tracks whether console_sem was acquired through
lock or trylock. If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched(). This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.
However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines. Also, only a few drivers call
console_conditional_schedule() to begin with. This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.
If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.
Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Kyle McMartin <kyle@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Boot consoles are typically replaced by proper consoles during the boot
process. This can be problematic if the boot console data is part of
the init section that is reclaimed late during boot. If the proper
console does not register before this point in time, the boot console
will need to be removed (so that the freed memory is not accessed),
leaving the system without output for some time.
There are various reasons why the proper console may not register early
enough, such as deferred probe or the driver being a loadable module.
If that happens, there is some amount of time where no console messages
are visible to the user, which in turn can mean that they won't see
crashes or other potentially useful information.
To avoid this situation, only remove the boot console when it resides in
the init section. Code exists to replace the boot console by the proper
console when it is registered, keeping a seamless transition between the
boot and proper consoles.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stop_machine.o is only built if CONFIG_SMP=y, so this ifdef always
evaluates to true.
[akpm@linux-foundation.org: remove now-unneeded ifdef]
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.
The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.
This patchset does not take care of the futex code. I will now look
closer at this.
This patch (of 2):
With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.
This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use. Unlike get_page() it may fail if the device is, or
is in the process of being, disabled. While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.
devm_memremap_pages() now requires a reference counter to be specified
at init time. For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".
ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list. That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert. This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>