Commit Graph

1640 Commits

Author SHA1 Message Date
Linus Torvalds
e06fdaf40a Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull structure randomization updates from Kees Cook:
 "Now that IPC and other changes have landed, enable manual markings for
  randstruct plugin, including the task_struct.

  This is the rest of what was staged in -next for the gcc-plugins, and
  comes in three patches, largest first:

   - mark "easy" structs with __randomize_layout

   - mark task_struct with an optional anonymous struct to isolate the
     __randomize_layout section

   - mark structs to opt _out_ of automated marking (which will come
     later)

  And, FWIW, this continues to pass allmodconfig (normal and patched to
  enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
  s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: opt-out externally exposed function pointer structs
  task_struct: Allow randomized layout
  randstruct: Mark various structs for randomization
2017-07-19 08:55:18 -07:00
Akinobu Mita
168c42bc56 fault-inject: add /proc/<pid>/fail-nth
fail-nth interface is only created in /proc/self/task/<current-tid>/.
This change also adds it in /proc/<pid>/.

This makes shell based tool a bit simpler.

	$ bash -c "builtin echo 100 > /proc/self/fail-nth && exec ls /"

Link: http://lkml.kernel.org/r/1491490561-10485-6-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
1203c8e6fb fault-inject: simplify access check for fail-nth
The fail-nth file is created with 0666 and the access is permitted if
and only if the task is current.

This file is owned by the currnet user.  So we can create it with 0644
and allow the owner to write it.  This enables to watch the status of
task->fail_nth from another processes.

[akinobu.mita@gmail.com: don't convert unsigned type value as signed int]
  Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com
[akinobu.mita@gmail.com: avoid unwanted data race to task->fail_nth]
  Link: http://lkml.kernel.org/r/1499962492-8931-1-git-send-email-akinobu.mita@gmail.com
Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
bfc740938d fault-inject: make fail-nth read/write interface symmetric
The read interface for fail-nth looks a bit odd.  Read from this file
returns "NYYYY..." or "YYYYY..." (this makes me surprise when cat this
file).  Because there is no EOF condition.  The first character
indicates current->fail_nth is zero or not, and then current->fail_nth
is reset to zero.

Just returning task->fail_nth value is more natural to understand.

Link: http://lkml.kernel.org/r/1491490561-10485-4-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
9049f2f6e7 fault-inject: parse as natural 1-based value for fail-nth write interface
The value written to fail-nth file is parsed as 0-based.  Parsing as
one-based is more natural to understand and it enables to cancel the
previous setup by simply writing '0'.

This change also converts task->fail_nth from signed to unsigned int.

Link: http://lkml.kernel.org/r/1491490561-10485-3-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
ecaad81ca0 fault-inject: automatically detect the number base for fail-nth write interface
Automatically detect the number base to use when writing to fail-nth
file instead of always parsing as a decimal number.

Link: http://lkml.kernel.org/r/1491490561-10485-2-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Linus Torvalds
ad51271afc Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

- various misc things

- kexec updates

- sysctl core updates

- scripts/gdb udpates

- checkpoint-restart updates

- ipc updates

- kernel/watchdog updates

- Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

- "stackprotector: ascii armor the stack canary"

- more MM bits

- checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  writeback: rework wb_[dec|inc]_stat family of functions
  ARM: samsung: usb-ohci: move inline before return type
  video: fbdev: omap: move inline before return type
  video: fbdev: intelfb: move inline before return type
  USB: serial: safe_serial: move __inline__ before return type
  drivers: tty: serial: move inline before return type
  drivers: s390: move static and inline before return type
  x86/efi: move asmlinkage before return type
  sh: move inline before return type
  MIPS: SMP: move asmlinkage before return type
  m68k: coldfire: move inline before return type
  ia64: sn: pci: move inline before type
  ia64: move inline before return type
  FRV: tlbflush: move asmlinkage before return type
  CRIS: gpio: move inline before return type
  ARM: HP Jornada 7XX: move inline before return type
  ARM: KVM: move asmlinkage before type
  checkpatch: improve the STORAGE_CLASS test
  mm, migration: do not trigger OOM killer when migrating memory
  drm/i915: use __GFP_RETRY_MAYFAIL
  ...
2017-07-13 12:38:49 -07:00
Linus Torvalds
4ca6df1348 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull sysctl fix from Eric Biederman:
 "A rather embarassing and hard to hit bug was merged into 4.11-rc1.

  Andrei Vagin tracked this bug now and after some staring at the code
  I came up with a fix"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix proc_sys_prune_dcache to hold a sb reference
2017-07-12 19:43:20 -07:00
Dmitry Vyukov
e41d58185f fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:

 "Write to this file of integer N makes N-th call in the current task
  fail (N is 0-based). Read from this file returns a single char 'Y' or
  'N' that says if the fault setup with a previous write to this file
  was injected or not, and disables the fault if it wasn't yet injected.
  Note that this file enables all types of faults (slab, futex, etc).
  This setting takes precedence over all other generic settings like
  probability, interval, times, etc. But per-capability settings (e.g.
  fail_futex/ignore-private) take precedence over it. This feature is
  intended for systematic testing of faults in a single system call. See
  an example below"

Why add a new setting:
1. Existing settings are global rather than per-task.
   So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
   which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
   of all of probability, interval, times, space, task-filter and
   unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
   types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
   stress testing better be done from an unprivileged user.
   Similarly, this would require opening the debugfs files to the
   unprivileged user, as he would need to reopen at least times file
   (not possible to pre-open before dropping privs).

The proposed interface solves all of the above (see the example).

We want to integrate this into syzkaller fuzzer.  A prototype has found
10 bugs in kernel in first day of usage:

  https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned.  So I am fine with the current
version of the code.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Luis R. Rodriguez
61d9b56a89 sysctl: add unsigned int range support
To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.

Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.

Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez
4f2fec00af sysctl: simplify unsigned int support
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed.  Two fixes
have come in since then for the following issues:

  o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

This was fixed by commit 5380e5644a ("sysctl: don't print negative
flag for proc_douintvec").

  o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
  o We echo negative values in and they are accepted

This was fixed by commit 425fffd886 ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").

It still also failed to be added to sysctl_check_table()...  instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.

Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int.  An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:

  o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
	Answer: about 8
	- Of these how many are array users ?
		Answer: Probably only 1
  o How many sysctl array users exist ?
	Answer: about 12

This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.

The identified uint ports are:

  drivers/infiniband/core/ucma.c - max_backlog
  drivers/infiniband/core/iwcm.c - default_backlog
  net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
  net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
  net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
  net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
  net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
  net/phonet/sysctl.c proc_local_port_range()

The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well.  Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.

If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:

sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G        W   E <etc>
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <etc>
Call Trace:
 dump_stack+0x63/0x81
 __register_sysctl_table+0x350/0x650
 ? kmem_cache_alloc_trace+0x107/0x240
 __register_sysctl_paths+0x1b3/0x1e0
 ? 0xffffffffc005f000
 register_sysctl_table+0x1f/0x30
 test_sysctl_init+0x10/0x1000 [test_sysctl]
 do_one_initcall+0x52/0x1a0
 ? kmem_cache_alloc_trace+0x107/0x240
 do_init_module+0x5f/0x200
 load_module+0x1867/0x1bd0
 ? __symbol_put+0x60/0x60
 SYSC_finit_module+0xdf/0x110
 SyS_finit_module+0xe/0x10
 entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119
<etc>

Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Liping Zhang <zlpnobody@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez
89c5b53b16 sysctl: fix lax sysctl_check_table() sanity check
Patch series "sysctl: few fixes", v5.

I've been working on making kmod more deterministic, and as I did that I
couldn't help but notice a few issues with sysctl.  My end goal was just
to fix unsigned int support, which back then was completely broken.
Liping Zhang has sent up small atomic fixes, however it still missed yet
one more fix and Alexey Dobriyan had also suggested to just drop array
support given its complexity.

I have inspected array support using Coccinelle and indeed its not that
popular, so if in fact we can avoid it for new interfaces, I agree its
best.

I did develop a sysctl stress driver but will hold that off for another
series.

This patch (of 5):

Commit 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
improved sanity checks considerbly, however the enhancements on
sysctl_check_table() meant adding a functional change so that only the
last table entry's sanity error is propagated.  It also changed the way
errors were propagated so that each new check reset the err value, this
means only last sanity check computed is used for an error.  This has
been in the kernel since v3.4 days.

Fix this by carrying on errors from previous checks and iterations as we
traverse the table and ensuring we keep any error from previous checks.
We keep iterating on the table even if an error is found so we can
complain for all errors found in one shot.  This works as -EINVAL is
always returned on error anyway, and the check for error is any non-zero
value.

Fixes: 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Eric W. Biederman
2fd1d2c4ce proc: Fix proc_sys_prune_dcache to hold a sb reference
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
>  dump_stack+0x63/0x86
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  umount_check+0x6e/0x80
>  d_walk+0xc6/0x270
>  ? dentry_free+0x80/0x80
>  do_one_tree+0x26/0x40
>  shrink_dcache_for_umount+0x2d/0x90
>  generic_shutdown_super+0x1f/0xf0
>  kill_anon_super+0x12/0x20
>  proc_kill_sb+0x40/0x50
>  deactivate_locked_super+0x43/0x70
>  deactivate_super+0x5a/0x60
>  cleanup_mnt+0x3f/0x90
>  mntput_no_expire+0x13b/0x190
>  kern_unmount+0x3e/0x50
>  pid_ns_release_proc+0x15/0x20
>  proc_cleanup_work+0x15/0x20
>  process_one_work+0x197/0x450
>  worker_thread+0x4e/0x4a0
>  kthread+0x109/0x140
>  ? process_one_work+0x450/0x450
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0

Fix this by taking a reference to the super block in proc_sys_prune_dcache.

The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used.  This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list.  Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks.  The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.

Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to.  The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-11 11:01:24 -05:00
Heiner Kallweit
cde1b69389 fs/proc/generic.c: switch to ida_simple_get/remove
The code can be much simplified by switching to ida_simple_get/remove.

Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Vasily Averin
8c03cc85a0 fs/proc/task_mmu.c: remove obsolete comment in show_map_vma()
After commit 1be7107fbe ("mm: larger stack guard gap, between vmas")
we do not hide stack guard page in /proc/<pid>/maps

Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Linus Torvalds
55a7b2125c Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:

 - RAS reporting via GHES/APEI (ACPI)

 - Indirect ftrace trampolines for modules

 - Improvements to kernel fault reporting

 - Page poisoning

 - Sigframe cleanups and preparation for SVE context

 - Core dump fixes

 - Sparse fixes (mainly relating to endianness)

 - xgene SoC PMU v3 driver

 - Misc cleanups and non-critical fixes

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (75 commits)
  arm64: fix endianness annotation for 'struct jit_ctx' and friends
  arm64: cpuinfo: constify attribute_group structures.
  arm64: ptrace: Fix incorrect get_user() use in compat_vfp_set()
  arm64: ptrace: Remove redundant overrun check from compat_vfp_set()
  arm64: ptrace: Avoid setting compat FP[SC]R to garbage if get_user fails
  arm64: fix endianness annotation for __apply_alternatives()/get_alt_insn()
  arm64: fix endianness annotation in get_kaslr_seed()
  arm64: add missing conversion to __wsum in ip_fast_csum()
  arm64: fix endianness annotation in acpi_parking_protocol.c
  arm64: use readq() instead of readl() to read 64bit entry_point
  arm64: fix endianness annotation for reloc_insn_movw() & reloc_insn_imm()
  arm64: fix endianness annotation for aarch64_insn_write()
  arm64: fix endianness annotation in aarch64_insn_read()
  arm64: fix endianness annotation in call_undef_hook()
  arm64: fix endianness annotation for debug-monitors.c
  ras: mark stub functions as 'inline'
  arm64: pass endianness info to sparse
  arm64: ftrace: fix !CONFIG_ARM64_MODULE_PLTS kernels
  arm64: signal: Allow expansion of the signal frame
  acpi: apei: check for pending errors when probing GHES entries
  ...
2017-07-05 17:09:27 -07:00
Kees Cook
3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
Will Deacon
3edb1dd13c Merge branch 'aarch64/for-next/ras-apei' into aarch64/for-next/core
Merge in arm64 ACPI RAS support (APEI/GHES) from Tyler Baicar.
2017-06-26 10:54:27 +01:00
Ard Biesheuvel
737326aa51 fs/proc: kcore: use kcore_list type to check for vmalloc/module address
Instead of passing each start address into is_vmalloc_or_module_addr()
to decide whether it falls into either the VMALLOC or the MODULES region,
we can simply check the type field of the current kcore_list entry, since
it will be set to KCORE_VMALLOC based on exactly the same conditions.

As a bonus, when reading the KCORE_TEXT region on architectures that have
one, this will avoid using vread() on the region if it happens to intersect
with a KCORE_VMALLOC region. This is due the fact that the KCORE_TEXT
region is the first one to be added to the kcore region list.

Reported-by: Tan Xiaojun <tanxiaojun@huawei.com>
Tested-by: Tan Xiaojun <tanxiaojun@huawei.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-06-20 12:42:57 +01:00
Hugh Dickins
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
Linus Torvalds
f511c0b17b "Yes, people use FOLL_FORCE ;)"
This effectively reverts commit 8ee74a91ac ("proc: try to remove use
of FOLL_FORCE entirely")

It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:

Keno Fischer:
 "We used these semantics as a hardening mechanism in the julia JIT. By
  opening /proc/self/mem and using these semantics, we could avoid
  needing RWX pages, or a dual mapping approach. We do have fallbacks to
  these other methods (though getting EIO here actually causes an assert
  in released versions - we'll updated that to make sure to take the
  fall back in that case).

  Nevertheless the /proc/self/mem approach was our favored approach
  because it a) Required an attacker to be able to execute syscalls
  which is a taller order than getting memory write and b) didn't double
  the virtual address space requirements (as a dual mapping approach
  would).

  I think in general this feature is very useful for anybody who needs
  to precisely control the execution of some other process. Various
  debuggers (gdb/lldb/rr) certainly fall into that category, but there's
  another class of such processes (wine, various emulators) which may
  want to do that kind of thing.

  Now, I suspect most of these will have the other process under ptrace
  control, so maybe allowing (same_mm || ptraced) would be ok, but at
  least for the sandbox/remote-jit use case, it would be perfectly
  reasonable to not have the jit server be a ptracer"

Robert O'Callahan:
 "We write to readonly code and data mappings via /proc/.../mem in lots
  of different situations, particularly when we're adjusting program
  state during replay to match the recorded execution.

  Like Julia, we can add workarounds, but they could be expensive."

so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.

With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"

Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30 12:38:59 -07:00
Linus Torvalds
8ee74a91ac proc: try to remove use of FOLL_FORCE entirely
We fixed the bugs in it, but it's still an ugly interface, so let's see
if anybody actually depends on it.  It's entirely possible that nothing
actually requires the whole "punch through read-only mappings"
semantics.

For example, gdb definitely uses the /proc/<pid>/mem interface, but it
looks like it mainly does it for regular reads of the target (that don't
need FOLL_FORCE), and looking at the gdb source code seems to fall back
on the traditional ptrace(PTRACE_POKEDATA) interface if it needs to.

If this breaks something, I do have a (more complex) version that only
enables FOLL_FORCE when somebody has PTRACE_ATTACH'ed to the target,
like the comment here used to say ("Maybe we should limit FOLL_FORCE to
actual ptrace users?").

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-09 08:45:16 -07:00
Kirill Tkhai
eaa0d190bf pidns: expose task pid_ns_for_children to userspace
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.

It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.

This patch solves the problem, and it exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":

  ~# ls /proc/5531/ns -l | grep pid
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Tobin C. Harding
f245e1c17a fs/proc/inode.c: remove cast from memory allocation
Coccinelle emits this warning:

  WARNING: casting value returned by memory allocation function to (struct proc_inode *) is useless.

Remove unnecessary cast.

Link: http://lkml.kernel.org/r/1487745720-16967-1-git-send-email-me@tobin.cc
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:10 -07:00
Linus Torvalds
e579dde654 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a set of small fixes that were mostly stumbled over during
  more significant development. This proc fix and the fix to
  posix-timers are the most significant of the lot.

  There is a lot of good development going on but unfortunately it
  didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix unbalanced hard link numbers
  signal: Make kill_proc_info static
  rlimit: Properly call security_task_setrlimit
  signal: Remove unused definition of sig_user_definied
  ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
  ipc: Remove unused declaration of recompute_msgmni
  posix-timers: Correct sanity check in posix_cpu_nsleep
  sysctl: Remove dead register_sysctl_root
2017-05-05 11:08:43 -07:00