Commit Graph

580 Commits

Author SHA1 Message Date
Oleg Nesterov
e6bc60b8fb tracing: Fix syscall_*regfunc() vs copy_process() race
commit 4af4206be2 upstream.

syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.

Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.

Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

Fixes: a871bd33a6 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:16 -07:00
Matthew Dempsky
1a2d973242 ptrace: fix fork event messages across pid namespaces
commit 4e52365f27 upstream.

When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's.  Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.

We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value.  However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.

Signed-off-by: Matthew Dempsky <mdempsky@chromium.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Julien Tinnes <jln@chromium.org>
Cc: Roland McGrath <mcgrathr@chromium.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:42 -07:00
Rik van Riel
d303cf4624 mm: fix TLB flush race between migration, and change_protection_range
commit 2084140594 upstream.

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.

This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09 12:24:23 -08:00
Oleg Nesterov
f608ebd760 pidns: fix vfork() after unshare(CLONE_NEWPID)
commit e79f525e99 upstream.

Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

	int main(void)
	{
		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
		assert(vfork() >= 0);
		_exit(0);
		return 0;
	}

fails without this patch.

Change this check to use CLONE_SIGHAND instead.  This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this.  The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Colin Walters <walters@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 17:18:28 -07:00
Michal Simek
4f01c72ef3 microblaze: fix clone syscall
commit dfa9771a7c upstream.

Fix inadvertent breakage in the clone syscall ABI for Microblaze that
was introduced in commit f3268edbe6 ("microblaze: switch to generic
fork/vfork/clone").

The Microblaze syscall ABI for clone takes the parent tid address in the
4th argument; the third argument slot is used for the stack size.  The
incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd
slot.

This commit restores the original ABI so that existing userspace libc
code will work correctly.

All kernel versions from v3.8-rc1 were affected.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20 08:43:02 -07:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Linus Torvalds
08d7676083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
 "Mostly about syscall wrappers this time; there will be another pile
  with patches in the same general area from various people, but I'd
  rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  syscalls.h: slightly reduce the jungles of macros
  get rid of union semop in sys_semctl(2) arguments
  make do_mremap() static
  sparc: no need to sign-extend in sync_file_range() wrapper
  ppc compat wrappers for add_key(2) and request_key(2) are pointless
  x86: trim sys_ia32.h
  x86: sys32_kill and sys32_mprotect are pointless
  get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
  merge compat sys_ipc instances
  consolidate compat lookup_dcookie()
  convert vmsplice to COMPAT_SYSCALL_DEFINE
  switch getrusage() to COMPAT_SYSCALL_DEFINE
  switch epoll_pwait to COMPAT_SYSCALL_DEFINE
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
  switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
  make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
  make HAVE_SYSCALL_WRAPPERS unconditional
  consolidate cond_syscall and SYSCALL_ALIAS declarations
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
  get rid of duplicate logics in __SC_....[1-6] definitions
2013-05-01 07:21:43 -07:00
Linus Torvalds
16fa94b532 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this development cycle were:

   - full dynticks preparatory work by Frederic Weisbecker

   - factor out the cpu time accounting code better, by Li Zefan

   - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

   - various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched: Fix init NOHZ_IDLE flag
  sched: Prevent to re-select dst-cpu in load_balance()
  sched: Rename load_balance_tmpmask to load_balance_mask
  sched: Move up affinity check to mitigate useless redoing overhead
  sched: Don't consider other cpus in our group in case of NEWLY_IDLE
  sched: Explicitly cpu_idle_type checking in rebalance_domains()
  sched: Change position of resched_cpu() in load_balance()
  sched: Fix wrong rq's runnable_avg update with rt tasks
  sched: Document task_struct::personality field
  sched/cpuacct/UML: Fix header file dependency bug on the UML build
  cgroup: Kill subsys.active flag
  sched/cpuacct: No need to check subsys active state
  sched/cpuacct: Initialize cpuacct subsystem earlier
  sched/cpuacct: Initialize root cpuacct earlier
  sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
  sched/cpuacct: Clean up cpuacct.h
  sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
  sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
  sched/cpuacct: Add cpuacct_acount_field()
  sched/cpuacct: Add cpuacct_init()
  ...
2013-04-30 07:43:28 -07:00
Kent Overstreet
cafe563591 bcache: A block layer cache
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.

See the wiki at http://bcache.evilpiepirate.org for more.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-03-23 16:11:31 -07:00
Eric W. Biederman
e66eded830 userns: Don't allow CLONE_NEWUSER | CLONE_FS
Don't allowing sharing the root directory with processes in a
different user namespace.  There doesn't seem to be any point, and to
allow it would require the overhead of putting a user namespace
reference in fs_struct (for permission checks) and incrementing that
reference count on practically every call to fork.

So just perform the inexpensive test of forbidding sharing fs_struct
acrosss processes in different user namespaces.  We already disallow
other forms of threading when unsharing a user namespace so this
should be no real burden in practice.

This updates setns, clone, and unshare to disallow multiple user
namespaces sharing an fs_struct.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:00:20 -07:00
Frederic Weisbecker
9fbc42eac1 cputime: Dynamically scale cputime for full dynticks accounting
The full dynticks cputime accounting is able to account either
using the tick or the context tracking subsystem. This way
the housekeeping CPU can keep the low overhead tick based
solution.

This latter mode has a low jiffies resolution granularity and
need to be scaled against CFS precise runtime accounting to
improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
now we also need to expand it to full dynticks accounting dynamic
off-case as well.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-07 17:10:32 +01:00
Al Viro
2cf0966683 make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03 22:58:33 -05:00
Alan Cox
6f977e6b2f fork: unshare: remove dead code
If new_nsproxy is set we will always call switch_task_namespaces and
then set new_nsproxy back to NULL so the reassignment and fall through
check are redundant

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:12 -08:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Ingo Molnar
b2c77a57e4 Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:

 "This implements the cputime accounting on full dynticks CPUs.

  Typical cputime stats infrastructure relies on the timer tick and
  its periodic polling on the CPU to account the amount of time
  spent by the CPUs and the tasks per high level domains such as
  userspace, kernelspace, guest, ...

  Now we are preparing to implement full dynticks capability on
  Linux for Real Time and HPC users who want full CPU isolation.
  This feature requires a cputime accounting that doesn't depend
  on the timer tick.

  To implement it, this new cputime infrastructure plugs into
  kernel/user/guest boundaries to take snapshots of cputime and
  flush these to the stats when needed. This performs pretty
  much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
  and cputime snaphots are synchronized between write and read
  side such that the latter can safely retrieve the pending tickless
  cputime of a task and add it to its latest cputime snapshot to
  return the correct result to the user."

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-05 13:10:33 +01:00
Frederic Weisbecker
6a61671bb2 cputime: Safely read cputime of full dynticks CPUs
While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
2013-01-27 20:35:47 +01:00
Linus Torvalds
3a142ed962 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull misc syscall fixes from Al Viro:

 - compat syscall fixes (discussed back in December)

 - a couple of "make life easier for sigaltstack stuff by reducing
   inter-tree dependencies"

 - fix up compiler/asmlinkage calling convention disagreement of
   sys_clone()

 - misc

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  sys_clone() needs asmlinkage_protect
  make sure that /linuxrc has std{in,out,err}
  x32: fix sigtimedwait
  x32: fix waitid()
  switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
  switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
  CONFIG_GENERIC_SIGALTSTACK build breakage with asm-generic/syscalls.h
  Ensure that kernel_init_freeable() is not inlined into non __init code
2013-01-20 13:58:48 -08:00
Al Viro
b1e0318b8c sys_clone() needs asmlinkage_protect
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-01-19 22:13:34 -05:00
Eric W. Biederman
8382fcac1b pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

Creates a new process in the new pid namespace without setting
pid_ns->child_reaper.  After forking this results in a NULL
pointer dereference.

Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Acked-by:  Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-24 22:53:14 -08:00
Linus Torvalds
54d46ea993 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "sigaltstack infrastructure + conversion for x86, alpha and um,
  COMPAT_SYSCALL_DEFINE infrastructure.

  Note that there are several conflicts between "unify
  SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
  resolution is trivial - just remove definitions of SS_ONSTACK and
  SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
  include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to generic sigaltstack
  new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
  generic compat_sys_sigaltstack()
  introduce generic sys_sigaltstack(), switch x86 and um to it
  new helper: compat_user_stack_pointer()
  new helper: restore_altstack()
  unify SS_ONSTACK/SS_DISABLE definitions
  new helper: current_user_stack_pointer()
  missing user_stack_pointer() instances
  Bury the conditionals from kernel_thread/kernel_execve series
  COMPAT_SYSCALL_DEFINE: infrastructure
2012-12-20 18:05:28 -08:00
Al Viro
ae903caae2 Bury the conditionals from kernel_thread/kernel_execve series
All architectures have
	CONFIG_GENERIC_KERNEL_THREAD
	CONFIG_GENERIC_KERNEL_EXECVE
	__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-19 18:07:38 -05:00
Glauber Costa
2ad306b17c fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path.  Once the slab is also tracked by memcg, we can
get rid of that flag.

Tested to successfully protect against :(){ :|:& };:

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 15:02:13 -08:00
Linus Torvalds
6a2b60b17b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "While small this set of changes is very significant with respect to
  containers in general and user namespaces in particular.  The user
  space interface is now complete.

  This set of changes adds support for unprivileged users to create user
  namespaces and as a user namespace root to create other namespaces.
  The tyranny of supporting suid root preventing unprivileged users from
  using cool new kernel features is broken.

  This set of changes completes the work on setns, adding support for
  the pid, user, mount namespaces.

  This set of changes includes a bunch of basic pid namespace
  cleanups/simplifications.  Of particular significance is the rework of
  the pid namespace cleanup so it no longer requires sending out
  tendrils into all kinds of unexpected cleanup paths for operation.  At
  least one case of broken error handling is fixed by this cleanup.

  The files under /proc/<pid>/ns/ have been converted from regular files
  to magic symlinks which prevents incorrect caching by the VFS,
  ensuring the files always refer to the namespace the process is
  currently using and ensuring that the ptrace_mayaccess permission
  checks are always applied.

  The files under /proc/<pid>/ns/ have been given stable inode numbers
  so it is now possible to see if different processes share the same
  namespaces.

  Through the David Miller's net tree are changes to relax many of the
  permission checks in the networking stack to allowing the user
  namespace root to usefully use the networking stack.  Similar changes
  for the mount namespace and the pid namespace are coming through my
  tree.

  Two small changes to add user namespace support were commited here adn
  in David Miller's -net tree so that I could complete the work on the
  /proc/<pid>/ns/ files in this tree.

  Work remains to make it safe to build user namespaces and 9p, afs,
  ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
  Kconfig guard remains in place preventing that user namespaces from
  being built when any of those filesystems are enabled.

  Future design work remains to allow root users outside of the initial
  user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
  proc: Usable inode numbers for the namespace file descriptors.
  proc: Fix the namespace inode permission checks.
  proc: Generalize proc inode allocation
  userns: Allow unprivilged mounts of proc and sysfs
  userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
  procfs: Print task uids and gids in the userns that opened the proc file
  userns: Implement unshare of the user namespace
  userns: Implent proc namespace operations
  userns: Kill task_user_ns
  userns: Make create_new_namespaces take a user_ns parameter
  userns: Allow unprivileged use of setns.
  userns: Allow unprivileged users to create new namespaces
  userns: Allow setting a userns mapping to your current uid.
  userns: Allow chown and setgid preservation
  userns: Allow unprivileged users to create user namespaces.
  userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
  userns: fix return value on mntns_install() failure
  vfs: Allow unprivileged manipulation of the mount namespace.
  vfs: Only support slave subtrees across different user namespaces
  vfs: Add a user namespace reference from struct mnt_namespace
  ...
2012-12-17 15:44:47 -08:00