Commit Graph

1314 Commits

Author SHA1 Message Date
David Drysdale 51f39a1f0c syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts).  The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns.  The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.

This patch (of 4):

Add a new execveat(2) system call.  execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers.  This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.

Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Vladimir Davydov 6f185c290e memcg: turn memcg_kmem_skip_account into a bit field
It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
the task_struct.

Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
set/clear the flag inline.  Regarding the overwhelming comment to the
helpers, which is removed by this patch too, we already have a compact yet
accurate explanation in memcg_schedule_cache_create, no need in yet
another one.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Linus Torvalds 86c6a2fddf Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

     This instruments might_sleep() checks to catch places that nest
     blocking primitives - such as mutex usage in a wait loop.  Such
     bugs can result in hard to debug races/hangs.

     Another category of invalid nesting that this facility will detect
     is the calling of blocking functions from within schedule() ->
     sched_submit_work() -> blk_schedule_flush_plug().

     There's some potential for false positives (if secondary blocking
     primitives themselves are not ready yet for this facility), but the
     kernel will warn once about such bugs per bootup, so the warning
     isn't much of a nuisance.

     This feature comes with a number of fixes, for problems uncovered
     with it, so no messages are expected normally.

   - Another round of sched/numa optimizations and refinements, for
     CONFIG_NUMA_BALANCING=y.

   - Another round of sched/dl fixes and refinements.

  Plus various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched: Add missing rcu protection to wake_up_all_idle_cpus
  sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
  sched/numa: Init numa balancing fields of init_task
  sched/deadline: Remove unnecessary definitions in cpudeadline.h
  sched/cpupri: Remove unnecessary definitions in cpupri.h
  sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
  sched/fair: Fix stale overloaded status in the busiest group finding logic
  sched: Move p->nr_cpus_allowed check to select_task_rq()
  sched/completion: Document when to use wait_for_completion_io_*()
  sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
  sched/fair: Kill task_struct::numa_entry and numa_group::task_list
  sched: Refactor task_struct to use numa_faults instead of numa_* pointers
  sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
  sched/deadline: Reschedule from switched_from_dl() after a successful pull
  sched/deadline: Push task away if the deadline is equal to curr during wakeup
  sched/deadline: Add deadline rq status print
  sched/deadline: Fix artificial overrun introduced by yield_task_dl()
  sched/rt: Clean up check_preempt_equal_prio()
  sched/core: Use dl_bw_of() under rcu_read_lock_sched()
  sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
  ...
2014-12-09 21:21:34 -08:00
Iulia Manda 44dba3d5d6 sched: Refactor task_struct to use numa_faults instead of numa_* pointers
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).

A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.

All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:57 +01:00
Pranith Kumar 28f6569ab7 rcu: Remove redundant TREE_PREEMPT_RCU config option
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:20:05 -07:00
Peter Zijlstra 3427445afd sched: Exclude cond_resched() from nested sleep test
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.

In particular, preemption preserves task_struct::state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:56:57 +01:00
Peter Zijlstra 8eb23b9f35 sched: Debug nested sleeps
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.

Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.

Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:56:52 +01:00
Juri Lelli f82f80426f sched/deadline: Ensure that updates to exclusive cpusets don't break AC
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.

This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:48:00 +01:00
Juri Lelli 7f51412a41 sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:

 - No check is performed when the user tries to attach a task to
   an exlusive cpuset (recall that exclusive cpusets have an
   associated maximum allowed bandwidth).

 - Bandwidths of source and destination cpusets are not correctly
   updated after a task is migrated between them.

This patch fixes both things at once, as they are opposite faces
of the same coin.

The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.

Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:58 +01:00
Linus Torvalds faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Linus Torvalds d6dd50e07c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...
2014-10-13 15:44:12 +02:00
Junxiao Bi 934f3072c1 mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set
commit 21caf2fc19 ("mm: teach mm by current context info to not do I/O
during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing
I/O inside memory allocation, __GFP_IO is cleared when this flag is set,
but __GFP_FS implies __GFP_IO, it should also be cleared.  Or it may still
run into I/O, like in superblock shrinker.  And this will make the kernel
run into the deadlock case described in that commit.

See Dave Chinner's comment about io in superblock shrinker:

Filesystem shrinkers do indeed perform IO from the superblock shrinker and
have for years.  Even clean inodes can require IO before they can be freed
- e.g.  on an orphan list, need truncation of post-eof blocks, need to
wait for ordered operations to complete before it can be freed, etc.

IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts
of IO in the superblock shrinker context.  XFS, in particular, has been
doing transactions and IO from the VFS inode cache shrinker since it was
first introduced....

Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has
masked all the gfp_mask that will be passed into fs for the processes
setting PF_MEMALLOC_NOIO in the direct reclaim path.

v1 thread at: https://lkml.org/lkml/2014/9/3/32

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: joyce.xue <xuejiufei@huawei.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Linus Torvalds 87d7bcee4f Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 - add multibuffer infrastructure (single_task_running scheduler helper,
   OKed by Peter on lkml.
 - add SHA1 multibuffer implementation for AVX2.
 - reenable "by8" AVX CTR optimisation after fixing counter overflow.
 - add APM X-Gene SoC RNG support.
 - SHA256/SHA512 now handles unaligned input correctly.
 - set lz4 decompressed length correctly.
 - fix algif socket buffer allocation failure for 64K page machines.
 - misc fixes

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (47 commits)
  crypto: sha - Handle unaligned input data in generic sha256 and sha512.
  Revert "crypto: aesni - disable "by8" AVX CTR optimization"
  crypto: aesni - remove unused defines in "by8" variant
  crypto: aesni - fix counter overflow handling in "by8" variant
  hwrng: printk replacement
  crypto: qat - Removed unneeded partial state
  crypto: qat - Fix typo in name of tasklet_struct
  crypto: caam - Dynamic allocation of addresses for various memory blocks in CAAM.
  crypto: mcryptd - Fix typos in CRYPTO_MCRYPTD description
  crypto: algif - avoid excessive use of socket buffer in skcipher
  arm64: dts: add random number generator dts node to APM X-Gene platform.
  Documentation: rng: Add X-Gene SoC RNG driver documentation
  hwrng: xgene - add support for APM X-Gene SoC RNG support
  crypto: mv_cesa - Add missing #define
  crypto: testmgr - add test for lz4 and lz4hc
  crypto: lz4,lz4hc - fix decompression
  crypto: qat - Use pci_enable_msix_exact() instead of pci_enable_msix()
  crypto: drbg - fix maximum value checks on 32 bit systems
  crypto: drbg - fix sparse warning for cpu_to_be[32|64]
  crypto: sha-mb - sha1_mb_alg_state can be static
  ...
2014-10-08 06:44:48 -04:00
Linus Torvalds 6111da3432 Merge branch 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This is quite late but these need to be backported anyway.

  This is the fix for a long-standing cpuset bug which existed from
  2009.  cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
  task's memory allocation behavior according to the settings of the
  cpuset it belongs to; unfortunately, when those flags have to be
  changed, cpuset did so directly even whlie the target task is running,
  which is obviously racy as task->flags may be modified by the task
  itself at any time.  This obscure bug manifested as corrupt
  PF_USED_MATH flag leading to a weird crash.

  The bug is fixed by moving the flag to task->atomic_flags.  The first
  two are prepatory ones to help defining atomic_flags accessors and the
  third one is the actual fix"

* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
  sched: add macros to define bitops for task atomic flags
  sched: fix confusing PFA_NO_NEW_PRIVS constant
2014-09-27 16:45:33 -07:00
Zefan Li 2ad654bc5e cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Cc: <stable@vger.kernel.org> # 2.6.31+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-24 22:16:06 -04:00
Zefan Li e0e5070b20 sched: add macros to define bitops for task atomic flags
This will simplify code when we add new flags.

v3:
- Kees pointed out that no_new_privs should never be cleared, so we
shouldn't define task_clear_no_new_privs(). we define 3 macros instead
of a single one.

v2:
- updated scripts/tags.sh, suggested by Peter

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-24 22:16:06 -04:00
Zefan Li a2b86f7722 sched: fix confusing PFA_NO_NEW_PRIVS constant
Commit 1d4457f999 ("sched: move no_new_privs into new atomic flags")
defined PFA_NO_NEW_PRIVS as hexadecimal value, but it is confusing
because it is used as bit number. Redefine it as decimal bit number.

Note this changes the bit position of PFA_NOW_NEW_PRIVS from 1 to 0.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
[ lizf: slightly modified subject and changelog ]
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-24 22:16:06 -04:00
Chuck Ebbert 6a40281ab5 sched: Fix end_of_stack() and location of stack canary for architectures using CONFIG_STACK_GROWSUP
Aaron Tomlin recently posted patches [1] to enable checking the
stack canary on every task switch. Looking at the canary code, I
realized that every arch (except ia64, which adds some space for
register spill above the stack) shares a definition of
end_of_stack() that makes it the first long after the
threadinfo.

For stacks that grow down, this low address is correct because
the stack starts at the end of the thread area and grows toward
lower addresses. However, for stacks that grow up, toward higher
addresses, this is wrong. (The stack actually grows away from
the canary.) On these archs end_of_stack() should return the
address of the last long, at the highest possible address for the stack.

[1] http://lkml.org/lkml/2014/9/12/293

Signed-off-by: Chuck Ebbert <cebbert.lkml@gmail.com>
Link: http://lkml.kernel.org/r/20140920101751.6c5166b6@as
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: James Hogan <james.hogan@imgtec.com> [metag]
Acked-by: James Hogan <james.hogan@imgtec.com>
Acked-by: Aaron Tomlin <atomlin@redhat.com>
2014-09-20 19:44:04 +02:00
Aaron Tomlin a70857e46d sched: Add helper for task stack page overrun checking
This facility is used in a few places so let's introduce
a helper function to improve code readability.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: oleg@redhat.com
Cc: riel@redhat.com
Cc: prarit@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: mpe@ellerman.id.au
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-3-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:23 +02:00
Aaron Tomlin d4311ff1a8 init/main.c: Give init_task a canary
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.

Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:

  http://marc.info/?l=linux-kernel&m=127144305403241&w=2

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:22 +02:00
Chuansheng Liu f6be8af1c9 sched: Add new API wake_up_if_idle() to wake up the idle cpu
Implementing one new API wake_up_if_idle(), which is used to
wake up the idle CPU.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: daniel.lezcano@linaro.org
Cc: rjw@rjwysocki.net
Cc: linux-pm@vger.kernel.org
Cc: changcheng.liu@intel.com
Cc: xiaoming.wang@intel.com
Cc: souvik.k.chakravarty@intel.com
Cc: chuansheng.liu@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409815075-4180-1-git-send-email-chuansheng.liu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:14 +02:00
Rik van Riel e78c349679 time, signal: Protect resource use statistics with seqlock
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08 08:17:01 +02:00
Paul E. McKenney 1d082fd061 rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch()
The rcu_preempt_note_context_switch() function is on a scheduling fast
path, so it would be good to avoid disabling irqs.  The reason that irqs
are disabled is to synchronize process-level and irq-handler access to
the task_struct ->rcu_read_unlock_special bitmask.  This commit therefore
makes ->rcu_read_unlock_special instead be a union of bools with a short
allowing single-access checks in RCU's __rcu_read_unlock().  This results
in the process-level and irq-handler accesses being simple loads and
stores, so that irqs need no longer be disabled.  This commit therefore
removes the irq disabling from rcu_preempt_note_context_switch().

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:34 -07:00
Paul E. McKenney 176f8f7a52 rcu: Make TASKS_RCU handle nohz_full= CPUs
Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution.  There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state.  The grace period would therefore
extend indefinitely.  This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:30 -07:00
Paul E. McKenney 8315f42295 rcu: Add call_rcu_tasks()
This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
context switch (not preemption!) and userspace execution (not the idle
loop -- use some sort of schedule_on_each_cpu() if you need to handle the
idle tasks.  Note that unlike other RCU flavors, these quiescent states
occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerant
updaters.  This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single task-private linked list containing all tasks that have not
yet passed through a quiescent state.  If experience shows this assumption
to be incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:19 -07:00