Commit Graph

13911 Commits

Author SHA1 Message Date
Paul E. McKenney c8020a67e6 rcu: Protect rcu_node accesses during CPU stall warnings
The print_other_cpu_stall() function accesses a number of rcu_node
fields without protection from the ->lock.  In theory, this is not
a problem because the fields accessed are all integers, but in
practice the compiler can get nasty.  Therefore, the commit extends
the existing critical section to cover the entire loop body.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:51 -07:00
Paul E. McKenney 5fd4dc068c rcu: Avoid rcu_print_detail_task_stall_rnp() segfault
The rcu_print_detail_task_stall_rnp() function invokes
rcu_preempt_blocked_readers_cgp() to verify that there are some preempted
RCU readers blocking the current grace period outside of the protection
of the rcu_node structure's ->lock.  This means that the last blocked
reader might exit its RCU read-side critical section and remove itself
from the ->blkd_tasks list before the ->lock is acquired, resulting in
a segmentation fault when the subsequent code attempts to dereference
the now-NULL gp_tasks pointer.

This commit therefore moves the test under the lock.  This will not
have measurable effect on lock contention because this code is invoked
only when printing RCU CPU stall warnings, in other words, in the common
case, never.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:50 -07:00
Paul E. McKenney 115f7a7ca0 rcu: Apply for_each_rcu_flavor() to increment_cpu_stall_ticks()
The increment_cpu_stall_ticks() function listed each RCU flavor
explicitly, with an ifdef to handle preemptible RCU.  This commit
therefore applies for_each_rcu_flavor() to save a line of code.

Because this commit switches from a code-based enumeration of the
flavors of RCU to an rcu_state-list-based enumeration, it is no longer
possible to apply __get_cpu_var() to the per-CPU rcu_data structures.
We instead use __this_cpu_var() on the rcu_state structure's ->rda field
that references the corresponding rcu_data structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:50 -07:00
Paul E. McKenney b065a85354 rcu: Fix obsolete rcu_initiate_boost() header comment
Commit 1217ed1b (rcu: permit rcu_read_unlock() to be called while holding
runqueue locks) made rcu_initiate_boost() restore irq state when releasing
the rcu_node structure's ->lock, but failed to update the header comment
accordingly.  This commit therefore brings the header comment up to date.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney a82dcc7602 rcu: Make offline-CPU checking allow for indefinite delays
The rcu_implicit_offline_qs() function implicitly assumed that execution
would progress predictably when interrupts are disabled, which is of course
not guaranteed when running on a hypervisor.  Furthermore, this function
is short, and is called from one place only in a short function.

This commit therefore ensures that the timing is checked before
checking the condition, which guarantees correct behavior even given
indefinite delays.  It also inlines rcu_implicit_offline_qs() into
rcu_implicit_dynticks_qs().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney 5cc900cf55 rcu: Improve boost selection when moving tasks to root rcu_node
The rcu_preempt_offline_tasks() moves all tasks queued on a given leaf
rcu_node structure to the root rcu_node, which is done when the last CPU
corresponding the the leaf rcu_node structure goes offline.  Now that
RCU-preempt's synchronize_rcu_expedited() implementation blocks CPU-hotplug
operations during the initialization of each rcu_node structure's
->boost_tasks pointer, rcu_preempt_offline_tasks() can do a better job
of setting the root rcu_node's ->boost_tasks pointer.

The key point is that rcu_preempt_offline_tasks() runs as part of the
CPU-hotplug process, so that a concurrent synchronize_rcu_expedited()
is guaranteed to either have not started on the one hand (in which case
there is no boosting on behalf of the expedited grace period) or to be
completely initialized on the other (in which case, in the absence of
other priority boosting, all ->boost_tasks pointers will be initialized).
Therefore, if rcu_preempt_offline_tasks() finds that the ->boost_tasks
pointer is equal to the ->exp_tasks pointer, it can be sure that it is
correctly placed.

In the case where there was boosting ongoing at the time that the
synchronize_rcu_expedited() function started, different nodes might start
boosting the tasks blocking the expedited grace period at different times.
In this mixed case, the root node will either be boosting tasks for
the expedited grace period already, or it will start as soon as it gets
done boosting for the normal grace period -- but in this latter case,
the root node's tasks needed to be boosted in any case.

This commit therefore adds a check of the ->boost_tasks pointer against
the ->exp_tasks pointer to the list that prevents updating ->boost_tasks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:50 -07:00
Paul E. McKenney b4270ee356 rcu: Permit RCU_NONIDLE() to be used from interrupt context
There is a need to use RCU from interrupt context, but either before
rcu_irq_enter() is called or after rcu_irq_exit() is called.  If the
interrupt occurs from idle, then lockdep-RCU will complain about such
uses, as they appear to be illegal uses of RCU from the idle loop.
In other environments, RCU_NONIDLE() could be used to properly protect
the use of RCU, but RCU_NONIDLE() currently cannot be invoked except
from process context.

This commit therefore modifies RCU_NONIDLE() to permit its use more
globally.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:42:49 -07:00
Paul E. McKenney 1e3fd2b38c rcu: Properly initialize ->boost_tasks on CPU offline
When rcu_preempt_offline_tasks() clears tasks from a leaf rcu_node
structure, it does not NULL out the structure's ->boost_tasks field.
This commit therefore fixes this issue.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney 818615c4cd rcu: Pull TINY_RCU dyntick-idle tracing into non-idle region
Because TINY_RCU's idle detection keys directly off of the nesting
level, rather than from a separate variable as in TREE_RCU, the
TINY_RCU dyntick-idle tracing on transition to idle must happen
before the change to the nesting level.  This commit therefore makes
this change by passing the desired new value (rather than the old value)
of the nesting level in to rcu_idle_enter_common().

[ paulmck: Add fix for wrong-variable bug spotted by
  Michael Wang <wangyun@linux.vnet.ibm.com>. ]

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney e3ebfb96f3 rcu: Add PROVE_RCU_DELAY to provoke difficult races
There have been some recent bugs that were triggered only when
preemptible RCU's __rcu_read_unlock() was preempted just after setting
->rcu_read_lock_nesting to INT_MIN, which is a low-probability event.
Therefore, reproducing those bugs (to say nothing of gaining confidence
in alleged fixes) was quite difficult.  This commit therefore creates
a new debug-only RCU kernel config option that forces a short delay
in __rcu_read_unlock() to increase the probability of those sorts of
bugs occurring.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:42:49 -07:00
Paul E. McKenney a10d206ef1 rcu: Fix day-one dyntick-idle stall-warning bug
Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

That is, unless RCU CPU stall warnings are also enabled, as they are
in recent kernels.  In this case, if a CPU wakes up after at least one
minute of inactivity, an RCU CPU stall warning will result.  The reason
that no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that consistently get into this situation.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait for
another grace period, never mind the fact that the grace period that they
were waiting for just completed.  This CPU can therefore erroneously
decide to start a new grace period.  Note that this can happen in
TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
considerations mean that the CPU that detected the end of the grace
period is not necessarily officially informed of this fact for some time.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU is officially notified that the old grace period
has ended.

This change is to cpu_needs_another_gp(), which is called in a number
of places.  The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.

Reported-by: Becky Bruce <bgillbruce@gmail.com>
Reported-by: Subodh Nijsure <snijsure@grid-net.com>
Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
Cc: stable@vger.kernel.org
2012-09-23 07:31:52 -07:00
John Stultz cee58483cf time: Move ktime_t overflow checking into timespec_valid_strict
Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b14526c ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.

Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.

This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.

Reported-and-tested-by: Andreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-01 10:24:48 -07:00
Linus Torvalds 7ca63ee1b0 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree contains misc fixlets: a perf script python binding fix, a
  uprobes fix and a syscall tracing fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Add missing files to build the python binding
  uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
  tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
2012-08-23 21:48:41 -07:00
Linus Torvalds b5bc0c7054 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Mostly small fixes for the fallout of the timekeeping overhaul in 3.6
  along with stable fixes to address an accumulation problem and missing
  sanity checks for RTC readouts and user space provided values."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Avoid making adjustments if we haven't accumulated anything
  time: Avoid potential shift overflow with large shift values
  time: Fix casting issue in timekeeping_forward_now
  time: Ensure we normalize the timekeeper in tk_xtime_add
  time: Improve sanity checking of timekeeping inputs
2012-08-23 21:46:57 -07:00
John Stultz bf2ac31219 time: Avoid making adjustments if we haven't accumulated anything
If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
John Stultz 6ea565a9be time: Avoid potential shift overflow with large shift values
Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8be (time:
Condense timekeeper.xtime into xtime_sec)

Use 1ULL instead to ensure we don't overflow on the shift.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
Andreas Schwab 85dc8f05c9 time: Fix casting issue in timekeeping_forward_now
arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8be (time: Condense
timekeeper.xtime into xtime_sec)

Cast it to u64 first.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
John Stultz 784ffcbb96 time: Ensure we normalize the timekeeper in tk_xtime_add
Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dca (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)

After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:12 +02:00
Linus Torvalds 1456c75a80 Merge branch 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull audit-tree fixes from Miklos Szeredi:
 "The audit subsystem maintainers (Al and Eric) are not responding to
  repeated resends.  Eric did ack them a while ago, but no response
  since then.  So I'm sending these directly to you."

* 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  audit: clean up refcounting in audit-tree
  audit: fix refcounting in audit-tree
  audit: don't free_chunk() after fsnotify_add_mark()
2012-08-21 12:25:24 -07:00
Eric Dumazet f341861fb0 task_work: add a scheduling point in task_work_run()
It seems commit 4a9d4b024a ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b224 ("close_files(): add
scheduling point")

If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 09:11:44 -07:00
Ingo Molnar 5c65ca7520 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull syscall tracing fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21 11:49:37 +02:00
Oleg Nesterov c7a3a88c93 uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
This patch fixes:

  https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().

Reported-and-tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21 11:48:12 +02:00
Linus Torvalds 53795ced6e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix migration thread runtime bogosity
  sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
  sched,cgroup: Fix up task_groups list
  sched: fix divide by zero at {thread_group,task}_times
  sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
2012-08-20 10:35:05 -07:00
Linus Torvalds 90785be317 Merge branch 'alpha' (alpha architecture patches)
Merge alpha architecture update from Michael Cree:
 "The Alpha Maintainer, Matt Turner, is currently unavailable, so I have
  collected up patches that have been posted to the linux-alpha mailing
  list over the last couple of months, and are forwarding them to you in
  the hope that you are prepared to accept them via me.

  The patches by Al Viro and myself I have been running against kernels
  for two months now so have had quite a bit of testing.  All except one
  patch were intended for the 3.5 kernel but because of Matt's
  unavailability never got forwarded to you."

* emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits)
  alpha: Fix fall-out from disintegrating asm/system.h
  Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
  alpha: fix fpu.h usage in userspace
  alpha/mm/fault.c: Port OOM changes to do_page_fault
  alpha: take kernel_execve() out of entry.S
  alpha: take a bunch of syscalls into osf_sys.c
  alpha: Use new generic strncpy_from_user() and strnlen_user()
  alpha: Wire up cross memory attach syscalls
  alpha: Don't export SOCK_NONBLOCK to user space.
2012-08-19 08:41:29 -07:00
Al Viro be53db6e4e alpha: take a bunch of syscalls into osf_sys.c
New helper: current_thread_info().  Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler.  We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-19 08:41:19 -07:00