Kirill noted the following deadlock cycle on shutdown involving padata:
> With commit 755609a908 I've got deadlock on
> poweroff.
>
> It guess it happens because of race for cpu_hotplug.lock:
>
> CPU A CPU B
> disable_nonboot_cpus()
> _cpu_down()
> cpu_hotplug_begin()
> mutex_lock(&cpu_hotplug.lock);
> __cpu_notify()
> padata_cpu_callback()
> __padata_remove_cpu()
> padata_replace()
> synchronize_rcu()
> rcu_gp_kthread()
> get_online_cpus();
> mutex_lock(&cpu_hotplug.lock);
It would of course be good to eliminate grace-period delays from
CPU-hotplug notifiers, but that is a separate issue. Deadlock is
not an appropriate diagnostic for excessive CPU-hotplug latency.
Fortunately, grace-period initialization does not actually need to
exclude all of the CPU-hotplug operation, but rather only RCU's own
CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers. This commit therefore
introduces a new per-rcu_state onoff_mutex that provides the required
concurrency control in place of the get_online_cpus() that was previously
in rcu_gp_init().
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Checking "user" before "is_idle_task()" allows better optimizations
in cases where inlining is possible. Also, "bool" should be passed
"true" or "false" rather than "1" or "0". This commit therefore makes
these changes, as noted in Josh's review.
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In some cases, it is necessary to enter or exit userspace-RCU-idle mode
from an interrupt handler, for example, if some other CPU sends this
CPU a resched IPI. In this case, the current CPU would enter the IPI
handler in userspace-RCU-idle mode, but would need to exit the IPI handler
after having exited that mode.
To allow this to work, this commit adds two new APIs to TREE_RCU:
- rcu_user_enter_after_irq(). This must be called from an interrupt between
rcu_irq_enter() and rcu_irq_exit(). After the irq calls rcu_irq_exit(),
the irq handler will return into an RCU extended quiescent state.
In theory, this interrupt is never a nested interrupt, but in practice
it might interrupt softirq, which looks to RCU like a nested interrupt.
- rcu_user_exit_after_irq(). This must be called from a non-nesting
interrupt, interrupting an RCU extended quiescent state, also
between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
state.
[ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU currently insists that only idle tasks can enter RCU idle mode, which
prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
would mean that usermode execution would always take scheduling-clock
interrupts, even when there is only one task runnable on the CPU in
question.
This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
allow non-idle tasks to enter RCU idle mode. These are quite similar
to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
omit the idle-task checks.
[ Updated to use "user" flag rather than separate check functions. ]
[ paulmck: Updated to drop exports of new functions based on Josh's patch
getting rid of the need for them. ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
were due to adjacent insertions and deletions, which were resolved
by simply accepting the changes on both branches.
bigrt.2012.09.23a contains additional commits to reduce scheduling latency
from RCU on huge systems (many hundrends or thousands of CPUs).
doctorture.2012.09.23a contains documentation changes and rcutorture fixes.
fixes.2012.09.23a contains miscellaneous fixes.
hotplug.2012.09.23a contains CPU-hotplug-related changes.
idle.2012.09.23a fixes architectures for which RCU no longer considered
the idle loop to be a quiescent state due to earlier
adaptive-dynticks changes. Affected architectures are alpha,
cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa,
and ia64.
Posting a callback after the CPU_DEAD notifier effectively leaks
that callback unless/until that CPU comes back online. Silence is
unhelpful when attempting to track down such leaks, so this commit emits
a WARN_ON_ONCE() and unconditionally leaks the callback when an offline
CPU attempts to register a callback. The rdp->nxttail[RCU_NEXT_TAIL] is
set to NULL in the CPU_DEAD notifier and restored in the CPU_UP_PREPARE
notifier, allowing _call_rcu() to determine exactly when posting callbacks
is illegal.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, _rcu_barrier() relies on preempt_disable() to prevent
any CPU from going offline, which in turn depends on CPU hotplug's
use of __stop_machine().
This patch therefore makes _rcu_barrier() use get_online_cpus() to
block CPU-hotplug operations. This has the added benefit of removing
the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug
operations are excluded, there can be no callbacks to adopt. This
commit simplifies the code accordingly.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The first memory barrier in __call_rcu() is supposed to order any
updates done beforehand by the caller against the actual queuing
of the callback. However, the second memory barrier (which is intended
to order incrementing the queue lengths before queuing the callback)
is also between the caller's updates and the queuing of the callback.
The second memory barrier can therefore serve both purposes.
This commit therefore removes the first memory barrier.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
If a given CPU avoids the idle loop but also avoids starting a new
RCU grace period for a full minute, RCU can issue spurious RCU CPU
stall warnings. This commit fixes this issue by adding a check for
ongoing grace period to avoid these spurious stall warnings.
Reported-by: Becky Bruce <bgillbruce@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_other_cpu_stall() function accesses a number of rcu_node
fields without protection from the ->lock. In theory, this is not
a problem because the fields accessed are all integers, but in
practice the compiler can get nasty. Therefore, the commit extends
the existing critical section to cover the entire loop body.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_implicit_offline_qs() function implicitly assumed that execution
would progress predictably when interrupts are disabled, which is of course
not guaranteed when running on a hypervisor. Furthermore, this function
is short, and is called from one place only in a short function.
This commit therefore ensures that the timing is checked before
checking the condition, which guarantees correct behavior even given
indefinite delays. It also inlines rcu_implicit_offline_qs() into
rcu_implicit_dynticks_qs().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, rcu_init_geometry() only reshapes RCU's combining trees
if the leaf fanout is changed at boot time. This means that by
default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
data structures, even when running on systems with (say) four CPUs.
This commit therefore checks to see if the maximum number of CPUs on
the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
reshapes the combining trees accordingly.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
to nr_cpu_ids) to require more than a single rcu_node structure, but if
NR_CPUS is larger than would fit into a single rcu_node structure, then
the current rcu_init_levelspread() code is subject to integer overflow
in the eight-bit ->levelspread[] array in the rcu_state structure.
In this case, the solution is -not- to increase the size of the
elements in this array because the values in that array should be
constrained to the number of bits in an unsigned long. Instead, this
commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
function's initialization of the cprv local variable. This results in
all of the arithmetic being consistently based off of the nr_cpu_ids
value, thus avoiding the overflow, which was caused by the mixing of
nr_cpu_ids and NR_CPUS.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current quiescent-state detection algorithm is needlessly
complex. It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period. This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.
Therefore, this commit makes this change.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Now that the rcu_node structures' ->completed fields are unconditionally
assigned at grace-period cleanup time, they should already have the
correct value for the new grace period at grace-period initialization
time. This commit therefore inserts a WARN_ON_ONCE() to verify this
invariant.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Preemption greatly raised the probability of certain types of race
conditions, so this commit adds an anti-heisenbug to greatly increase
the collision cross section, also known as the probability of occurrence.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The current approach to grace-period initialization is vulnerable to
extremely low-probability races. These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.
These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:
1. CPU 0 completes a grace period, but needs an additional
grace period, so starts initializing one, initializing all
the non-leaf rcu_node structures and the first leaf rcu_node
structure. Because CPU 0 is both completing the old grace
period and starting a new one, it marks the completion of
the old grace period and the start of the new grace period
in a single traversal of the rcu_node structures.
Therefore, CPUs corresponding to the first rcu_node structure
can become aware that the prior grace period has completed, but
CPUs corresponding to the other rcu_node structures will see
this same prior grace period as still being in progress.
2. CPU 1 passes through a quiescent state, and therefore informs
the RCU core. Because its leaf rcu_node structure has already
been initialized, this CPU's quiescent state is applied to the
new (and only partially initialized) grace period.
3. CPU 1 enters an RCU read-side critical section and acquires
a reference to data item A. Note that this CPU believes that
its critical section started after the beginning of the new
grace period, and therefore will not block this new grace period.
4. CPU 16 exits dyntick-idle mode. Because it was in dyntick-idle
mode, other CPUs informed the RCU core of its extended quiescent
state for the past several grace periods. This means that CPU 16
is not yet aware that these past grace periods have ended. Assume
that CPU 16 corresponds to the second leaf rcu_node structure --
which has not yet been made aware of the new grace period.
5. CPU 16 removes data item A from its enclosing data structure
and passes it to call_rcu(), which queues a callback in the
RCU_NEXT_TAIL segment of the callback queue.
6. CPU 16 enters the RCU core, possibly because it has taken a
scheduling-clock interrupt, or alternatively because it has
more than 10,000 callbacks queued. It notes that the second
most recent grace period has completed (recall that because it
corresponds to the second as-yet-uninitialized rcu_node structure,
it cannot yet become aware that the most recent grace period has
completed), and therefore advances its callbacks. The callback
for data item A is therefore in the RCU_NEXT_READY_TAIL segment
of the callback queue.
7. CPU 0 completes initialization of the remaining leaf rcu_node
structures for the new grace period, including the structure
corresponding to CPU 16.
8. CPU 16 again enters the RCU core, again, possibly because it has
taken a scheduling-clock interrupt, or alternatively because
it now has more than 10,000 callbacks queued. It notes that
the most recent grace period has ended, and therefore advances
its callbacks. The callback for data item A is therefore in
the RCU_DONE_TAIL segment of the callback queue.
9. All CPUs other than CPU 1 pass through quiescent states. Because
CPU 1 already passed through its quiescent state, the new grace
period completes. Note that CPU 1 is still in its RCU read-side
critical section, still referencing data item A.
10. Suppose that CPU 2 wais the last CPU to pass through a quiescent
state for the new grace period, and suppose further that CPU 2
did not have any callbacks queued, therefore not needing an
additional grace period. CPU 2 therefore traverses all of the
rcu_node structures, marking the new grace period as completed,
but does not initialize a new grace period.
11. CPU 16 yet again enters the RCU core, yet again possibly because
it has taken a scheduling-clock interrupt, or alternatively
because it now has more than 10,000 callbacks queued. It notes
that the new grace period has ended, and therefore advances
its callbacks. The callback for data item A is therefore in
the RCU_DONE_TAIL segment of the callback queue. This means
that this callback is now considered ready to be invoked.
12. CPU 16 invokes the callback, freeing data item A while CPU 1
is still referencing it.
This scenario represents a day-zero bug for TREE_RCU. This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.
That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Conflicts:
kernel/rcutree.c
The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs. This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system. This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.
Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Although almost everyone is well-served by the defaults, some uses of RCU
benefit from shorter grace periods, while others benefit more from the
greater efficiency provided by longer grace periods. Situations requiring
a large number of grace periods to elapse (and wireshark startup has
been called out as an example of this) are helped by lower-latency
grace periods. Furthermore, in some embedded applications, people are
willing to accept a small degradation in update efficiency (due to there
being more of the shorter grace-period operations) in order to gain the
lower latency.
In contrast, those few systems with thousands of CPUs need longer grace
periods because the CPU overhead of a grace period rises roughly
linearly with the number of CPUs. Such systems normally do not make
much use of facilities that require large numbers of grace periods to
elapse, so this is a good tradeoff.
Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states. They both default
to three jiffies, which is compatible with the old hard-coded behavior.
At some future time, it may be possible to automatically increase the
grace-period length with the number of CPUs, but we do not yet have
sufficient data to do a good job. Preliminary data indicates that we
should add an addiitonal jiffy to each of the delays for every 200 CPUs
in the system, but more experimentation is needed. For now, the number
of systems with more than 1,000 CPUs is small enough that this can be
relegated to boot-time hand tuning.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>