bigrtm: First steps towards getting RCU out of the way of
tens-of-microseconds real-time response on systems compiled
with NR_CPUS=4096. Also cleanups for and increased concurrency
of rcu_barrier() family of primitives.
doctorture: rcutorture and documentation improvements.
fixes: Miscellaneous fixes.
fnh: RCU_FAST_NO_HZ fixes and improvements.
If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
also disable itself. This commit therefore checks for tick_nohz_enabled
being zero, disabling rcu_prepare_for_idle() if so. This commit assumes
that tick_nohz_enabled can change at runtime: If this is not the case,
then a simpler approach suffices.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The arrival of TREE_PREEMPT_RCU some years back included some ugly
code involving either #ifdef or #ifdef'ed wrapper functions to iterate
over all non-SRCU flavors of RCU. This commit therefore introduces
a for_each_rcu_flavor() iterator over the rcu_state structures for each
flavor of RCU to clean up a bit of the ugliness.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The traditional rcu_barrier() implementation has serialized all requests,
regardless of RCU flavor, and also does not coalesce concurrent requests.
In the past, this has been good and sufficient.
However, systems are getting larger and use of rcu_barrier() has been
increasing. This commit therefore introduces a counter-based scheme
that allows _rcu_barrier() calls for the same flavor of RCU to take
advantage of each others' work.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure. This commit therefore moves the
rcu_barrier_mutex global variable to a new ->barrier_mutex field
in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure. This commit therefore moves the
rcu_barrier_completion global variable to a new ->barrier_completion
field in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In order to allow each RCU flavor to concurrently execute its rcu_barrier()
function, it is necessary to move the relevant state to the rcu_state
structure. This commit therefore moves the rcu_barrier_cpu_count global
variable to a new ->barrier_cpu_count field in the rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In order for multiple flavors of RCU to each concurrently run one
rcu_barrier(), each flavor needs its own per-CPU set of rcu_head
structures. This commit therefore moves _rcu_barrier()'s set of
per-CPU rcu_head structures from per-CPU variables to the existing
per-CPU and per-RCU-flavor rcu_data structures.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This is a preparatory commit for increasing rcu_barrier()'s concurrency.
It adds a pointer in the rcu_data structure to the corresponding call_rcu()
function. This allows a pointer to the rcu_data structure to imply the
function pointer, which allows _rcu_barrier() state to be placed in the
rcu_state structure.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Although making RCU_FANOUT_LEAF a kernel configuration parameter rather
than a fixed constant makes it easier for people to decrease cache-miss
overhead for large systems, it is of little help for people who must
run a single pre-built kernel binary.
This commit therefore allows the value of RCU_FANOUT_LEAF to be
increased (but not decreased!) via a boot-time parameter named
rcutree.rcu_fanout_leaf.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This reverts commit 616c310e83.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin <levinsasha928@gmail.com> showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held. Because this commit was simply a
performance optimization, revert it.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h. This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
barrier: Reduce the amount of disturbance by rcu_barrier() to the rest of
the system. This branch also includes improvements to
RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes: Miscellaneous fixes.
inline: Remaining changes from an abortive attempt to inline
preemptible RCU's __rcu_read_lock(). These are (1) making
exit_rcu() avoid unnecessary work and (2) avoiding having
preemptible RCU record a blocked thread when the scheduler
declines to do a context switch.
srcu: Lai Jiangshan's algorithmic implementation of SRCU, including
call_srcu().
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU. Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.
However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state. This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks. Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.
The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE()
macro can cause RCU to momentarily pause out of idle without the rest
of the system being involved. This can cause rcu_prepare_for_idle()
to run through its state machine too quickly, which can in turn result
in needless scheduling-clock interrupts.
This commit therefore adds code to enable rcu_prepare_for_idle() to
distinguish between an initial entry to idle on the one hand (which needs
to advance the rcu_prepare_for_idle() state machine) and an idle reentry
due to idle-capable trace macros and RCU_NONIDLE() on the other hand
(which should avoid advancing the rcu_prepare_for_idle() state machine).
Additional state is maintained to allow the timer to be correctly reposted
when returning after a momentary pause out of idle, and even more state
is maintained to detect when new non-lazy callbacks have been enqueued
(which may require re-evaluation of the approach to idleness).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper
limit of 16 on the leaf-level fanout for the rcu_node tree. This was
needed to reduce lock contention that was induced by the synchronization
of scheduling-clock interrupts, which was in turn needed to improve
energy efficiency for moderate-sized lightly loaded servers.
However, reducing the leaf-level fanout means that there are more
leaf-level rcu_node structures in the tree, which in turn means that
RCU's grace-period initialization incurs more cache misses. This is
not a problem on moderate-sized servers with only a few tens of CPUs,
but becomes a major source of real-time latency spikes on systems with
many hundreds of CPUs. In addition, the workloads running on these large
systems tend to be CPU-bound, which eliminates the energy-efficiency
advantages of synchronizing scheduling-clock interrupts. Therefore,
these systems need maximal values for the rcu_node leaf-level fanout.
This commit addresses this problem by introducing a new kernel parameter
named RCU_FANOUT_LEAF that directly controls the leaf-level fanout.
This parameter defaults to 16 to handle the common case of a moderate
sized lightly loaded servers, but may be set higher on larger systems.
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because newly offlined CPUs continue executing after completing the
CPU_DYING notifiers, they legitimately enter the scheduler and use
RCU while appearing to be offline. This calls for a more sophisticated
approach as follows:
1. RCU marks the CPU online during the CPU_UP_PREPARE phase.
2. RCU marks the CPU offline during the CPU_DEAD phase.
3. Diagnostics regarding use of read-side RCU by offline CPUs use
RCU's accounting rather than the cpu_online_map. (Note that
__call_rcu() still uses cpu_online_map to detect illegal
invocations within CPU_DYING notifiers.)
4. Offline CPUs are prevented from hanging the system by
force_quiescent_state(), which pays attention to cpu_online_map.
Some additional work (in a later commit) will be needed to
guarantee that force_quiescent_state() waits a full jiffy before
assuming that a CPU is offline, for example, when called from
idle entry. (This commit also makes the one-jiffy wait
explicit, since the old-style implicit wait can now be defeated
by RCU_FAST_NO_HZ and by rcutorture.)
This approach avoids the false positives encountered when attempting to
use more exact classification of CPU online/offline state.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There have been situations where RCU CPU stall warnings were caused by
issues in scheduling-clock timer initialization. To make it easier to
track these down, this commit causes the RCU CPU stall-warning messages
to print out the number of scheduling-clock interrupts taken in the
current grace period for each stalled CPU.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The default CONFIG_RCU_CPU_STALL_TIMEOUT value of 60 seconds has served
Linux users well for production use for quite some time. However, for
debugging, there will be more than three minutes between subsequent
stall-warning messages. This can be an annoyingly long wait if you
are trying to work out where the offending infinite loop is hiding.
Therefore, this commit provides a rcu_cpu_stall_timeout sysfs
parameter that may be adjusted at boot time and at runtime to speed
up debugging.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The recent updates to RCU_CPU_FAST_NO_HZ have an rcu_needs_cpu() that
does more than just check for callbacks, so get the name for
rcu_preempt_needs_cpu() consistent with that change, now calling it
rcu_preempt_cpu_has_callbacks().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Move ->qsmaskinit and blkd_tasks[] manipulation to the CPU_DYING
notifier. This simplifies the code by eliminating a potential
deadlock and by reducing the responsibilities of force_quiescent_state().
Also rename functions to make their connection to the CPU-hotplug
stages explicit.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
enter dyntick-idle mode even if it still has RCU callbacks queued.
RCU avoids system hangs in this case by scheduling a timer for several
jiffies in the future. However, if all of the callbacks on that CPU
are from kfree_rcu(), there is no reason to wake the CPU up, as it is
not a problem to defer freeing of memory.
This commit therefore tracks the number of callbacks on a given CPU
that are from kfree_rcu(), and avoids scheduling the timer if all of
a given CPU's callbacks are from kfree_rcu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_do_batch() function that invokes callbacks for TREE_RCU and
TREE_PREEMPT_RCU normally throttles callback invocation to avoid degrading
scheduling latency. However, as long as the CPU would otherwise be idle,
there is no downside to continuing to invoke any callbacks that have passed
through their grace periods. In fact, processing such callbacks in a
timely manner has the benefit of increasing the probability that the
CPU can enter the power-saving dyntick-idle mode.
Therefore, this commit allows callback invocation to continue beyond the
preset limit as long as the scheduler does not have some other task to
run and as long as context is that of the idle task or the relevant
RCU kthread.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
dyntick-idle state if they have RCU callbacks pending. Unfortunately,
this has the side-effect of often preventing them from entering this
state, especially if at least one other CPU is not in dyntick-idle state.
However, the resulting per-tick wakeup is wasteful in many cases: if the
CPU has already fully responded to the current RCU grace period, there
will be nothing for it to do until this grace period ends, which will
frequently take several jiffies.
This commit therefore permits a CPU that has done everything that the
current grace period has asked of it (rcu_pending() == 0) even if it
still as RCU callbacks pending. However, such a CPU posts a timer to
wake it up several jiffies later (6 jiffies, based on experience with
grace-period lengths). This wakeup is required to handle situations
that can result in all CPUs being in dyntick-idle mode, thus failing
to ever complete the current grace period. If a CPU wakes up before
the timer goes off, then it cancels that timer, thus avoiding spurious
wakeups.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>