mirror of
https://github.com/armbian/linux-cix.git
synced 2026-01-06 12:30:45 -08:00
Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes, perhaps most notably:
- Throttling callback invocation based on the number of callbacks
that are now ready to invoke instead of on the total number of
callbacks
- Several patches that suppress false-positive boot-time
diagnostics, for example, due to lockdep not yet being
initialized
- Make expedited RCU CPU stall warnings dump stacks of any tasks
that are blocking the stalled grace period. (Normal RCU CPU
stall warnings have done this for many years)
- Lazy-callback fixes to avoid delays during boot, suspend, and
resume. (Note that lazy callbacks must be explicitly enabled, so
this should not (yet) affect production use cases)
- Make kfree_rcu() and friends take advantage of polled grace periods,
thus reducing memory footprint by almost two orders of magnitude,
admittedly on a microbenchmark
This also begins the transition from kfree_rcu(p) to
kfree_rcu_mightsleep(p). This transition was motivated by bugs where
kfree_rcu(p), which can block, was typed instead of the intended
kfree_rcu(p, rh)
- SRCU updates, perhaps most notably fixing a bug that causes SRCU to
fail when booted on a system with a non-zero boot CPU. This
surprising situation actually happens for kdump kernels on the
powerpc architecture
This also adds an srcu_down_read() and srcu_up_read(), which act like
srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
critical section to be handed off from one task to another
- Clean up the now-useless SRCU Kconfig option
There are a few more commits that are not yet acked or pulled into
maintainer trees, and these will be in a pull request for a later
merge window
- RCU-tasks updates, perhaps most notably these fixes:
- A strange interaction between PID-namespace unshare and the
RCU-tasks grace period that results in a low-probability but
very real hang
- A race between an RCU tasks rude grace period on a single-CPU
system and CPU-hotplug addition of the second CPU that can
result in a too-short grace period
- A race between shrinking RCU tasks down to a single callback
list and queuing a new callback to some other CPU, but where
that queuing is delayed for more than an RCU grace period. This
can result in that callback being stranded on the non-boot CPU
- Torture-test updates and fixes
- Torture-test scripting updates and fixes
- Provide additional RCU CPU stall-warning information in kernels built
with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
timeout limit for expedited RCU CPU stall warnings
* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
kernel/notifier: Remove CONFIG_SRCU
init: Remove "select SRCU"
fs/quota: Remove "select SRCU"
fs/notify: Remove "select SRCU"
fs/btrfs: Remove "select SRCU"
fs: Remove CONFIG_SRCU
drivers/pci/controller: Remove "select SRCU"
drivers/net: Remove "select SRCU"
drivers/md: Remove "select SRCU"
drivers/hwtracing/stm: Remove "select SRCU"
drivers/dax: Remove "select SRCU"
drivers/base: Remove CONFIG_SRCU
rcu: Disable laziness if lazy-tracking says so
rcu: Track laziness during boot and suspend
rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
rcu: Align the output of RCU CPU stall warning messages
rcu: Add RCU stall diagnosis information
sched: Add helper nr_context_switches_cpu()
...
This commit is contained in:
@@ -8,7 +8,7 @@ Although RCU is usually used to protect read-mostly data structures,
|
||||
it is possible to use RCU to provide dynamic non-maskable interrupt
|
||||
handlers, as well as dynamic irq handlers. This document describes
|
||||
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
|
||||
work in "arch/x86/kernel/traps.c".
|
||||
work in an old version of "arch/x86/kernel/traps.c".
|
||||
|
||||
The relevant pieces of code are listed below, each followed by a
|
||||
brief explanation::
|
||||
@@ -116,7 +116,7 @@ Answer to Quick Quiz:
|
||||
|
||||
This same sad story can happen on other CPUs when using
|
||||
a compiler with aggressive pointer-value speculation
|
||||
optimizations.
|
||||
optimizations. (But please don't!)
|
||||
|
||||
More important, the rcu_dereference_sched() makes it
|
||||
clear to someone reading the code that the pointer is
|
||||
|
||||
@@ -38,7 +38,7 @@ by having call_rcu() directly invoke its arguments only if it was called
|
||||
from process context. However, this can fail in a similar manner.
|
||||
|
||||
Suppose that an RCU-based algorithm again scans a linked list containing
|
||||
elements A, B, and C in process contexts, but that it invokes a function
|
||||
elements A, B, and C in process context, but that it invokes a function
|
||||
on each element as it is scanned. Suppose further that this function
|
||||
deletes element B from the list, then passes it to call_rcu() for deferred
|
||||
freeing. This may be a bit unconventional, but it is perfectly legal
|
||||
@@ -59,7 +59,8 @@ Example 3: Death by Deadlock
|
||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||
callback function must acquire this same lock. In this case, if
|
||||
call_rcu() were to directly invoke the callback, the result would
|
||||
be self-deadlock.
|
||||
be self-deadlock *even if* this invocation occurred from a later
|
||||
call_rcu() invocation a full grace period later.
|
||||
|
||||
In some cases, it would possible to restructure to code so that
|
||||
the call_rcu() is delayed until after the lock is released. However,
|
||||
@@ -85,6 +86,14 @@ Quick Quiz #2:
|
||||
|
||||
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||
|
||||
It is important to note that userspace RCU implementations *do*
|
||||
permit call_rcu() to directly invoke callbacks, but only if a full
|
||||
grace period has elapsed since those callbacks were queued. This is
|
||||
the case because some userspace environments are extremely constrained.
|
||||
Nevertheless, people writing userspace RCU implementations are strongly
|
||||
encouraged to avoid invoking callbacks from call_rcu(), thus obtaining
|
||||
the deadlock-avoidance benefits called out above.
|
||||
|
||||
Summary
|
||||
-------
|
||||
|
||||
|
||||
@@ -69,9 +69,8 @@ checking of rcu_dereference() primitives:
|
||||
value of the pointer itself, for example, against NULL.
|
||||
|
||||
The rcu_dereference_check() check expression can be any boolean
|
||||
expression, but would normally include a lockdep expression. However,
|
||||
any boolean expression can be used. For a moderately ornate example,
|
||||
consider the following::
|
||||
expression, but would normally include a lockdep expression. For a
|
||||
moderately ornate example, consider the following::
|
||||
|
||||
file = rcu_dereference_check(fdt->fd[fd],
|
||||
lockdep_is_held(&files->file_lock) ||
|
||||
@@ -97,10 +96,10 @@ code, it could instead be written as follows::
|
||||
atomic_read(&files->count) == 1);
|
||||
|
||||
This would verify cases #2 and #3 above, and furthermore lockdep would
|
||||
complain if this was used in an RCU read-side critical section unless one
|
||||
of these two cases held. Because rcu_dereference_protected() omits all
|
||||
barriers and compiler constraints, it generates better code than do the
|
||||
other flavors of rcu_dereference(). On the other hand, it is illegal
|
||||
complain even if this was used in an RCU read-side critical section unless
|
||||
one of these two cases held. Because rcu_dereference_protected() omits
|
||||
all barriers and compiler constraints, it generates better code than do
|
||||
the other flavors of rcu_dereference(). On the other hand, it is illegal
|
||||
to use rcu_dereference_protected() if either the RCU-protected pointer
|
||||
or the RCU-protected data that it points to can change concurrently.
|
||||
|
||||
|
||||
@@ -77,15 +77,17 @@ Frequently Asked Questions
|
||||
search for the string "Patent" in Documentation/RCU/RTFP.txt to find them.
|
||||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
Many (but not all) have long since expired.
|
||||
There are now also LGPL implementations of user-level RCU
|
||||
available (https://liburcu.org/).
|
||||
|
||||
- I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
||||
Realtime-friendly RCU are enabled via the CONFIG_PREEMPTION
|
||||
kernel configuration parameter.
|
||||
|
||||
- Where can I find more information on RCU?
|
||||
|
||||
See the Documentation/RCU/RTFP.txt file.
|
||||
Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/).
|
||||
Or point your browser at (https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit)
|
||||
or (https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing).
|
||||
|
||||
@@ -19,8 +19,9 @@ Follow these rules to keep your RCU code working properly:
|
||||
can reload the value, and won't your code have fun with two
|
||||
different values for a single pointer! Without rcu_dereference(),
|
||||
DEC Alpha can load a pointer, dereference that pointer, and
|
||||
return data preceding initialization that preceded the store of
|
||||
the pointer.
|
||||
return data preceding initialization that preceded the store
|
||||
of the pointer. (As noted later, in recent kernels READ_ONCE()
|
||||
also prevents DEC Alpha from playing these tricks.)
|
||||
|
||||
In addition, the volatile cast in rcu_dereference() prevents the
|
||||
compiler from deducing the resulting pointer value. Please see
|
||||
@@ -34,7 +35,7 @@ Follow these rules to keep your RCU code working properly:
|
||||
takes on the role of the lockless_dereference() primitive that
|
||||
was removed in v4.15.
|
||||
|
||||
- You are only permitted to use rcu_dereference on pointer values.
|
||||
- You are only permitted to use rcu_dereference() on pointer values.
|
||||
The compiler simply knows too much about integral values to
|
||||
trust it to carry dependencies through integer operations.
|
||||
There are a very few exceptions, namely that you can temporarily
|
||||
@@ -240,6 +241,7 @@ precautions. To see this, consider the following code fragment::
|
||||
struct foo *q;
|
||||
int r1, r2;
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(gp2);
|
||||
if (p == NULL)
|
||||
return;
|
||||
@@ -248,7 +250,10 @@ precautions. To see this, consider the following code fragment::
|
||||
if (p == q) {
|
||||
/* The compiler decides that q->c is same as p->c. */
|
||||
r2 = p->c; /* Could get 44 on weakly order system. */
|
||||
} else {
|
||||
r2 = p->c - r1; /* Unconditional access to p->c. */
|
||||
}
|
||||
rcu_read_unlock();
|
||||
do_something_with(r1, r2);
|
||||
}
|
||||
|
||||
@@ -297,6 +302,7 @@ Then one approach is to use locking, for example, as follows::
|
||||
struct foo *q;
|
||||
int r1, r2;
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(gp2);
|
||||
if (p == NULL)
|
||||
return;
|
||||
@@ -306,7 +312,12 @@ Then one approach is to use locking, for example, as follows::
|
||||
if (p == q) {
|
||||
/* The compiler decides that q->c is same as p->c. */
|
||||
r2 = p->c; /* Locking guarantees r2 == 144. */
|
||||
} else {
|
||||
spin_lock(&q->lock);
|
||||
r2 = q->c - r1;
|
||||
spin_unlock(&q->lock);
|
||||
}
|
||||
rcu_read_unlock();
|
||||
spin_unlock(&p->lock);
|
||||
do_something_with(r1, r2);
|
||||
}
|
||||
@@ -364,7 +375,7 @@ the exact value of "p" even in the not-equals case. This allows the
|
||||
compiler to make the return values independent of the load from "gp",
|
||||
in turn destroying the ordering between this load and the loads of the
|
||||
return values. This can result in "p->b" returning pre-initialization
|
||||
garbage values.
|
||||
garbage values on weakly ordered systems.
|
||||
|
||||
In short, rcu_dereference() is *not* optional when you are going to
|
||||
dereference the resulting pointer.
|
||||
@@ -430,7 +441,7 @@ member of the rcu_dereference() to use in various situations:
|
||||
SPARSE CHECKING OF RCU-PROTECTED POINTERS
|
||||
-----------------------------------------
|
||||
|
||||
The sparse static-analysis tool checks for direct access to RCU-protected
|
||||
The sparse static-analysis tool checks for non-RCU access to RCU-protected
|
||||
pointers, which can result in "interesting" bugs due to compiler
|
||||
optimizations involving invented loads and perhaps also load tearing.
|
||||
For example, suppose someone mistakenly does something like this::
|
||||
|
||||
@@ -5,37 +5,12 @@ RCU and Unloadable Modules
|
||||
|
||||
[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
|
||||
|
||||
RCU (read-copy update) is a synchronization mechanism that can be thought
|
||||
of as a replacement for read-writer locking (among other things), but with
|
||||
very low-overhead readers that are immune to deadlock, priority inversion,
|
||||
and unbounded latency. RCU read-side critical sections are delimited
|
||||
by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPTION
|
||||
kernels, generate no code whatsoever.
|
||||
|
||||
This means that RCU writers are unaware of the presence of concurrent
|
||||
readers, so that RCU updates to shared data must be undertaken quite
|
||||
carefully, leaving an old version of the data structure in place until all
|
||||
pre-existing readers have finished. These old versions are needed because
|
||||
such readers might hold a reference to them. RCU updates can therefore be
|
||||
rather expensive, and RCU is thus best suited for read-mostly situations.
|
||||
|
||||
How can an RCU writer possibly determine when all readers are finished,
|
||||
given that readers might well leave absolutely no trace of their
|
||||
presence? There is a synchronize_rcu() primitive that blocks until all
|
||||
pre-existing readers have completed. An updater wishing to delete an
|
||||
element p from a linked list might do the following, while holding an
|
||||
appropriate lock, of course::
|
||||
|
||||
list_del_rcu(p);
|
||||
synchronize_rcu();
|
||||
kfree(p);
|
||||
|
||||
But the above code cannot be used in IRQ context -- the call_rcu()
|
||||
primitive must be used instead. This primitive takes a pointer to an
|
||||
rcu_head struct placed within the RCU-protected data structure and
|
||||
another pointer to a function that may be invoked later to free that
|
||||
structure. Code to delete an element p from the linked list from IRQ
|
||||
context might then be as follows::
|
||||
RCU updaters sometimes use call_rcu() to initiate an asynchronous wait for
|
||||
a grace period to elapse. This primitive takes a pointer to an rcu_head
|
||||
struct placed within the RCU-protected data structure and another pointer
|
||||
to a function that may be invoked later to free that structure. Code to
|
||||
delete an element p from the linked list from IRQ context might then be
|
||||
as follows::
|
||||
|
||||
list_del_rcu(p);
|
||||
call_rcu(&p->rcu, p_callback);
|
||||
@@ -54,7 +29,7 @@ IRQ context. The function p_callback() might be defined as follows::
|
||||
Unloading Modules That Use call_rcu()
|
||||
-------------------------------------
|
||||
|
||||
But what if p_callback is defined in an unloadable module?
|
||||
But what if the p_callback() function is defined in an unloadable module?
|
||||
|
||||
If we unload the module while some RCU callbacks are pending,
|
||||
the CPUs executing these callbacks are going to be severely
|
||||
@@ -67,20 +42,21 @@ grace period to elapse, it does not wait for the callbacks to complete.
|
||||
|
||||
One might be tempted to try several back-to-back synchronize_rcu()
|
||||
calls, but this is still not guaranteed to work. If there is a very
|
||||
heavy RCU-callback load, then some of the callbacks might be deferred
|
||||
in order to allow other processing to proceed. Such deferral is required
|
||||
in realtime kernels in order to avoid excessive scheduling latencies.
|
||||
heavy RCU-callback load, then some of the callbacks might be deferred in
|
||||
order to allow other processing to proceed. For but one example, such
|
||||
deferral is required in realtime kernels in order to avoid excessive
|
||||
scheduling latencies.
|
||||
|
||||
|
||||
rcu_barrier()
|
||||
-------------
|
||||
|
||||
We instead need the rcu_barrier() primitive. Rather than waiting for
|
||||
a grace period to elapse, rcu_barrier() waits for all outstanding RCU
|
||||
callbacks to complete. Please note that rcu_barrier() does **not** imply
|
||||
synchronize_rcu(), in particular, if there are no RCU callbacks queued
|
||||
anywhere, rcu_barrier() is within its rights to return immediately,
|
||||
without waiting for a grace period to elapse.
|
||||
This situation can be handled by the rcu_barrier() primitive. Rather
|
||||
than waiting for a grace period to elapse, rcu_barrier() waits for all
|
||||
outstanding RCU callbacks to complete. Please note that rcu_barrier()
|
||||
does **not** imply synchronize_rcu(), in particular, if there are no RCU
|
||||
callbacks queued anywhere, rcu_barrier() is within its rights to return
|
||||
immediately, without waiting for anything, let alone a grace period.
|
||||
|
||||
Pseudo-code using rcu_barrier() is as follows:
|
||||
|
||||
@@ -89,83 +65,86 @@ Pseudo-code using rcu_barrier() is as follows:
|
||||
3. Allow the module to be unloaded.
|
||||
|
||||
There is also an srcu_barrier() function for SRCU, and you of course
|
||||
must match the flavor of rcu_barrier() with that of call_rcu(). If your
|
||||
module uses multiple flavors of call_rcu(), then it must also use multiple
|
||||
flavors of rcu_barrier() when unloading that module. For example, if
|
||||
it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on
|
||||
srcu_struct_2, then the following three lines of code will be required
|
||||
when unloading::
|
||||
must match the flavor of srcu_barrier() with that of call_srcu().
|
||||
If your module uses multiple srcu_struct structures, then it must also
|
||||
use multiple invocations of srcu_barrier() when unloading that module.
|
||||
For example, if it uses call_rcu(), call_srcu() on srcu_struct_1, and
|
||||
call_srcu() on srcu_struct_2, then the following three lines of code
|
||||
will be required when unloading::
|
||||
|
||||
1 rcu_barrier();
|
||||
2 srcu_barrier(&srcu_struct_1);
|
||||
3 srcu_barrier(&srcu_struct_2);
|
||||
1 rcu_barrier();
|
||||
2 srcu_barrier(&srcu_struct_1);
|
||||
3 srcu_barrier(&srcu_struct_2);
|
||||
|
||||
The rcutorture module makes use of rcu_barrier() in its exit function
|
||||
as follows::
|
||||
If latency is of the essence, workqueues could be used to run these
|
||||
three functions concurrently.
|
||||
|
||||
1 static void
|
||||
2 rcu_torture_cleanup(void)
|
||||
3 {
|
||||
4 int i;
|
||||
5
|
||||
6 fullstop = 1;
|
||||
7 if (shuffler_task != NULL) {
|
||||
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
|
||||
9 kthread_stop(shuffler_task);
|
||||
10 }
|
||||
11 shuffler_task = NULL;
|
||||
An ancient version of the rcutorture module makes use of rcu_barrier()
|
||||
in its exit function as follows::
|
||||
|
||||
1 static void
|
||||
2 rcu_torture_cleanup(void)
|
||||
3 {
|
||||
4 int i;
|
||||
5
|
||||
6 fullstop = 1;
|
||||
7 if (shuffler_task != NULL) {
|
||||
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
|
||||
9 kthread_stop(shuffler_task);
|
||||
10 }
|
||||
11 shuffler_task = NULL;
|
||||
12
|
||||
13 if (writer_task != NULL) {
|
||||
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
|
||||
15 kthread_stop(writer_task);
|
||||
16 }
|
||||
17 writer_task = NULL;
|
||||
13 if (writer_task != NULL) {
|
||||
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
|
||||
15 kthread_stop(writer_task);
|
||||
16 }
|
||||
17 writer_task = NULL;
|
||||
18
|
||||
19 if (reader_tasks != NULL) {
|
||||
20 for (i = 0; i < nrealreaders; i++) {
|
||||
21 if (reader_tasks[i] != NULL) {
|
||||
22 VERBOSE_PRINTK_STRING(
|
||||
23 "Stopping rcu_torture_reader task");
|
||||
24 kthread_stop(reader_tasks[i]);
|
||||
25 }
|
||||
26 reader_tasks[i] = NULL;
|
||||
27 }
|
||||
28 kfree(reader_tasks);
|
||||
29 reader_tasks = NULL;
|
||||
30 }
|
||||
31 rcu_torture_current = NULL;
|
||||
19 if (reader_tasks != NULL) {
|
||||
20 for (i = 0; i < nrealreaders; i++) {
|
||||
21 if (reader_tasks[i] != NULL) {
|
||||
22 VERBOSE_PRINTK_STRING(
|
||||
23 "Stopping rcu_torture_reader task");
|
||||
24 kthread_stop(reader_tasks[i]);
|
||||
25 }
|
||||
26 reader_tasks[i] = NULL;
|
||||
27 }
|
||||
28 kfree(reader_tasks);
|
||||
29 reader_tasks = NULL;
|
||||
30 }
|
||||
31 rcu_torture_current = NULL;
|
||||
32
|
||||
33 if (fakewriter_tasks != NULL) {
|
||||
34 for (i = 0; i < nfakewriters; i++) {
|
||||
35 if (fakewriter_tasks[i] != NULL) {
|
||||
36 VERBOSE_PRINTK_STRING(
|
||||
37 "Stopping rcu_torture_fakewriter task");
|
||||
38 kthread_stop(fakewriter_tasks[i]);
|
||||
39 }
|
||||
40 fakewriter_tasks[i] = NULL;
|
||||
41 }
|
||||
42 kfree(fakewriter_tasks);
|
||||
43 fakewriter_tasks = NULL;
|
||||
44 }
|
||||
33 if (fakewriter_tasks != NULL) {
|
||||
34 for (i = 0; i < nfakewriters; i++) {
|
||||
35 if (fakewriter_tasks[i] != NULL) {
|
||||
36 VERBOSE_PRINTK_STRING(
|
||||
37 "Stopping rcu_torture_fakewriter task");
|
||||
38 kthread_stop(fakewriter_tasks[i]);
|
||||
39 }
|
||||
40 fakewriter_tasks[i] = NULL;
|
||||
41 }
|
||||
42 kfree(fakewriter_tasks);
|
||||
43 fakewriter_tasks = NULL;
|
||||
44 }
|
||||
45
|
||||
46 if (stats_task != NULL) {
|
||||
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
|
||||
48 kthread_stop(stats_task);
|
||||
49 }
|
||||
50 stats_task = NULL;
|
||||
46 if (stats_task != NULL) {
|
||||
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
|
||||
48 kthread_stop(stats_task);
|
||||
49 }
|
||||
50 stats_task = NULL;
|
||||
51
|
||||
52 /* Wait for all RCU callbacks to fire. */
|
||||
53 rcu_barrier();
|
||||
52 /* Wait for all RCU callbacks to fire. */
|
||||
53 rcu_barrier();
|
||||
54
|
||||
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
|
||||
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
|
||||
56
|
||||
57 if (cur_ops->cleanup != NULL)
|
||||
58 cur_ops->cleanup();
|
||||
59 if (atomic_read(&n_rcu_torture_error))
|
||||
60 rcu_torture_print_module_parms("End of test: FAILURE");
|
||||
61 else
|
||||
62 rcu_torture_print_module_parms("End of test: SUCCESS");
|
||||
63 }
|
||||
57 if (cur_ops->cleanup != NULL)
|
||||
58 cur_ops->cleanup();
|
||||
59 if (atomic_read(&n_rcu_torture_error))
|
||||
60 rcu_torture_print_module_parms("End of test: FAILURE");
|
||||
61 else
|
||||
62 rcu_torture_print_module_parms("End of test: SUCCESS");
|
||||
63 }
|
||||
|
||||
Line 6 sets a global variable that prevents any RCU callbacks from
|
||||
re-posting themselves. This will not be necessary in most cases, since
|
||||
@@ -190,16 +169,17 @@ Quick Quiz #1:
|
||||
:ref:`Answer to Quick Quiz #1 <answer_rcubarrier_quiz_1>`
|
||||
|
||||
Your module might have additional complications. For example, if your
|
||||
module invokes call_rcu() from timers, you will need to first cancel all
|
||||
the timers, and only then invoke rcu_barrier() to wait for any remaining
|
||||
module invokes call_rcu() from timers, you will need to first refrain
|
||||
from posting new timers, cancel (or wait for) all the already-posted
|
||||
timers, and only then invoke rcu_barrier() to wait for any remaining
|
||||
RCU callbacks to complete.
|
||||
|
||||
Of course, if you module uses call_rcu(), you will need to invoke
|
||||
Of course, if your module uses call_rcu(), you will need to invoke
|
||||
rcu_barrier() before unloading. Similarly, if your module uses
|
||||
call_srcu(), you will need to invoke srcu_barrier() before unloading,
|
||||
and on the same srcu_struct structure. If your module uses call_rcu()
|
||||
**and** call_srcu(), then you will need to invoke rcu_barrier() **and**
|
||||
srcu_barrier().
|
||||
**and** call_srcu(), then (as noted above) you will need to invoke
|
||||
rcu_barrier() **and** srcu_barrier().
|
||||
|
||||
|
||||
Implementing rcu_barrier()
|
||||
@@ -211,27 +191,40 @@ queues. His implementation queues an RCU callback on each of the per-CPU
|
||||
callback queues, and then waits until they have all started executing, at
|
||||
which point, all earlier RCU callbacks are guaranteed to have completed.
|
||||
|
||||
The original code for rcu_barrier() was as follows::
|
||||
The original code for rcu_barrier() was roughly as follows::
|
||||
|
||||
1 void rcu_barrier(void)
|
||||
2 {
|
||||
3 BUG_ON(in_interrupt());
|
||||
4 /* Take cpucontrol mutex to protect against CPU hotplug */
|
||||
5 mutex_lock(&rcu_barrier_mutex);
|
||||
6 init_completion(&rcu_barrier_completion);
|
||||
7 atomic_set(&rcu_barrier_cpu_count, 0);
|
||||
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
|
||||
9 wait_for_completion(&rcu_barrier_completion);
|
||||
10 mutex_unlock(&rcu_barrier_mutex);
|
||||
11 }
|
||||
1 void rcu_barrier(void)
|
||||
2 {
|
||||
3 BUG_ON(in_interrupt());
|
||||
4 /* Take cpucontrol mutex to protect against CPU hotplug */
|
||||
5 mutex_lock(&rcu_barrier_mutex);
|
||||
6 init_completion(&rcu_barrier_completion);
|
||||
7 atomic_set(&rcu_barrier_cpu_count, 1);
|
||||
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
|
||||
9 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
10 complete(&rcu_barrier_completion);
|
||||
11 wait_for_completion(&rcu_barrier_completion);
|
||||
12 mutex_unlock(&rcu_barrier_mutex);
|
||||
13 }
|
||||
|
||||
Line 3 verifies that the caller is in process context, and lines 5 and 10
|
||||
Line 3 verifies that the caller is in process context, and lines 5 and 12
|
||||
use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
|
||||
global completion and counters at a time, which are initialized on lines
|
||||
6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
|
||||
shown below. Note that the final "1" in on_each_cpu()'s argument list
|
||||
ensures that all the calls to rcu_barrier_func() will have completed
|
||||
before on_each_cpu() returns. Line 9 then waits for the completion.
|
||||
before on_each_cpu() returns. Line 9 removes the initial count from
|
||||
rcu_barrier_cpu_count, and if this count is now zero, line 10 finalizes
|
||||
the completion, which prevents line 11 from blocking. Either way,
|
||||
line 11 then waits (if needed) for the completion.
|
||||
|
||||
.. _rcubarrier_quiz_2:
|
||||
|
||||
Quick Quiz #2:
|
||||
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
|
||||
thereby avoiding the need for lines 9 and 10?
|
||||
|
||||
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
|
||||
|
||||
This code was rewritten in 2008 and several times thereafter, but this
|
||||
still gives the general idea.
|
||||
@@ -239,21 +232,21 @@ still gives the general idea.
|
||||
The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
|
||||
to post an RCU callback, as follows::
|
||||
|
||||
1 static void rcu_barrier_func(void *notused)
|
||||
2 {
|
||||
3 int cpu = smp_processor_id();
|
||||
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
|
||||
5 struct rcu_head *head;
|
||||
6
|
||||
7 head = &rdp->barrier;
|
||||
8 atomic_inc(&rcu_barrier_cpu_count);
|
||||
9 call_rcu(head, rcu_barrier_callback);
|
||||
10 }
|
||||
1 static void rcu_barrier_func(void *notused)
|
||||
2 {
|
||||
3 int cpu = smp_processor_id();
|
||||
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
|
||||
5 struct rcu_head *head;
|
||||
6
|
||||
7 head = &rdp->barrier;
|
||||
8 atomic_inc(&rcu_barrier_cpu_count);
|
||||
9 call_rcu(head, rcu_barrier_callback);
|
||||
10 }
|
||||
|
||||
Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
|
||||
which contains the struct rcu_head that needed for the later call to
|
||||
call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
|
||||
8 increments a global counter. This counter will later be decremented
|
||||
8 increments the global counter. This counter will later be decremented
|
||||
by the callback. Line 9 then registers the rcu_barrier_callback() on
|
||||
the current CPU's queue.
|
||||
|
||||
@@ -261,33 +254,34 @@ The rcu_barrier_callback() function simply atomically decrements the
|
||||
rcu_barrier_cpu_count variable and finalizes the completion when it
|
||||
reaches zero, as follows::
|
||||
|
||||
1 static void rcu_barrier_callback(struct rcu_head *notused)
|
||||
2 {
|
||||
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
4 complete(&rcu_barrier_completion);
|
||||
5 }
|
||||
1 static void rcu_barrier_callback(struct rcu_head *notused)
|
||||
2 {
|
||||
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
4 complete(&rcu_barrier_completion);
|
||||
5 }
|
||||
|
||||
.. _rcubarrier_quiz_2:
|
||||
.. _rcubarrier_quiz_3:
|
||||
|
||||
Quick Quiz #2:
|
||||
Quick Quiz #3:
|
||||
What happens if CPU 0's rcu_barrier_func() executes
|
||||
immediately (thus incrementing rcu_barrier_cpu_count to the
|
||||
value one), but the other CPU's rcu_barrier_func() invocations
|
||||
are delayed for a full grace period? Couldn't this result in
|
||||
rcu_barrier() returning prematurely?
|
||||
|
||||
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
|
||||
:ref:`Answer to Quick Quiz #3 <answer_rcubarrier_quiz_3>`
|
||||
|
||||
The current rcu_barrier() implementation is more complex, due to the need
|
||||
to avoid disturbing idle CPUs (especially on battery-powered systems)
|
||||
and the need to minimally disturb non-idle CPUs in real-time systems.
|
||||
However, the code above illustrates the concepts.
|
||||
In addition, a great many optimizations have been applied. However,
|
||||
the code above illustrates the concepts.
|
||||
|
||||
|
||||
rcu_barrier() Summary
|
||||
---------------------
|
||||
|
||||
The rcu_barrier() primitive has seen relatively little use, since most
|
||||
The rcu_barrier() primitive is used relatively infrequently, since most
|
||||
code using RCU is in the core kernel rather than in modules. However, if
|
||||
you are using RCU from an unloadable module, you need to use rcu_barrier()
|
||||
so that your module may be safely unloaded.
|
||||
@@ -302,7 +296,8 @@ Quick Quiz #1:
|
||||
Is there any other situation where rcu_barrier() might
|
||||
be required?
|
||||
|
||||
Answer: Interestingly enough, rcu_barrier() was not originally
|
||||
Answer:
|
||||
Interestingly enough, rcu_barrier() was not originally
|
||||
implemented for module unloading. Nikita Danilov was using
|
||||
RCU in a filesystem, which resulted in a similar situation at
|
||||
filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
|
||||
@@ -318,13 +313,48 @@ Answer: Interestingly enough, rcu_barrier() was not originally
|
||||
.. _answer_rcubarrier_quiz_2:
|
||||
|
||||
Quick Quiz #2:
|
||||
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
|
||||
thereby avoiding the need for lines 9 and 10?
|
||||
|
||||
Answer:
|
||||
Suppose that the on_each_cpu() function shown on line 8 was
|
||||
delayed, so that CPU 0's rcu_barrier_func() executed and
|
||||
the corresponding grace period elapsed, all before CPU 1's
|
||||
rcu_barrier_func() started executing. This would result in
|
||||
rcu_barrier_cpu_count being decremented to zero, so that line
|
||||
11's wait_for_completion() would return immediately, failing to
|
||||
wait for CPU 1's callbacks to be invoked.
|
||||
|
||||
Note that this was not a problem when the rcu_barrier() code
|
||||
was first added back in 2005. This is because on_each_cpu()
|
||||
disables preemption, which acted as an RCU read-side critical
|
||||
section, thus preventing CPU 0's grace period from completing
|
||||
until on_each_cpu() had dealt with all of the CPUs. However,
|
||||
with the advent of preemptible RCU, rcu_barrier() no longer
|
||||
waited on nonpreemptible regions of code in preemptible kernels,
|
||||
that being the job of the new rcu_barrier_sched() function.
|
||||
|
||||
However, with the RCU flavor consolidation around v4.20, this
|
||||
possibility was once again ruled out, because the consolidated
|
||||
RCU once again waits on nonpreemptible regions of code.
|
||||
|
||||
Nevertheless, that extra count might still be a good idea.
|
||||
Relying on these sort of accidents of implementation can result
|
||||
in later surprise bugs when the implementation changes.
|
||||
|
||||
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
|
||||
|
||||
.. _answer_rcubarrier_quiz_3:
|
||||
|
||||
Quick Quiz #3:
|
||||
What happens if CPU 0's rcu_barrier_func() executes
|
||||
immediately (thus incrementing rcu_barrier_cpu_count to the
|
||||
value one), but the other CPU's rcu_barrier_func() invocations
|
||||
are delayed for a full grace period? Couldn't this result in
|
||||
rcu_barrier() returning prematurely?
|
||||
|
||||
Answer: This cannot happen. The reason is that on_each_cpu() has its last
|
||||
Answer:
|
||||
This cannot happen. The reason is that on_each_cpu() has its last
|
||||
argument, the wait flag, set to "1". This flag is passed through
|
||||
to smp_call_function() and further to smp_call_function_on_cpu(),
|
||||
causing this latter to spin until the cross-CPU invocation of
|
||||
@@ -336,18 +366,15 @@ Answer: This cannot happen. The reason is that on_each_cpu() has its last
|
||||
|
||||
Therefore, on_each_cpu() disables preemption across its call
|
||||
to smp_call_function() and also across the local call to
|
||||
rcu_barrier_func(). This prevents the local CPU from context
|
||||
switching, again preventing grace periods from completing. This
|
||||
rcu_barrier_func(). Because recent RCU implementations treat
|
||||
preemption-disabled regions of code as RCU read-side critical
|
||||
sections, this prevents grace periods from completing. This
|
||||
means that all CPUs have executed rcu_barrier_func() before
|
||||
the first rcu_barrier_callback() can possibly execute, in turn
|
||||
preventing rcu_barrier_cpu_count from prematurely reaching zero.
|
||||
|
||||
Currently, -rt implementations of RCU keep but a single global
|
||||
queue for RCU callbacks, and thus do not suffer from this
|
||||
problem. However, when the -rt RCU eventually does have per-CPU
|
||||
callback queues, things will have to change. One simple change
|
||||
is to add an rcu_read_lock() before line 8 of rcu_barrier()
|
||||
and an rcu_read_unlock() after line 8 of this same function. If
|
||||
you can think of a better change, please let me know!
|
||||
But if on_each_cpu() ever decides to forgo disabling preemption,
|
||||
as might well happen due to real-time latency considerations,
|
||||
initializing rcu_barrier_cpu_count to one will save the day.
|
||||
|
||||
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
|
||||
:ref:`Back to Quick Quiz #3 <rcubarrier_quiz_3>`
|
||||
|
||||
@@ -14,19 +14,19 @@ Using 'nulls'
|
||||
=============
|
||||
|
||||
Using special makers (called 'nulls') is a convenient way
|
||||
to solve following problem :
|
||||
to solve following problem.
|
||||
|
||||
A typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
|
||||
use following algos :
|
||||
Without 'nulls', a typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use the following
|
||||
algorithms:
|
||||
|
||||
1) Lookup algo
|
||||
--------------
|
||||
1) Lookup algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
rcu_read_lock()
|
||||
begin:
|
||||
rcu_read_lock()
|
||||
obj = lockless_lookup(key);
|
||||
if (obj) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
@@ -38,6 +38,7 @@ use following algos :
|
||||
*/
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
}
|
||||
@@ -52,9 +53,9 @@ but a version with an additional memory barrier (smp_rmb())
|
||||
{
|
||||
struct hlist_node *node, *next;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
@@ -64,9 +65,9 @@ And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
|
||||
|
||||
struct hlist_node *node;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
@@ -82,36 +83,32 @@ Quoting Corey Minyard::
|
||||
solved by pre-fetching the "next" field (with proper barriers) before
|
||||
checking the key."
|
||||
|
||||
2) Insert algo
|
||||
--------------
|
||||
2) Insertion algorithm
|
||||
----------------------
|
||||
|
||||
We need to make sure a reader cannot read the new 'obj->obj_next' value
|
||||
and previous value of 'obj->key'. Or else, an item could be deleted
|
||||
and previous value of 'obj->key'. Otherwise, an item could be deleted
|
||||
from a chain, and inserted into another chain. If new chain was empty
|
||||
before the move, 'next' pointer is NULL, and lockless reader can
|
||||
not detect it missed following items in original chain.
|
||||
before the move, 'next' pointer is NULL, and lockless reader can not
|
||||
detect the fact that it missed following items in original chain.
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(...);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* we need to make sure obj->key is updated before obj->next
|
||||
* or obj->refcnt
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
atomic_set_release(&obj->refcnt, 1); // key before refcnt
|
||||
hlist_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
||||
|
||||
3) Remove algo
|
||||
--------------
|
||||
3) Removal algorithm
|
||||
--------------------
|
||||
|
||||
Nothing special here, we can use a standard RCU hlist deletion.
|
||||
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
|
||||
very very fast (before the end of RCU grace period)
|
||||
@@ -133,7 +130,7 @@ Avoiding extra smp_rmb()
|
||||
========================
|
||||
|
||||
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
|
||||
and extra smp_wmb() in insert function.
|
||||
and extra _release() in insert function.
|
||||
|
||||
For example, if we choose to store the slot number as the 'nulls'
|
||||
end-of-list marker for each slot of the hash table, we can detect
|
||||
@@ -142,59 +139,61 @@ to another chain) checking the final 'nulls' value if
|
||||
the lookup met the end of chain. If final 'nulls' value
|
||||
is not the slot number, then we must restart the lookup at
|
||||
the beginning. If the object was moved to the same chain,
|
||||
then the reader doesn't care : It might eventually
|
||||
then the reader doesn't care: It might occasionally
|
||||
scan the list again without harm.
|
||||
|
||||
|
||||
1) lookup algo
|
||||
--------------
|
||||
1) lookup algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
head = &table[slot];
|
||||
rcu_read_lock();
|
||||
begin:
|
||||
rcu_read_lock();
|
||||
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
|
||||
if (obj->key == key) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
if (!try_get_ref(obj)) { // might fail for free objects
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
}
|
||||
}
|
||||
|
||||
// If the nulls value we got at the end of this lookup is
|
||||
// not the expected one, we must restart lookup.
|
||||
// We probably met an item that was moved to another chain.
|
||||
if (get_nulls_value(node) != slot) {
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
/*
|
||||
* if the nulls value we got at the end of this lookup is
|
||||
* not the expected one, we must restart lookup.
|
||||
* We probably met an item that was moved to another chain.
|
||||
*/
|
||||
if (get_nulls_value(node) != slot)
|
||||
goto begin;
|
||||
obj = NULL;
|
||||
|
||||
out:
|
||||
rcu_read_unlock();
|
||||
|
||||
2) Insert function
|
||||
------------------
|
||||
2) Insert algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(cachep);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
atomic_set_release(&obj->refcnt, 1); // key before refcnt
|
||||
/*
|
||||
* changes to obj->key must be visible before refcnt one
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
/*
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
hlist_nulls_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
||||
@@ -25,10 +25,10 @@ warnings:
|
||||
|
||||
- A CPU looping with bottom halves disabled.
|
||||
|
||||
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the kernel
|
||||
without invoking schedule(). If the looping in the kernel is
|
||||
really expected and desirable behavior, you might need to add
|
||||
some calls to cond_resched().
|
||||
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the
|
||||
kernel without potentially invoking schedule(). If the looping
|
||||
in the kernel is really expected and desirable behavior, you
|
||||
might need to add some calls to cond_resched().
|
||||
|
||||
- Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
@@ -108,16 +108,17 @@ warnings:
|
||||
|
||||
- A bug in the RCU implementation.
|
||||
|
||||
- A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
leading the realization that the CPU had failed.
|
||||
- A hardware failure. This is quite unlikely, but is not at all
|
||||
uncommon in large datacenter. In one memorable case some decades
|
||||
back, a CPU failed in a running system, becoming unresponsive,
|
||||
but not causing an immediate crash. This resulted in a series
|
||||
of RCU CPU stall warnings, eventually leading the realization
|
||||
that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning.
|
||||
Note that SRCU does *not* have CPU stall warnings. Please note that
|
||||
RCU only detects CPU stalls when there is a grace period in progress.
|
||||
No grace period, no CPU stall warnings.
|
||||
The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have
|
||||
CPU stall warning. Note that SRCU does *not* have CPU stall warnings.
|
||||
Please note that RCU only detects CPU stalls when there is a grace period
|
||||
in progress. No grace period, no CPU stall warnings.
|
||||
|
||||
To diagnose the cause of the stall, inspect the stack traces.
|
||||
The offending function will usually be near the top of the stack.
|
||||
@@ -205,16 +206,21 @@ RCU_STALL_RAT_DELAY
|
||||
rcupdate.rcu_task_stall_timeout
|
||||
-------------------------------
|
||||
|
||||
This boot/sysfs parameter controls the RCU-tasks stall warning
|
||||
interval. A value of zero or less suppresses RCU-tasks stall
|
||||
warnings. A positive value sets the stall-warning interval
|
||||
in seconds. An RCU-tasks stall warning starts with the line:
|
||||
This boot/sysfs parameter controls the RCU-tasks and
|
||||
RCU-tasks-trace stall warning intervals. A value of zero or less
|
||||
suppresses RCU-tasks stall warnings. A positive value sets the
|
||||
stall-warning interval in seconds. An RCU-tasks stall warning
|
||||
starts with the line:
|
||||
|
||||
INFO: rcu_tasks detected stalls on tasks:
|
||||
|
||||
And continues with the output of sched_show_task() for each
|
||||
task stalling the current RCU-tasks grace period.
|
||||
|
||||
An RCU-tasks-trace stall warning starts (and continues) similarly:
|
||||
|
||||
INFO: rcu_tasks_trace detected stalls on tasks
|
||||
|
||||
|
||||
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||
==============================================
|
||||
@@ -248,7 +254,8 @@ dynticks counter, which will have an even-numbered value if the CPU
|
||||
is in dyntick-idle mode and an odd-numbered value otherwise. The hex
|
||||
number between the two "/"s is the value of the nesting, which will be
|
||||
a small non-negative number if in the idle loop (as shown above) and a
|
||||
very large positive number otherwise.
|
||||
very large positive number otherwise. The number following the final
|
||||
"/" is the NMI nesting, which will be a small non-negative number.
|
||||
|
||||
The "softirq=" portion of the message tracks the number of RCU softirq
|
||||
handlers that the stalled CPU has executed. The number before the "/"
|
||||
@@ -383,3 +390,95 @@ for example, "P3421".
|
||||
|
||||
It is entirely possible to see stall warnings from normal and from
|
||||
expedited grace periods at about the same time during the same run.
|
||||
|
||||
RCU_CPU_STALL_CPUTIME
|
||||
=====================
|
||||
|
||||
In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
|
||||
rcupdate.rcu_cpu_stall_cputime=1, the following additional information
|
||||
is supplied with each RCU CPU stall warning::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 45 0
|
||||
rcu: cputime: 69 1 2425 ==> 2500(ms)
|
||||
|
||||
These statistics are collected during the sampling period. The values
|
||||
in row "number:" are the number of hard interrupts, number of soft
|
||||
interrupts, and number of context switches on the stalled CPU. The
|
||||
first three values in row "cputime:" indicate the CPU time in
|
||||
milliseconds consumed by hard interrupts, soft interrupts, and tasks
|
||||
on the stalled CPU. The last number is the measurement interval, again
|
||||
in milliseconds. Because user-mode tasks normally do not cause RCU CPU
|
||||
stalls, these tasks are typically kernel tasks, which is why only the
|
||||
system CPU time are considered.
|
||||
|
||||
The sampling period is shown as follows::
|
||||
|
||||
|<------------first timeout---------->|<-----second timeout----->|
|
||||
|<--half timeout-->|<--half timeout-->| |
|
||||
| |<--first period-->| |
|
||||
| |<-----------second sampling period---------->|
|
||||
| | | |
|
||||
snapshot time point 1st-stall 2nd-stall
|
||||
|
||||
The following describes four typical scenarios:
|
||||
|
||||
1. A CPU looping with interrupts disabled.
|
||||
|
||||
::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 0 0 0
|
||||
rcu: cputime: 0 0 0 ==> 2500(ms)
|
||||
|
||||
Because interrupts have been disabled throughout the measurement
|
||||
interval, there are no interrupts and no context switches.
|
||||
Furthermore, because CPU time consumption was measured using interrupt
|
||||
handlers, the system CPU consumption is misleadingly measured as zero.
|
||||
This scenario will normally also have "(0 ticks this GP)" printed on
|
||||
this CPU's summary line.
|
||||
|
||||
2. A CPU looping with bottom halves disabled.
|
||||
|
||||
This is similar to the previous example, but with non-zero number of
|
||||
and CPU time consumed by hard interrupts, along with non-zero CPU
|
||||
time consumed by in-kernel execution::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 0 0
|
||||
rcu: cputime: 49 0 2446 ==> 2500(ms)
|
||||
|
||||
The fact that there are zero softirqs gives a hint that these were
|
||||
disabled, perhaps via local_bh_disable(). It is of course possible
|
||||
that there were no softirqs, perhaps because all events that would
|
||||
result in softirq execution are confined to other CPUs. In this case,
|
||||
the diagnosis should continue as shown in the next example.
|
||||
|
||||
3. A CPU looping with preemption disabled.
|
||||
|
||||
Here, only the number of context switches is zero::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 45 0
|
||||
rcu: cputime: 69 1 2425 ==> 2500(ms)
|
||||
|
||||
This situation hints that the stalled CPU was looping with preemption
|
||||
disabled.
|
||||
|
||||
4. No looping, but massive hard and soft interrupts.
|
||||
|
||||
::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: xx xx 0
|
||||
rcu: cputime: xx xx 0 ==> 2500(ms)
|
||||
|
||||
Here, the number and CPU time of hard interrupts are all non-zero,
|
||||
but the number of context switches and the in-kernel CPU time consumed
|
||||
are zero. The number and cputime of soft interrupts will usually be
|
||||
non-zero, but could be zero, for example, if the CPU was spinning
|
||||
within a single hard interrupt handler.
|
||||
|
||||
If this type of RCU CPU stall warning can be reproduced, you can
|
||||
narrow it down by looking at /proc/interrupts or by writing code to
|
||||
trace each interrupt, for example, by referring to show_interrupts().
|
||||
|
||||
@@ -206,7 +206,11 @@ values for memory may require disabling the callback-flooding tests
|
||||
using the --bootargs parameter discussed below.
|
||||
|
||||
Sometimes additional debugging is useful, and in such cases the --kconfig
|
||||
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_KASAN=y'``.
|
||||
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_RCU_EQS_DEBUG=y'``.
|
||||
In addition, there are the --gdb, --kasan, and --kcsan parameters.
|
||||
Note that --gdb limits you to one scenario per kvm.sh run and requires
|
||||
that you have another window open from which to run ``gdb`` as instructed
|
||||
by the script.
|
||||
|
||||
Kernel boot arguments can also be supplied, for example, to control
|
||||
rcutorture's module parameters. For example, to test a change to RCU's
|
||||
@@ -219,10 +223,17 @@ require disabling rcutorture's callback-flooding tests::
|
||||
--bootargs 'rcutorture.fwd_progress=0'
|
||||
|
||||
Sometimes all that is needed is a full set of kernel builds. This is
|
||||
what the --buildonly argument does.
|
||||
what the --buildonly parameter does.
|
||||
|
||||
Finally, the --trust-make argument allows each kernel build to reuse what
|
||||
it can from the previous kernel build.
|
||||
The --duration parameter can override the default run time of 30 minutes.
|
||||
For example, ``--duration 2d`` would run for two days, ``--duration 3h``
|
||||
would run for three hours, ``--duration 5m`` would run for five minutes,
|
||||
and ``--duration 45s`` would run for 45 seconds. This last can be useful
|
||||
for tracking down rare boot-time failures.
|
||||
|
||||
Finally, the --trust-make parameter allows each kernel build to reuse what
|
||||
it can from the previous kernel build. Please note that without the
|
||||
--trust-make parameter, your tags files may be demolished.
|
||||
|
||||
There are additional more arcane arguments that are documented in the
|
||||
source code of the kvm.sh script.
|
||||
@@ -291,3 +302,73 @@ the following summary at the end of the run on a 12-CPU system::
|
||||
TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732
|
||||
CPU count limited from 16 to 12
|
||||
TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011
|
||||
|
||||
|
||||
Repeated Runs
|
||||
=============
|
||||
|
||||
Suppose that you are chasing down a rare boot-time failure. Although you
|
||||
could use kvm.sh, doing so will rebuild the kernel on each run. If you
|
||||
need (say) 1,000 runs to have confidence that you have fixed the bug,
|
||||
these pointless rebuilds can become extremely annoying.
|
||||
|
||||
This is why kvm-again.sh exists.
|
||||
|
||||
Suppose that a previous kvm.sh run left its output in this directory::
|
||||
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
Then this run can be re-run without rebuilding as follow:
|
||||
|
||||
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
A few of the original run's kvm.sh parameters may be overridden, perhaps
|
||||
most notably --duration and --bootargs. For example::
|
||||
|
||||
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28 \
|
||||
--duration 45s
|
||||
|
||||
would re-run the previous test, but for only 45 seconds, thus facilitating
|
||||
tracking down the aforementioned rare boot-time failure.
|
||||
|
||||
|
||||
Distributed Runs
|
||||
================
|
||||
|
||||
Although kvm.sh is quite useful, its testing is confined to a single
|
||||
system. It is not all that hard to use your favorite framework to cause
|
||||
(say) 5 instances of kvm.sh to run on your 5 systems, but this will very
|
||||
likely unnecessarily rebuild kernels. In addition, manually distributing
|
||||
the desired rcutorture scenarios across the available systems can be
|
||||
painstaking and error-prone.
|
||||
|
||||
And this is why the kvm-remote.sh script exists.
|
||||
|
||||
If you the following command works::
|
||||
|
||||
ssh system0 date
|
||||
|
||||
and if it also works for system1, system2, system3, system4, and system5,
|
||||
and all of these systems have 64 CPUs, you can type::
|
||||
|
||||
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
|
||||
--cpus 64 --duration 8h --configs "5*CFLIST"
|
||||
|
||||
This will build each default scenario's kernel on the local system, then
|
||||
spread each of five instances of each scenario over the systems listed,
|
||||
running each scenario for eight hours. At the end of the runs, the
|
||||
results will be gathered, recorded, and printed. Most of the parameters
|
||||
that kvm.sh will accept can be passed to kvm-remote.sh, but the list of
|
||||
systems must come first.
|
||||
|
||||
The kvm.sh ``--dryrun scenarios`` argument is useful for working out
|
||||
how many scenarios may be run in one batch across a group of systems.
|
||||
|
||||
You can also re-run a previous remote run in a manner similar to kvm.sh:
|
||||
|
||||
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \
|
||||
--duration 24h
|
||||
|
||||
In this case, most of the kvm-again.sh parmeters may be supplied following
|
||||
the pathname of the old run-results directory.
|
||||
|
||||
@@ -16,18 +16,23 @@ to start learning about RCU:
|
||||
| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
|
||||
| 2019 Big API Table https://lwn.net/Articles/777165/
|
||||
|
||||
For those preferring video:
|
||||
|
||||
| 1. Unraveling RCU Mysteries: Fundamentals https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries
|
||||
| 2. Unraveling RCU Mysteries: Additional Use Cases https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases
|
||||
|
||||
|
||||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
during the 2.5 development effort that is optimized for read-mostly
|
||||
situations. Although RCU is actually quite simple once you understand it,
|
||||
getting there can sometimes be a challenge. Part of the problem is that
|
||||
most of the past descriptions of RCU have been written with the mistaken
|
||||
assumption that there is "one true way" to describe RCU. Instead,
|
||||
the experience has been that different people must take different paths
|
||||
to arrive at an understanding of RCU. This document provides several
|
||||
different paths, as follows:
|
||||
situations. Although RCU is actually quite simple, making effective use
|
||||
of it requires you to think differently about your code. Another part
|
||||
of the problem is the mistaken assumption that there is "one true way" to
|
||||
describe and to use RCU. Instead, the experience has been that different
|
||||
people must take different paths to arrive at an understanding of RCU,
|
||||
depending on their experiences and use cases. This document provides
|
||||
several different paths, as follows:
|
||||
|
||||
:ref:`1. RCU OVERVIEW <1_whatisRCU>`
|
||||
|
||||
@@ -157,34 +162,36 @@ rcu_read_lock()
|
||||
^^^^^^^^^^^^^^^
|
||||
void rcu_read_lock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
entering an RCU read-side critical section. It is illegal
|
||||
to block while in an RCU read-side critical section, though
|
||||
kernels built with CONFIG_PREEMPT_RCU can preempt RCU
|
||||
read-side critical sections. Any RCU-protected data structure
|
||||
accessed during an RCU read-side critical section is guaranteed to
|
||||
remain unreclaimed for the full duration of that critical section.
|
||||
Reference counts may be used in conjunction with RCU to maintain
|
||||
longer-term references to data structures.
|
||||
This temporal primitive is used by a reader to inform the
|
||||
reclaimer that the reader is entering an RCU read-side critical
|
||||
section. It is illegal to block while in an RCU read-side
|
||||
critical section, though kernels built with CONFIG_PREEMPT_RCU
|
||||
can preempt RCU read-side critical sections. Any RCU-protected
|
||||
data structure accessed during an RCU read-side critical section
|
||||
is guaranteed to remain unreclaimed for the full duration of that
|
||||
critical section. Reference counts may be used in conjunction
|
||||
with RCU to maintain longer-term references to data structures.
|
||||
|
||||
rcu_read_unlock()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
void rcu_read_unlock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
exiting an RCU read-side critical section. Note that RCU
|
||||
read-side critical sections may be nested and/or overlapping.
|
||||
This temporal primitives is used by a reader to inform the
|
||||
reclaimer that the reader is exiting an RCU read-side critical
|
||||
section. Note that RCU read-side critical sections may be nested
|
||||
and/or overlapping.
|
||||
|
||||
synchronize_rcu()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
void synchronize_rcu(void);
|
||||
|
||||
Marks the end of updater code and the beginning of reclaimer
|
||||
code. It does this by blocking until all pre-existing RCU
|
||||
read-side critical sections on all CPUs have completed.
|
||||
Note that synchronize_rcu() will **not** necessarily wait for
|
||||
any subsequent RCU read-side critical sections to complete.
|
||||
For example, consider the following sequence of events::
|
||||
This temporal primitive marks the end of updater code and the
|
||||
beginning of reclaimer code. It does this by blocking until
|
||||
all pre-existing RCU read-side critical sections on all CPUs
|
||||
have completed. Note that synchronize_rcu() will **not**
|
||||
necessarily wait for any subsequent RCU read-side critical
|
||||
sections to complete. For example, consider the following
|
||||
sequence of events::
|
||||
|
||||
CPU 0 CPU 1 CPU 2
|
||||
----------------- ------------------------- ---------------
|
||||
@@ -211,13 +218,13 @@ synchronize_rcu()
|
||||
to be useful in all but the most read-intensive situations,
|
||||
synchronize_rcu()'s overhead must also be quite small.
|
||||
|
||||
The call_rcu() API is a callback form of synchronize_rcu(),
|
||||
and is described in more detail in a later section. Instead of
|
||||
blocking, it registers a function and argument which are invoked
|
||||
after all ongoing RCU read-side critical sections have completed.
|
||||
This callback variant is particularly useful in situations where
|
||||
it is illegal to block or where update-side performance is
|
||||
critically important.
|
||||
The call_rcu() API is an asynchronous callback form of
|
||||
synchronize_rcu(), and is described in more detail in a later
|
||||
section. Instead of blocking, it registers a function and
|
||||
argument which are invoked after all ongoing RCU read-side
|
||||
critical sections have completed. This callback variant is
|
||||
particularly useful in situations where it is illegal to block
|
||||
or where update-side performance is critically important.
|
||||
|
||||
However, the call_rcu() API should not be used lightly, as use
|
||||
of the synchronize_rcu() API generally results in simpler code.
|
||||
@@ -236,11 +243,13 @@ rcu_assign_pointer()
|
||||
would be cool to be able to declare a function in this manner.
|
||||
(Compiler experts will no doubt disagree.)
|
||||
|
||||
The updater uses this function to assign a new value to an
|
||||
The updater uses this spatial macro to assign a new value to an
|
||||
RCU-protected pointer, in order to safely communicate the change
|
||||
in value from the updater to the reader. This macro does not
|
||||
evaluate to an rvalue, but it does execute any memory-barrier
|
||||
instructions required for a given CPU architecture.
|
||||
in value from the updater to the reader. This is a spatial (as
|
||||
opposed to temporal) macro. It does not evaluate to an rvalue,
|
||||
but it does execute any memory-barrier instructions required
|
||||
for a given CPU architecture. Its ordering properties are that
|
||||
of a store-release operation.
|
||||
|
||||
Perhaps just as important, it serves to document (1) which
|
||||
pointers are protected by RCU and (2) the point at which a
|
||||
@@ -255,14 +264,15 @@ rcu_dereference()
|
||||
Like rcu_assign_pointer(), rcu_dereference() must be implemented
|
||||
as a macro.
|
||||
|
||||
The reader uses rcu_dereference() to fetch an RCU-protected
|
||||
pointer, which returns a value that may then be safely
|
||||
dereferenced. Note that rcu_dereference() does not actually
|
||||
dereference the pointer, instead, it protects the pointer for
|
||||
later dereferencing. It also executes any needed memory-barrier
|
||||
instructions for a given CPU architecture. Currently, only Alpha
|
||||
needs memory barriers within rcu_dereference() -- on other CPUs,
|
||||
it compiles to nothing, not even a compiler directive.
|
||||
The reader uses the spatial rcu_dereference() macro to fetch
|
||||
an RCU-protected pointer, which returns a value that may
|
||||
then be safely dereferenced. Note that rcu_dereference()
|
||||
does not actually dereference the pointer, instead, it
|
||||
protects the pointer for later dereferencing. It also
|
||||
executes any needed memory-barrier instructions for a given
|
||||
CPU architecture. Currently, only Alpha needs memory barriers
|
||||
within rcu_dereference() -- on other CPUs, it compiles to a
|
||||
volatile load.
|
||||
|
||||
Common coding practice uses rcu_dereference() to copy an
|
||||
RCU-protected pointer to a local variable, then dereferences
|
||||
@@ -355,12 +365,15 @@ reader, updater, and reclaimer.
|
||||
synchronize_rcu() & call_rcu()
|
||||
|
||||
|
||||
The RCU infrastructure observes the time sequence of rcu_read_lock(),
|
||||
The RCU infrastructure observes the temporal sequence of rcu_read_lock(),
|
||||
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
|
||||
order to determine when (1) synchronize_rcu() invocations may return
|
||||
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
|
||||
implementations of the RCU infrastructure make heavy use of batching in
|
||||
order to amortize their overhead over many uses of the corresponding APIs.
|
||||
The rcu_assign_pointer() and rcu_dereference() invocations communicate
|
||||
spatial changes via stores to and loads from the RCU-protected pointer in
|
||||
question.
|
||||
|
||||
There are at least three flavors of RCU usage in the Linux kernel. The diagram
|
||||
above shows the most common one. On the updater side, the rcu_assign_pointer(),
|
||||
@@ -392,7 +405,9 @@ b. RCU applied to networking data structures that may be subjected
|
||||
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
|
||||
|
||||
Again, most uses will be of (a). The (b) and (c) cases are important
|
||||
for specialized uses, but are relatively uncommon.
|
||||
for specialized uses, but are relatively uncommon. The SRCU, RCU-Tasks,
|
||||
RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among
|
||||
their assorted primitives.
|
||||
|
||||
.. _3_whatisRCU:
|
||||
|
||||
@@ -468,7 +483,7 @@ So, to sum up:
|
||||
- Within an RCU read-side critical section, use rcu_dereference()
|
||||
to dereference RCU-protected pointers.
|
||||
|
||||
- Use some solid scheme (such as locks or semaphores) to
|
||||
- Use some solid design (such as locks or semaphores) to
|
||||
keep concurrent updates from interfering with each other.
|
||||
|
||||
- Use rcu_assign_pointer() to update an RCU-protected pointer.
|
||||
@@ -579,6 +594,14 @@ to avoid having to write your own callback::
|
||||
|
||||
kfree_rcu(old_fp, rcu);
|
||||
|
||||
If the occasional sleep is permitted, the single-argument form may
|
||||
be used, omitting the rcu_head structure from struct foo.
|
||||
|
||||
kfree_rcu(old_fp);
|
||||
|
||||
This variant of kfree_rcu() almost never blocks, but might do so by
|
||||
invoking synchronize_rcu() in response to memory-allocation failure.
|
||||
|
||||
Again, see checklist.rst for additional rules governing the use of RCU.
|
||||
|
||||
.. _5_whatisRCU:
|
||||
@@ -596,7 +619,7 @@ lacking both functionality and performance. However, they are useful
|
||||
in getting a feel for how RCU works. See kernel/rcu/update.c for a
|
||||
production-quality implementation, and see:
|
||||
|
||||
http://www.rdrop.com/users/paulmck/RCU
|
||||
https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
|
||||
|
||||
for papers describing the Linux kernel RCU implementation. The OLS'01
|
||||
and OLS'02 papers are a good introduction, and the dissertation provides
|
||||
@@ -929,6 +952,8 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
|
||||
initialized after each and every call to kmem_cache_alloc(), which renders
|
||||
reference-free spinlock acquisition completely unsafe. Therefore, when
|
||||
using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
|
||||
(Those willing to use a kmem_cache constructor may also use locking,
|
||||
including cache-friendly sequence locking.)
|
||||
|
||||
With traditional reference counting -- such as that implemented by the
|
||||
kref library in Linux -- there is typically code that runs when the last
|
||||
@@ -1047,6 +1072,30 @@ sched::
|
||||
rcu_read_lock_sched_held
|
||||
|
||||
|
||||
RCU-Tasks::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
N/A call_rcu_tasks rcu_barrier_tasks
|
||||
synchronize_rcu_tasks
|
||||
|
||||
|
||||
RCU-Tasks-Rude::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
N/A call_rcu_tasks_rude rcu_barrier_tasks_rude
|
||||
synchronize_rcu_tasks_rude
|
||||
|
||||
|
||||
RCU-Tasks-Trace::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock_trace call_rcu_tasks_trace rcu_barrier_tasks_trace
|
||||
rcu_read_unlock_trace synchronize_rcu_tasks_trace
|
||||
|
||||
|
||||
SRCU::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
@@ -1087,35 +1136,43 @@ list can be helpful:
|
||||
|
||||
a. Will readers need to block? If so, you need SRCU.
|
||||
|
||||
b. What about the -rt patchset? If readers would need to block
|
||||
in an non-rt kernel, you need SRCU. If readers would block
|
||||
in a -rt kernel, but not in a non-rt kernel, SRCU is not
|
||||
necessary. (The -rt patchset turns spinlocks into sleeplocks,
|
||||
hence this distinction.)
|
||||
b. Will readers need to block and are you doing tracing, for
|
||||
example, ftrace or BPF? If so, you need RCU-tasks,
|
||||
RCU-tasks-rude, and/or RCU-tasks-trace.
|
||||
|
||||
c. Do you need to treat NMI handlers, hardirq handlers,
|
||||
c. What about the -rt patchset? If readers would need to block in
|
||||
an non-rt kernel, you need SRCU. If readers would block when
|
||||
acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,
|
||||
SRCU is not necessary. (The -rt patchset turns spinlocks into
|
||||
sleeplocks, hence this distinction.)
|
||||
|
||||
d. Do you need to treat NMI handlers, hardirq handlers,
|
||||
and code segments with preemption disabled (whether
|
||||
via preempt_disable(), local_irq_save(), local_bh_disable(),
|
||||
or some other mechanism) as if they were explicit RCU readers?
|
||||
If so, RCU-sched is the only choice that will work for you.
|
||||
If so, RCU-sched readers are the only choice that will work
|
||||
for you, but since about v4.20 you use can use the vanilla RCU
|
||||
update primitives.
|
||||
|
||||
d. Do you need RCU grace periods to complete even in the face
|
||||
of softirq monopolization of one or more of the CPUs? For
|
||||
example, is your code subject to network-based denial-of-service
|
||||
attacks? If so, you should disable softirq across your readers,
|
||||
for example, by using rcu_read_lock_bh().
|
||||
e. Do you need RCU grace periods to complete even in the face of
|
||||
softirq monopolization of one or more of the CPUs? For example,
|
||||
is your code subject to network-based denial-of-service attacks?
|
||||
If so, you should disable softirq across your readers, for
|
||||
example, by using rcu_read_lock_bh(). Since about v4.20 you
|
||||
use can use the vanilla RCU update primitives.
|
||||
|
||||
e. Is your workload too update-intensive for normal use of
|
||||
f. Is your workload too update-intensive for normal use of
|
||||
RCU, but inappropriate for other synchronization mechanisms?
|
||||
If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
|
||||
named SLAB_DESTROY_BY_RCU). But please be careful!
|
||||
|
||||
f. Do you need read-side critical sections that are respected
|
||||
even though they are in the middle of the idle loop, during
|
||||
user-mode execution, or on an offlined CPU? If so, SRCU is the
|
||||
only choice that will work for you.
|
||||
g. Do you need read-side critical sections that are respected even
|
||||
on CPUs that are deep in the idle loop, during entry to or exit
|
||||
from user-mode execution, or on an offlined CPU? If so, SRCU
|
||||
and RCU Tasks Trace are the only choices that will work for you,
|
||||
with SRCU being strongly preferred in almost all cases.
|
||||
|
||||
g. Otherwise, use RCU.
|
||||
h. Otherwise, use RCU.
|
||||
|
||||
Of course, this all assumes that you have determined that RCU is in fact
|
||||
the right tool for your job.
|
||||
|
||||
@@ -5113,6 +5113,17 @@
|
||||
rcupdate.rcu_cpu_stall_timeout to be used (after
|
||||
conversion from seconds to milliseconds).
|
||||
|
||||
rcupdate.rcu_cpu_stall_cputime= [KNL]
|
||||
Provide statistics on the cputime and count of
|
||||
interrupts and tasks during the sampling period. For
|
||||
multiple continuous RCU stalls, all sampling periods
|
||||
begin at half of the first RCU stall timeout.
|
||||
|
||||
rcupdate.rcu_exp_stall_task_details= [KNL]
|
||||
Print stack dumps of any tasks blocking the
|
||||
current expedited RCU grace period during an
|
||||
expedited RCU CPU stall warning.
|
||||
|
||||
rcupdate.rcu_expedited= [KNL]
|
||||
Use expedited grace-period primitives, for
|
||||
example, synchronize_rcu_expedited() instead
|
||||
|
||||
@@ -181,7 +181,6 @@ void fw_devlink_purge_absent_suppliers(struct fwnode_handle *fwnode)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(fw_devlink_purge_absent_suppliers);
|
||||
|
||||
#ifdef CONFIG_SRCU
|
||||
static DEFINE_MUTEX(device_links_lock);
|
||||
DEFINE_STATIC_SRCU(device_links_srcu);
|
||||
|
||||
@@ -220,47 +219,6 @@ static void device_link_remove_from_lists(struct device_link *link)
|
||||
list_del_rcu(&link->s_node);
|
||||
list_del_rcu(&link->c_node);
|
||||
}
|
||||
#else /* !CONFIG_SRCU */
|
||||
static DECLARE_RWSEM(device_links_lock);
|
||||
|
||||
static inline void device_links_write_lock(void)
|
||||
{
|
||||
down_write(&device_links_lock);
|
||||
}
|
||||
|
||||
static inline void device_links_write_unlock(void)
|
||||
{
|
||||
up_write(&device_links_lock);
|
||||
}
|
||||
|
||||
int device_links_read_lock(void)
|
||||
{
|
||||
down_read(&device_links_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void device_links_read_unlock(int not_used)
|
||||
{
|
||||
up_read(&device_links_lock);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
int device_links_read_lock_held(void)
|
||||
{
|
||||
return lockdep_is_held(&device_links_lock);
|
||||
}
|
||||
#endif
|
||||
|
||||
static inline void device_link_synchronize_removal(void)
|
||||
{
|
||||
}
|
||||
|
||||
static void device_link_remove_from_lists(struct device_link *link)
|
||||
{
|
||||
list_del(&link->s_node);
|
||||
list_del(&link->c_node);
|
||||
}
|
||||
#endif /* !CONFIG_SRCU */
|
||||
|
||||
static bool device_is_ancestor(struct device *dev, struct device *target)
|
||||
{
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only
|
||||
menuconfig DAX
|
||||
tristate "DAX: direct access to differentiated memory"
|
||||
select SRCU
|
||||
default m if NVDIMM_DAX
|
||||
|
||||
if DAX
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
config STM
|
||||
tristate "System Trace Module devices"
|
||||
select CONFIGFS_FS
|
||||
select SRCU
|
||||
help
|
||||
A System Trace Module (STM) is a device exporting data in System
|
||||
Trace Protocol (STP) format as defined by MIPI STP standards.
|
||||
|
||||
@@ -6,7 +6,6 @@
|
||||
menuconfig MD
|
||||
bool "Multiple devices driver support (RAID and LVM)"
|
||||
depends on BLOCK
|
||||
select SRCU
|
||||
help
|
||||
Support multiple physical spindles through a single logical device.
|
||||
Required for RAID and logical volume management.
|
||||
|
||||
@@ -334,7 +334,6 @@ config NETCONSOLE_DYNAMIC
|
||||
|
||||
config NETPOLL
|
||||
def_bool NETCONSOLE
|
||||
select SRCU
|
||||
|
||||
config NET_POLL_CONTROLLER
|
||||
def_bool NETPOLL
|
||||
|
||||
@@ -258,7 +258,7 @@ config PCIE_MEDIATEK_GEN3
|
||||
MediaTek SoCs.
|
||||
|
||||
config VMD
|
||||
depends on PCI_MSI && X86_64 && SRCU && !UML
|
||||
depends on PCI_MSI && X86_64 && !UML
|
||||
tristate "Intel Volume Management Device Driver"
|
||||
help
|
||||
Adds support for the Intel Volume Management Device (VMD). VMD is a
|
||||
|
||||
@@ -17,7 +17,6 @@ config BTRFS_FS
|
||||
select FS_IOMAP
|
||||
select RAID6_PQ
|
||||
select XOR_BLOCKS
|
||||
select SRCU
|
||||
depends on PAGE_SIZE_LESS_THAN_256KB
|
||||
|
||||
help
|
||||
|
||||
25
fs/locks.c
25
fs/locks.c
@@ -1890,7 +1890,6 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp,
|
||||
}
|
||||
EXPORT_SYMBOL(generic_setlease);
|
||||
|
||||
#if IS_ENABLED(CONFIG_SRCU)
|
||||
/*
|
||||
* Kernel subsystems can register to be notified on any attempt to set
|
||||
* a new lease with the lease_notifier_chain. This is used by (e.g.) nfsd
|
||||
@@ -1924,30 +1923,6 @@ void lease_unregister_notifier(struct notifier_block *nb)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
|
||||
|
||||
#else /* !IS_ENABLED(CONFIG_SRCU) */
|
||||
static inline void
|
||||
lease_notifier_chain_init(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void
|
||||
setlease_notifier(long arg, struct file_lock *lease)
|
||||
{
|
||||
}
|
||||
|
||||
int lease_register_notifier(struct notifier_block *nb)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_register_notifier);
|
||||
|
||||
void lease_unregister_notifier(struct notifier_block *nb)
|
||||
{
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
|
||||
|
||||
#endif /* IS_ENABLED(CONFIG_SRCU) */
|
||||
|
||||
/**
|
||||
* vfs_setlease - sets a lease on an open file
|
||||
* @filp: file pointer
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only
|
||||
config FSNOTIFY
|
||||
def_bool n
|
||||
select SRCU
|
||||
|
||||
source "fs/notify/dnotify/Kconfig"
|
||||
source "fs/notify/inotify/Kconfig"
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user