Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably:

      - Throttling callback invocation based on the number of callbacks
        that are now ready to invoke instead of on the total number of
        callbacks

      - Several patches that suppress false-positive boot-time
        diagnostics, for example, due to lockdep not yet being
        initialized

      - Make expedited RCU CPU stall warnings dump stacks of any tasks
        that are blocking the stalled grace period. (Normal RCU CPU
        stall warnings have done this for many years)

      - Lazy-callback fixes to avoid delays during boot, suspend, and
        resume. (Note that lazy callbacks must be explicitly enabled, so
        this should not (yet) affect production use cases)

 - Make kfree_rcu() and friends take advantage of polled grace periods,
   thus reducing memory footprint by almost two orders of magnitude,
   admittedly on a microbenchmark

   This also begins the transition from kfree_rcu(p) to
   kfree_rcu_mightsleep(p). This transition was motivated by bugs where
   kfree_rcu(p), which can block, was typed instead of the intended
   kfree_rcu(p, rh)

 - SRCU updates, perhaps most notably fixing a bug that causes SRCU to
   fail when booted on a system with a non-zero boot CPU. This
   surprising situation actually happens for kdump kernels on the
   powerpc architecture

   This also adds an srcu_down_read() and srcu_up_read(), which act like
   srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
   critical section to be handed off from one task to another

 - Clean up the now-useless SRCU Kconfig option

   There are a few more commits that are not yet acked or pulled into
   maintainer trees, and these will be in a pull request for a later
   merge window

 - RCU-tasks updates, perhaps most notably these fixes:

      - A strange interaction between PID-namespace unshare and the
        RCU-tasks grace period that results in a low-probability but
        very real hang

      - A race between an RCU tasks rude grace period on a single-CPU
        system and CPU-hotplug addition of the second CPU that can
        result in a too-short grace period

      - A race between shrinking RCU tasks down to a single callback
        list and queuing a new callback to some other CPU, but where
        that queuing is delayed for more than an RCU grace period. This
        can result in that callback being stranded on the non-boot CPU

 - Torture-test updates and fixes

 - Torture-test scripting updates and fixes

 - Provide additional RCU CPU stall-warning information in kernels built
   with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
   timeout limit for expedited RCU CPU stall warnings

* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
  kernel/notifier: Remove CONFIG_SRCU
  init: Remove "select SRCU"
  fs/quota: Remove "select SRCU"
  fs/notify: Remove "select SRCU"
  fs/btrfs: Remove "select SRCU"
  fs: Remove CONFIG_SRCU
  drivers/pci/controller: Remove "select SRCU"
  drivers/net: Remove "select SRCU"
  drivers/md: Remove "select SRCU"
  drivers/hwtracing/stm: Remove "select SRCU"
  drivers/dax: Remove "select SRCU"
  drivers/base: Remove CONFIG_SRCU
  rcu: Disable laziness if lazy-tracking says so
  rcu: Track laziness during boot and suspend
  rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
  rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
  rcu: Align the output of RCU CPU stall warning messages
  rcu: Add RCU stall diagnosis information
  sched: Add helper nr_context_switches_cpu()
  ...
This commit is contained in:
Linus Torvalds
2023-02-21 10:45:51 -08:00
54 changed files with 1739 additions and 844 deletions

View File

@@ -8,7 +8,7 @@ Although RCU is usually used to protect read-mostly data structures,
it is possible to use RCU to provide dynamic non-maskable interrupt
handlers, as well as dynamic irq handlers. This document describes
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
work in "arch/x86/kernel/traps.c".
work in an old version of "arch/x86/kernel/traps.c".
The relevant pieces of code are listed below, each followed by a
brief explanation::
@@ -116,7 +116,7 @@ Answer to Quick Quiz:
This same sad story can happen on other CPUs when using
a compiler with aggressive pointer-value speculation
optimizations.
optimizations. (But please don't!)
More important, the rcu_dereference_sched() makes it
clear to someone reading the code that the pointer is

View File

@@ -38,7 +38,7 @@ by having call_rcu() directly invoke its arguments only if it was called
from process context. However, this can fail in a similar manner.
Suppose that an RCU-based algorithm again scans a linked list containing
elements A, B, and C in process contexts, but that it invokes a function
elements A, B, and C in process context, but that it invokes a function
on each element as it is scanned. Suppose further that this function
deletes element B from the list, then passes it to call_rcu() for deferred
freeing. This may be a bit unconventional, but it is perfectly legal
@@ -59,7 +59,8 @@ Example 3: Death by Deadlock
Suppose that call_rcu() is invoked while holding a lock, and that the
callback function must acquire this same lock. In this case, if
call_rcu() were to directly invoke the callback, the result would
be self-deadlock.
be self-deadlock *even if* this invocation occurred from a later
call_rcu() invocation a full grace period later.
In some cases, it would possible to restructure to code so that
the call_rcu() is delayed until after the lock is released. However,
@@ -85,6 +86,14 @@ Quick Quiz #2:
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
It is important to note that userspace RCU implementations *do*
permit call_rcu() to directly invoke callbacks, but only if a full
grace period has elapsed since those callbacks were queued. This is
the case because some userspace environments are extremely constrained.
Nevertheless, people writing userspace RCU implementations are strongly
encouraged to avoid invoking callbacks from call_rcu(), thus obtaining
the deadlock-avoidance benefits called out above.
Summary
-------

View File

@@ -69,9 +69,8 @@ checking of rcu_dereference() primitives:
value of the pointer itself, for example, against NULL.
The rcu_dereference_check() check expression can be any boolean
expression, but would normally include a lockdep expression. However,
any boolean expression can be used. For a moderately ornate example,
consider the following::
expression, but would normally include a lockdep expression. For a
moderately ornate example, consider the following::
file = rcu_dereference_check(fdt->fd[fd],
lockdep_is_held(&files->file_lock) ||
@@ -97,10 +96,10 @@ code, it could instead be written as follows::
atomic_read(&files->count) == 1);
This would verify cases #2 and #3 above, and furthermore lockdep would
complain if this was used in an RCU read-side critical section unless one
of these two cases held. Because rcu_dereference_protected() omits all
barriers and compiler constraints, it generates better code than do the
other flavors of rcu_dereference(). On the other hand, it is illegal
complain even if this was used in an RCU read-side critical section unless
one of these two cases held. Because rcu_dereference_protected() omits
all barriers and compiler constraints, it generates better code than do
the other flavors of rcu_dereference(). On the other hand, it is illegal
to use rcu_dereference_protected() if either the RCU-protected pointer
or the RCU-protected data that it points to can change concurrently.

View File

@@ -77,15 +77,17 @@ Frequently Asked Questions
search for the string "Patent" in Documentation/RCU/RTFP.txt to find them.
Of these, one was allowed to lapse by the assignee, and the
others have been contributed to the Linux kernel under GPL.
Many (but not all) have long since expired.
There are now also LGPL implementations of user-level RCU
available (https://liburcu.org/).
- I hear that RCU needs work in order to support realtime kernels?
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
Realtime-friendly RCU are enabled via the CONFIG_PREEMPTION
kernel configuration parameter.
- Where can I find more information on RCU?
See the Documentation/RCU/RTFP.txt file.
Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/).
Or point your browser at (https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit)
or (https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing).

View File

@@ -19,8 +19,9 @@ Follow these rules to keep your RCU code working properly:
can reload the value, and won't your code have fun with two
different values for a single pointer! Without rcu_dereference(),
DEC Alpha can load a pointer, dereference that pointer, and
return data preceding initialization that preceded the store of
the pointer.
return data preceding initialization that preceded the store
of the pointer. (As noted later, in recent kernels READ_ONCE()
also prevents DEC Alpha from playing these tricks.)
In addition, the volatile cast in rcu_dereference() prevents the
compiler from deducing the resulting pointer value. Please see
@@ -34,7 +35,7 @@ Follow these rules to keep your RCU code working properly:
takes on the role of the lockless_dereference() primitive that
was removed in v4.15.
- You are only permitted to use rcu_dereference on pointer values.
- You are only permitted to use rcu_dereference() on pointer values.
The compiler simply knows too much about integral values to
trust it to carry dependencies through integer operations.
There are a very few exceptions, namely that you can temporarily
@@ -240,6 +241,7 @@ precautions. To see this, consider the following code fragment::
struct foo *q;
int r1, r2;
rcu_read_lock();
p = rcu_dereference(gp2);
if (p == NULL)
return;
@@ -248,7 +250,10 @@ precautions. To see this, consider the following code fragment::
if (p == q) {
/* The compiler decides that q->c is same as p->c. */
r2 = p->c; /* Could get 44 on weakly order system. */
} else {
r2 = p->c - r1; /* Unconditional access to p->c. */
}
rcu_read_unlock();
do_something_with(r1, r2);
}
@@ -297,6 +302,7 @@ Then one approach is to use locking, for example, as follows::
struct foo *q;
int r1, r2;
rcu_read_lock();
p = rcu_dereference(gp2);
if (p == NULL)
return;
@@ -306,7 +312,12 @@ Then one approach is to use locking, for example, as follows::
if (p == q) {
/* The compiler decides that q->c is same as p->c. */
r2 = p->c; /* Locking guarantees r2 == 144. */
} else {
spin_lock(&q->lock);
r2 = q->c - r1;
spin_unlock(&q->lock);
}
rcu_read_unlock();
spin_unlock(&p->lock);
do_something_with(r1, r2);
}
@@ -364,7 +375,7 @@ the exact value of "p" even in the not-equals case. This allows the
compiler to make the return values independent of the load from "gp",
in turn destroying the ordering between this load and the loads of the
return values. This can result in "p->b" returning pre-initialization
garbage values.
garbage values on weakly ordered systems.
In short, rcu_dereference() is *not* optional when you are going to
dereference the resulting pointer.
@@ -430,7 +441,7 @@ member of the rcu_dereference() to use in various situations:
SPARSE CHECKING OF RCU-PROTECTED POINTERS
-----------------------------------------
The sparse static-analysis tool checks for direct access to RCU-protected
The sparse static-analysis tool checks for non-RCU access to RCU-protected
pointers, which can result in "interesting" bugs due to compiler
optimizations involving invented loads and perhaps also load tearing.
For example, suppose someone mistakenly does something like this::

View File

@@ -5,37 +5,12 @@ RCU and Unloadable Modules
[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
RCU (read-copy update) is a synchronization mechanism that can be thought
of as a replacement for read-writer locking (among other things), but with
very low-overhead readers that are immune to deadlock, priority inversion,
and unbounded latency. RCU read-side critical sections are delimited
by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPTION
kernels, generate no code whatsoever.
This means that RCU writers are unaware of the presence of concurrent
readers, so that RCU updates to shared data must be undertaken quite
carefully, leaving an old version of the data structure in place until all
pre-existing readers have finished. These old versions are needed because
such readers might hold a reference to them. RCU updates can therefore be
rather expensive, and RCU is thus best suited for read-mostly situations.
How can an RCU writer possibly determine when all readers are finished,
given that readers might well leave absolutely no trace of their
presence? There is a synchronize_rcu() primitive that blocks until all
pre-existing readers have completed. An updater wishing to delete an
element p from a linked list might do the following, while holding an
appropriate lock, of course::
list_del_rcu(p);
synchronize_rcu();
kfree(p);
But the above code cannot be used in IRQ context -- the call_rcu()
primitive must be used instead. This primitive takes a pointer to an
rcu_head struct placed within the RCU-protected data structure and
another pointer to a function that may be invoked later to free that
structure. Code to delete an element p from the linked list from IRQ
context might then be as follows::
RCU updaters sometimes use call_rcu() to initiate an asynchronous wait for
a grace period to elapse. This primitive takes a pointer to an rcu_head
struct placed within the RCU-protected data structure and another pointer
to a function that may be invoked later to free that structure. Code to
delete an element p from the linked list from IRQ context might then be
as follows::
list_del_rcu(p);
call_rcu(&p->rcu, p_callback);
@@ -54,7 +29,7 @@ IRQ context. The function p_callback() might be defined as follows::
Unloading Modules That Use call_rcu()
-------------------------------------
But what if p_callback is defined in an unloadable module?
But what if the p_callback() function is defined in an unloadable module?
If we unload the module while some RCU callbacks are pending,
the CPUs executing these callbacks are going to be severely
@@ -67,20 +42,21 @@ grace period to elapse, it does not wait for the callbacks to complete.
One might be tempted to try several back-to-back synchronize_rcu()
calls, but this is still not guaranteed to work. If there is a very
heavy RCU-callback load, then some of the callbacks might be deferred
in order to allow other processing to proceed. Such deferral is required
in realtime kernels in order to avoid excessive scheduling latencies.
heavy RCU-callback load, then some of the callbacks might be deferred in
order to allow other processing to proceed. For but one example, such
deferral is required in realtime kernels in order to avoid excessive
scheduling latencies.
rcu_barrier()
-------------
We instead need the rcu_barrier() primitive. Rather than waiting for
a grace period to elapse, rcu_barrier() waits for all outstanding RCU
callbacks to complete. Please note that rcu_barrier() does **not** imply
synchronize_rcu(), in particular, if there are no RCU callbacks queued
anywhere, rcu_barrier() is within its rights to return immediately,
without waiting for a grace period to elapse.
This situation can be handled by the rcu_barrier() primitive. Rather
than waiting for a grace period to elapse, rcu_barrier() waits for all
outstanding RCU callbacks to complete. Please note that rcu_barrier()
does **not** imply synchronize_rcu(), in particular, if there are no RCU
callbacks queued anywhere, rcu_barrier() is within its rights to return
immediately, without waiting for anything, let alone a grace period.
Pseudo-code using rcu_barrier() is as follows:
@@ -89,83 +65,86 @@ Pseudo-code using rcu_barrier() is as follows:
3. Allow the module to be unloaded.
There is also an srcu_barrier() function for SRCU, and you of course
must match the flavor of rcu_barrier() with that of call_rcu(). If your
module uses multiple flavors of call_rcu(), then it must also use multiple
flavors of rcu_barrier() when unloading that module. For example, if
it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on
srcu_struct_2, then the following three lines of code will be required
when unloading::
must match the flavor of srcu_barrier() with that of call_srcu().
If your module uses multiple srcu_struct structures, then it must also
use multiple invocations of srcu_barrier() when unloading that module.
For example, if it uses call_rcu(), call_srcu() on srcu_struct_1, and
call_srcu() on srcu_struct_2, then the following three lines of code
will be required when unloading::
1 rcu_barrier();
2 srcu_barrier(&srcu_struct_1);
3 srcu_barrier(&srcu_struct_2);
1 rcu_barrier();
2 srcu_barrier(&srcu_struct_1);
3 srcu_barrier(&srcu_struct_2);
The rcutorture module makes use of rcu_barrier() in its exit function
as follows::
If latency is of the essence, workqueues could be used to run these
three functions concurrently.
1 static void
2 rcu_torture_cleanup(void)
3 {
4 int i;
5
6 fullstop = 1;
7 if (shuffler_task != NULL) {
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
9 kthread_stop(shuffler_task);
10 }
11 shuffler_task = NULL;
An ancient version of the rcutorture module makes use of rcu_barrier()
in its exit function as follows::
1 static void
2 rcu_torture_cleanup(void)
3 {
4 int i;
5
6 fullstop = 1;
7 if (shuffler_task != NULL) {
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
9 kthread_stop(shuffler_task);
10 }
11 shuffler_task = NULL;
12
13 if (writer_task != NULL) {
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
15 kthread_stop(writer_task);
16 }
17 writer_task = NULL;
13 if (writer_task != NULL) {
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
15 kthread_stop(writer_task);
16 }
17 writer_task = NULL;
18
19 if (reader_tasks != NULL) {
20 for (i = 0; i < nrealreaders; i++) {
21 if (reader_tasks[i] != NULL) {
22 VERBOSE_PRINTK_STRING(
23 "Stopping rcu_torture_reader task");
24 kthread_stop(reader_tasks[i]);
25 }
26 reader_tasks[i] = NULL;
27 }
28 kfree(reader_tasks);
29 reader_tasks = NULL;
30 }
31 rcu_torture_current = NULL;
19 if (reader_tasks != NULL) {
20 for (i = 0; i < nrealreaders; i++) {
21 if (reader_tasks[i] != NULL) {
22 VERBOSE_PRINTK_STRING(
23 "Stopping rcu_torture_reader task");
24 kthread_stop(reader_tasks[i]);
25 }
26 reader_tasks[i] = NULL;
27 }
28 kfree(reader_tasks);
29 reader_tasks = NULL;
30 }
31 rcu_torture_current = NULL;
32
33 if (fakewriter_tasks != NULL) {
34 for (i = 0; i < nfakewriters; i++) {
35 if (fakewriter_tasks[i] != NULL) {
36 VERBOSE_PRINTK_STRING(
37 "Stopping rcu_torture_fakewriter task");
38 kthread_stop(fakewriter_tasks[i]);
39 }
40 fakewriter_tasks[i] = NULL;
41 }
42 kfree(fakewriter_tasks);
43 fakewriter_tasks = NULL;
44 }
33 if (fakewriter_tasks != NULL) {
34 for (i = 0; i < nfakewriters; i++) {
35 if (fakewriter_tasks[i] != NULL) {
36 VERBOSE_PRINTK_STRING(
37 "Stopping rcu_torture_fakewriter task");
38 kthread_stop(fakewriter_tasks[i]);
39 }
40 fakewriter_tasks[i] = NULL;
41 }
42 kfree(fakewriter_tasks);
43 fakewriter_tasks = NULL;
44 }
45
46 if (stats_task != NULL) {
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
48 kthread_stop(stats_task);
49 }
50 stats_task = NULL;
46 if (stats_task != NULL) {
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
48 kthread_stop(stats_task);
49 }
50 stats_task = NULL;
51
52 /* Wait for all RCU callbacks to fire. */
53 rcu_barrier();
52 /* Wait for all RCU callbacks to fire. */
53 rcu_barrier();
54
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
56
57 if (cur_ops->cleanup != NULL)
58 cur_ops->cleanup();
59 if (atomic_read(&n_rcu_torture_error))
60 rcu_torture_print_module_parms("End of test: FAILURE");
61 else
62 rcu_torture_print_module_parms("End of test: SUCCESS");
63 }
57 if (cur_ops->cleanup != NULL)
58 cur_ops->cleanup();
59 if (atomic_read(&n_rcu_torture_error))
60 rcu_torture_print_module_parms("End of test: FAILURE");
61 else
62 rcu_torture_print_module_parms("End of test: SUCCESS");
63 }
Line 6 sets a global variable that prevents any RCU callbacks from
re-posting themselves. This will not be necessary in most cases, since
@@ -190,16 +169,17 @@ Quick Quiz #1:
:ref:`Answer to Quick Quiz #1 <answer_rcubarrier_quiz_1>`
Your module might have additional complications. For example, if your
module invokes call_rcu() from timers, you will need to first cancel all
the timers, and only then invoke rcu_barrier() to wait for any remaining
module invokes call_rcu() from timers, you will need to first refrain
from posting new timers, cancel (or wait for) all the already-posted
timers, and only then invoke rcu_barrier() to wait for any remaining
RCU callbacks to complete.
Of course, if you module uses call_rcu(), you will need to invoke
Of course, if your module uses call_rcu(), you will need to invoke
rcu_barrier() before unloading. Similarly, if your module uses
call_srcu(), you will need to invoke srcu_barrier() before unloading,
and on the same srcu_struct structure. If your module uses call_rcu()
**and** call_srcu(), then you will need to invoke rcu_barrier() **and**
srcu_barrier().
**and** call_srcu(), then (as noted above) you will need to invoke
rcu_barrier() **and** srcu_barrier().
Implementing rcu_barrier()
@@ -211,27 +191,40 @@ queues. His implementation queues an RCU callback on each of the per-CPU
callback queues, and then waits until they have all started executing, at
which point, all earlier RCU callbacks are guaranteed to have completed.
The original code for rcu_barrier() was as follows::
The original code for rcu_barrier() was roughly as follows::
1 void rcu_barrier(void)
2 {
3 BUG_ON(in_interrupt());
4 /* Take cpucontrol mutex to protect against CPU hotplug */
5 mutex_lock(&rcu_barrier_mutex);
6 init_completion(&rcu_barrier_completion);
7 atomic_set(&rcu_barrier_cpu_count, 0);
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
9 wait_for_completion(&rcu_barrier_completion);
10 mutex_unlock(&rcu_barrier_mutex);
11 }
1 void rcu_barrier(void)
2 {
3 BUG_ON(in_interrupt());
4 /* Take cpucontrol mutex to protect against CPU hotplug */
5 mutex_lock(&rcu_barrier_mutex);
6 init_completion(&rcu_barrier_completion);
7 atomic_set(&rcu_barrier_cpu_count, 1);
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
9 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
10 complete(&rcu_barrier_completion);
11 wait_for_completion(&rcu_barrier_completion);
12 mutex_unlock(&rcu_barrier_mutex);
13 }
Line 3 verifies that the caller is in process context, and lines 5 and 10
Line 3 verifies that the caller is in process context, and lines 5 and 12
use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
global completion and counters at a time, which are initialized on lines
6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
shown below. Note that the final "1" in on_each_cpu()'s argument list
ensures that all the calls to rcu_barrier_func() will have completed
before on_each_cpu() returns. Line 9 then waits for the completion.
before on_each_cpu() returns. Line 9 removes the initial count from
rcu_barrier_cpu_count, and if this count is now zero, line 10 finalizes
the completion, which prevents line 11 from blocking. Either way,
line 11 then waits (if needed) for the completion.
.. _rcubarrier_quiz_2:
Quick Quiz #2:
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
thereby avoiding the need for lines 9 and 10?
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
This code was rewritten in 2008 and several times thereafter, but this
still gives the general idea.
@@ -239,21 +232,21 @@ still gives the general idea.
The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
to post an RCU callback, as follows::
1 static void rcu_barrier_func(void *notused)
2 {
3 int cpu = smp_processor_id();
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
5 struct rcu_head *head;
6
7 head = &rdp->barrier;
8 atomic_inc(&rcu_barrier_cpu_count);
9 call_rcu(head, rcu_barrier_callback);
10 }
1 static void rcu_barrier_func(void *notused)
2 {
3 int cpu = smp_processor_id();
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
5 struct rcu_head *head;
6
7 head = &rdp->barrier;
8 atomic_inc(&rcu_barrier_cpu_count);
9 call_rcu(head, rcu_barrier_callback);
10 }
Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
which contains the struct rcu_head that needed for the later call to
call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
8 increments a global counter. This counter will later be decremented
8 increments the global counter. This counter will later be decremented
by the callback. Line 9 then registers the rcu_barrier_callback() on
the current CPU's queue.
@@ -261,33 +254,34 @@ The rcu_barrier_callback() function simply atomically decrements the
rcu_barrier_cpu_count variable and finalizes the completion when it
reaches zero, as follows::
1 static void rcu_barrier_callback(struct rcu_head *notused)
2 {
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
4 complete(&rcu_barrier_completion);
5 }
1 static void rcu_barrier_callback(struct rcu_head *notused)
2 {
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
4 complete(&rcu_barrier_completion);
5 }
.. _rcubarrier_quiz_2:
.. _rcubarrier_quiz_3:
Quick Quiz #2:
Quick Quiz #3:
What happens if CPU 0's rcu_barrier_func() executes
immediately (thus incrementing rcu_barrier_cpu_count to the
value one), but the other CPU's rcu_barrier_func() invocations
are delayed for a full grace period? Couldn't this result in
rcu_barrier() returning prematurely?
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
:ref:`Answer to Quick Quiz #3 <answer_rcubarrier_quiz_3>`
The current rcu_barrier() implementation is more complex, due to the need
to avoid disturbing idle CPUs (especially on battery-powered systems)
and the need to minimally disturb non-idle CPUs in real-time systems.
However, the code above illustrates the concepts.
In addition, a great many optimizations have been applied. However,
the code above illustrates the concepts.
rcu_barrier() Summary
---------------------
The rcu_barrier() primitive has seen relatively little use, since most
The rcu_barrier() primitive is used relatively infrequently, since most
code using RCU is in the core kernel rather than in modules. However, if
you are using RCU from an unloadable module, you need to use rcu_barrier()
so that your module may be safely unloaded.
@@ -302,7 +296,8 @@ Quick Quiz #1:
Is there any other situation where rcu_barrier() might
be required?
Answer: Interestingly enough, rcu_barrier() was not originally
Answer:
Interestingly enough, rcu_barrier() was not originally
implemented for module unloading. Nikita Danilov was using
RCU in a filesystem, which resulted in a similar situation at
filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
@@ -318,13 +313,48 @@ Answer: Interestingly enough, rcu_barrier() was not originally
.. _answer_rcubarrier_quiz_2:
Quick Quiz #2:
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
thereby avoiding the need for lines 9 and 10?
Answer:
Suppose that the on_each_cpu() function shown on line 8 was
delayed, so that CPU 0's rcu_barrier_func() executed and
the corresponding grace period elapsed, all before CPU 1's
rcu_barrier_func() started executing. This would result in
rcu_barrier_cpu_count being decremented to zero, so that line
11's wait_for_completion() would return immediately, failing to
wait for CPU 1's callbacks to be invoked.
Note that this was not a problem when the rcu_barrier() code
was first added back in 2005. This is because on_each_cpu()
disables preemption, which acted as an RCU read-side critical
section, thus preventing CPU 0's grace period from completing
until on_each_cpu() had dealt with all of the CPUs. However,
with the advent of preemptible RCU, rcu_barrier() no longer
waited on nonpreemptible regions of code in preemptible kernels,
that being the job of the new rcu_barrier_sched() function.
However, with the RCU flavor consolidation around v4.20, this
possibility was once again ruled out, because the consolidated
RCU once again waits on nonpreemptible regions of code.
Nevertheless, that extra count might still be a good idea.
Relying on these sort of accidents of implementation can result
in later surprise bugs when the implementation changes.
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
.. _answer_rcubarrier_quiz_3:
Quick Quiz #3:
What happens if CPU 0's rcu_barrier_func() executes
immediately (thus incrementing rcu_barrier_cpu_count to the
value one), but the other CPU's rcu_barrier_func() invocations
are delayed for a full grace period? Couldn't this result in
rcu_barrier() returning prematurely?
Answer: This cannot happen. The reason is that on_each_cpu() has its last
Answer:
This cannot happen. The reason is that on_each_cpu() has its last
argument, the wait flag, set to "1". This flag is passed through
to smp_call_function() and further to smp_call_function_on_cpu(),
causing this latter to spin until the cross-CPU invocation of
@@ -336,18 +366,15 @@ Answer: This cannot happen. The reason is that on_each_cpu() has its last
Therefore, on_each_cpu() disables preemption across its call
to smp_call_function() and also across the local call to
rcu_barrier_func(). This prevents the local CPU from context
switching, again preventing grace periods from completing. This
rcu_barrier_func(). Because recent RCU implementations treat
preemption-disabled regions of code as RCU read-side critical
sections, this prevents grace periods from completing. This
means that all CPUs have executed rcu_barrier_func() before
the first rcu_barrier_callback() can possibly execute, in turn
preventing rcu_barrier_cpu_count from prematurely reaching zero.
Currently, -rt implementations of RCU keep but a single global
queue for RCU callbacks, and thus do not suffer from this
problem. However, when the -rt RCU eventually does have per-CPU
callback queues, things will have to change. One simple change
is to add an rcu_read_lock() before line 8 of rcu_barrier()
and an rcu_read_unlock() after line 8 of this same function. If
you can think of a better change, please let me know!
But if on_each_cpu() ever decides to forgo disabling preemption,
as might well happen due to real-time latency considerations,
initializing rcu_barrier_cpu_count to one will save the day.
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
:ref:`Back to Quick Quiz #3 <rcubarrier_quiz_3>`

View File

@@ -14,19 +14,19 @@ Using 'nulls'
=============
Using special makers (called 'nulls') is a convenient way
to solve following problem :
to solve following problem.
A typical RCU linked list managing objects which are
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :
Without 'nulls', a typical RCU linked list managing objects which are
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use the following
algorithms:
1) Lookup algo
--------------
1) Lookup algorithm
-------------------
::
rcu_read_lock()
begin:
rcu_read_lock()
obj = lockless_lookup(key);
if (obj) {
if (!try_get_ref(obj)) // might fail for free objects
@@ -38,6 +38,7 @@ use following algos :
*/
if (obj->key != key) { // not the object we expected
put_ref(obj);
rcu_read_unlock();
goto begin;
}
}
@@ -52,9 +53,9 @@ but a version with an additional memory barrier (smp_rmb())
{
struct hlist_node *node, *next;
for (pos = rcu_dereference((head)->first);
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(next))
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(next))
if (obj->key == key)
return obj;
return NULL;
@@ -64,9 +65,9 @@ And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
struct hlist_node *node;
for (pos = rcu_dereference((head)->first);
pos && ({ prefetch(pos->next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(pos->next))
pos && ({ prefetch(pos->next); 1; }) &&
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
pos = rcu_dereference(pos->next))
if (obj->key == key)
return obj;
return NULL;
@@ -82,36 +83,32 @@ Quoting Corey Minyard::
solved by pre-fetching the "next" field (with proper barriers) before
checking the key."
2) Insert algo
--------------
2) Insertion algorithm
----------------------
We need to make sure a reader cannot read the new 'obj->obj_next' value
and previous value of 'obj->key'. Or else, an item could be deleted
and previous value of 'obj->key'. Otherwise, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.
before the move, 'next' pointer is NULL, and lockless reader can not
detect the fact that it missed following items in original chain.
::
/*
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
obj = kmem_cache_alloc(...);
lock_chain(); // typically a spin_lock()
obj->key = key;
/*
* we need to make sure obj->key is updated before obj->next
* or obj->refcnt
*/
smp_wmb();
atomic_set(&obj->refcnt, 1);
atomic_set_release(&obj->refcnt, 1); // key before refcnt
hlist_add_head_rcu(&obj->obj_node, list);
unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
3) Removal algorithm
--------------------
Nothing special here, we can use a standard RCU hlist deletion.
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)
@@ -133,7 +130,7 @@ Avoiding extra smp_rmb()
========================
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
and extra smp_wmb() in insert function.
and extra _release() in insert function.
For example, if we choose to store the slot number as the 'nulls'
end-of-list marker for each slot of the hash table, we can detect
@@ -142,59 +139,61 @@ to another chain) checking the final 'nulls' value if
the lookup met the end of chain. If final 'nulls' value
is not the slot number, then we must restart the lookup at
the beginning. If the object was moved to the same chain,
then the reader doesn't care : It might eventually
then the reader doesn't care: It might occasionally
scan the list again without harm.
1) lookup algo
--------------
1) lookup algorithm
-------------------
::
head = &table[slot];
rcu_read_lock();
begin:
rcu_read_lock();
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
if (obj->key == key) {
if (!try_get_ref(obj)) // might fail for free objects
goto begin;
if (obj->key != key) { // not the object we expected
put_ref(obj);
if (!try_get_ref(obj)) { // might fail for free objects
rcu_read_unlock();
goto begin;
}
goto out;
if (obj->key != key) { // not the object we expected
put_ref(obj);
rcu_read_unlock();
goto begin;
}
goto out;
}
}
// If the nulls value we got at the end of this lookup is
// not the expected one, we must restart lookup.
// We probably met an item that was moved to another chain.
if (get_nulls_value(node) != slot) {
put_ref(obj);
rcu_read_unlock();
goto begin;
}
/*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(node) != slot)
goto begin;
obj = NULL;
out:
rcu_read_unlock();
2) Insert function
------------------
2) Insert algorithm
-------------------
::
/*
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/
obj = kmem_cache_alloc(cachep);
lock_chain(); // typically a spin_lock()
obj->key = key;
atomic_set_release(&obj->refcnt, 1); // key before refcnt
/*
* changes to obj->key must be visible before refcnt one
*/
smp_wmb();
atomic_set(&obj->refcnt, 1);
/*
* insert obj in RCU way (readers might be traversing chain)
*/
* insert obj in RCU way (readers might be traversing chain)
*/
hlist_nulls_add_head_rcu(&obj->obj_node, list);
unlock_chain(); // typically a spin_unlock()

View File

@@ -25,10 +25,10 @@ warnings:
- A CPU looping with bottom halves disabled.
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the kernel
without invoking schedule(). If the looping in the kernel is
really expected and desirable behavior, you might need to add
some calls to cond_resched().
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the
kernel without potentially invoking schedule(). If the looping
in the kernel is really expected and desirable behavior, you
might need to add some calls to cond_resched().
- Booting Linux using a console connection that is too slow to
keep up with the boot-time console-message rate. For example,
@@ -108,16 +108,17 @@ warnings:
- A bug in the RCU implementation.
- A hardware failure. This is quite unlikely, but has occurred
at least once in real life. A CPU failed in a running system,
becoming unresponsive, but not causing an immediate crash.
This resulted in a series of RCU CPU stall warnings, eventually
leading the realization that the CPU had failed.
- A hardware failure. This is quite unlikely, but is not at all
uncommon in large datacenter. In one memorable case some decades
back, a CPU failed in a running system, becoming unresponsive,
but not causing an immediate crash. This resulted in a series
of RCU CPU stall warnings, eventually leading the realization
that the CPU had failed.
The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning.
Note that SRCU does *not* have CPU stall warnings. Please note that
RCU only detects CPU stalls when there is a grace period in progress.
No grace period, no CPU stall warnings.
The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have
CPU stall warning. Note that SRCU does *not* have CPU stall warnings.
Please note that RCU only detects CPU stalls when there is a grace period
in progress. No grace period, no CPU stall warnings.
To diagnose the cause of the stall, inspect the stack traces.
The offending function will usually be near the top of the stack.
@@ -205,16 +206,21 @@ RCU_STALL_RAT_DELAY
rcupdate.rcu_task_stall_timeout
-------------------------------
This boot/sysfs parameter controls the RCU-tasks stall warning
interval. A value of zero or less suppresses RCU-tasks stall
warnings. A positive value sets the stall-warning interval
in seconds. An RCU-tasks stall warning starts with the line:
This boot/sysfs parameter controls the RCU-tasks and
RCU-tasks-trace stall warning intervals. A value of zero or less
suppresses RCU-tasks stall warnings. A positive value sets the
stall-warning interval in seconds. An RCU-tasks stall warning
starts with the line:
INFO: rcu_tasks detected stalls on tasks:
And continues with the output of sched_show_task() for each
task stalling the current RCU-tasks grace period.
An RCU-tasks-trace stall warning starts (and continues) similarly:
INFO: rcu_tasks_trace detected stalls on tasks
Interpreting RCU's CPU Stall-Detector "Splats"
==============================================
@@ -248,7 +254,8 @@ dynticks counter, which will have an even-numbered value if the CPU
is in dyntick-idle mode and an odd-numbered value otherwise. The hex
number between the two "/"s is the value of the nesting, which will be
a small non-negative number if in the idle loop (as shown above) and a
very large positive number otherwise.
very large positive number otherwise. The number following the final
"/" is the NMI nesting, which will be a small non-negative number.
The "softirq=" portion of the message tracks the number of RCU softirq
handlers that the stalled CPU has executed. The number before the "/"
@@ -383,3 +390,95 @@ for example, "P3421".
It is entirely possible to see stall warnings from normal and from
expedited grace periods at about the same time during the same run.
RCU_CPU_STALL_CPUTIME
=====================
In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
rcupdate.rcu_cpu_stall_cputime=1, the following additional information
is supplied with each RCU CPU stall warning::
rcu: hardirqs softirqs csw/system
rcu: number: 624 45 0
rcu: cputime: 69 1 2425 ==> 2500(ms)
These statistics are collected during the sampling period. The values
in row "number:" are the number of hard interrupts, number of soft
interrupts, and number of context switches on the stalled CPU. The
first three values in row "cputime:" indicate the CPU time in
milliseconds consumed by hard interrupts, soft interrupts, and tasks
on the stalled CPU. The last number is the measurement interval, again
in milliseconds. Because user-mode tasks normally do not cause RCU CPU
stalls, these tasks are typically kernel tasks, which is why only the
system CPU time are considered.
The sampling period is shown as follows::
|<------------first timeout---------->|<-----second timeout----->|
|<--half timeout-->|<--half timeout-->| |
| |<--first period-->| |
| |<-----------second sampling period---------->|
| | | |
snapshot time point 1st-stall 2nd-stall
The following describes four typical scenarios:
1. A CPU looping with interrupts disabled.
::
rcu: hardirqs softirqs csw/system
rcu: number: 0 0 0
rcu: cputime: 0 0 0 ==> 2500(ms)
Because interrupts have been disabled throughout the measurement
interval, there are no interrupts and no context switches.
Furthermore, because CPU time consumption was measured using interrupt
handlers, the system CPU consumption is misleadingly measured as zero.
This scenario will normally also have "(0 ticks this GP)" printed on
this CPU's summary line.
2. A CPU looping with bottom halves disabled.
This is similar to the previous example, but with non-zero number of
and CPU time consumed by hard interrupts, along with non-zero CPU
time consumed by in-kernel execution::
rcu: hardirqs softirqs csw/system
rcu: number: 624 0 0
rcu: cputime: 49 0 2446 ==> 2500(ms)
The fact that there are zero softirqs gives a hint that these were
disabled, perhaps via local_bh_disable(). It is of course possible
that there were no softirqs, perhaps because all events that would
result in softirq execution are confined to other CPUs. In this case,
the diagnosis should continue as shown in the next example.
3. A CPU looping with preemption disabled.
Here, only the number of context switches is zero::
rcu: hardirqs softirqs csw/system
rcu: number: 624 45 0
rcu: cputime: 69 1 2425 ==> 2500(ms)
This situation hints that the stalled CPU was looping with preemption
disabled.
4. No looping, but massive hard and soft interrupts.
::
rcu: hardirqs softirqs csw/system
rcu: number: xx xx 0
rcu: cputime: xx xx 0 ==> 2500(ms)
Here, the number and CPU time of hard interrupts are all non-zero,
but the number of context switches and the in-kernel CPU time consumed
are zero. The number and cputime of soft interrupts will usually be
non-zero, but could be zero, for example, if the CPU was spinning
within a single hard interrupt handler.
If this type of RCU CPU stall warning can be reproduced, you can
narrow it down by looking at /proc/interrupts or by writing code to
trace each interrupt, for example, by referring to show_interrupts().

View File

@@ -206,7 +206,11 @@ values for memory may require disabling the callback-flooding tests
using the --bootargs parameter discussed below.
Sometimes additional debugging is useful, and in such cases the --kconfig
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_KASAN=y'``.
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_RCU_EQS_DEBUG=y'``.
In addition, there are the --gdb, --kasan, and --kcsan parameters.
Note that --gdb limits you to one scenario per kvm.sh run and requires
that you have another window open from which to run ``gdb`` as instructed
by the script.
Kernel boot arguments can also be supplied, for example, to control
rcutorture's module parameters. For example, to test a change to RCU's
@@ -219,10 +223,17 @@ require disabling rcutorture's callback-flooding tests::
--bootargs 'rcutorture.fwd_progress=0'
Sometimes all that is needed is a full set of kernel builds. This is
what the --buildonly argument does.
what the --buildonly parameter does.
Finally, the --trust-make argument allows each kernel build to reuse what
it can from the previous kernel build.
The --duration parameter can override the default run time of 30 minutes.
For example, ``--duration 2d`` would run for two days, ``--duration 3h``
would run for three hours, ``--duration 5m`` would run for five minutes,
and ``--duration 45s`` would run for 45 seconds. This last can be useful
for tracking down rare boot-time failures.
Finally, the --trust-make parameter allows each kernel build to reuse what
it can from the previous kernel build. Please note that without the
--trust-make parameter, your tags files may be demolished.
There are additional more arcane arguments that are documented in the
source code of the kvm.sh script.
@@ -291,3 +302,73 @@ the following summary at the end of the run on a 12-CPU system::
TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732
CPU count limited from 16 to 12
TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011
Repeated Runs
=============
Suppose that you are chasing down a rare boot-time failure. Although you
could use kvm.sh, doing so will rebuild the kernel on each run. If you
need (say) 1,000 runs to have confidence that you have fixed the bug,
these pointless rebuilds can become extremely annoying.
This is why kvm-again.sh exists.
Suppose that a previous kvm.sh run left its output in this directory::
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
Then this run can be re-run without rebuilding as follow:
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
A few of the original run's kvm.sh parameters may be overridden, perhaps
most notably --duration and --bootargs. For example::
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28 \
--duration 45s
would re-run the previous test, but for only 45 seconds, thus facilitating
tracking down the aforementioned rare boot-time failure.
Distributed Runs
================
Although kvm.sh is quite useful, its testing is confined to a single
system. It is not all that hard to use your favorite framework to cause
(say) 5 instances of kvm.sh to run on your 5 systems, but this will very
likely unnecessarily rebuild kernels. In addition, manually distributing
the desired rcutorture scenarios across the available systems can be
painstaking and error-prone.
And this is why the kvm-remote.sh script exists.
If you the following command works::
ssh system0 date
and if it also works for system1, system2, system3, system4, and system5,
and all of these systems have 64 CPUs, you can type::
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
--cpus 64 --duration 8h --configs "5*CFLIST"
This will build each default scenario's kernel on the local system, then
spread each of five instances of each scenario over the systems listed,
running each scenario for eight hours. At the end of the runs, the
results will be gathered, recorded, and printed. Most of the parameters
that kvm.sh will accept can be passed to kvm-remote.sh, but the list of
systems must come first.
The kvm.sh ``--dryrun scenarios`` argument is useful for working out
how many scenarios may be run in one batch across a group of systems.
You can also re-run a previous remote run in a manner similar to kvm.sh:
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \
--duration 24h
In this case, most of the kvm-again.sh parmeters may be supplied following
the pathname of the old run-results directory.

View File

@@ -16,18 +16,23 @@ to start learning about RCU:
| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
| 2019 Big API Table https://lwn.net/Articles/777165/
For those preferring video:
| 1. Unraveling RCU Mysteries: Fundamentals https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries
| 2. Unraveling RCU Mysteries: Additional Use Cases https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases
What is RCU?
RCU is a synchronization mechanism that was added to the Linux kernel
during the 2.5 development effort that is optimized for read-mostly
situations. Although RCU is actually quite simple once you understand it,
getting there can sometimes be a challenge. Part of the problem is that
most of the past descriptions of RCU have been written with the mistaken
assumption that there is "one true way" to describe RCU. Instead,
the experience has been that different people must take different paths
to arrive at an understanding of RCU. This document provides several
different paths, as follows:
situations. Although RCU is actually quite simple, making effective use
of it requires you to think differently about your code. Another part
of the problem is the mistaken assumption that there is "one true way" to
describe and to use RCU. Instead, the experience has been that different
people must take different paths to arrive at an understanding of RCU,
depending on their experiences and use cases. This document provides
several different paths, as follows:
:ref:`1. RCU OVERVIEW <1_whatisRCU>`
@@ -157,34 +162,36 @@ rcu_read_lock()
^^^^^^^^^^^^^^^
void rcu_read_lock(void);
Used by a reader to inform the reclaimer that the reader is
entering an RCU read-side critical section. It is illegal
to block while in an RCU read-side critical section, though
kernels built with CONFIG_PREEMPT_RCU can preempt RCU
read-side critical sections. Any RCU-protected data structure
accessed during an RCU read-side critical section is guaranteed to
remain unreclaimed for the full duration of that critical section.
Reference counts may be used in conjunction with RCU to maintain
longer-term references to data structures.
This temporal primitive is used by a reader to inform the
reclaimer that the reader is entering an RCU read-side critical
section. It is illegal to block while in an RCU read-side
critical section, though kernels built with CONFIG_PREEMPT_RCU
can preempt RCU read-side critical sections. Any RCU-protected
data structure accessed during an RCU read-side critical section
is guaranteed to remain unreclaimed for the full duration of that
critical section. Reference counts may be used in conjunction
with RCU to maintain longer-term references to data structures.
rcu_read_unlock()
^^^^^^^^^^^^^^^^^
void rcu_read_unlock(void);
Used by a reader to inform the reclaimer that the reader is
exiting an RCU read-side critical section. Note that RCU
read-side critical sections may be nested and/or overlapping.
This temporal primitives is used by a reader to inform the
reclaimer that the reader is exiting an RCU read-side critical
section. Note that RCU read-side critical sections may be nested
and/or overlapping.
synchronize_rcu()
^^^^^^^^^^^^^^^^^
void synchronize_rcu(void);
Marks the end of updater code and the beginning of reclaimer
code. It does this by blocking until all pre-existing RCU
read-side critical sections on all CPUs have completed.
Note that synchronize_rcu() will **not** necessarily wait for
any subsequent RCU read-side critical sections to complete.
For example, consider the following sequence of events::
This temporal primitive marks the end of updater code and the
beginning of reclaimer code. It does this by blocking until
all pre-existing RCU read-side critical sections on all CPUs
have completed. Note that synchronize_rcu() will **not**
necessarily wait for any subsequent RCU read-side critical
sections to complete. For example, consider the following
sequence of events::
CPU 0 CPU 1 CPU 2
----------------- ------------------------- ---------------
@@ -211,13 +218,13 @@ synchronize_rcu()
to be useful in all but the most read-intensive situations,
synchronize_rcu()'s overhead must also be quite small.
The call_rcu() API is a callback form of synchronize_rcu(),
and is described in more detail in a later section. Instead of
blocking, it registers a function and argument which are invoked
after all ongoing RCU read-side critical sections have completed.
This callback variant is particularly useful in situations where
it is illegal to block or where update-side performance is
critically important.
The call_rcu() API is an asynchronous callback form of
synchronize_rcu(), and is described in more detail in a later
section. Instead of blocking, it registers a function and
argument which are invoked after all ongoing RCU read-side
critical sections have completed. This callback variant is
particularly useful in situations where it is illegal to block
or where update-side performance is critically important.
However, the call_rcu() API should not be used lightly, as use
of the synchronize_rcu() API generally results in simpler code.
@@ -236,11 +243,13 @@ rcu_assign_pointer()
would be cool to be able to declare a function in this manner.
(Compiler experts will no doubt disagree.)
The updater uses this function to assign a new value to an
The updater uses this spatial macro to assign a new value to an
RCU-protected pointer, in order to safely communicate the change
in value from the updater to the reader. This macro does not
evaluate to an rvalue, but it does execute any memory-barrier
instructions required for a given CPU architecture.
in value from the updater to the reader. This is a spatial (as
opposed to temporal) macro. It does not evaluate to an rvalue,
but it does execute any memory-barrier instructions required
for a given CPU architecture. Its ordering properties are that
of a store-release operation.
Perhaps just as important, it serves to document (1) which
pointers are protected by RCU and (2) the point at which a
@@ -255,14 +264,15 @@ rcu_dereference()
Like rcu_assign_pointer(), rcu_dereference() must be implemented
as a macro.
The reader uses rcu_dereference() to fetch an RCU-protected
pointer, which returns a value that may then be safely
dereferenced. Note that rcu_dereference() does not actually
dereference the pointer, instead, it protects the pointer for
later dereferencing. It also executes any needed memory-barrier
instructions for a given CPU architecture. Currently, only Alpha
needs memory barriers within rcu_dereference() -- on other CPUs,
it compiles to nothing, not even a compiler directive.
The reader uses the spatial rcu_dereference() macro to fetch
an RCU-protected pointer, which returns a value that may
then be safely dereferenced. Note that rcu_dereference()
does not actually dereference the pointer, instead, it
protects the pointer for later dereferencing. It also
executes any needed memory-barrier instructions for a given
CPU architecture. Currently, only Alpha needs memory barriers
within rcu_dereference() -- on other CPUs, it compiles to a
volatile load.
Common coding practice uses rcu_dereference() to copy an
RCU-protected pointer to a local variable, then dereferences
@@ -355,12 +365,15 @@ reader, updater, and reclaimer.
synchronize_rcu() & call_rcu()
The RCU infrastructure observes the time sequence of rcu_read_lock(),
The RCU infrastructure observes the temporal sequence of rcu_read_lock(),
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
order to determine when (1) synchronize_rcu() invocations may return
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
implementations of the RCU infrastructure make heavy use of batching in
order to amortize their overhead over many uses of the corresponding APIs.
The rcu_assign_pointer() and rcu_dereference() invocations communicate
spatial changes via stores to and loads from the RCU-protected pointer in
question.
There are at least three flavors of RCU usage in the Linux kernel. The diagram
above shows the most common one. On the updater side, the rcu_assign_pointer(),
@@ -392,7 +405,9 @@ b. RCU applied to networking data structures that may be subjected
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
Again, most uses will be of (a). The (b) and (c) cases are important
for specialized uses, but are relatively uncommon.
for specialized uses, but are relatively uncommon. The SRCU, RCU-Tasks,
RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among
their assorted primitives.
.. _3_whatisRCU:
@@ -468,7 +483,7 @@ So, to sum up:
- Within an RCU read-side critical section, use rcu_dereference()
to dereference RCU-protected pointers.
- Use some solid scheme (such as locks or semaphores) to
- Use some solid design (such as locks or semaphores) to
keep concurrent updates from interfering with each other.
- Use rcu_assign_pointer() to update an RCU-protected pointer.
@@ -579,6 +594,14 @@ to avoid having to write your own callback::
kfree_rcu(old_fp, rcu);
If the occasional sleep is permitted, the single-argument form may
be used, omitting the rcu_head structure from struct foo.
kfree_rcu(old_fp);
This variant of kfree_rcu() almost never blocks, but might do so by
invoking synchronize_rcu() in response to memory-allocation failure.
Again, see checklist.rst for additional rules governing the use of RCU.
.. _5_whatisRCU:
@@ -596,7 +619,7 @@ lacking both functionality and performance. However, they are useful
in getting a feel for how RCU works. See kernel/rcu/update.c for a
production-quality implementation, and see:
http://www.rdrop.com/users/paulmck/RCU
https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
for papers describing the Linux kernel RCU implementation. The OLS'01
and OLS'02 papers are a good introduction, and the dissertation provides
@@ -929,6 +952,8 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
initialized after each and every call to kmem_cache_alloc(), which renders
reference-free spinlock acquisition completely unsafe. Therefore, when
using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
(Those willing to use a kmem_cache constructor may also use locking,
including cache-friendly sequence locking.)
With traditional reference counting -- such as that implemented by the
kref library in Linux -- there is typically code that runs when the last
@@ -1047,6 +1072,30 @@ sched::
rcu_read_lock_sched_held
RCU-Tasks::
Critical sections Grace period Barrier
N/A call_rcu_tasks rcu_barrier_tasks
synchronize_rcu_tasks
RCU-Tasks-Rude::
Critical sections Grace period Barrier
N/A call_rcu_tasks_rude rcu_barrier_tasks_rude
synchronize_rcu_tasks_rude
RCU-Tasks-Trace::
Critical sections Grace period Barrier
rcu_read_lock_trace call_rcu_tasks_trace rcu_barrier_tasks_trace
rcu_read_unlock_trace synchronize_rcu_tasks_trace
SRCU::
Critical sections Grace period Barrier
@@ -1087,35 +1136,43 @@ list can be helpful:
a. Will readers need to block? If so, you need SRCU.
b. What about the -rt patchset? If readers would need to block
in an non-rt kernel, you need SRCU. If readers would block
in a -rt kernel, but not in a non-rt kernel, SRCU is not
necessary. (The -rt patchset turns spinlocks into sleeplocks,
hence this distinction.)
b. Will readers need to block and are you doing tracing, for
example, ftrace or BPF? If so, you need RCU-tasks,
RCU-tasks-rude, and/or RCU-tasks-trace.
c. Do you need to treat NMI handlers, hardirq handlers,
c. What about the -rt patchset? If readers would need to block in
an non-rt kernel, you need SRCU. If readers would block when
acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,
SRCU is not necessary. (The -rt patchset turns spinlocks into
sleeplocks, hence this distinction.)
d. Do you need to treat NMI handlers, hardirq handlers,
and code segments with preemption disabled (whether
via preempt_disable(), local_irq_save(), local_bh_disable(),
or some other mechanism) as if they were explicit RCU readers?
If so, RCU-sched is the only choice that will work for you.
If so, RCU-sched readers are the only choice that will work
for you, but since about v4.20 you use can use the vanilla RCU
update primitives.
d. Do you need RCU grace periods to complete even in the face
of softirq monopolization of one or more of the CPUs? For
example, is your code subject to network-based denial-of-service
attacks? If so, you should disable softirq across your readers,
for example, by using rcu_read_lock_bh().
e. Do you need RCU grace periods to complete even in the face of
softirq monopolization of one or more of the CPUs? For example,
is your code subject to network-based denial-of-service attacks?
If so, you should disable softirq across your readers, for
example, by using rcu_read_lock_bh(). Since about v4.20 you
use can use the vanilla RCU update primitives.
e. Is your workload too update-intensive for normal use of
f. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
named SLAB_DESTROY_BY_RCU). But please be careful!
f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
user-mode execution, or on an offlined CPU? If so, SRCU is the
only choice that will work for you.
g. Do you need read-side critical sections that are respected even
on CPUs that are deep in the idle loop, during entry to or exit
from user-mode execution, or on an offlined CPU? If so, SRCU
and RCU Tasks Trace are the only choices that will work for you,
with SRCU being strongly preferred in almost all cases.
g. Otherwise, use RCU.
h. Otherwise, use RCU.
Of course, this all assumes that you have determined that RCU is in fact
the right tool for your job.

View File

@@ -5113,6 +5113,17 @@
rcupdate.rcu_cpu_stall_timeout to be used (after
conversion from seconds to milliseconds).
rcupdate.rcu_cpu_stall_cputime= [KNL]
Provide statistics on the cputime and count of
interrupts and tasks during the sampling period. For
multiple continuous RCU stalls, all sampling periods
begin at half of the first RCU stall timeout.
rcupdate.rcu_exp_stall_task_details= [KNL]
Print stack dumps of any tasks blocking the
current expedited RCU grace period during an
expedited RCU CPU stall warning.
rcupdate.rcu_expedited= [KNL]
Use expedited grace-period primitives, for
example, synchronize_rcu_expedited() instead

View File

@@ -181,7 +181,6 @@ void fw_devlink_purge_absent_suppliers(struct fwnode_handle *fwnode)
}
EXPORT_SYMBOL_GPL(fw_devlink_purge_absent_suppliers);
#ifdef CONFIG_SRCU
static DEFINE_MUTEX(device_links_lock);
DEFINE_STATIC_SRCU(device_links_srcu);
@@ -220,47 +219,6 @@ static void device_link_remove_from_lists(struct device_link *link)
list_del_rcu(&link->s_node);
list_del_rcu(&link->c_node);
}
#else /* !CONFIG_SRCU */
static DECLARE_RWSEM(device_links_lock);
static inline void device_links_write_lock(void)
{
down_write(&device_links_lock);
}
static inline void device_links_write_unlock(void)
{
up_write(&device_links_lock);
}
int device_links_read_lock(void)
{
down_read(&device_links_lock);
return 0;
}
void device_links_read_unlock(int not_used)
{
up_read(&device_links_lock);
}
#ifdef CONFIG_DEBUG_LOCK_ALLOC
int device_links_read_lock_held(void)
{
return lockdep_is_held(&device_links_lock);
}
#endif
static inline void device_link_synchronize_removal(void)
{
}
static void device_link_remove_from_lists(struct device_link *link)
{
list_del(&link->s_node);
list_del(&link->c_node);
}
#endif /* !CONFIG_SRCU */
static bool device_is_ancestor(struct device *dev, struct device *target)
{

View File

@@ -1,7 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
menuconfig DAX
tristate "DAX: direct access to differentiated memory"
select SRCU
default m if NVDIMM_DAX
if DAX

View File

@@ -2,7 +2,6 @@
config STM
tristate "System Trace Module devices"
select CONFIGFS_FS
select SRCU
help
A System Trace Module (STM) is a device exporting data in System
Trace Protocol (STP) format as defined by MIPI STP standards.

View File

@@ -6,7 +6,6 @@
menuconfig MD
bool "Multiple devices driver support (RAID and LVM)"
depends on BLOCK
select SRCU
help
Support multiple physical spindles through a single logical device.
Required for RAID and logical volume management.

View File

@@ -334,7 +334,6 @@ config NETCONSOLE_DYNAMIC
config NETPOLL
def_bool NETCONSOLE
select SRCU
config NET_POLL_CONTROLLER
def_bool NETPOLL

View File

@@ -258,7 +258,7 @@ config PCIE_MEDIATEK_GEN3
MediaTek SoCs.
config VMD
depends on PCI_MSI && X86_64 && SRCU && !UML
depends on PCI_MSI && X86_64 && !UML
tristate "Intel Volume Management Device Driver"
help
Adds support for the Intel Volume Management Device (VMD). VMD is a

View File

@@ -17,7 +17,6 @@ config BTRFS_FS
select FS_IOMAP
select RAID6_PQ
select XOR_BLOCKS
select SRCU
depends on PAGE_SIZE_LESS_THAN_256KB
help

View File

@@ -1890,7 +1890,6 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp,
}
EXPORT_SYMBOL(generic_setlease);
#if IS_ENABLED(CONFIG_SRCU)
/*
* Kernel subsystems can register to be notified on any attempt to set
* a new lease with the lease_notifier_chain. This is used by (e.g.) nfsd
@@ -1924,30 +1923,6 @@ void lease_unregister_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
#else /* !IS_ENABLED(CONFIG_SRCU) */
static inline void
lease_notifier_chain_init(void)
{
}
static inline void
setlease_notifier(long arg, struct file_lock *lease)
{
}
int lease_register_notifier(struct notifier_block *nb)
{
return 0;
}
EXPORT_SYMBOL_GPL(lease_register_notifier);
void lease_unregister_notifier(struct notifier_block *nb)
{
}
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
#endif /* IS_ENABLED(CONFIG_SRCU) */
/**
* vfs_setlease - sets a lease on an open file
* @filp: file pointer

View File

@@ -1,7 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
config FSNOTIFY
def_bool n
select SRCU
source "fs/notify/dnotify/Kconfig"
source "fs/notify/inotify/Kconfig"

Some files were not shown because too many files have changed in this diff Show More