You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar: "The main changes are: - Debloat RCU headers - Parallelize SRCU callback handling (plus overlapping patches) - Improve the performance of Tree SRCU on a CPU-hotplug stress test - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Open-code the rcu_cblist_n_lazy_cbs() function rcu: Open-code the rcu_cblist_n_cbs() function rcu: Open-code the rcu_cblist_empty() function rcu: Separately compile large rcu_segcblist functions srcu: Debloat the <linux/rcu_segcblist.h> header srcu: Adjust default auto-expediting holdoff srcu: Specify auto-expedite holdoff time srcu: Expedite first synchronize_srcu() when idle srcu: Expedited grace periods with reduced memory contention srcu: Make rcutorture writer stalls print SRCU GP state srcu: Exact tracking of srcu_data structures containing callbacks srcu: Make SRCU be built by default srcu: Fix Kconfig botch when SRCU not selected rcu: Make non-preemptive schedule be Tasks RCU quiescent state srcu: Expedite srcu_schedule_cbs_snp() callback invocation srcu: Parallelize callback handling kvm: Move srcu_struct fields to end of struct kvm rcu: Fix typo in PER_RCU_NODE_PERIOD header comment rcu: Use true/false in assignment to bool rcu: Use bool value directly ...
This commit is contained in:
@@ -17,7 +17,7 @@ rcu_dereference.txt
|
||||
rcubarrier.txt
|
||||
- RCU and Unloadable Modules
|
||||
rculist_nulls.txt
|
||||
- RCU list primitives for use with SLAB_DESTROY_BY_RCU
|
||||
- RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
|
||||
rcuref.txt
|
||||
- Reference-count design for elements of lists/arrays protected by RCU
|
||||
rcu.txt
|
||||
|
||||
@@ -19,6 +19,8 @@ to each other.
|
||||
The <tt>rcu_state</tt> Structure</a>
|
||||
<li> <a href="#The rcu_node Structure">
|
||||
The <tt>rcu_node</tt> Structure</a>
|
||||
<li> <a href="#The rcu_segcblist Structure">
|
||||
The <tt>rcu_segcblist</tt> Structure</a>
|
||||
<li> <a href="#The rcu_data Structure">
|
||||
The <tt>rcu_data</tt> Structure</a>
|
||||
<li> <a href="#The rcu_dynticks Structure">
|
||||
@@ -841,6 +843,134 @@ for lockdep lock-class names.
|
||||
Finally, lines 64-66 produce an error if the maximum number of
|
||||
CPUs is too large for the specified fanout.
|
||||
|
||||
<h3><a name="The rcu_segcblist Structure">
|
||||
The <tt>rcu_segcblist</tt> Structure</a></h3>
|
||||
|
||||
The <tt>rcu_segcblist</tt> structure maintains a segmented list of
|
||||
callbacks as follows:
|
||||
|
||||
<pre>
|
||||
1 #define RCU_DONE_TAIL 0
|
||||
2 #define RCU_WAIT_TAIL 1
|
||||
3 #define RCU_NEXT_READY_TAIL 2
|
||||
4 #define RCU_NEXT_TAIL 3
|
||||
5 #define RCU_CBLIST_NSEGS 4
|
||||
6
|
||||
7 struct rcu_segcblist {
|
||||
8 struct rcu_head *head;
|
||||
9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
|
||||
10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
|
||||
11 long len;
|
||||
12 long len_lazy;
|
||||
13 };
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
The segments are as follows:
|
||||
|
||||
<ol>
|
||||
<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
|
||||
These callbacks are ready to be invoked.
|
||||
<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
|
||||
current grace period.
|
||||
Note that different CPUs can have different ideas about which
|
||||
grace period is current, hence the <tt>->gp_seq</tt> field.
|
||||
<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
|
||||
grace period to start.
|
||||
<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
|
||||
associated with a grace period.
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
The <tt>->head</tt> pointer references the first callback or
|
||||
is <tt>NULL</tt> if the list contains no callbacks (which is
|
||||
<i>not</i> the same as being empty).
|
||||
Each element of the <tt>->tails[]</tt> array references the
|
||||
<tt>->next</tt> pointer of the last callback in the corresponding
|
||||
segment of the list, or the list's <tt>->head</tt> pointer if
|
||||
that segment and all previous segments are empty.
|
||||
If the corresponding segment is empty but some previous segment is
|
||||
not empty, then the array element is identical to its predecessor.
|
||||
Older callbacks are closer to the head of the list, and new callbacks
|
||||
are added at the tail.
|
||||
This relationship between the <tt>->head</tt> pointer, the
|
||||
<tt>->tails[]</tt> array, and the callbacks is shown in this
|
||||
diagram:
|
||||
|
||||
</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
|
||||
|
||||
</p><p>In this figure, the <tt>->head</tt> pointer references the
|
||||
first
|
||||
RCU callback in the list.
|
||||
The <tt>->tails[RCU_DONE_TAIL]</tt> array element references
|
||||
the <tt>->head</tt> pointer itself, indicating that none
|
||||
of the callbacks is ready to invoke.
|
||||
The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback
|
||||
CB 2's <tt>->next</tt> pointer, which indicates that
|
||||
CB 1 and CB 2 are both waiting on the current grace period,
|
||||
give or take possible disagreements about exactly which grace period
|
||||
is the current one.
|
||||
The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element
|
||||
references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt>
|
||||
does, which indicates that there are no callbacks waiting on the next
|
||||
RCU grace period.
|
||||
The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references
|
||||
CB 4's <tt>->next</tt> pointer, indicating that all the
|
||||
remaining RCU callbacks have not yet been assigned to an RCU grace
|
||||
period.
|
||||
Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element
|
||||
always references the last RCU callback's <tt>->next</tt> pointer
|
||||
unless the callback list is empty, in which case it references
|
||||
the <tt>->head</tt> pointer.
|
||||
|
||||
<p>
|
||||
There is one additional important special case for the
|
||||
<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
|
||||
when this list is <i>disabled</i>.
|
||||
Lists are disabled when the corresponding CPU is offline or when
|
||||
the corresponding CPU's callbacks are offloaded to a kthread,
|
||||
both of which are described elsewhere.
|
||||
|
||||
</p><p>CPUs advance their callbacks from the
|
||||
<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
|
||||
<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
|
||||
as grace periods advance.
|
||||
|
||||
</p><p>The <tt>->gp_seq[]</tt> array records grace-period
|
||||
numbers corresponding to the list segments.
|
||||
This is what allows different CPUs to have different ideas as to
|
||||
which is the current grace period while still avoiding premature
|
||||
invocation of their callbacks.
|
||||
In particular, this allows CPUs that go idle for extended periods
|
||||
to determine which of their callbacks are ready to be invoked after
|
||||
reawakening.
|
||||
|
||||
</p><p>The <tt>->len</tt> counter contains the number of
|
||||
callbacks in <tt>->head</tt>, and the
|
||||
<tt>->len_lazy</tt> contains the number of those callbacks that
|
||||
are known to only free memory, and whose invocation can therefore
|
||||
be safely deferred.
|
||||
|
||||
<p><b>Important note</b>: It is the <tt>->len</tt> field that
|
||||
determines whether or not there are callbacks associated with
|
||||
this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt>
|
||||
pointer.
|
||||
The reason for this is that all the ready-to-invoke callbacks
|
||||
(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
|
||||
all at once at callback-invocation time.
|
||||
If callback invocation must be postponed, for example, because a
|
||||
high-priority process just woke up on this CPU, then the remaining
|
||||
callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment.
|
||||
Either way, the <tt>->len</tt> and <tt>->len_lazy</tt> counts
|
||||
are adjusted after the corresponding callbacks have been invoked, and so
|
||||
again it is the <tt>->len</tt> count that accurately reflects whether
|
||||
or not there are callbacks associated with this <tt>rcu_segcblist</tt>
|
||||
structure.
|
||||
Of course, off-CPU sampling of the <tt>->len</tt> count requires
|
||||
the use of appropriate synchronization, for example, memory barriers.
|
||||
This synchronization can be a bit subtle, particularly in the case
|
||||
of <tt>rcu_barrier()</tt>.
|
||||
|
||||
<h3><a name="The rcu_data Structure">
|
||||
The <tt>rcu_data</tt> Structure</a></h3>
|
||||
|
||||
@@ -983,62 +1113,18 @@ choice.
|
||||
as follows:
|
||||
|
||||
<pre>
|
||||
1 struct rcu_head *nxtlist;
|
||||
2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
|
||||
3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
|
||||
4 long qlen_lazy;
|
||||
5 long qlen;
|
||||
6 long qlen_last_fqs_check;
|
||||
1 struct rcu_segcblist cblist;
|
||||
2 long qlen_last_fqs_check;
|
||||
3 unsigned long n_cbs_invoked;
|
||||
4 unsigned long n_nocbs_invoked;
|
||||
5 unsigned long n_cbs_orphaned;
|
||||
6 unsigned long n_cbs_adopted;
|
||||
7 unsigned long n_force_qs_snap;
|
||||
8 unsigned long n_cbs_invoked;
|
||||
9 unsigned long n_cbs_orphaned;
|
||||
10 unsigned long n_cbs_adopted;
|
||||
11 long blimit;
|
||||
8 long blimit;
|
||||
</pre>
|
||||
|
||||
<p>The <tt>->nxtlist</tt> pointer and the
|
||||
<tt>->nxttail[]</tt> array form a four-segment list with
|
||||
older callbacks near the head and newer ones near the tail.
|
||||
Each segment contains callbacks with the corresponding relationship
|
||||
to the current grace period.
|
||||
The pointer out of the end of each of the four segments is referenced
|
||||
by the element of the <tt>->nxttail[]</tt> array indexed by
|
||||
<tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
|
||||
<tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
|
||||
<tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
|
||||
grace period), and
|
||||
<tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
|
||||
with a specific grace period)
|
||||
respectively, as shown in the following figure.
|
||||
|
||||
</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
|
||||
|
||||
</p><p>In this figure, the <tt>->nxtlist</tt> pointer references the
|
||||
first
|
||||
RCU callback in the list.
|
||||
The <tt>->nxttail[RCU_DONE_TAIL]</tt> array element references
|
||||
the <tt>->nxtlist</tt> pointer itself, indicating that none
|
||||
of the callbacks is ready to invoke.
|
||||
The <tt>->nxttail[RCU_WAIT_TAIL]</tt> array element references callback
|
||||
CB 2's <tt>->next</tt> pointer, which indicates that
|
||||
CB 1 and CB 2 are both waiting on the current grace period.
|
||||
The <tt>->nxttail[RCU_NEXT_READY_TAIL]</tt> array element
|
||||
references the same RCU callback that <tt>->nxttail[RCU_WAIT_TAIL]</tt>
|
||||
does, which indicates that there are no callbacks waiting on the next
|
||||
RCU grace period.
|
||||
The <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element references
|
||||
CB 4's <tt>->next</tt> pointer, indicating that all the
|
||||
remaining RCU callbacks have not yet been assigned to an RCU grace
|
||||
period.
|
||||
Note that the <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element
|
||||
always references the last RCU callback's <tt>->next</tt> pointer
|
||||
unless the callback list is empty, in which case it references
|
||||
the <tt>->nxtlist</tt> pointer.
|
||||
|
||||
</p><p>CPUs advance their callbacks from the
|
||||
<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
|
||||
<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
|
||||
as grace periods advance.
|
||||
<p>The <tt>->cblist</tt> structure is the segmented callback list
|
||||
described earlier.
|
||||
The CPU advances the callbacks in its <tt>rcu_data</tt> structure
|
||||
whenever it notices that another RCU grace period has completed.
|
||||
The CPU detects the completion of an RCU grace period by noticing
|
||||
@@ -1049,16 +1135,7 @@ Recall that each <tt>rcu_node</tt> structure's
|
||||
<tt>->completed</tt> field is updated at the end of each
|
||||
grace period.
|
||||
|
||||
</p><p>The <tt>->nxtcompleted[]</tt> array records grace-period
|
||||
numbers corresponding to the list segments.
|
||||
This allows CPUs that go idle for extended periods to determine
|
||||
which of their callbacks are ready to be invoked after reawakening.
|
||||
|
||||
</p><p>The <tt>->qlen</tt> counter contains the number of
|
||||
callbacks in <tt>->nxtlist</tt>, and the
|
||||
<tt>->qlen_lazy</tt> contains the number of those callbacks that
|
||||
are known to only free memory, and whose invocation can therefore
|
||||
be safely deferred.
|
||||
<p>
|
||||
The <tt>->qlen_last_fqs_check</tt> and
|
||||
<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
|
||||
states from <tt>call_rcu()</tt> and friends when callback
|
||||
@@ -1069,6 +1146,10 @@ lists grow excessively long.
|
||||
fields count the number of callbacks invoked,
|
||||
sent to other CPUs when this CPU goes offline,
|
||||
and received from other CPUs when those other CPUs go offline.
|
||||
The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks
|
||||
are offloaded to a kthread.
|
||||
|
||||
<p>
|
||||
Finally, the <tt>->blimit</tt> counter is the maximum number of
|
||||
RCU callbacks that may be invoked at a given time.
|
||||
|
||||
@@ -1104,6 +1185,9 @@ Its fields are as follows:
|
||||
1 int dynticks_nesting;
|
||||
2 int dynticks_nmi_nesting;
|
||||
3 atomic_t dynticks;
|
||||
4 bool rcu_need_heavy_qs;
|
||||
5 unsigned long rcu_qs_ctr;
|
||||
6 bool rcu_urgent_qs;
|
||||
</pre>
|
||||
|
||||
<p>The <tt>->dynticks_nesting</tt> field counts the
|
||||
@@ -1117,11 +1201,32 @@ NMIs are counted by the <tt>->dynticks_nmi_nesting</tt>
|
||||
field, except that NMIs that interrupt non-dyntick-idle execution
|
||||
are not counted.
|
||||
|
||||
</p><p>Finally, the <tt>->dynticks</tt> field counts the corresponding
|
||||
</p><p>The <tt>->dynticks</tt> field counts the corresponding
|
||||
CPU's transitions to and from dyntick-idle mode, so that this counter
|
||||
has an even value when the CPU is in dyntick-idle mode and an odd
|
||||
value otherwise.
|
||||
|
||||
</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used
|
||||
to record the fact that the RCU core code would really like to
|
||||
see a quiescent state from the corresponding CPU, so much so that
|
||||
it is willing to call for heavy-weight dyntick-counter operations.
|
||||
This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
|
||||
code, which provide a momentary idle sojourn in response.
|
||||
|
||||
</p><p>The <tt>->rcu_qs_ctr</tt> field is used to record
|
||||
quiescent states from <tt>cond_resched()</tt>.
|
||||
Because <tt>cond_resched()</tt> can execute quite frequently, this
|
||||
must be quite lightweight, as in a non-atomic increment of this
|
||||
per-CPU field.
|
||||
|
||||
</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record
|
||||
the fact that the RCU core code would really like to see a quiescent
|
||||
state from the corresponding CPU, with the various other fields indicating
|
||||
just how badly RCU wants this quiescent state.
|
||||
This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
|
||||
code, which, if nothing else, non-atomically increment <tt>->rcu_qs_ctr</tt>
|
||||
in response.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
|
||||
@@ -19,7 +19,7 @@
|
||||
id="svg2"
|
||||
version="1.1"
|
||||
inkscape:version="0.48.4 r9939"
|
||||
sodipodi:docname="nxtlist.fig">
|
||||
sodipodi:docname="segcblist.svg">
|
||||
<metadata
|
||||
id="metadata94">
|
||||
<rdf:RDF>
|
||||
@@ -28,7 +28,7 @@
|
||||
<dc:format>image/svg+xml</dc:format>
|
||||
<dc:type
|
||||
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
|
||||
<dc:title></dc:title>
|
||||
<dc:title />
|
||||
</cc:Work>
|
||||
</rdf:RDF>
|
||||
</metadata>
|
||||
@@ -241,61 +241,51 @@
|
||||
xml:space="preserve"
|
||||
x="225"
|
||||
y="675"
|
||||
fill="#000000"
|
||||
font-family="Courier"
|
||||
font-style="normal"
|
||||
font-weight="bold"
|
||||
font-size="324"
|
||||
text-anchor="start"
|
||||
id="text64">nxtlist</text>
|
||||
id="text64"
|
||||
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->head</text>
|
||||
<!-- Text -->
|
||||
<text
|
||||
xml:space="preserve"
|
||||
x="225"
|
||||
y="1800"
|
||||
fill="#000000"
|
||||
font-family="Courier"
|
||||
font-style="normal"
|
||||
font-weight="bold"
|
||||
font-size="324"
|
||||
text-anchor="start"
|
||||
id="text66">nxttail[RCU_DONE_TAIL]</text>
|
||||
id="text66"
|
||||
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_DONE_TAIL]</text>
|
||||
<!-- Text -->
|
||||
<text
|
||||
xml:space="preserve"
|
||||
x="225"
|
||||
y="2925"
|
||||
fill="#000000"
|
||||
font-family="Courier"
|
||||
font-style="normal"
|
||||
font-weight="bold"
|
||||
font-size="324"
|
||||
text-anchor="start"
|
||||
id="text68">nxttail[RCU_WAIT_TAIL]</text>
|
||||
id="text68"
|
||||
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_WAIT_TAIL]</text>
|
||||
<!-- Text -->
|
||||
<text
|
||||
xml:space="preserve"
|
||||
x="225"
|
||||
y="4050"
|
||||
fill="#000000"
|
||||
font-family="Courier"
|
||||
font-style="normal"
|
||||
font-weight="bold"
|
||||
font-size="324"
|
||||
text-anchor="start"
|
||||
id="text70">nxttail[RCU_NEXT_READY_TAIL]</text>
|
||||
id="text70"
|
||||
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_NEXT_READY_TAIL]</text>
|
||||
<!-- Text -->
|
||||
<text
|
||||
xml:space="preserve"
|
||||
x="225"
|
||||
y="5175"
|
||||
fill="#000000"
|
||||
font-family="Courier"
|
||||
font-style="normal"
|
||||
font-weight="bold"
|
||||
font-size="324"
|
||||
text-anchor="start"
|
||||
id="text72">nxttail[RCU_NEXT_TAIL]</text>
|
||||
id="text72"
|
||||
style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">->tails[RCU_NEXT_TAIL]</text>
|
||||
<!-- Text -->
|
||||
<text
|
||||
xml:space="preserve"
|
||||
|
||||
|
Before Width: | Height: | Size: 11 KiB After Width: | Height: | Size: 11 KiB |
@@ -284,6 +284,7 @@ Expedited Grace Period Refinements</a></h2>
|
||||
Funnel locking and wait/wakeup</a>.
|
||||
<li> <a href="#Use of Workqueues">Use of Workqueues</a>.
|
||||
<li> <a href="#Stall Warnings">Stall warnings</a>.
|
||||
<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>.
|
||||
</ol>
|
||||
|
||||
<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3>
|
||||
@@ -524,7 +525,7 @@ their grace periods and carrying out their wakeups.
|
||||
In earlier implementations, the task requesting the expedited
|
||||
grace period also drove it to completion.
|
||||
This straightforward approach had the disadvantage of needing to
|
||||
account for signals sent to user tasks,
|
||||
account for POSIX signals sent to user tasks,
|
||||
so more recent implemementations use the Linux kernel's
|
||||
<a href="https://www.kernel.org/doc/Documentation/workqueue.txt">workqueues</a>.
|
||||
|
||||
@@ -533,8 +534,8 @@ The requesting task still does counter snapshotting and funnel-lock
|
||||
processing, but the task reaching the top of the funnel lock
|
||||
does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt>
|
||||
so that a workqueue kthread does the actual grace-period processing.
|
||||
Because workqueue kthreads do not accept signals, grace-period-wait
|
||||
processing need not allow for signals.
|
||||
Because workqueue kthreads do not accept POSIX signals, grace-period-wait
|
||||
processing need not allow for POSIX signals.
|
||||
|
||||
In addition, this approach allows wakeups for the previous expedited
|
||||
grace period to be overlapped with processing for the next expedited
|
||||
@@ -586,6 +587,46 @@ blocking the current grace period are printed.
|
||||
Each stall warning results in another pass through the loop, but the
|
||||
second and subsequent passes use longer stall times.
|
||||
|
||||
<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3>
|
||||
|
||||
<p>
|
||||
The use of workqueues has the advantage that the expedited
|
||||
grace-period code need not worry about POSIX signals.
|
||||
Unfortunately, it has the
|
||||
corresponding disadvantage that workqueues cannot be used until
|
||||
they are initialized, which does not happen until some time after
|
||||
the scheduler spawns the first task.
|
||||
Given that there are parts of the kernel that really do want to
|
||||
execute grace periods during this mid-boot “dead zone”,
|
||||
expedited grace periods must do something else during thie time.
|
||||
|
||||
<p>
|
||||
What they do is to fall back to the old practice of requiring that the
|
||||
requesting task drive the expedited grace period, as was the case
|
||||
before the use of workqueues.
|
||||
However, the requesting task is only required to drive the grace period
|
||||
during the mid-boot dead zone.
|
||||
Before mid-boot, a synchronous grace period is a no-op.
|
||||
Some time after mid-boot, workqueues are used.
|
||||
|
||||
<p>
|
||||
Non-expedited non-SRCU synchronous grace periods must also operate
|
||||
normally during mid-boot.
|
||||
This is handled by causing non-expedited grace periods to take the
|
||||
expedited code path during mid-boot.
|
||||
|
||||
<p>
|
||||
The current code assumes that there are no POSIX signals during
|
||||
the mid-boot dead zone.
|
||||
However, if an overwhelming need for POSIX signals somehow arises,
|
||||
appropriate adjustments can be made to the expedited stall-warning code.
|
||||
One such adjustment would reinstate the pre-workqueue stall-warning
|
||||
checks, but only during the mid-boot dead zone.
|
||||
|
||||
<p>
|
||||
With this refinement, synchronous grace periods can now be used from
|
||||
task context pretty much any time during the life of the kernel.
|
||||
|
||||
<h3><a name="Summary">
|
||||
Summary</a></h3>
|
||||
|
||||
|
||||
@@ -659,8 +659,9 @@ systems with more than one CPU:
|
||||
In other words, a given instance of <tt>synchronize_rcu()</tt>
|
||||
can avoid waiting on a given RCU read-side critical section only
|
||||
if it can prove that <tt>synchronize_rcu()</tt> started first.
|
||||
</font>
|
||||
|
||||
<p>
|
||||
<p><font color="ffffff">
|
||||
A related question is “When <tt>rcu_read_lock()</tt>
|
||||
doesn't generate any code, why does it matter how it relates
|
||||
to a grace period?”
|
||||
@@ -675,8 +676,9 @@ systems with more than one CPU:
|
||||
within the critical section, in which case none of the accesses
|
||||
within the critical section may observe the effects of any
|
||||
access following the grace period.
|
||||
</font>
|
||||
|
||||
<p>
|
||||
<p><font color="ffffff">
|
||||
As of late 2016, mathematical models of RCU take this
|
||||
viewpoint, for example, see slides 62 and 63
|
||||
of the
|
||||
@@ -1616,8 +1618,8 @@ CPUs should at least make reasonable forward progress.
|
||||
In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
|
||||
is permitted to impose modest degradation of real-time latency
|
||||
on non-idle online CPUs.
|
||||
That said, it will likely be necessary to take further steps to reduce this
|
||||
degradation, hopefully to roughly that of a scheduling-clock interrupt.
|
||||
Here, “modest” means roughly the same latency
|
||||
degradation as a scheduling-clock interrupt.
|
||||
|
||||
<p>
|
||||
There are a number of situations where even
|
||||
@@ -1913,12 +1915,9 @@ This requirement is another factor driving batching of grace periods,
|
||||
but it is also the driving force behind the checks for large numbers
|
||||
of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
|
||||
Finally, high update rates should not delay RCU read-side critical
|
||||
sections, although some read-side delays can occur when using
|
||||
sections, although some small read-side delays can occur when using
|
||||
<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
|
||||
of <tt>try_stop_cpus()</tt>.
|
||||
(In the future, <tt>synchronize_rcu_expedited()</tt> will be
|
||||
converted to use lighter-weight inter-processor interrupts (IPIs),
|
||||
but this will still disturb readers, though to a much smaller degree.)
|
||||
of <tt>smp_call_function_single()</tt>.
|
||||
|
||||
<p>
|
||||
Although all three of these corner cases were understood in the early
|
||||
@@ -2154,7 +2153,8 @@ as will <tt>rcu_assign_pointer()</tt>.
|
||||
<p>
|
||||
Although <tt>call_rcu()</tt> may be invoked at any
|
||||
time during boot, callbacks are not guaranteed to be invoked until after
|
||||
the scheduler is fully up and running.
|
||||
all of RCU's kthreads have been spawned, which occurs at
|
||||
<tt>early_initcall()</tt> time.
|
||||
This delay in callback invocation is due to the fact that RCU does not
|
||||
invoke callbacks until it is fully initialized, and this full initialization
|
||||
cannot occur until after the scheduler has initialized itself to the
|
||||
@@ -2167,8 +2167,10 @@ on what operations those callbacks could invoke.
|
||||
Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
|
||||
<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
|
||||
(<a href="#Bottom-Half Flavor">discussed below</a>),
|
||||
and
|
||||
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
|
||||
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>,
|
||||
<tt>synchronize_rcu_expedited()</tt>,
|
||||
<tt>synchronize_rcu_bh_expedited()</tt>, and
|
||||
<tt>synchronize_sched_expedited()</tt>
|
||||
will all operate normally
|
||||
during very early boot, the reason being that there is only one CPU
|
||||
and preemption is disabled.
|
||||
@@ -2178,45 +2180,59 @@ state and thus a grace period, so the early-boot implementation can
|
||||
be a no-op.
|
||||
|
||||
<p>
|
||||
Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
|
||||
continue to operate normally through the remainder of boot, courtesy
|
||||
of the fact that preemption is disabled across their RCU read-side
|
||||
critical sections and also courtesy of the fact that there is still
|
||||
only one CPU.
|
||||
However, once the scheduler starts initializing, preemption is enabled.
|
||||
There is still only a single CPU, but the fact that preemption is enabled
|
||||
means that the no-op implementation of <tt>synchronize_rcu()</tt> no
|
||||
longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
|
||||
Therefore, as soon as the scheduler starts initializing, the early-boot
|
||||
fastpath is disabled.
|
||||
This means that <tt>synchronize_rcu()</tt> switches to its runtime
|
||||
mode of operation where it posts callbacks, which in turn means that
|
||||
any call to <tt>synchronize_rcu()</tt> will block until the corresponding
|
||||
callback is invoked.
|
||||
Unfortunately, the callback cannot be invoked until RCU's runtime
|
||||
grace-period machinery is up and running, which cannot happen until
|
||||
the scheduler has initialized itself sufficiently to allow RCU's
|
||||
kthreads to be spawned.
|
||||
Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
|
||||
initialization can result in deadlock.
|
||||
However, once the scheduler has spawned its first kthread, this early
|
||||
boot trick fails for <tt>synchronize_rcu()</tt> (as well as for
|
||||
<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt>
|
||||
kernels.
|
||||
The reason is that an RCU read-side critical section might be preempted,
|
||||
which means that a subsequent <tt>synchronize_rcu()</tt> really does have
|
||||
to wait for something, as opposed to simply returning immediately.
|
||||
Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of
|
||||
its kthreads are spawned, which doesn't happen until some time during
|
||||
<tt>early_initcalls()</tt> time.
|
||||
But this is no excuse: RCU is nevertheless required to correctly handle
|
||||
synchronous grace periods during this time period.
|
||||
Once all of its kthreads are up and running, RCU starts running
|
||||
normally.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
So what happens with <tt>synchronize_rcu()</tt> during
|
||||
scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
|
||||
kernels?
|
||||
How can RCU possibly handle grace periods before all of its
|
||||
kthreads have been spawned???
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
|
||||
maps directly to <tt>synchronize_sched()</tt>.
|
||||
Therefore, <tt>synchronize_rcu()</tt> works normally throughout
|
||||
boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
|
||||
However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
|
||||
so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
|
||||
during scheduler initialization.
|
||||
Very carefully!
|
||||
</font>
|
||||
|
||||
<p><font color="ffffff">
|
||||
During the “dead zone” between the time that the
|
||||
scheduler spawns the first task and the time that all of RCU's
|
||||
kthreads have been spawned, all synchronous grace periods are
|
||||
handled by the expedited grace-period mechanism.
|
||||
At runtime, this expedited mechanism relies on workqueues, but
|
||||
during the dead zone the requesting task itself drives the
|
||||
desired expedited grace period.
|
||||
Because dead-zone execution takes place within task context,
|
||||
everything works.
|
||||
Once the dead zone ends, expedited grace periods go back to
|
||||
using workqueues, as is required to avoid problems that would
|
||||
otherwise occur when a user task received a POSIX signal while
|
||||
driving an expedited grace period.
|
||||
</font>
|
||||
|
||||
<p><font color="ffffff">
|
||||
And yes, this does mean that it is unhelpful to send POSIX
|
||||
signals to random tasks between the time that the scheduler
|
||||
spawns its first kthread and the time that RCU's kthreads
|
||||
have all been spawned.
|
||||
If there ever turns out to be a good reason for sending POSIX
|
||||
signals during that time, appropriate adjustments will be made.
|
||||
(If it turns out that POSIX signals are sent during this time for
|
||||
no good reason, other adjustments will be made, appropriate
|
||||
or otherwise.)
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
@@ -2295,12 +2311,61 @@ situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
|
||||
The need for <tt>rcu_barrier()</tt> for module unloading became
|
||||
apparent later.
|
||||
|
||||
<p>
|
||||
<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not,
|
||||
repeat, <i>not</i>, obligated to wait for a grace period.
|
||||
It is instead only required to wait for RCU callbacks that have
|
||||
already been posted.
|
||||
Therefore, if there are no RCU callbacks posted anywhere in the system,
|
||||
<tt>rcu_barrier()</tt> is within its rights to return immediately.
|
||||
Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not
|
||||
necessarily need to wait for a grace period.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
Wait a minute!
|
||||
Each RCU callbacks must wait for a grace period to complete,
|
||||
and <tt>rcu_barrier()</tt> must wait for each pre-existing
|
||||
callback to be invoked.
|
||||
Doesn't <tt>rcu_barrier()</tt> therefore need to wait for
|
||||
a full grace period if there is even one callback posted anywhere
|
||||
in the system?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Absolutely not!!!
|
||||
</font>
|
||||
|
||||
<p><font color="ffffff">
|
||||
Yes, each RCU callbacks must wait for a grace period to complete,
|
||||
but it might well be partly (or even completely) finished waiting
|
||||
by the time <tt>rcu_barrier()</tt> is invoked.
|
||||
In that case, <tt>rcu_barrier()</tt> need only wait for the
|
||||
remaining portion of the grace period to elapse.
|
||||
So even if there are quite a few callbacks posted,
|
||||
<tt>rcu_barrier()</tt> might well return quite quickly.
|
||||
</font>
|
||||
|
||||
<p><font color="ffffff">
|
||||
So if you need to wait for a grace period as well as for all
|
||||
pre-existing callbacks, you will need to invoke both
|
||||
<tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>.
|
||||
If latency is a concern, you can always use workqueues
|
||||
to invoke them concurrently.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
|
||||
|
||||
<p>
|
||||
The Linux kernel supports CPU hotplug, which means that CPUs
|
||||
can come and go.
|
||||
It is of course illegal to use any RCU API member from an offline CPU.
|
||||
It is of course illegal to use any RCU API member from an offline CPU,
|
||||
with the exception of <a href="#Sleepable RCU">SRCU</a> read-side
|
||||
critical sections.
|
||||
This requirement was present from day one in DYNIX/ptx, but
|
||||
on the other hand, the Linux kernel's CPU-hotplug implementation
|
||||
is “interesting.”
|
||||
@@ -2310,19 +2375,18 @@ The Linux-kernel CPU-hotplug implementation has notifiers that
|
||||
are used to allow the various kernel subsystems (including RCU)
|
||||
to respond appropriately to a given CPU-hotplug operation.
|
||||
Most RCU operations may be invoked from CPU-hotplug notifiers,
|
||||
including even normal synchronous grace-period operations
|
||||
such as <tt>synchronize_rcu()</tt>.
|
||||
However, expedited grace-period operations such as
|
||||
<tt>synchronize_rcu_expedited()</tt> are not supported,
|
||||
due to the fact that current implementations block CPU-hotplug
|
||||
operations, which could result in deadlock.
|
||||
including even synchronous grace-period operations such as
|
||||
<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>.
|
||||
|
||||
<p>
|
||||
In addition, all-callback-wait operations such as
|
||||
However, all-callback-wait operations such as
|
||||
<tt>rcu_barrier()</tt> are also not supported, due to the
|
||||
fact that there are phases of CPU-hotplug operations where
|
||||
the outgoing CPU's callbacks will not be invoked until after
|
||||
the CPU-hotplug operation ends, which could also result in deadlock.
|
||||
Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations
|
||||
during its execution, which results in another type of deadlock
|
||||
when invoked from a CPU-hotplug notifier.
|
||||
|
||||
<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
|
||||
|
||||
@@ -2863,6 +2927,27 @@ It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
|
||||
API, which, in combination with <tt>srcu_read_unlock()</tt>,
|
||||
guarantees a full memory barrier.
|
||||
|
||||
<p>
|
||||
Also unlike other RCU flavors, SRCU's callbacks-wait function
|
||||
<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers,
|
||||
though this is not necessarily a good idea.
|
||||
The reason that this is possible is that SRCU is insensitive
|
||||
to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt>
|
||||
need not exclude CPU-hotplug operations.
|
||||
|
||||
<p>
|
||||
As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating
|
||||
a locking bottleneck present in prior kernel versions.
|
||||
Although this will allow users to put much heavier stress on
|
||||
<tt>call_srcu()</tt>, it is important to note that SRCU does not
|
||||
yet take any special steps to deal with callback flooding.
|
||||
So if you are posting (say) 10,000 SRCU callbacks per second per CPU,
|
||||
you are probably totally OK, but if you intend to post (say) 1,000,000
|
||||
SRCU callbacks per second per CPU, please run some tests first.
|
||||
SRCU just might need a few adjustment to deal with that sort of load.
|
||||
Of course, your mileage may vary based on the speed of your CPUs and
|
||||
the size of your memory.
|
||||
|
||||
<p>
|
||||
The
|
||||
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
|
||||
@@ -3021,8 +3106,8 @@ to do some redesign to avoid this scalability problem.
|
||||
|
||||
<p>
|
||||
RCU disables CPU hotplug in a few places, perhaps most notably in the
|
||||
expedited grace-period and <tt>rcu_barrier()</tt> operations.
|
||||
If there is a strong reason to use expedited grace periods in CPU-hotplug
|
||||
<tt>rcu_barrier()</tt> operations.
|
||||
If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug
|
||||
notifiers, it will be necessary to avoid disabling CPU hotplug.
|
||||
This would introduce some complexity, so there had better be a <i>very</i>
|
||||
good reason.
|
||||
@@ -3096,9 +3181,5 @@ Andy Lutomirski for their help in rendering
|
||||
this article human readable, and to Michelle Rankin for her support
|
||||
of this effort.
|
||||
Other contributions are acknowledged in the Linux kernel's git archive.
|
||||
The cartoon is copyright (c) 2013 by Melissa Broussard,
|
||||
and is provided
|
||||
under the terms of the Creative Commons Attribution-Share Alike 3.0
|
||||
United States license.
|
||||
|
||||
</body></html>
|
||||
|
||||
@@ -138,6 +138,15 @@ o Be very careful about comparing pointers obtained from
|
||||
This sort of comparison occurs frequently when scanning
|
||||
RCU-protected circular linked lists.
|
||||
|
||||
Note that if checks for being within an RCU read-side
|
||||
critical section are not required and the pointer is never
|
||||
dereferenced, rcu_access_pointer() should be used in place
|
||||
of rcu_dereference(). The rcu_access_pointer() primitive
|
||||
does not require an enclosing read-side critical section,
|
||||
and also omits the smp_read_barrier_depends() included in
|
||||
rcu_dereference(), which in turn should provide a small
|
||||
performance gain in some CPUs (e.g., the DEC Alpha).
|
||||
|
||||
o The comparison is against a pointer that references memory
|
||||
that was initialized "a long time ago." The reason
|
||||
this is safe is that even if misordering occurs, the
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
Using hlist_nulls to protect read-mostly linked lists and
|
||||
objects using SLAB_DESTROY_BY_RCU allocations.
|
||||
objects using SLAB_TYPESAFE_BY_RCU allocations.
|
||||
|
||||
Please read the basics in Documentation/RCU/listRCU.txt
|
||||
|
||||
@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
|
||||
to solve following problem :
|
||||
|
||||
A typical RCU linked list managing objects which are
|
||||
allocated with SLAB_DESTROY_BY_RCU kmem_cache can
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
|
||||
use following algos :
|
||||
|
||||
1) Lookup algo
|
||||
@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
|
||||
3) Remove algo
|
||||
--------------
|
||||
Nothing special here, we can use a standard RCU hlist deletion.
|
||||
But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
|
||||
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
|
||||
very very fast (before the end of RCU grace period)
|
||||
|
||||
if (put_last_reference_on(obj) {
|
||||
|
||||
+100
-90
@@ -1,9 +1,102 @@
|
||||
Using RCU's CPU Stall Detector
|
||||
|
||||
The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
|
||||
detector, which detects conditions that unduly delay RCU grace periods.
|
||||
This module parameter enables CPU stall detection by default, but
|
||||
may be overridden via boot-time parameter or at runtime via sysfs.
|
||||
This document first discusses what sorts of issues RCU's CPU stall
|
||||
detector can locate, and then discusses kernel parameters and Kconfig
|
||||
options that can be used to fine-tune the detector's operation. Finally,
|
||||
this document explains the stall detector's "splat" format.
|
||||
|
||||
|
||||
What Causes RCU CPU Stall Warnings?
|
||||
|
||||
So your kernel printed an RCU CPU stall warning. The next question is
|
||||
"What caused it?" The following problems can result in RCU CPU stall
|
||||
warnings:
|
||||
|
||||
o A CPU looping in an RCU read-side critical section.
|
||||
|
||||
o A CPU looping with interrupts disabled.
|
||||
|
||||
o A CPU looping with preemption disabled. This condition can
|
||||
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
|
||||
stalls.
|
||||
|
||||
o A CPU looping with bottom halves disabled. This condition can
|
||||
result in RCU-sched and RCU-bh stalls.
|
||||
|
||||
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
|
||||
kernel without invoking schedule(). Note that cond_resched()
|
||||
does not necessarily prevent RCU CPU stall warnings. Therefore,
|
||||
if the looping in the kernel is really expected and desirable
|
||||
behavior, you might need to replace some of the cond_resched()
|
||||
calls with calls to cond_resched_rcu_qs().
|
||||
|
||||
o Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
a 115Kbaud serial console can be -way- too slow to keep up
|
||||
with boot-time message rates, and will frequently result in
|
||||
RCU CPU stall warning messages. Especially if you have added
|
||||
debug printk()s.
|
||||
|
||||
o Anything that prevents RCU's grace-period kthreads from running.
|
||||
This can result in the "All QSes seen" console-log message.
|
||||
This message will include information on when the kthread last
|
||||
ran and how often it should be expected to run.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||
happen to preempt a low-priority task in the middle of an RCU
|
||||
read-side critical section. This is especially damaging if
|
||||
that low-priority task is not permitted to run on any other CPU,
|
||||
in which case the next RCU grace period can never complete, which
|
||||
will eventually cause the system to run out of memory and hang.
|
||||
While the system is in the process of running itself out of
|
||||
memory, you might see stall-warning messages.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||
is running at a higher priority than the RCU softirq threads.
|
||||
This will prevent RCU callbacks from ever being invoked,
|
||||
and in a CONFIG_PREEMPT_RCU kernel will further prevent
|
||||
RCU grace periods from ever completing. Either way, the
|
||||
system will eventually run out of memory and hang. In the
|
||||
CONFIG_PREEMPT_RCU case, you might see stall-warning
|
||||
messages.
|
||||
|
||||
o A hardware or software issue shuts off the scheduler-clock
|
||||
interrupt on a CPU that is not in dyntick-idle mode. This
|
||||
problem really has happened, and seems to be most likely to
|
||||
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
|
||||
|
||||
o A bug in the RCU implementation.
|
||||
|
||||
o A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
leading the realization that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
|
||||
warning. Note that SRCU does -not- have CPU stall warnings. Please note
|
||||
that RCU only detects CPU stalls when there is a grace period in progress.
|
||||
No grace period, no CPU stall warnings.
|
||||
|
||||
To diagnose the cause of the stall, inspect the stack traces.
|
||||
The offending function will usually be near the top of the stack.
|
||||
If you have a series of stall warnings from a single extended stall,
|
||||
comparing the stack traces can often help determine where the stall
|
||||
is occurring, which will usually be in the function nearest the top of
|
||||
that portion of the stack which remains the same from trace to trace.
|
||||
If you can reliably trigger the stall, ftrace can be quite helpful.
|
||||
|
||||
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
|
||||
and with RCU's event tracing. For information on RCU's event tracing,
|
||||
see include/trace/events/rcu.h.
|
||||
|
||||
|
||||
Fine-Tuning the RCU CPU Stall Detector
|
||||
|
||||
The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
|
||||
CPU stall detector, which detects conditions that unduly delay RCU grace
|
||||
periods. This module parameter enables CPU stall detection by default,
|
||||
but may be overridden via boot-time parameter or at runtime via sysfs.
|
||||
The stall detector's idea of what constitutes "unduly delayed" is
|
||||
controlled by a set of kernel configuration variables and cpp macros:
|
||||
|
||||
@@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout
|
||||
And continues with the output of sched_show_task() for each
|
||||
task stalling the current RCU-tasks grace period.
|
||||
|
||||
|
||||
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||
|
||||
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
|
||||
it will print a message similar to the following:
|
||||
|
||||
@@ -178,89 +274,3 @@ grace period is in flight.
|
||||
|
||||
It is entirely possible to see stall warnings from normal and from
|
||||
expedited grace periods at about the same time from the same run.
|
||||
|
||||
|
||||
What Causes RCU CPU Stall Warnings?
|
||||
|
||||
So your kernel printed an RCU CPU stall warning. The next question is
|
||||
"What caused it?" The following problems can result in RCU CPU stall
|
||||
warnings:
|
||||
|
||||
o A CPU looping in an RCU read-side critical section.
|
||||
|
||||
o A CPU looping with interrupts disabled. This condition can
|
||||
result in RCU-sched and RCU-bh stalls.
|
||||
|
||||
o A CPU looping with preemption disabled. This condition can
|
||||
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
|
||||
stalls.
|
||||
|
||||
o A CPU looping with bottom halves disabled. This condition can
|
||||
result in RCU-sched and RCU-bh stalls.
|
||||
|
||||
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
|
||||
kernel without invoking schedule(). Note that cond_resched()
|
||||
does not necessarily prevent RCU CPU stall warnings. Therefore,
|
||||
if the looping in the kernel is really expected and desirable
|
||||
behavior, you might need to replace some of the cond_resched()
|
||||
calls with calls to cond_resched_rcu_qs().
|
||||
|
||||
o Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
a 115Kbaud serial console can be -way- too slow to keep up
|
||||
with boot-time message rates, and will frequently result in
|
||||
RCU CPU stall warning messages. Especially if you have added
|
||||
debug printk()s.
|
||||
|
||||
o Anything that prevents RCU's grace-period kthreads from running.
|
||||
This can result in the "All QSes seen" console-log message.
|
||||
This message will include information on when the kthread last
|
||||
ran and how often it should be expected to run.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||
happen to preempt a low-priority task in the middle of an RCU
|
||||
read-side critical section. This is especially damaging if
|
||||
that low-priority task is not permitted to run on any other CPU,
|
||||
in which case the next RCU grace period can never complete, which
|
||||
will eventually cause the system to run out of memory and hang.
|
||||
While the system is in the process of running itself out of
|
||||
memory, you might see stall-warning messages.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||
is running at a higher priority than the RCU softirq threads.
|
||||
This will prevent RCU callbacks from ever being invoked,
|
||||
and in a CONFIG_PREEMPT_RCU kernel will further prevent
|
||||
RCU grace periods from ever completing. Either way, the
|
||||
system will eventually run out of memory and hang. In the
|
||||
CONFIG_PREEMPT_RCU case, you might see stall-warning
|
||||
messages.
|
||||
|
||||
o A hardware or software issue shuts off the scheduler-clock
|
||||
interrupt on a CPU that is not in dyntick-idle mode. This
|
||||
problem really has happened, and seems to be most likely to
|
||||
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
|
||||
|
||||
o A bug in the RCU implementation.
|
||||
|
||||
o A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
leading the realization that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
|
||||
warning. Note that SRCU does -not- have CPU stall warnings. Please note
|
||||
that RCU only detects CPU stalls when there is a grace period in progress.
|
||||
No grace period, no CPU stall warnings.
|
||||
|
||||
To diagnose the cause of the stall, inspect the stack traces.
|
||||
The offending function will usually be near the top of the stack.
|
||||
If you have a series of stall warnings from a single extended stall,
|
||||
comparing the stack traces can often help determine where the stall
|
||||
is occurring, which will usually be in the function nearest the top of
|
||||
that portion of the stack which remains the same from trace to trace.
|
||||
If you can reliably trigger the stall, ftrace can be quite helpful.
|
||||
|
||||
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
|
||||
and with RCU's event tracing. For information on RCU's event tracing,
|
||||
see include/trace/events/rcu.h.
|
||||
|
||||
@@ -562,7 +562,9 @@ This section presents a "toy" RCU implementation that is based on
|
||||
familiar locking primitives. Its overhead makes it a non-starter for
|
||||
real-life use, as does its lack of scalability. It is also unsuitable
|
||||
for realtime use, since it allows scheduling latency to "bleed" from
|
||||
one read-side critical section to another.
|
||||
one read-side critical section to another. It also assumes recursive
|
||||
reader-writer locks: If you try this with non-recursive locks, and
|
||||
you allow nested rcu_read_lock() calls, you can deadlock.
|
||||
|
||||
However, it is probably the easiest implementation to relate to, so is
|
||||
a good starting point.
|
||||
@@ -587,20 +589,21 @@ It is extremely simple:
|
||||
write_unlock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
[You can ignore rcu_assign_pointer() and rcu_dereference() without
|
||||
missing much. But here they are anyway. And whatever you do, don't
|
||||
forget about them when submitting patches making use of RCU!]
|
||||
[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
|
||||
much. But here are simplified versions anyway. And whatever you do,
|
||||
don't forget about them when submitting patches making use of RCU!]
|
||||
|
||||
#define rcu_assign_pointer(p, v) ({ \
|
||||
smp_wmb(); \
|
||||
(p) = (v); \
|
||||
})
|
||||
#define rcu_assign_pointer(p, v) \
|
||||
({ \
|
||||
smp_store_release(&(p), (v)); \
|
||||
})
|
||||
|
||||
#define rcu_dereference(p) ({ \
|
||||
typeof(p) _________p1 = p; \
|
||||
smp_read_barrier_depends(); \
|
||||
(_________p1); \
|
||||
})
|
||||
#define rcu_dereference(p) \
|
||||
({ \
|
||||
typeof(p) _________p1 = p; \
|
||||
smp_read_barrier_depends(); \
|
||||
(_________p1); \
|
||||
})
|
||||
|
||||
|
||||
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
|
||||
@@ -925,7 +928,8 @@ d. Do you need RCU grace periods to complete even in the face
|
||||
|
||||
e. Is your workload too update-intensive for normal use of
|
||||
RCU, but inappropriate for other synchronization mechanisms?
|
||||
If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
|
||||
If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
|
||||
named SLAB_DESTROY_BY_RCU). But please be careful!
|
||||
|
||||
f. Do you need read-side critical sections that are respected
|
||||
even though they are in the middle of the idle loop, during
|
||||
|
||||
@@ -3800,6 +3800,14 @@
|
||||
spia_pedr=
|
||||
spia_peddr=
|
||||
|
||||
srcutree.exp_holdoff [KNL]
|
||||
Specifies how many nanoseconds must elapse
|
||||
since the end of the last SRCU grace period for
|
||||
a given srcu_struct until the next normal SRCU
|
||||
grace period will be considered for automatic
|
||||
expediting. Set to zero to disable automatic
|
||||
expediting.
|
||||
|
||||
stacktrace [FTRACE]
|
||||
Enabled the stack tracer on boot up.
|
||||
|
||||
|
||||
@@ -768,7 +768,7 @@ equal to zero, in which case the compiler is within its rights to
|
||||
transform the above code into the following:
|
||||
|
||||
q = READ_ONCE(a);
|
||||
WRITE_ONCE(b, 1);
|
||||
WRITE_ONCE(b, 2);
|
||||
do_something_else();
|
||||
|
||||
Given this transformation, the CPU is not required to respect the ordering
|
||||
|
||||
@@ -324,6 +324,9 @@ config HAVE_CMPXCHG_LOCAL
|
||||
config HAVE_CMPXCHG_DOUBLE
|
||||
bool
|
||||
|
||||
config ARCH_WEAK_RELEASE_ACQUIRE
|
||||
bool
|
||||
|
||||
config ARCH_WANT_IPC_PARSE_VERSION
|
||||
bool
|
||||
|
||||
|
||||
@@ -146,6 +146,7 @@ config PPC
|
||||
select ARCH_USE_BUILTIN_BSWAP
|
||||
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
|
||||
select ARCH_WANT_IPC_PARSE_VERSION
|
||||
select ARCH_WEAK_RELEASE_ACQUIRE
|
||||
select BINFMT_ELF
|
||||
select BUILDTIME_EXTABLE_SORT
|
||||
select CLONE_BACKWARDS
|
||||
|
||||
@@ -4789,7 +4789,7 @@ i915_gem_load_init(struct drm_i915_private *dev_priv)
|
||||
dev_priv->requests = KMEM_CACHE(drm_i915_gem_request,
|
||||
SLAB_HWCACHE_ALIGN |
|
||||
SLAB_RECLAIM_ACCOUNT |
|
||||
SLAB_DESTROY_BY_RCU);
|
||||
SLAB_TYPESAFE_BY_RCU);
|
||||
if (!dev_priv->requests)
|
||||
goto err_vmas;
|
||||
|
||||
|
||||
@@ -521,7 +521,7 @@ static inline struct drm_i915_gem_request *
|
||||
__i915_gem_active_get_rcu(const struct i915_gem_active *active)
|
||||
{
|
||||
/* Performing a lockless retrieval of the active request is super
|
||||
* tricky. SLAB_DESTROY_BY_RCU merely guarantees that the backing
|
||||
* tricky. SLAB_TYPESAFE_BY_RCU merely guarantees that the backing
|
||||
* slab of request objects will not be freed whilst we hold the
|
||||
* RCU read lock. It does not guarantee that the request itself
|
||||
* will not be freed and then *reused*. Viz,
|
||||
|
||||
@@ -174,7 +174,7 @@ struct drm_i915_private *mock_gem_device(void)
|
||||
i915->requests = KMEM_CACHE(mock_request,
|
||||
SLAB_HWCACHE_ALIGN |
|
||||
SLAB_RECLAIM_ACCOUNT |
|
||||
SLAB_DESTROY_BY_RCU);
|
||||
SLAB_TYPESAFE_BY_RCU);
|
||||
if (!i915->requests)
|
||||
goto err_vmas;
|
||||
|
||||
|
||||
@@ -1115,7 +1115,7 @@ int ldlm_init(void)
|
||||
ldlm_lock_slab = kmem_cache_create("ldlm_locks",
|
||||
sizeof(struct ldlm_lock), 0,
|
||||
SLAB_HWCACHE_ALIGN |
|
||||
SLAB_DESTROY_BY_RCU, NULL);
|
||||
SLAB_TYPESAFE_BY_RCU, NULL);
|
||||
if (!ldlm_lock_slab) {
|
||||
kmem_cache_destroy(ldlm_resource_slab);
|
||||
return -ENOMEM;
|
||||
|
||||
+1
-1
@@ -2363,7 +2363,7 @@ static int jbd2_journal_init_journal_head_cache(void)
|
||||
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
|
||||
sizeof(struct journal_head),
|
||||
0, /* offset */
|
||||
SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
|
||||
SLAB_TEMPORARY | SLAB_TYPESAFE_BY_RCU,
|
||||
NULL); /* ctor */
|
||||
retval = 0;
|
||||
if (!jbd2_journal_head_cache) {
|
||||
|
||||
+1
-1
@@ -38,7 +38,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
|
||||
/*
|
||||
* The lockless check can race with remove_wait_queue() in progress,
|
||||
* but in this case its caller should run under rcu_read_lock() and
|
||||
* sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
|
||||
* sighand_cachep is SLAB_TYPESAFE_BY_RCU, we can safely return.
|
||||
*/
|
||||
if (likely(!waitqueue_active(wqh)))
|
||||
return;
|
||||
|
||||
@@ -229,7 +229,7 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence)
|
||||
*
|
||||
* Function returns NULL if no refcount could be obtained, or the fence.
|
||||
* This function handles acquiring a reference to a fence that may be
|
||||
* reallocated within the RCU grace period (such as with SLAB_DESTROY_BY_RCU),
|
||||
* reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU),
|
||||
* so long as the caller is using RCU on the pointer to the fence.
|
||||
*
|
||||
* An alternative mechanism is to employ a seqlock to protect a bunch of
|
||||
@@ -257,7 +257,7 @@ dma_fence_get_rcu_safe(struct dma_fence * __rcu *fencep)
|
||||
* have successfully acquire a reference to it. If it no
|
||||
* longer matches, we are holding a reference to some other
|
||||
* reallocated pointer. This is possible if the allocator
|
||||
* is using a freelist like SLAB_DESTROY_BY_RCU where the
|
||||
* is using a freelist like SLAB_TYPESAFE_BY_RCU where the
|
||||
* fence remains valid for the RCU grace period, but it
|
||||
* may be reallocated. When using such allocators, we are
|
||||
* responsible for ensuring the reference we get is to
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user