Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-05-01 15:00:59 -07:00 · 2018-01-03 14:14:18 +01:00
parent 30a7acd573 1dfa55e019
commit 475c5ee193
81 changed files with 500 additions and 747 deletions
@@ -1097,7 +1097,8 @@ will cause the CPU to disregard the values of its counters on
 its next exit from idle.
 Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
 cases where a given operation has resulted in a quiescent state
-for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
+for all flavors of RCU, for example, <tt>cond_resched()</tt>
+when RCU has indicated a need for quiescent states.

 <h5>RCU Callback Handling</h5>

@@ -1182,8 +1183,8 @@ CPU (and from tracing) unless otherwise stated.
 Its fields are as follows:

 <pre>
-  1   int dynticks_nesting;
-  2   int dynticks_nmi_nesting;
+  1   long dynticks_nesting;
+  2   long dynticks_nmi_nesting;
  3   atomic_t dynticks;
  4   bool rcu_need_heavy_qs;
  5   unsigned long rcu_qs_ctr;
@@ -1191,15 +1192,31 @@ Its fields are as follows:
 </pre>

 <p>The <tt>-&gt;dynticks_nesting</tt> field counts the
-nesting depth of normal interrupts.
-In addition, this counter is incremented when exiting dyntick-idle
-mode and decremented when entering it.
+nesting depth of process execution, so that in normal circumstances
+this counter has value zero or one.
+NMIs, irqs, and tracers are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
+field.
+Because NMIs cannot be masked, changes to this variable have to be
+undertaken carefully using an algorithm provided by Andy Lutomirski.
+The initial transition from idle adds one, and nested transitions
+add two, so that a nesting level of five is represented by a
+<tt>-&gt;dynticks_nmi_nesting</tt> value of nine.
 This counter can therefore be thought of as counting the number
 of reasons why this CPU cannot be permitted to enter dyntick-idle
-mode, aside from non-maskable interrupts (NMIs).
-NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
-field, except that NMIs that interrupt non-dyntick-idle execution
-are not counted.
+mode, aside from process-level transitions.
+
+<p>However, it turns out that when running in non-idle kernel context,
+the Linux kernel is fully capable of entering interrupt handlers that
+never exit and perhaps also vice versa.
+Therefore, whenever the <tt>-&gt;dynticks_nesting</tt> field is
+incremented up from zero, the <tt>-&gt;dynticks_nmi_nesting</tt> field
+is set to a large positive number, and whenever the
+<tt>-&gt;dynticks_nesting</tt> field is decremented down to zero,
+the the <tt>-&gt;dynticks_nmi_nesting</tt> field is set to zero.
+Assuming that the number of misnested interrupts is not sufficient
+to overflow the counter, this approach corrects the
+<tt>-&gt;dynticks_nmi_nesting</tt> field every time the corresponding
+CPU enters the idle loop from process context.

 </p><p>The <tt>-&gt;dynticks</tt> field counts the corresponding
 CPU's transitions to and from dyntick-idle mode, so that this counter
@@ -1231,14 +1248,16 @@ in response.
 <tr><th>&nbsp;</th></tr>
 <tr><th align="left">Quick Quiz:</th></tr>
 <tr><td>
-	Why not just count all NMIs?
-	Wouldn't that be simpler and less error prone?
+	Why not simply combine the <tt>-&gt;dynticks_nesting</tt>
+	and <tt>-&gt;dynticks_nmi_nesting</tt> counters into a
+	single counter that just counts the number of reasons that
+	the corresponding CPU is non-idle?
 </td></tr>
 <tr><th align="left">Answer:</th></tr>
 <tr><td bgcolor="#ffffff"><font color="ffffff">
-	It seems simpler only until you think hard about how to go about
-	updating the <tt>rcu_dynticks</tt> structure's
-	<tt>-&gt;dynticks</tt> field.
+	Because this would fail in the presence of interrupts whose
+	handlers never return and of handlers that manage to return
+	from a made-up interrupt.
 </font></td></tr>
 <tr><td>&nbsp;</td></tr>
 </table>
@@ -581,7 +581,8 @@ This guarantee was only partially premeditated.
 DYNIX/ptx used an explicit memory barrier for publication, but had nothing
 resembling <tt>rcu_dereference()</tt> for subscription, nor did it
 have anything resembling the <tt>smp_read_barrier_depends()</tt>
-that was later subsumed into <tt>rcu_dereference()</tt>.
+that was later subsumed into <tt>rcu_dereference()</tt> and later
+still into <tt>READ_ONCE()</tt>.
 The need for these operations made itself known quite suddenly at a
 late-1990s meeting with the DEC Alpha architects, back in the days when
 DEC was still a free-standing company.
@@ -2797,7 +2798,7 @@ RCU must avoid degrading real-time response for CPU-bound threads, whether
 executing in usermode (which is one use case for
 <tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel.
 That said, CPU-bound loops in the kernel must execute
-<tt>cond_resched_rcu_qs()</tt> at least once per few tens of milliseconds
+<tt>cond_resched()</tt> at least once per few tens of milliseconds
 in order to avoid receiving an IPI from RCU.

 <p>
@@ -3128,7 +3129,7 @@ The solution, in the form of
 is to have implicit
 read-side critical sections that are delimited by voluntary context
 switches, that is, calls to <tt>schedule()</tt>,
-<tt>cond_resched_rcu_qs()</tt>, and
+<tt>cond_resched()</tt>, and
 <tt>synchronize_rcu_tasks()</tt>.
 In addition, transitions to and from userspace execution also delimit
 tasks-RCU read-side critical sections.
@@ -122,11 +122,7 @@ o	Be very careful about comparing pointers obtained from
 		Note that if checks for being within an RCU read-side
 		critical section are not required and the pointer is never
 		dereferenced, rcu_access_pointer() should be used in place
-		of rcu_dereference(). The rcu_access_pointer() primitive
-		does not require an enclosing read-side critical section,
-		and also omits the smp_read_barrier_depends() included in
-		rcu_dereference(), which in turn should provide a small
-		performance gain in some CPUs (e.g., the DEC Alpha).
+		of rcu_dereference().

 	o	The comparison is against a pointer that references memory
 		that was initialized "a long time ago."  The reason
@@ -23,12 +23,10 @@ o	A CPU looping with preemption disabled.  This condition can
 o	A CPU looping with bottom halves disabled.  This condition can
 	result in RCU-sched and RCU-bh stalls.

-o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
-	kernel without invoking schedule().  Note that cond_resched()
-	does not necessarily prevent RCU CPU stall warnings.  Therefore,
-	if the looping in the kernel is really expected and desirable
-	behavior, you might need to replace some of the cond_resched()
-	calls with calls to cond_resched_rcu_qs().
+o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
+	without invoking schedule().  If the looping in the kernel is
+	really expected and desirable behavior, you might need to add
+	some calls to cond_resched().

 o	Booting Linux using a console connection that is too slow to
 	keep up with the boot-time console-message rate.  For example,
@@ -600,8 +600,7 @@ don't forget about them when submitting patches making use of RCU!]

 	#define rcu_dereference(p) \
 	({ \
-		typeof(p) _________p1 = p; \
-		smp_read_barrier_depends(); \
+		typeof(p) _________p1 = READ_ONCE(p); \
 		(_________p1); \
 	})

@@ -2053,9 +2053,6 @@
 			This tests the locking primitive's ability to
 			transition abruptly to and from idle.

-	locktorture.torture_runnable= [BOOT]
-			Start locktorture running at boot time.
-
 	locktorture.torture_type= [KNL]
 			Specify the locking implementation to test.

@@ -3471,9 +3468,6 @@
 			the same as for rcuperf.nreaders.
 			N, where N is the number of CPUs

-	rcuperf.perf_runnable= [BOOT]
-			Start rcuperf running at boot time.
-
 	rcuperf.perf_type= [KNL]
 			Specify the RCU implementation to test.

@@ -3607,9 +3601,6 @@
 			Test RCU's dyntick-idle handling.  See also the
 			rcutorture.shuffle_interval parameter.

-	rcutorture.torture_runnable= [BOOT]
-			Start rcutorture running at boot time.
-
 	rcutorture.torture_type= [KNL]
 			Specify the RCU implementation to test.

@@ -220,8 +220,7 @@ before it writes the new tail pointer, which will erase the item.

 Note the use of READ_ONCE() and smp_load_acquire() to read the
 opposition index.  This prevents the compiler from discarding and
-reloading its cached value - which some compilers will do across
-smp_read_barrier_depends().  This isn't strictly needed if you can
+reloading its cached value.  This isn't strictly needed if you can
 be sure that the opposition index will _only_ be used the once.
 The smp_load_acquire() additionally forces the CPU to order against
 subsequent memory references.  Similarly, smp_store_release() is used
@@ -57,11 +57,6 @@ torture_type	  Type of lock to torture. By default, only spinlocks will

 		     o "rwsem_lock": read/write down() and up() semaphore pairs.

-torture_runnable  Start locktorture at boot time in the case where the
-		  module is built into the kernel, otherwise wait for
-		  torture_runnable to be set via sysfs before starting.
-		  By default it will begin once the module is loaded.
-

 	    ** Torture-framework (RCU + locking) **

@@ -227,17 +227,20 @@ There are some minimal guarantees that may be expected of a CPU:
 (*) On any given CPU, dependent memory accesses will be issued in order, with
     respect to itself.  This means that for:

-	Q = READ_ONCE(P); smp_read_barrier_depends(); D = READ_ONCE(*Q);
+	Q = READ_ONCE(P); D = READ_ONCE(*Q);

     the CPU will issue the following memory operations:

 	Q = LOAD P, D = LOAD *Q

-     and always in that order.  On most systems, smp_read_barrier_depends()
-     does nothing, but it is required for DEC Alpha.  The READ_ONCE()
-     is required to prevent compiler mischief.  Please note that you
-     should normally use something like rcu_dereference() instead of
-     open-coding smp_read_barrier_depends().
+     and always in that order.  However, on DEC Alpha, READ_ONCE() also
+     emits a memory-barrier instruction, so that a DEC Alpha CPU will
+     instead issue the following memory operations:
+
+	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
+
+     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
+     mischief.

 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:
@@ -1815,7 +1818,7 @@ The Linux kernel has eight basic CPU memory barriers:
 	GENERAL		mb()			smp_mb()
 	WRITE		wmb()			smp_wmb()
 	READ		rmb()			smp_rmb()
-	DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
+	DATA DEPENDENCY				READ_ONCE()


 All memory barriers except the data dependency barriers imply a compiler
@@ -2864,7 +2867,10 @@ access depends on a read, not all do, so it may not be relied on.

 Other CPUs may also have split caches, but must coordinate between the various
 cachelets for normal memory accesses.  The semantics of the Alpha removes the
-need for coordination in the absence of memory barriers.
+need for hardware coordination in the absence of memory barriers, which
+permitted Alpha to sport higher CPU clock rates back in the day.  However,
+please note that smp_read_barrier_depends() should not be used except in
+Alpha arch-specific code and within the READ_ONCE() macro.


 CACHE COHERENCY VS DMA
@@ -8193,6 +8193,7 @@ F:	arch/*/include/asm/rwsem.h
 F:	include/linux/seqlock.h
 F:	lib/locking*.[ch]
 F:	kernel/locking/
+X:	kernel/locking/locktorture.c

 LOGICAL DISK MANAGER SUPPORT (LDM, Windows 2000/XP/Vista Dynamic Disks)
 M:	"Richard Russon (FlatCap)" <ldm@flatcap.org>
@@ -11450,15 +11451,6 @@ L:	linux-wireless@vger.kernel.org
 S:	Orphan
 F:	drivers/net/wireless/ray*

-RCUTORTURE MODULE
-M:	Josh Triplett <josh@joshtriplett.org>
-M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
-L:	linux-kernel@vger.kernel.org
-S:	Supported
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
-F:	Documentation/RCU/torture.txt
-F:	kernel/rcu/rcutorture.c
-
 RCUTORTURE TEST FRAMEWORK
 M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
 M:	Josh Triplett <josh@joshtriplett.org>
@@ -13767,6 +13759,18 @@ L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
 F:	drivers/platform/x86/topstar-laptop.c

+TORTURE-TEST MODULES
+M:	Davidlohr Bueso <dave@stgolabs.net>
+M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
+M:	Josh Triplett <josh@joshtriplett.org>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
+F:	Documentation/RCU/torture.txt
+F:	kernel/torture.c
+F:	kernel/rcu/rcutorture.c
+F:	kernel/locking/locktorture.c
+
 TOSHIBA ACPI EXTRAS DRIVER
 M:	Azael Avalos <coproscefalo@gmail.com>
 L:	platform-driver-x86@vger.kernel.org
@@ -550,7 +550,7 @@ try_again:
 		return;
 	}

-	smp_read_barrier_depends();
+	/* READ_ONCE() enforces dependency, but dangerous through integer!!! */
 	ch = port->rx_buffer[ix++];
 	st = port->rx_buffer[ix++];
 	smp_mb();
@@ -1728,7 +1728,10 @@ static int mn10300_serial_poll_get_char(struct uart_port *_port)
 			if (CIRC_CNT(port->rx_inp, ix, MNSC_BUFFER_SIZE) == 0)
 				return NO_POLL_CHAR;

-			smp_read_barrier_depends();
+			/*
+			 * READ_ONCE() enforces dependency, but dangerous
+			 * through integer!!!
+			 */
 			ch = port->rx_buffer[ix++];
 			st = port->rx_buffer[ix++];
 			smp_mb();
@@ -597,7 +597,6 @@ static void __cleanup(struct ioatdma_chan *ioat_chan, dma_addr_t phys_complete)
 	for (i = 0; i < active && !seen_current; i++) {
 		struct dma_async_tx_descriptor *tx;

-		smp_read_barrier_depends();
 		prefetch(ioat_get_ring_ent(ioat_chan, idx + i + 1));
 		desc = ioat_get_ring_ent(ioat_chan, idx + i);
 		dump_desc_dbg(ioat_chan, desc);
@@ -715,7 +714,6 @@ static void ioat_abort_descs(struct ioatdma_chan *ioat_chan)
 	for (i = 1; i < active; i++) {
 		struct dma_async_tx_descriptor *tx;

-		smp_read_barrier_depends();
 		prefetch(ioat_get_ring_ent(ioat_chan, idx + i + 1));
 		desc = ioat_get_ring_ent(ioat_chan, idx + i);

@@ -4,6 +4,7 @@ menuconfig INFINIBAND
 	depends on NET
 	depends on INET
 	depends on m || IPV6 != m
+	depends on !ALPHA
 	select IRQ_POLL
 	---help---
 	  Core support for InfiniBand (IB).  Make sure to also select
@@ -302,7 +302,6 @@ int hfi1_make_rc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		if (!(ib_rvt_state_ops[qp->state] & RVT_FLUSH_SEND))
 			goto bail;
 		/* We are in the error state, flush the work request. */
-		smp_read_barrier_depends(); /* see post_one_send() */
 		if (qp->s_last == READ_ONCE(qp->s_head))
 			goto bail;
 		/* If DMAs are in progress, we can't flush immediately. */
@@ -346,7 +345,6 @@ int hfi1_make_rc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		newreq = 0;
 		if (qp->s_cur == qp->s_tail) {
 			/* Check if send work queue is empty. */
-			smp_read_barrier_depends(); /* see post_one_send() */
 			if (qp->s_tail == READ_ONCE(qp->s_head)) {
 				clear_ahg(qp);
 				goto bail;
@@ -900,7 +898,6 @@ void hfi1_send_rc_ack(struct hfi1_ctxtdata *rcd,
 	}

 	/* Ensure s_rdma_ack_cnt changes are committed */
-	smp_read_barrier_depends();
 	if (qp->s_rdma_ack_cnt) {
 		hfi1_queue_rc_ack(qp, is_fecn);
 		return;
@@ -1562,7 +1559,6 @@ static void rc_rcv_resp(struct hfi1_packet *packet)
 	trace_hfi1_ack(qp, psn);

 	/* Ignore invalid responses. */
-	smp_read_barrier_depends(); /* see post_one_send */
 	if (cmp_psn(psn, READ_ONCE(qp->s_next_psn)) >= 0)
 		goto ack_done;

@@ -362,7 +362,6 @@ static void ruc_loopback(struct rvt_qp *sqp)
 	sqp->s_flags |= RVT_S_BUSY;

 again:
-	smp_read_barrier_depends(); /* see post_one_send() */
 	if (sqp->s_last == READ_ONCE(sqp->s_head))
 		goto clr_busy;
 	wqe = rvt_get_swqe_ptr(sqp, sqp->s_last);
@@ -553,7 +553,6 @@ static void sdma_hw_clean_up_task(unsigned long opaque)

 static inline struct sdma_txreq *get_txhead(struct sdma_engine *sde)
 {
-	smp_read_barrier_depends(); /* see sdma_update_tail() */
 	return sde->tx_ring[sde->tx_head & sde->sdma_mask];
 }

@@ -79,7 +79,6 @@ int hfi1_make_uc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		if (!(ib_rvt_state_ops[qp->state] & RVT_FLUSH_SEND))
 			goto bail;
 		/* We are in the error state, flush the work request. */
-		smp_read_barrier_depends(); /* see post_one_send() */
 		if (qp->s_last == READ_ONCE(qp->s_head))
 			goto bail;
 		/* If DMAs are in progress, we can't flush immediately. */
@@ -119,7 +118,6 @@ int hfi1_make_uc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		    RVT_PROCESS_NEXT_SEND_OK))
 			goto bail;
 		/* Check if send work queue is empty. */
-		smp_read_barrier_depends(); /* see post_one_send() */
 		if (qp->s_cur == READ_ONCE(qp->s_head)) {
 			clear_ahg(qp);
 			goto bail;
@@ -486,7 +486,6 @@ int hfi1_make_ud_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		if (!(ib_rvt_state_ops[qp->state] & RVT_FLUSH_SEND))
 			goto bail;
 		/* We are in the error state, flush the work request. */
-		smp_read_barrier_depends(); /* see post_one_send */
 		if (qp->s_last == READ_ONCE(qp->s_head))
 			goto bail;
 		/* If DMAs are in progress, we can't flush immediately. */
@@ -500,7 +499,6 @@ int hfi1_make_ud_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 	}

 	/* see post_one_send() */
-	smp_read_barrier_depends();
 	if (qp->s_cur == READ_ONCE(qp->s_head))
 		goto bail;

@@ -246,7 +246,6 @@ int qib_make_rc_req(struct rvt_qp *qp, unsigned long *flags)
 		if (!(ib_rvt_state_ops[qp->state] & RVT_FLUSH_SEND))
 			goto bail;
 		/* We are in the error state, flush the work request. */
-		smp_read_barrier_depends(); /* see post_one_send() */
 		if (qp->s_last == READ_ONCE(qp->s_head))
 			goto bail;
 		/* If DMAs are in progress, we can't flush immediately. */
@@ -293,7 +292,6 @@ int qib_make_rc_req(struct rvt_qp *qp, unsigned long *flags)
 		newreq = 0;
 		if (qp->s_cur == qp->s_tail) {
 			/* Check if send work queue is empty. */
-			smp_read_barrier_depends(); /* see post_one_send() */
 			if (qp->s_tail == READ_ONCE(qp->s_head))
 				goto bail;
 			/*
@@ -1340,7 +1338,6 @@ static void qib_rc_rcv_resp(struct qib_ibport *ibp,
 		goto ack_done;

 	/* Ignore invalid responses. */
-	smp_read_barrier_depends(); /* see post_one_send */
 	if (qib_cmp24(psn, READ_ONCE(qp->s_next_psn)) >= 0)
 		goto ack_done;

@@ -367,7 +367,6 @@ static void qib_ruc_loopback(struct rvt_qp *sqp)
 	sqp->s_flags |= RVT_S_BUSY;

 again:
-	smp_read_barrier_depends(); /* see post_one_send() */
 	if (sqp->s_last == READ_ONCE(sqp->s_head))
 		goto clr_busy;
 	wqe = rvt_get_swqe_ptr(sqp, sqp->s_last);
--- a/Show More
+++ b/Show More