Binding the grace-period kthreads to the timekeeping CPU resulted in
significant performance decreases for some workloads. For more detail,
see:
https://lkml.org/lkml/2014/6/3/395 for benchmark numbers
https://lkml.org/lkml/2014/6/4/218 for CPU statistics
It turns out that it is necessary to bind the grace-period kthreads
to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
In other cases, it suffices to bind the grace-period kthreads to the
set of non-nohz_full CPUs.
This commit therefore creates a tick_nohz_not_full_mask that is the
complement of tick_nohz_full_mask, and then binds the grace-period
kthread to the set of CPUs indicated by this new mask, which covers
the CONFIG_NO_HZ_FULL_SYSIDLE=n case. The CONFIG_NO_HZ_FULL_SYSIDLE=y
case still binds the grace-period kthreads to the timekeeping CPU.
This commit also includes the tick_nohz_full_enabled() check suggested
by Frederic Weisbecker.
Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Created housekeeping_affine() and housekeeping_mask per
fweisbec feedback. ]
RCU priority boosting currently checks for boosting via a pointer in
task_struct. However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock. This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
and ->completed twice, once without ACCESS_ONCE() and once with it.
Which is pointless because we hold that rcu_node's ->lock at that point.
The intent was to check the current rcu_node structure and the root
rcu_node structure, the latter locklessly with ACCESS_ONCE(). This
commit therefore makes that change.
The reason that it is safe to locklessly check the root rcu_nodes's
->gpnum and ->completed fields is that we hold the current rcu_node's
->lock, which constrains the root rcu_node's ability to change its
->gpnum and ->completed fields. Of course, if there is a single rcu_node
structure, then rnp_root==rnp, and holding the lock prevents all changes.
If there is more than one rcu_node structure, then the code updates the
fields in the following order:
1. Increment rnp_root->gpnum to start new grace period.
2. Increment rnp->gpnum to initialize the current rcu_node,
continuing initialization for the new grace period.
3. Increment rnp_root->completed to end the current grace period.
4. Increment rnp->completed to continue cleaning up after the
old grace period.
So there are four possible combinations of relative values of these
four fields:
N N N N: RCU idle, new grace period must be initiated.
Although rnp_root->gpnum might be incremented immediately
after we check, that will just result in unnecessary work.
The grace period already started, and we try to start it.
N+1 N N N: RCU grace period just started. No further change is
possible because we hold rnp->lock, so the checks of
rnp_root->gpnum and rnp_root->completed are stable.
We know that our request for a future grace period will
be seen during grace-period cleanup.
N+1 N N+1 N: RCU grace period is ongoing. Because rnp->gpnum is
different than rnp->completed, we won't even look at
rnp_root->gpnum and rnp_root->completed, so the possible
concurrent change to rnp_root->completed does not matter.
We know that our request for a future grace period will
be seen during grace-period cleanup, which cannot pass
this rcu_node because we hold its ->lock.
N+1 N+1 N+1 N: RCU grace period has ended, but not yet been cleaned up.
Because rnp->gpnum is different than rnp->completed, we
won't look at rnp_root->gpnum and rnp_root->completed, so
the possible concurrent change to rnp_root->completed does
not matter. We know that our request for a future grace
period will be seen during grace-period cleanup, which
cannot pass this rcu_node because we hold its ->lock.
Therefore, despite initial appearances, the lockless check is safe.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Update comment to say why the lockless check is safe. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects. The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee. The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee. When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex. The booster can then go on
to boost the next task that is blocking the current RCU grace period.
But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it. But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it. This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.
This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The m68k architecture aligns only to 16-bit boundaries, which can cause
the align-to-32-bits check in __call_rcu() to trigger. Because there is
currently no known potential need for more than one low-order bit, this
commit loosens the check to 16-bit boundaries.
Reported-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
RCU contains code of the following forms:
ACCESS_ONCE(x)++;
ACCESS_ONCE(x) += y;
ACCESS_ONCE(x) -= y;
Now these constructs do operate correctly, but they really result in a
pair of volatile accesses, one to do the load and another to do the store.
This can be confusing, as the casual reader might well assume that (for
example) gcc might generate a memory-to-memory add instruction for each
of these three cases. In fact, gcc will do no such thing. Also, there
is a good chance that the kernel will move to separate load and store
variants of ACCESS_ONCE(), and constructs like the above could easily
confuse both people and scripts attempting to make that sort of change.
Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
only need the store to be volatile, so that the read-modify-write form
might be misleading.
This commit therefore changes the above forms in RCU so that each instance
of ACCESS_ONCE() either does a load or a store, but not both. In a few
cases, ACCESS_ONCE() was not critical, for example, for maintaining
statisitics. In these cases, ACCESS_ONCE() has been dispensed with
entirely.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
once boot completes. Thus, there is no need to wrap it in ACCESS_ONCE()
in code that is built only when CONFIG_NO_HZ_FULL. This commit therefore
removes the redundant ACCESS_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The explicit local_irq_save() in __lock_task_sighand() is needed to avoid
a potential deadlock condition, as noted in a841796f11 (signal:
align __lock_task_sighand() irq disabling and RCU). However, someone
reading the code might be forgiven for concluding that this separate
local_irq_save() was completely unnecessary. This commit therefore adds
a comment referencing the shiny new block comment on rcu_read_unlock().
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Commit ac1bea8578 (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls. Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.
This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable. However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs. These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
ktest build robot. Also fixed smp_mb() comment as noted by
Oleg Nesterov. ]
Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, call_rcu() relies on implicit allocation and initialization
for the debug-objects handling of RCU callbacks. If you hammer the
kernel hard enough with Sasha's modified version of trinity, you can end
up with the sl*b allocators recursing into themselves via this implicit
call_rcu() allocation.
This commit therefore exports the debug_init_rcu_head() and
debug_rcu_head_free() functions, which permits the allocators to allocated
and pre-initialize the debug-objects information, so that there no longer
any need for call_rcu() to do that initialization, which in turn prevents
the recursion into the memory allocators.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Looks-good-to: Christoph Lameter <cl@linux.com>
Pull networking fixes from David Miller:
1) Fix checksumming regressions, from Tom Herbert.
2) Undo unintentional permissions changes for SCTP rto_alpha and
rto_beta sysfs knobs, from Denial Borkmann.
3) VXLAN, like other IP tunnels, should advertize it's encapsulation
size using dev->needed_headroom instead of dev->hard_header_len.
From Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: sctp: fix permissions for rto_alpha and rto_beta knobs
vxlan: Checksum fixes
net: add skb_pop_rcv_encapsulation
udp: call __skb_checksum_complete when doing full checksum
net: Fix save software checksum complete
net: Fix GSO constants to match NETIF flags
udp: ipv4: do not waste time in __udp4_lib_mcast_demux_lookup
vxlan: use dev->needed_headroom instead of dev->hard_header_len
MAINTAINERS: update cxgb4 maintainer
Pull more clock framework updates from Mike Turquette:
"This contains the second half the of the clk changes for 3.16.
They are simply fixes and code refactoring for the OMAP clock drivers.
The sunxi clock driver changes include splitting out the one
mega-driver into several smaller pieces and adding support for the A31
SoC clocks"
* tag 'clk-for-linus-3.16-part2' of git://git.linaro.org/people/mike.turquette/linux: (25 commits)
clk: sunxi: document PRCM clock compatible strings
clk: sunxi: add PRCM (Power/Reset/Clock Management) clks support
clk: sun6i: Protect SDRAM gating bit
clk: sun6i: Protect CPU clock
clk: sunxi: Rework clock protection code
clk: sunxi: Move the GMAC clock to a file of its own
clk: sunxi: Move the 24M oscillator to a file of its own
clk: sunxi: Remove calls to clk_put
clk: sunxi: document new A31 USB clock compatible
clk: sunxi: Implement A31 USB clock
ARM: dts: OMAP5/DRA7: use omap5-mpu-dpll-clock capable of dealing with higher frequencies
CLK: TI: dpll: support OMAP5 MPU DPLL that need special handling for higher frequencies
ARM: OMAP5+: dpll: support Duty Cycle Correction(DCC)
CLK: TI: clk-54xx: Set the rate for dpll_abe_m2x2_ck
CLK: TI: Driver for DRA7 ATL (Audio Tracking Logic)
dt:/bindings: DRA7 ATL (Audio Tracking Logic) clock bindings
ARM: dts: dra7xx-clocks: Correct name for atl clkin3 clock
CLK: TI: gate: add composite interface clock to OMAP2 only build
ARM: OMAP2: clock: add DT boot support for cpufreq_ck
CLK: TI: OMAP2: add clock init support
...
Pull NVMe update from Matthew Wilcox:
"Mostly bugfixes again for the NVMe driver. I'd like to call out the
exported tracepoint in the block layer; I believe Keith has cleared
this with Jens.
We've had a few reports from people who're really pounding on NVMe
devices at scale, hence the timeout changes (and new module
parameters), hotplug cpu deadlock, tracepoints, and minor performance
tweaks"
[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
end up going away when mq conversion happens ]
* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
NVMe: Use Log Page constants in SCSI emulation
NVMe: Define Log Page constants
NVMe: Fix hot cpu notification dead lock
NVMe: Rename io_timeout to nvme_io_timeout
NVMe: Use last bytes of f/w rev SCSI Inquiry
NVMe: Adhere to request queue block accounting enable/disable
NVMe: Fix nvme get/put queue semantics
NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
NVMe: Make admin timeout a module parameter
NVMe: Make iod bio timeout a parameter
NVMe: Prevent possible NULL pointer dereference
NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
NVMe: Update data structures for NVMe 1.2
NVMe: Enable BUILD_BUG_ON checks
NVMe: Update namespace and controller identify structures to the 1.1a spec
NVMe: Flush with data support
NVMe: Configure support for block flush
NVMe: Add tracepoints
NVMe: Protect against badly formatted CQEs
...
Commit 3fd091e73b ("[SCTP]: Remove multiple levels of msecs
to jiffies conversions.") has silently changed permissions for
rto_alpha and rto_beta knobs from 0644 to 0444. The purpose of
this was to discourage users from tweaking rto_alpha and
rto_beta knobs in production environments since they are key
to correctly compute rtt/srtt.
RFC4960 under section 6.3.1. RTO Calculation says regarding
rto_alpha and rto_beta under rule C3 and C4:
[...]
C3) When a new RTT measurement R' is made, set
RTTVAR <- (1 - RTO.Beta) * RTTVAR + RTO.Beta * |SRTT - R'|
and
SRTT <- (1 - RTO.Alpha) * SRTT + RTO.Alpha * R'
Note: The value of SRTT used in the update to RTTVAR
is its value before updating SRTT itself using the
second assignment. After the computation, update
RTO <- SRTT + 4 * RTTVAR.
C4) When data is in flight and when allowed by rule C5
below, a new RTT measurement MUST be made each round
trip. Furthermore, new RTT measurements SHOULD be
made no more than once per round trip for a given
destination transport address. There are two reasons
for this recommendation: First, it appears that
measuring more frequently often does not in practice
yield any significant benefit [ALLMAN99]; second,
if measurements are made more often, then the values
of RTO.Alpha and RTO.Beta in rule C3 above should be
adjusted so that SRTT and RTTVAR still adjust to
changes at roughly the same rate (in terms of how many
round trips it takes them to reflect new values) as
they would if making only one measurement per
round-trip and using RTO.Alpha and RTO.Beta as given
in rule C3. However, the exact nature of these
adjustments remains a research issue.
[...]
While it is discouraged to adjust rto_alpha and rto_beta
and not further specified how to adjust them, the RFC also
doesn't explicitly forbid it, but rather gives a RECOMMENDED
default value (rto_alpha=3, rto_beta=2). We have a couple
of users relying on the old permissions before they got
changed. That said, if someone really has the urge to adjust
them, we could allow it with a warning in the log.
Fixes: 3fd091e73b ("[SCTP]: Remove multiple levels of msecs to jiffies conversions.")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert says:
====================
Fixes related to some recent checksum modifications.
- Fix GSO constants to match NETIF flags
- Fix logic in saving checksum complete in __skb_checksum_complete
- Call __skb_checksum_complete from UDP if we are checksumming over
whole packet in order to save checksum.
- Fixes to VXLAN to work correctly with checksum complete
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_pop_rcv_encapsulation and postpull_rcsum for the Ethernet
header to work properly with checksum complete.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is used by UDP encapsulation protocols in RX when
crossing encapsulation boundary. If ip_summed is set to
CHECKSUM_UNNECESSARY and encapsulation is not set, change to
CHECKSUM_NONE since the checksum has not been validated within the
encapsulation. Clears csum_valid by the same rationale.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __udp_lib_checksum_complete check if checksum is being done over all
the data (len is equal to skb->len) and if it is call
__skb_checksum_complete instead of __skb_checksum_complete_head. This
allows checksum to be saved in checksum complete.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert reported issues regarding checksum complete and UDP.
The logic introduced in commit 7e3cead517
("net: Save software checksum complete") is not correct.
This patch:
1) Restores code in __skb_checksum_complete_header except for setting
CHECKSUM_UNNECESSARY. This function may be calculating checksum on
something less than skb->len.
2) Adds saving checksum to __skb_checksum_complete. The full packet
checksum 0..skb->len is calculated without adding in pseudo header.
This value is saved in skb->csum and then the pseudo header is added
to that to derive the checksum for validation.
3) In both __skb_checksum_complete_header and __skb_checksum_complete,
set skb->csum_valid to whether checksum of zero was computed. This
allows skb_csum_unnecessary to return true without changing to
CHECKSUM_UNNECESSARY which was done previously.
4) Copy new csum related bits in __copy_skb_header.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joseph Gasparakis reported that VXLAN GSO offload stopped working with
i40e device after recent UDP changes. The problem is that the
SKB_GSO_* bits are out of sync with the corresponding NETIF flags. This
patch fixes that. Also, we add BUILD_BUG_ONs in net_gso_ok for several
GSO constants that were missing to avoid the problem in the future.
Reported-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more SCSI updates from James Bottomley:
"This is just a couple of drivers (hpsa and lpfc) that got left out for
further testing in linux-next. We also have one fix to a prior
submission (qla2xxx sparse)"
* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (36 commits)
qla2xxx: fix sparse warnings introduced by previous target mode t10-dif patch
lpfc: Update lpfc version to driver version 10.2.8001.0
lpfc: Fix ExpressLane priority setup
lpfc: mark old devices as obsolete
lpfc: Fix for initializing RRQ bitmap
lpfc: Fix for cleaning up stale ring flag and sp_queue_event entries
lpfc: Update lpfc version to driver version 10.2.8000.0
lpfc: Update Copyright on changed files from 8.3.45 patches
lpfc: Update Copyright on changed files
lpfc: Fixed locking for scsi task management commands
lpfc: Convert runtime references to old xlane cfg param to fof cfg param
lpfc: Fix FW dump using sysfs
lpfc: Fix SLI4 s abort loop to process all FCP rings and under ring_lock
lpfc: Fixed kernel panic in lpfc_abort_handler
lpfc: Fix locking for postbufq when freeing
lpfc: Fix locking for lpfc_hba_down_post
lpfc: Fix dynamic transitions of FirstBurst from on to off
hpsa: fix handling of hpsa_volume_offline return value
hpsa: return -ENOMEM not -1 on kzalloc failure in hpsa_get_device_id
hpsa: remove messages about volume status VPD inquiry page not supported
...