Pull scheduler updates from Ingo Molnar:
"Core scheduler updates:
- Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
preempt=none/voluntary/full boot options (default: full), to allow
distros to build a PREEMPT kernel but fall back to close to
PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
a boot time selection.
There's also the /debug/sched_debug switch to do this runtime.
This feature is implemented via runtime patching (a new variant of
static calls).
The scope of the runtime patching can be best reviewed by looking
at the sched_dynamic_update() function in kernel/sched/core.c.
( Note that the dynamic none/voluntary mode isn't 100% identical,
for example preempt-RCU is available in all cases, plus the
preempt count is maintained in all models, which has runtime
overhead even with the code patching. )
The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
majority of distributions, are supposed to be unaffected.
- Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
was found via rcutorture triggering a hang. The bug is that
rcu_idle_enter() may wake up a NOCB kthread, but this happens after
the last generic need_resched() check. Some cpuidle drivers fix it
by chance but many others don't.
In true 2020 fashion the original bug fix has grown into a 5-patch
scheduler/RCU fix series plus another 16 RCU patches to address the
underlying issue of missed preemption events. These are the initial
fixes that should fix current incarnations of the bug.
- Clean up rbtree usage in the scheduler, by providing & using the
following consistent set of rbtree APIs:
partial-order; less() based:
- rb_add(): add a new entry to the rbtree
- rb_add_cached(): like rb_add(), but for a rb_root_cached
total-order; cmp() based:
- rb_find(): find an entry in an rbtree
- rb_find_add(): find an entry, and add if not found
- rb_find_first(): find the first (leftmost) matching entry
- rb_next_match(): continue from rb_find_first()
- rb_for_each(): iterate a sub-tree using the previous two
- Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
single pass. This is a 4-commit series where each commit improves
one aspect of the idle sibling scan logic.
- Improve the cpufreq cooling driver by getting the effective CPU
utilization metrics from the scheduler
- Improve the fair scheduler's active load-balancing logic by
reducing the number of active LB attempts & lengthen the
load-balancing interval. This improves stress-ng mmapfork
performance.
- Fix CFS's estimated utilization (util_est) calculation bug that can
result in too high utilization values
Misc updates & fixes:
- Fix the HRTICK reprogramming & optimization feature
- Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code
- Reduce dl_add_task_root_domain() overhead
- Fix uprobes refcount bug
- Process pending softirqs in flush_smp_call_function_from_idle()
- Clean up task priority related defines, remove *USER_*PRIO and
USER_PRIO()
- Simplify the sched_init_numa() deduplication sort
- Documentation updates
- Fix EAS bug in update_misfit_status(), which degraded the quality
of energy-balancing
- Smaller cleanups"
* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
sched,x86: Allow !PREEMPT_DYNAMIC
entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
entry: Explicitly flush pending rcuog wakeup before last rescheduling point
rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
rcu/nocb: Perform deferred wake up before last idle's need_resched() check
rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
sched/features: Distinguish between NORMAL and DEADLINE hrtick
sched/features: Fix hrtick reprogramming
sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
uprobes: (Re)add missing get_uprobe() in __find_uprobe()
smp: Process pending softirqs in flush_smp_call_function_from_idle()
sched: Harden PREEMPT_DYNAMIC
static_call: Allow module use without exposing static_call_key
sched: Add /debug/sched_preempt
preempt/dynamic: Support dynamic preempt with preempt= boot option
preempt/dynamic: Provide irqentry_exit_cond_resched() static call
preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
preempt/dynamic: Provide cond_resched() and might_resched() static calls
preempt: Introduce CONFIG_PREEMPT_DYNAMIC
static_call: Provide DEFINE_STATIC_CALL_RET0()
...
Reduce rbtree boiler plate by using the new helpers.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
addresses to more easily locate bugs. This has a couple of RCU-related commits,
but is mostly MM. Was pulled in with akpm's agreement.
- Per-callback-batch tracking of numbers of callbacks,
which enables better debugging information and smarter
reactions to large numbers of callbacks.
- The first round of changes to allow CPUs to be runtime switched from and to
callback-offloaded state.
- CONFIG_PREEMPT_RT-related changes.
- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.
- Torture-test and torture-test scripting updates, including a "torture everything"
script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
Plus does an allmodconfig build.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To fix the following issues:
kernel/locking/rtmutex.c:1612: warning: Function parameter or member
'lock' not described in '__rt_mutex_futex_unlock'
kernel/locking/rtmutex.c:1612: warning: Function parameter or member
'wake_q' not described in '__rt_mutex_futex_unlock'
kernel/locking/rtmutex.c:1675: warning: Function parameter or member
'name' not described in '__rt_mutex_init'
kernel/locking/rtmutex.c:1675: warning: Function parameter or member
'key' not described in '__rt_mutex_init'
[ tglx: Change rt lock to rt_mutex for consistency sake ]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/1605257895-5536-2-git-send-email-alex.shi@linux.alibaba.com
We generally expect local_irq_save() and local_irq_restore() to be
paired and sanely nested, and so local_irq_restore() expects to be
called with irqs disabled. Thus, within local_irq_restore() we only
trace irq flag changes when unmasking irqs.
This means that a sequence such as:
| local_irq_disable();
| local_irq_save(flags);
| local_irq_enable();
| local_irq_restore(flags);
... is liable to break things, as the local_irq_restore() would mask
irqs without tracing this change. Similar problems may exist for
architectures whose arch_irq_restore() function depends on being called
with irqs disabled.
We don't consider such sequences to be a good idea, so let's define
those as forbidden, and add tooling to detect such broken cases.
This patch adds debug code to WARN() when raw_local_irq_restore() is
called with irqs enabled. As raw_local_irq_restore() is expected to pair
with raw_local_irq_save(), it should never be called with irqs enabled.
To avoid the possibility of circular header dependencies between
irqflags.h and bug.h, the warning is handled in a separate C file.
The new code is all conditional on a new CONFIG_DEBUG_IRQFLAGS symbol
which is independent of CONFIG_TRACE_IRQFLAGS. As noted above such cases
will confuse lockdep, so CONFIG_DEBUG_LOCKDEP now selects
CONFIG_DEBUG_IRQFLAGS.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210111153707.10071-1-mark.rutland@arm.com
The purpose of local_lock_t is to abstract: preempt_disable() /
local_bh_disable() / local_irq_disable(). These are the traditional
means of gaining access to per-cpu data, but are fundamentally
non-preemptible.
local_lock_t provides a per-cpu lock, that on !PREEMPT_RT reduces to
no-ops, just like regular spinlocks do on UP.
This gives rise to:
CPU0 CPU1
local_lock(B) spin_lock_irq(A)
<IRQ>
spin_lock(A) local_lock(B)
Where lockdep then figures things will lock up; which would be true if
B were any other kind of lock. However this is a false positive, no
such deadlock actually exists.
For !RT the above local_lock(B) is preempt_disable(), and there's
obviously no deadlock; alternatively, CPU0's B != CPU1's B.
For RT the argument is that since local_lock() nests inside
spin_lock(), it cannot be used in hardirq context, and therefore CPU0
cannot in fact happen. Even though B is a real lock, it is a
preemptible lock and any threaded-irq would simply schedule out and
let the preempted task (which holds B) continue such that the task on
CPU1 can make progress, after which the threaded-irq resumes and can
finish.
This means that we can never form an IRQ inversion on a local_lock
dependency, so terminate the graph walk when looking for IRQ
inversions when we encounter one.
One consequence is that (for LOCKDEP_SMALL) when we look for redundant
dependencies, A -> B is not redundant in the presence of A -> L -> B.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
In preparation for adding an TRACE_IRQFLAGS dependent skip function to
check_redundant(), move it below the TRACE_IRQFLAGS #ifdef.
While there, provide a stub function to reduce #ifdef usage.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Some __bfs() walks will have additional iteration constraints (beyond
the path being strong). Provide an additional function to allow
terminating graph walks.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The local_lock_t's are special, because they cannot form IRQ
inversions, make sure we can tell them apart from the rest of the
locks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Pull locking updates from Thomas Gleixner:
"A moderate set of locking updates:
- A few extensions to the rwsem API and support for opportunistic
spinning and lock stealing
- lockdep selftest improvements
- Documentation updates
- Cleanups and small fixes all over the place"
* tag 'locking-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
seqlock: kernel-doc: Specify when preemption is automatically altered
seqlock: Prefix internal seqcount_t-only macros with a "do_"
Documentation: seqlock: s/LOCKTYPE/LOCKNAME/g
locking/rwsem: Remove reader optimistic spinning
locking/rwsem: Enable reader optimistic lock stealing
locking/rwsem: Prevent potential lock starvation
locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath()
locking/rwsem: Fold __down_{read,write}*()
locking/rwsem: Introduce rwsem_write_trylock()
locking/rwsem: Better collate rwsem_read_trylock()
rwsem: Implement down_read_interruptible
rwsem: Implement down_read_killable_nested
refcount: Fix a kernel-doc markup
completion: Drop init_completion define
atomic: Update MAINTAINERS
atomic: Delete obsolete documentation
seqlock: Rename __seqprop() users
lockdep/selftest: Add spin_nest_lock test
lockdep/selftests: Fix PROVE_RAW_LOCK_NESTING
seqlock: avoid -Wshadow warnings
...
Pull RCU updates from Thomas Gleixner:
"RCU, LKMM and KCSAN updates collected by Paul McKenney.
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs
- Lockdep-RCU updates reducing the need for __maybe_unused
- Tasks-RCU updates
- Miscellaneous fixes
- Documentation updates
- Torture-test updates
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests"
* tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
srcu: Take early exit on memory-allocation failure
rcu/tree: Defer kvfree_rcu() allocation to a clean context
rcu: Do not report strict GPs for outgoing CPUs
rcu: Fix a typo in rcu_blocking_is_gp() header comment
rcu: Prevent lockdep-RCU splats on lock acquisition/release
rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
rcu,ftrace: Fix ftrace recursion
rcu/tree: Make struct kernel_param_ops definitions const
rcu/tree: Add a warning if CPU being onlined did not report QS already
rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
rcu: Fix single-CPU check in rcu_blocking_is_gp()
rcu: Implement rcu_segcblist_is_offloaded() config dependent
list.h: Update comment to explicitly note circular lists
rcu: Panic after fixed number of stalls
x86/smpboot: Move rcu_cpu_starting() earlier
rcu: Allow rcu_irq_enter_check_tick() from NMI
tools/memory-model: Label MP tests' producers and consumers
tools/memory-model: Use "buf" and "flag" for message-passing tests
tools/memory-model: Add types to litmus tests
tools/memory-model: Add a glossary of LKMM terms
...
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It also improves
the chance that a reader can get the lock as writer optimistic spinning
disproportionally favors writers much more than readers.
Since commit d3681e269f ("locking/rwsem: Wake up almost all readers
in wait queue"), all the waiting readers are woken up so that they can
all get the read lock and run in parallel. When the number of contending
readers is large, allowing reader optimistic spinning will likely cause
reader fragmentation where multiple smaller groups of readers can get
the read lock in a sequential manner separated by writers. That reduces
reader parallelism.
One possible way to address that drawback is to limit the number of
readers (preferably one) that can do optimistic spinning. These readers
act as representatives of all the waiting readers in the wait queue as
they will wake up all those waiting readers once they get the lock.
Alternatively, as reader optimistic lock stealing has already enhanced
fairness to readers, it may be easier to just remove reader optimistic
spinning and simplifying the optimistic spinning code as a result.
Performance measurements (locking throughput kops/s) using a locking
microbenchmark with 50/50 reader/writer distribution and turbo-boost
disabled was done on a 2-socket Cascade Lake system (48-core 96-thread)
to see the impacts of these changes:
1) Vanilla - 5.10-rc3 kernel
2) Before - 5.10-rc3 kernel with previous patches in this series
2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch
3) no-rspin - 5.10-rc3 kernel with reader spinning disabled
# of threads CS Load Vanilla Before limit-rspin no-rspin
------------ ------- ------- ------ ----------- --------
2 1 5,185 5,662 5,214 5,077
4 1 5,107 4,983 5,188 4,760
8 1 4,782 4,564 4,720 4,628
16 1 4,680 4,053 4,567 3,402
32 1 4,299 1,115 1,118 1,098
64 1 3,218 983 1,001 957
96 1 1,938 944 957 930
2 20 2,008 2,128 2,264 1,665
4 20 1,390 1,033 1,046 1,101
8 20 1,472 1,155 1,098 1,213
16 20 1,332 1,077 1,089 1,122
32 20 967 914 917 980
64 20 787 874 891 858
96 20 730 836 847 844
2 100 372 356 360 355
4 100 492 425 434 392
8 100 533 537 529 538
16 100 548 572 568 598
32 100 499 520 527 537
64 100 466 517 526 512
96 100 406 497 506 509
The column "CS Load" represents the number of pause instructions issued
in the locking critical section. A CS load of 1 is extremely short and
is not likey in real situations. A load of 20 (moderate) and 100 (long)
are more realistic.
It can be seen that the previous patches in this series have reduced
performance in general except in highly contended cases with moderate
or long critical sections that performance improves a bit. This change
is mostly caused by the "Prevent potential lock starvation" patch that
reduce reader optimistic spinning and hence reduce reader fragmentation.
The patch that further limit reader optimistic spinning doesn't seem to
have too much impact on overall performance as shown in the benchmark
data.
The patch that disables reader optimistic spinning shows reduced
performance at lightly loaded cases, but comparable or slightly better
performance on with heavier contention.
This patch just removes reader optimistic spinning for now. As readers
are not going to do optimistic spinning anymore, we don't need to
consider if the OSQ is empty or not when doing lock stealing.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com
If the optimistic spinning queue is empty and the rwsem does not have
the handoff or write-lock bits set, it is actually not necessary to
call rwsem_optimistic_spin() to spin on it. Instead, it can steal the
lock directly as its reader bias is in the count already. If it is
the first reader in this state, it will try to wake up other readers
in the wait queue.
With this patch applied, the following were the lock event counts
after rebooting a 2-socket system and a "make -j96" kernel rebuild.
rwsem_opt_rlock=4437
rwsem_rlock=29
rwsem_rlock_steal=19
So lock stealing represents about 0.4% of all the read locks acquired
in the slow path.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com
The lock handoff bit is added in commit 4f23dbc1e6 ("locking/rwsem:
Implement lock handoff to prevent lock starvation") to avoid lock
starvation. However, allowing readers to do optimistic spinning does
introduce an unlikely scenario where lock starvation can happen.
The lock handoff bit may only be set when a waiter is being woken up.
In the case of reader unlock, wakeup happens only when the reader count
reaches 0. If there is a continuous stream of incoming readers acquiring
read lock via optimistic spinning, it is possible that the reader count
may never reach 0 and so the handoff bit will never be asserted.
One way to prevent this scenario from happening is to disallow optimistic
spinning if the rwsem is currently owned by readers. If the previous
or current owner is a writer, optimistic spinning will be allowed.
If the previous owner is a reader but the reader count has reached 0
before, a wakeup should have been issued. So the handoff mechanism
will be kicked in to prevent lock starvation. As a result, it should
be OK to do optimistic spinning in this case.
This patch may have some impact on reader performance as it reduces
reader optimistic spinning especially if the lock critical sections
are short the number of contending readers are small.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com