A circular lock dependency splat has been seen involving down_trylock():
======================================================
WARNING: possible circular locking dependency detected
6.12.0-41.el10.s390x+debug
------------------------------------------------------
dd/32479 is trying to acquire lock:
0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
but task is already holding lock:
000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console_sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
-> #4 (&zone->lock){-.-.}-{2:2}:
__lock_acquire+0xe86/0x1cc0
lock_acquire.part.0+0x258/0x630
lock_acquire+0xb8/0xe0
_raw_spin_lock_irqsave+0xb4/0x120
rmqueue_bulk+0xac/0x8f0
__rmqueue_pcplist+0x580/0x830
rmqueue_pcplist+0xfc/0x470
rmqueue.isra.0+0xdec/0x11b0
get_page_from_freelist+0x2ee/0xeb0
__alloc_pages_noprof+0x2c2/0x520
alloc_pages_mpol_noprof+0x1fc/0x4d0
alloc_pages_noprof+0x8c/0xe0
allocate_slab+0x320/0x460
___slab_alloc+0xa58/0x12b0
__slab_alloc.isra.0+0x42/0x60
kmem_cache_alloc_noprof+0x304/0x350
fill_pool+0xf6/0x450
debug_object_activate+0xfe/0x360
enqueue_hrtimer+0x34/0x190
__run_hrtimer+0x3c8/0x4c0
__hrtimer_run_queues+0x1b2/0x260
hrtimer_interrupt+0x316/0x760
do_IRQ+0x9a/0xe0
do_irq_async+0xf6/0x160
Normally a raw_spinlock to spinlock dependency is not legitimate
and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legitimate and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by: yzbot+ed801a886dfdbfe7136d@syzkaller.appspotmail.com
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pull RCU updates from Uladzislau Rezki:
"Misc fixes:
- check if IRQs are disabled in rcu_exp_need_qs()
- instrument KCSAN exclusive-writer assertions
- add extra WARN_ON_ONCE() check
- set the cpu_no_qs.b.exp under lock
- warn if callback enqueued on offline CPU
Torture-test updates:
- add rcutorture.preempt_duration kernel module parameter
- make the TREE03 scenario do preemption
- improve pooling timeouts for rcu_torture_writer()
- improve output of "Failure/close-call rcutorture reader segments"
- add some reader-state debugging checks
- update doc of polled APIs
- add extra diagnostics for per-reader-segment preemption
- add an extra test for sched_clock()
- improve testing on unresponsive systems
SRCU updates:
- improve doc for srcu_read_lock() in terms of return value
- fix typo in comments
- remove redundant GP sequence checks in the srcu_funnel_gp_start"
* tag 'rcu.release.v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits)
srcu: Remove redundant GP sequence checks in srcu_funnel_gp_start
srcu: Fix typo s/srcu_check_read_flavor()/__srcu_check_read_flavor()/
srcu: Guarantee non-negative return value from srcu_read_lock()
MAINTAINERS: Update RCU git tree
rcu: Add lockdep_assert_irqs_disabled() to rcu_exp_need_qs()
rcu: Add KCSAN exclusive-writer assertions for rdp->cpu_no_qs.b.exp
rcu: Make preemptible rcu_exp_handler() check idempotency
rcu: Replace open-coded rcu_exp_need_qs() from rcu_exp_handler() with call
rcu: Move rcu_report_exp_rdp() setting of ->cpu_no_qs.b.exp under lock
rcu: Make rcu_report_exp_cpu_mult() caller acquire lock
rcu: Report callbacks enqueued on offline CPU blind spot
rcutorture: Use symbols for SRCU reader flavors
rcutorture: Add per-reader-segment preemption diagnostics
rcutorture: Read CPU ID for decoration protected by both reader types
rcutorture: Add preempt_count() to rcutorture_one_extend_check() diagnostics
rcutorture: Add parameters to control polled/conditional wait interval
rcutorture: Add documentation for recent conditional and polled APIs
rcutorture: Ignore attempts to test preemption and forward progress
rcutorture: Make rcutorture_one_extend() check reader state
rcutorture: Pretty-print rcutorture reader segments
...
Lockdep changes for v6.14:
- Use swap() macro in the ww_mutex test.
- Minor fixes and documentation for lockdep configs on internal data structure sizes.
- Some "-Wunused-function" warning fixes for Clang.
Rust locking changes for v6.14:
- Add Rust locking files into LOCKING PRIMITIVES maintainer entry.
- Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking.
- Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards.
- Add lockdep checking when creating a new lock `Guard`.
A common pattern seen when wake_qs are used to defer a wakeup
until after a lock is released is something like:
preempt_disable();
raw_spin_unlock(lock);
wake_up_q(wake_q);
preempt_enable();
So create some raw_spin_unlock*_wake() helper functions to clean
this up.
Applies on top of the fix I submitted here:
https://lore.kernel.org/lkml/20241212222138.2400498-1-jstultz@google.com/
NOTE: I recognise the unlock()/unlock_irq()/unlock_irqrestore()
variants creates its own duplication, which we could use a macro
to generate the similar functions, but I often dislike how those
generation macros making finding the actual implementation
harder, so I left the three functions as is. If folks would
prefer otherwise, let me know and I'll switch it.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241217040803.243420-1-jstultz@google.com
When lockdep_assert_locked() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and
CONFIG_PROVE_LOCKING=n:
kernel/locking/lockdep.c:160:20: error: unused function 'lockdep_assert_locked' [-Werror,-Wunused-function]
Fix this by moving it under the respective ifdeffery.
See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
[Boqun: add more config information of the error]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20241202193445.769567-1-andriy.shevchenko@linux.intel.com
When chain_hlock_class_idx() is unused, it prevents kernel builds with
clang, `make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and
CONFIG_PROVE_LOCKING=n:
kernel/locking/lockdep.c:435:28: error: unused function 'chain_hlock_class_idx' [-Werror,-Wunused-function]
Fix this by marking it with __maybe_unused.
See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
[Boqun: add more config information of the error]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20241209170810.1485183-1-andriy.shevchenko@linux.intel.com
Bert reported seeing occasional boot hangs when running with
PREEPT_RT and bisected it down to commit 894d1b3db4
("locking/mutex: Remove wakeups from under mutex::wait_lock").
It looks like I missed a few spots where we drop the wait_lock and
potentially call into schedule without waking up the tasks on the
wake_q structure. Since the tasks being woken are ww_mutex tasks
they need to be able to run to release the mutex and unblock the
task that currently is planning to wake them. Thus we can deadlock.
So make sure we wake the wake_q tasks when we unlock the wait_lock.
Closes: https://lore.kernel.org/lkml/20241211182502.2915-1-spasswolf@web.de
Fixes: 894d1b3db4 ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Reported-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241212222138.2400498-1-jstultz@google.com
Current use cases of torture_sched_setaffinity() are well served by its
unconditional warning on error. However, an upcoming use case for a
preemption kthread needs to avoid warnings that might otherwise arise
when that kthread attempted to bind itself to a CPU on its way offline.
This commit therefore adds a dowarn argument that, when false, suppresses
the warning.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Anders had bisected a crash using PREEMPT_RT with linux-next and
isolated it down to commit 894d1b3db4 ("locking/mutex: Remove
wakeups from under mutex::wait_lock"), where it seemed the
wake_q structure was somehow getting corrupted causing a null
pointer traversal.
I was able to easily repoduce this with PREEMPT_RT and managed
to isolate down that through various call stacks we were
actually calling wake_up_q() twice on the same wake_q.
I found that in the problematic commit, I had added the
wake_up_q() call in task_blocks_on_rt_mutex() around
__ww_mutex_add_waiter(), following a similar pattern in
__mutex_lock_common().
However, its just wrong. We haven't dropped the lock->wait_lock,
so its contrary to the point of the original patch. And it
didn't match the __mutex_lock_common() logic of re-initializing
the wake_q after calling it midway in the stack.
Looking at it now, the wake_up_q() call is incorrect and should
just be removed. So drop the erronious logic I had added.
Fixes: 894d1b3db4 ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Closes: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241114190051.552665-1-jstultz@google.com
Replace this pattern in osq_unlock():
atomic_cmpxchg(*ptr, old, new) == old
... with the simpler and faster:
atomic_try_cmpxchg(*ptr, &old, new)
The x86 CMPXCHG instruction returns success in the ZF flag,
so this change saves a compare after the CMPXCHG. The code
in the fast path of osq_unlock() improves from:
11b: 31 c9 xor %ecx,%ecx
11d: 8d 50 01 lea 0x1(%rax),%edx
120: 89 d0 mov %edx,%eax
122: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
126: 39 c2 cmp %eax,%edx
128: 75 05 jne 12f <...>
to:
12b: 31 d2 xor %edx,%edx
12d: 83 c0 01 add $0x1,%eax
130: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
134: 75 05 jne 13b <...>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241001114606.820277-1-ubizjak@gmail.com
Every lock, that becomes a sleeping lock on PREEMPT_RT, starts a RCU read
side critical section. There is no sparse annotation for this and sparse
complains about unbalanced locking.
Add __acquires/ __releases for the RCU lock. This covers all but the
trylock functions. A __cond_acquires() annotation didn't work.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812104200.2239232-4-bigeasy@linutronix.de
Add a function to check that an offline CPU has left the tracing
infrastructure in a sane state.
Commit 9bb69ba4c1 ("ACPI: processor_idle: use raw_safe_halt() in
acpi_idle_play_dead()") fixed an issue where the acpi_idle_play_dead()
function called safe_halt() instead of raw_safe_halt(), which had the
side-effect of setting the hardirqs_enabled flag for the offline CPU.
On x86 this triggered warnings from lockdep_assert_irqs_disabled() when
the CPU was brought back online again later. These warnings were too
early for the exception to be handled correctly, leading to a
triple-fault.
Add lockdep_cleanup_dead_cpu() to check for this kind of failure mode,
print the events leading up to it, and correct it so that the CPU can
come online again correctly. Re-introducing the original bug now merely
results in this warning instead:
[ 61.556652] smpboot: CPU 1 is now offline
[ 61.556769] CPU 1 left hardirqs enabled!
[ 61.556915] irq event stamp: 128149
[ 61.556965] hardirqs last enabled at (128149): [<ffffffff81720a36>] acpi_idle_play_dead+0x46/0x70
[ 61.557055] hardirqs last disabled at (128148): [<ffffffff81124d50>] do_idle+0x90/0xe0
[ 61.557117] softirqs last enabled at (128078): [<ffffffff81cec74c>] __do_softirq+0x31c/0x423
[ 61.557199] softirqs last disabled at (128065): [<ffffffff810baae1>] __irq_exit_rcu+0x91/0x100
[boqun: Capitalize the title and reword the message a bit]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/f7bd2b3b999051bb3ef4be34526a9262008285f5.camel@infradead.org
With the proxy-execution series, we traverse the task->mutex->task
blocked_on/owner chain in the scheduler core. We do this while holding
the rq::lock to keep the structures in place while taking and
releasing the alternating lock types.
Since the mutex::wait_lock is one of the locks we will take in this
way under the rq::lock in the scheduler core, we need to make sure
that its usage elsewhere is irq safe.
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-3-jstultz@google.com
When using mutex_acquire_nest() with a nest_lock, lockdep refcounts the
number of acquired lockdep_maps of mutexes of the same class, and also
keeps a pointer to the first acquired lockdep_map of a class. That pointer
is then used for various comparison-, printing- and checking purposes,
but there is no mechanism to actively ensure that lockdep_map stays in
memory. Instead, a warning is printed if the lockdep_map is freed and
there are still held locks of the same lock class, even if the lockdep_map
itself has been released.
In the context of WW/WD transactions that means that if a user unlocks
and frees a ww_mutex from within an ongoing ww transaction, and that
mutex happens to be the first ww_mutex grabbed in the transaction,
such a warning is printed and there might be a risk of a UAF.
Note that this is only problem when lockdep is enabled and affects only
dereferences of struct lockdep_map.
Adjust to this by adding a fake lockdep_map to the acquired context and
make sure it is the first acquired lockdep map of the associated
ww_mutex class. Then hold it for the duration of the WW/WD transaction.
This has the side effect that trying to lock a ww mutex *without* a
ww_acquire_context but where a such context has been acquire, we'd see
a lockdep splat. The test-ww_mutex.c selftest attempts to do that, so
modify that particular test to not acquire a ww_acquire_context if it
is not going to be used.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241009092031.6356-1-thomas.hellstrom@linux.intel.com