Marc Hartmayer reported:
[ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
[ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
[ 23.133954] Fault in home space mode while using kernel ASCE.
[ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
[ 23.134207] Oops: 0004 ilc:2 [#1] SMP
(snip)
[ 23.134516] Call Trace:
[ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430
[ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
[ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130
[ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
[ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38
[ 23.134552] Last Breaking-Event-Address:
[ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30
[ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops
With debuging and analysis, worker_thread() accesses to the nullified
worker->pool when the newly created worker is destroyed before being
waken-up, in which case worker_thread() can see the result detach_worker()
reseting worker->pool to NULL at the begining.
Move the code "worker->pool = NULL;" out from detach_worker() to fix the
problem.
worker->pool had been designed to be constant for regular workers and
changeable for rescuer. To share attaching/detaching code for regular
and rescuer workers and to avoid worker->pool being accessed inadvertently
when the worker has been detached, worker->pool is reset to NULL when
detached no matter the worker is rescuer or not.
To maintain worker->pool being reset after detached, move the code
"worker->pool = NULL;" in the worker thread context after detached.
It is either be in the regular worker thread context after PF_WQ_WORKER
is cleared or in rescuer worker thread context with wq_pool_attach_mutex
held. So it is safe to do so.
Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: f4b7b53c94 ("workqueue: Detach workers directly in idle_cull_fn()")
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make tags always produces below annoying warnings:
ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1"
In commit 25528213fe ("tags: Fix DEFINE_PER_CPU expansions"), codes in
places have been adjusted including cpu_worker_pools definition. I noticed
in commit 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually
replace tasklets"), cpu_worker_pools definition was unfolded back. Not
sure if it was intentionally done or ignored carelessly.
Makes change to mute them specifically.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq->lockdep_map is set only after __alloc_workqueue()
successfully returns. However, on its error path
__alloc_workqueue() may call destroy_workqueue() which
expects wq->lockdep_map to be already set, which results
in a null-ptr-deref in touch_wq_lockdep_map().
Add a simple NULL-check to touch_wq_lockdep_map().
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:__lock_acquire+0x81/0x7800
[..]
Call Trace:
<TASK>
? __die_body+0x66/0xb0
? die_addr+0xb2/0xe0
? exc_general_protection+0x300/0x470
? asm_exc_general_protection+0x22/0x30
? __lock_acquire+0x81/0x7800
? mark_lock+0x94/0x330
? __lock_acquire+0x12fd/0x7800
? __lock_acquire+0x3439/0x7800
lock_acquire+0x14c/0x3e0
? __flush_workqueue+0x167/0x13a0
? __init_swait_queue_head+0xaf/0x150
? __flush_workqueue+0x167/0x13a0
__flush_workqueue+0x17d/0x13a0
? __flush_workqueue+0x167/0x13a0
? lock_release+0x50f/0x830
? drain_workqueue+0x94/0x300
drain_workqueue+0xe3/0x300
destroy_workqueue+0xac/0xc40
? workqueue_sysfs_register+0x159/0x2f0
__alloc_workqueue+0x1506/0x1760
alloc_workqueue+0x61/0x150
...
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Calling va_start / va_end multiple times is undefined and causes
problems with certain compiler / platforms.
Change alloc_ordered_workqueue_lockdep_map to a macro and updated
__alloc_workqueue to take a va_list argument.
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add an interface for a user-defined workqueue lockdep map, which is
helpful when multiple workqueues are created for the same purpose. This
also helps avoid leaking lockdep maps on each workqueue creation.
v2:
- Add alloc_workqueue_lockdep_map (Tejun)
v3:
- Drop __WQ_USER_OWNED_LOCKDEP (Tejun)
- static inline alloc_ordered_workqueue_lockdep_map (Tejun)
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we want to debug the workqueue stall, we can immediately make
a panic to get the information we want.
In some systems, it may be necessary to quickly reboot the system to
escape from a workqueue lockup situation. In this case, we can control
the number of stall detections to generate panic.
workqueue.panic_on_stall sets the number times of the stall to trigger
panic. 0 disables the panic on stall.
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpu_pwq is used in various percpu functions that expect variable in
__percpu address space. Correct the declaration of cpu_pwq to
struct pool_workqueue __rcu * __percpu *cpu_pwq
to declare the variable as __percpu pointer.
The patch also fixes following sparse errors:
workqueue.c:380:37: warning: duplicate [noderef]
workqueue.c:380:37: error: multiple address spaces given: __rcu & __percpu
workqueue.c:2271:15: error: incompatible types in comparison expression (different address spaces):
workqueue.c:2271:15: struct pool_workqueue [noderef] __rcu *
workqueue.c:2271:15: struct pool_workqueue [noderef] __percpu *
and uncovers a couple of exisiting "incorrect type in assignment"
warnings (from __rcu address space), which this patch does not address.
Found by GCC's named address space checks.
There were no changes in the resulting object files.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When flushing a work item for cancellation, __flush_work() knows that it
exclusively owns the work item through its PENDING bit. 134874e2ee
("workqueue: Allow cancel_work_sync() and disable_work() from atomic
contexts on BH work items") added a read of @work->data to determine whether
to use busy wait for BH work items that are being canceled. While the read
is safe when @from_cancel, @work->data was read before testing @from_cancel
to simplify code structure:
data = *work_data_bits(work);
if (from_cancel &&
!WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && (data & WORK_OFFQ_BH)) {
While the read data was never used if !@from_cancel, this could trigger
KCSAN data race detection spuriously:
==================================================================
BUG: KCSAN: data-race in __flush_work / __flush_work
write to 0xffff8881223aa3e8 of 8 bytes by task 3998 on cpu 0:
instrument_write include/linux/instrumented.h:41 [inline]
___set_bit include/asm-generic/bitops/instrumented-non-atomic.h:28 [inline]
insert_wq_barrier kernel/workqueue.c:3790 [inline]
start_flush_work kernel/workqueue.c:4142 [inline]
__flush_work+0x30b/0x570 kernel/workqueue.c:4178
flush_work kernel/workqueue.c:4229 [inline]
...
read to 0xffff8881223aa3e8 of 8 bytes by task 50 on cpu 1:
__flush_work+0x42a/0x570 kernel/workqueue.c:4188
flush_work kernel/workqueue.c:4229 [inline]
flush_delayed_work+0x66/0x70 kernel/workqueue.c:4251
...
value changed: 0x0000000000400000 -> 0xffff88810006c00d
Reorganize the code so that @from_cancel is tested before @work->data is
accessed. The only problem is triggering KCSAN detection spuriously. This
shouldn't need READ_ONCE() or other access qualifiers.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot+b3e4f2f51ed645fd5df2@syzkaller.appspotmail.com
Fixes: 134874e2ee ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items")
Link: http://lkml.kernel.org/r/000000000000ae429e061eea2157@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Pull workqueue updates from Tejun Heo:
- Lai fixed a bug where CPU hotplug and workqueue attribute changes
race leaving some workqueues not fully updated. This involved
refactoring and changing how online CPUs are tracked. The resulting
code is cleaner.
- Workqueue watchdog touch operation was causing too much cacheline
contention on very large machines. Nicholas improved scalabililty by
avoiding unnecessary global updates.
- Code cleanups and minor rescuer behavior improvement.
- The last commit 58629d4871 ("workqueue: Always queue work items to
the newest PWQ for order workqueues") is a cherry-picked straggler
commit from for-6.10-fixes, a fix for a bug which may not actually
trigger.
* tag 'wq-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (24 commits)
workqueue: Always queue work items to the newest PWQ for order workqueues
workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()
workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()
workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()
workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()
workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()
workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumask
workqueue: Add wq_online_cpumask
workqueue: Init rescuer's affinities as the wq's effective cpumask
workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.
workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()
workqueue: Make rescuer initialization as the last step of the creation of a new wq
workqueue: Register sysfs after the whole creation of the new wq
workqueue: Simplify goto statement
workqueue: Update cpumasks after only applying it successfully
workqueue: Improve scalability of workqueue watchdog touch
workqueue: wq_watchdog_touch is always called with valid CPU
workqueue: Remove useless pool->dying_workers
workqueue: Detach workers directly in idle_cull_fn()
workqueue: Don't bind the rescuer in the last working cpu
...
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.
Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 4c065dbce1 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org # v6.9+
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
What wq_update_pod() does is just to update the pwq of the specific
cpu. Rename it and update the comments.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The arguments @hotplug_cpu and @online are not used in wq_update_pod()
since the functions called by wq_update_pod() don't need them.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_calc_pod_cpumask() uses wq_online_cpumask, which excludes the cpu
going down, so the argument cpu_going_down is unused and can be removed.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpumask empty check in wq_calc_pod_cpumask() has long been useless.
It just works purely as documents which states that the cpumask is not
possible empty after the function returns.
Now the code above is even more explicit that the cpumask is not empty,
so the document-only empty check can be removed.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same
lock C.S.") led to the following possible deadlock:
WARNING: possible recursive locking detected
6.10.0-rc5-00004-g1d4c6111406c #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730)
but task is already holding lock:
c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: padata_alloc (kernel/padata.c:1007)
...
stack backtrace:
...
cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:488)
alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730)
padata_alloc (kernel/padata.c:1007 (discriminator 1))
pcrypt_init_padata (crypto/pcrypt.c:327 (discriminator 1))
pcrypt_init (crypto/pcrypt.c:353)
do_one_initcall (init/main.c:1267)
do_initcalls (init/main.c:1328 (discriminator 1) init/main.c:1345 (discriminator 1))
kernel_init_freeable (init/main.c:1364)
kernel_init (init/main.c:1469)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
entry_INT80_32 (arch/x86/entry/entry_32.S:944)
This is caused by pcrypt allocating a workqueue while holding
cpus_read_lock(), so workqueue code can't do it again as that can lead to
deadlocks if down_write starts after the first down_read.
The pwq creations and installations have been reworked based on
wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is
unneeded during wqattrs changes. Fix the deadlock by removing
cpus_read_lock() from apply_wqattrs_lock().
tj: Updated changelog.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 1726a17135 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.")
Link: http://lkml.kernel.org/r/202407081521.83b627c1-lkp@intel.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Avoid relying on cpu_online_mask for wqattrs changes so that
cpus_read_lock() can be removed from apply_wqattrs_lock().
And with wq_online_cpumask, attrs->__pod_cpumask doesn't need to be
reused as a temporary storage to calculate if the pod have any online
CPUs @attrs wants since @cpu_going_down is not in the wq_online_cpumask.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The new wq_online_mask mirrors the cpu_online_mask except during
hotplugging; specifically, it differs between the hotplugging stages
of workqueue_offline_cpu() and workqueue_online_cpu(), during which
the transitioning CPU is not represented in the mask.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The PWQ allocation and WQ enlistment are not within the same lock-held
critical section; therefore, their states can become out of sync when
the user modifies the unbound mask or if CPU hotplug events occur in
the interim since those operations only update the WQs that are already
in the list.
Make the PWQ allocation and WQ enlistment atomic.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>