Pull workqueue updates from Tejun Heo:
"Nothing major. I introduced a flag collsion bug during v4.13 cycle
which is fixed in this pull request. Fortunately, the flag is for
debugging / verification and the bug isn't critical"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix flag collision
workqueue: Use TASK_IDLE
workqueue: fix path to documentation
workqueue: doc change for ST behavior on NUMA systems
The new completion/crossrelease annotations interact unfavourable with
the extant flush_work()/flush_workqueue() annotations.
The problem is that when a single work class does:
wait_for_completion(&C)
and
complete(&C)
in different executions, we'll build dependencies like:
lock_map_acquire(W)
complete_acquire(C)
and
lock_map_acquire(W)
complete_release(C)
which results in the dependency chain: W->C->W, which lockdep thinks
spells deadlock, even though there is no deadlock potential since
works are ran concurrently.
One possibility would be to change the work 'lock' to recursive-read,
but that would mean hitting a lockdep limitation on recursive locks.
Also, unconditinoally switching to recursive-read here would fail to
detect the actual deadlock on single-threaded workqueues, which do
have a problem with this.
For now, forcefully disregard these locks for crossrelease.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The flush_work() annotation as introduced by commit:
e159489baa ("workqueue: relax lockdep annotation on flush_work()")
hits on the lockdep problem with recursive read locks.
The situation as described is:
Work W1: Work W2: Task:
ARR(Q) ARR(Q) flush_workqueue(Q)
A(W1) A(W2) A(Q)
flush_work(W2) R(Q)
A(W2)
R(W2)
if (special)
A(Q)
else
ARR(Q)
R(Q)
where: A - acquire, ARR - acquire-read-recursive, R - release.
Where under 'special' conditions we want to trigger a lock recursion
deadlock, but otherwise allow the flush_work(). The allowing is done
by using recursive read locks (ARR), but lockdep is broken for
recursive stuff.
However, there appears to be no need to acquire the lock if we're not
'special', so if we remove the 'else' clause things become much
simpler and no longer need the recursion thing at all.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid
increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE
for this case:
80ed87c8a9 ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
use it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:
> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done
Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.
The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.
To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.
Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is an underlying assumption/trade-off in many layers of the Linux
system that CPU <-> node mapping is static. This is despite the presence
of features like NUMA and 'hotplug' that support the dynamic addition/
removal of fundamental system resources like CPUs and memory. PowerPC
systems, however, do provide extensive features for the dynamic change
of resources available to a system.
Currently, there is little or no synchronization protection around the
updating of the CPU <-> node mapping, and the export/update of this
information for other layers / modules. In systems which can change
this mapping during 'hotplug', like PowerPC, the information is changing
underneath all layers that might reference it.
This patch attempts to ensure that a valid, usable cpumask attribute
is used by the workqueue infrastructure when setting up new resource
pools. It prevents a crash that has been observed when an 'empty'
cpumask is passed along to the worker/task scheduling code. It is
intended as a temporary workaround until a more fundamental review and
correction of the issue can be done.
[With additions to the patch provided by Tejun Hao <tj@kernel.org>]
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1. Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.
This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
The combination of WQ_UNBOUND and max_active == 1 used to imply
ordered execution. After NUMA affinity 4c16bd327c ("workqueue:
implement NUMA affinity for unbound workqueues"), this is no longer
true due to per-node worker pools.
While the right way to create an ordered workqueue is
alloc_ordered_workqueue(), the documentation has been misleading for a
long time and people do use WQ_UNBOUND and max_active == 1 for ordered
workqueues which can lead to subtle bugs which are very difficult to
trigger.
It's unlikely that we'd see noticeable performance impact by enforcing
ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's
automatically set __WQ_ORDERED for those workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: Alexei Potashnik <alexei@purestorage.com>
Fixes: 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues")
Cc: stable@vger.kernel.org # v3.10+
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull workqueue update from Tejun Heo:
"One trivial patch to use setup_deferrable_timer() instead of
open-coding the initialization"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use setup_deferrable_timer
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue is currently initialized in an early init call; however,
there are cases where early boot code has to be split and reordered to
come after workqueue initialization or the same code path which makes
use of workqueues is used both before workqueue initailization and
after. The latter cases have to gate workqueue usages with
keventd_up() tests, which is nasty and easy to get wrong.
Workqueue usages have become widespread and it'd be a lot more
convenient if it can be used very early from boot. This patch splits
workqueue initialization into two steps. workqueue_init_early() which
sets up the basic data structures so that workqueues can be created
and work items queued, and workqueue_init() which actually brings up
workqueues online and starts executing queued work items. The former
step can be done very early during boot once memory allocation,
cpumasks and idr are initialized. The latter right after kthreads
become available.
This allows work item queueing and canceling from very early boot
which is what most of these use cases want.
* As systemd_wq being initialized doesn't indicate that workqueue is
fully online anymore, update keventd_up() to test wq_online instead.
The follow-up patches will get rid of all its usages and the
function itself.
* Flushing doesn't make sense before workqueue is fully initialized.
The flush functions trigger WARN and return immediately before fully
online.
* Work items are never in-flight before fully online. Canceling can
always succeed by skipping the flush step.
* Some code paths can no longer assume to be called with irq enabled
as irq is disabled during early boot. Use irqsave/restore
operations instead.
v2: Watchdog init, which requires timer to be running, moved from
workqueue_init_early() to workqueue_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com