Currently, rebind_workers() and idle_worker_rebind() are two-way
interlocked. rebind_workers() waits for idle workers to finish
rebinding and rebound idle workers wait for rebind_workers() to finish
rebinding busy workers before proceeding.
Unfortunately, this isn't enough. The second wait from idle workers
is implemented as follows.
wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
then returns. If CPU hotplug cycle happens again before one of the
idle workers finishes the above wait_event(), rebind_workers() will
repeat the first part of the handshake - set WORKER_REBIND again and
wait for the idle worker to finish rebinding - and this leads to
deadlock because the idle worker would be waiting for WORKER_REBIND to
clear.
This is fixed by adding another interlocking step at the end -
rebind_workers() now waits for all the idle workers to finish the
above WORKER_REBIND wait before returning. This ensures that all
rebinding steps are complete on all idle workers before the next
hotplug cycle can happen.
This problem was diagnosed by Lai Jiangshan who also posted a patch to
fix the issue, upon which this patch is based.
This is the minimal fix and further patches are scheduled for the next
merge window to simplify the CPU hotplug path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
This doesn't make any functional difference and is purely to help the
next patch to be simpler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
The compiler may compile the following code into TWO write/modify
instructions.
worker->flags &= ~WORKER_UNBOUND;
worker->flags |= WORKER_REBIND;
so the other CPU may temporarily see worker->flags which doesn't have
either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
prematurely.
Fix it by using single explicit assignment via ACCESS_ONCE().
Because idle workers have another WORKER_NOT_RUNNING flag, this bug
doesn't exist for them; however, update it to use the same pattern for
consistency.
tj: Applied the change to idle workers too and updated comments and
patch description a bit.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.
Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue. If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.
This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Up to now, for delayed_works, try_to_grab_pending() couldn't be used
from IRQ handlers because IRQs may happen while
delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
This patch makes delayed_work use the new TIMER_IRQSAFE flag for
delayed_work->timer. This makes try_to_grab_pending() and thus
mod_delayed_work_on() safe to call from IRQ handlers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq(). Replace the latter with
wrappers around system[_freezable]_wq(). The wrapping goes through
inline functions so that __deprecated can be added easily.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work(). Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().
* start_flush_work() no longer takes @wait_executing as the only left
user - flush_work() - always sets it to %true.
* __cancel_work_timer() uses flush_work() instead of wait_on_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.
tj: CPU up/downs care quite a bit about latency these days. This
shouldn't hurt anything and makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.
To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.
tj: Rephrased comment a bit.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.
It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.
tj: Even apart from rebinding, having system_highpri_wq generally
makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.
gcwq = get_gcwq(cpu);
if (wq->flags & WQ_NON_REENTRANT &&
(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.
tj: Slightly rephrased the comment.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.
Use temporary local variable for storing requested cpu.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.
There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
the former already does smp_wmb() but the latter doesn't have any.
Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().
There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer. The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.
Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer. Currently, it chooses the local CPU
flush is running on. This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.
There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off. This patch adds delayed_work->cpu to
remember the CPU it's queued for.
Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change. As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work. delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.
This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.
v2: Updated to reflect try_to_grab_pending() changes. Now safe to be
called from bh context.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item. With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.
The other is if multiple __cancel_work_timer() invocations are racing
one another. __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued. try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.
Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails. For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem. For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens. Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.
While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.
This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop. It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.
__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().
v2: try_to_grab_pending() was testing preempt_count() to ensure that
the caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Updated so that try_to_grab_pending() disables irq on success
rather than requiring preemption disabled by the caller. This
makes busy-looping easier and will allow try_to_grap_pending() to
be used from bh/irq contexts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
* Use bool @is_dwork instead of @timer and let try_to_grab_pending()
use to_delayed_work() to determine the delayed_work address.
* Move timer handling from __cancel_work_timer() to
try_to_grab_pending().
* Make try_to_grab_pending() use -EAGAIN instead of -1 for
busy-looping and drop the ret local variable.
* Add proper function comment to try_to_grab_pending().
This makes the code a bit easier to understand and will ease further
changes. This patch doesn't make any functional change.
v2: Use @is_dwork instead of @timer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.
Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue. There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue. This patch doesn't
define any actual OFFQ flag yet.
Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.
To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on](). Move try_to_grab_pending() and related
functions above queueing functions.
This patch only moves functions around.
Signed-off-by: Tejun Heo <tj@kernel.org>
If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.
This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Queueing functions have been using different methods to determine the
local CPU.
* queue_work() superflously uses get/put_cpu() to acquire and hold the
local CPU across queue_work_on().
* delayed_work_timer_fn() uses smp_processor_id().
* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
which is interpreted as the local CPU.
* flush_delayed_work[_sync]() were using raw_smp_processor_id().
* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
target workqueue is bound one but nobody uses this.
This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().
Signed-off-by: Tejun Heo <tj@kernel.org>