This function will be used in a later patch to switch the struct
request_queue q_usage_counter from killed back to live. In contrast
to percpu_ref_reinit(), this new function does not require that the
refcount is zero.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
percpu_ref internally uses sched-RCU to implement the percpu -> atomic
mode switching and the documentation suggested that this could be
depended upon. This doesn't seem like a good idea.
* percpu_ref uses sched-RCU which has different grace periods regular
RCU. Users may combine percpu_ref with regular RCU usage and
incorrectly believe that regular RCU grace periods are performed by
percpu_ref. This can lead to, for example, use-after-free due to
premature freeing.
* percpu_ref has a grace period when switching from percpu to atomic
mode. It doesn't have one between the last put and release. This
distinction is subtle and can lead to surprising bugs.
* percpu_ref allows starting in and switching to atomic mode manually
for debugging and other purposes. This means that there may not be
any grace periods from kill to release.
This patch makes it clear that the grace periods are percpu_ref's
internal implementation detail and can't be depended upon by the
users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Because READ_ONCE() now implies smp_read_barrier_depends(), this commit
removes the now-redundant smp_read_barrier_depends() following the
READ_ONCE() in __ref_is_percpu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then
waits for it to complete.
Also export percpu_ref_switch_to_* so they can be used from modules.
This will be used in md/raid to count the number of pending write
requests to an array.
We occasionally need to check if the count is zero, but most often
we don't care.
We always want updates to the counter to be fast, as in some cases
we count every 4K page.
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
This patch targets two things which are related to ->confirm_switch:
1. Init ->confirm_switch pointer with NULL on percpu_ref_init() or
kernel frightfully complains with WARN_ON_ONCE(ref->confirm_switch)
at __percpu_ref_switch_to_atomic if memory chunk was not properly
zeroed.
2. Warn if RCU callback is still in progress on percpu_ref_exit().
The race still exists, because percpu_ref_call_confirm_rcu()
drops ->confirm_switch to NULL early, but that is only a warning
and still the caller is responsible that ref is no longer in
active use. Hopefully that can help to catch incorrect usage
of percpu-refcount.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
percpu_ref initially didn't have explicit mode switching operations.
It started out in percpu mode and switched to atomic mode on kill and
then released. Ensuring that kill operation is initiated only after
init completes was naturally the caller's responsibility.
percpu_ref_reinit() was introduced later but it didn't shift the
synchronization responsibility. Reinit can't be performed until kill
is confirmed, so there was nothing to worry about
synchronization-wise. Also, as both reinit and kill manipulate the
base reference, invocations of the same function couldn't be allowed
to race each other.
The latest additions of percpu_ref_switch_to_atomic/percpu() changed
the situation. These two functions can be called any time as long as
the percpu_ref is between init and exit and thus there are valid valid
usage scenarios where these new functions race with each other or
against reinit/kill. Mostly from inertia, f47ad45784 ("percpu_ref:
decouple switching to percpu mode and reinit") still left
synchronization among percpu mode switching operations to its users.
That the new switch functions can be freely mixed with kill/reinit but
the operations themselves should be synchronized is too subtle a
requirement and led to a very subtle race condition in blk-mq freezing
path.
This patch fixes the situation by introducing percpu_ref_switch_lock
to protect mode switching operations. This ensures that percpu-ref
users don't have to worry about mode changing operations racing
against each other, e.g. switch_to_percpu against kill, as long as the
sequence of operations is valid.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: http://lkml.kernel.org/g/1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
Fixes: f47ad45784 ("percpu_ref: decouple switching to percpu mode and reinit")
Restructure atomic/percpu mode switching.
* The users of __percpu_ref_switch_to_atomic/percpu() now call a new
function __percpu_ref_switch_mode() which calls either of the
original switching functions depending on the current state of
ref->force_atomic and the __PERCPU_REF_DEAD flag. The callers no
longer check whether switching is necessary but always invoke
__percpu_ref_switch_mode().
* !ref->confirm_switch waiting is collected into
__percpu_ref_switch_mode().
This patch doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
When an atomic or percpu switching starts before the previous atomic
switching finishes, the taken behaviors are
* If the new atomic switching has confirmation callback, it waits
for the previous atomic switching to complete.
* If the new percpu switching is the first percpu switching following
the previous atomic switching, it waits the previous atomic
switching to complete.
No percpu_ref user depends on these subtleties. The only meaningful
part is that, if the caller ensures that atomic switching isn't in
progress, mode switching operations can be issued from any context.
This patch pulls the wait logic to the top of both switching functions
so that they always wait for the previous atomic switching to
complete. This makes the behavior simpler and consistent for both
directions and will help allowing concurrent invocations of mode
switching functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reorganize __percpu_ref_switch_to_atomic() so that it looks
structurally similar to __percpu_ref_switch_to_percpu() and relocate
percpu_ref_switch_to_atomic so that the two internal functions are
co-located.
This patch doesn't introduce any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
At the beginning, percpu_ref guaranteed a RCU grace period between a
call to percpu_ref_kill_and_confirm() and the invocation of the
confirmation callback. This guarantee exposed internal implementation
details and got rescinded while switching over to sched RCU; however,
__percpu_ref_switch_to_atomic() still inserts a full sched RCU grace
period even when it can simply wait for the previous attempt.
Remove the unnecessary grace period and perform the confirmation
synchronously for staggered atomic switching attempts. Update
comments accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
All are in comments.
Signed-off-by: Bogdan Sikora <bsikora@redhat.com>
Cc: <linux-mm@kvack.org>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jan Kara <jack@suse.cz>
[jkosina@suse.cz: more fixup]
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently, a percpu_ref which is initialized with
PERPCU_REF_INIT_ATOMIC or switched to atomic mode via
switch_to_atomic() automatically reverts to percpu mode on the first
percpu_ref_reinit(). This makes the atomic mode difficult to use for
cases where a percpu_ref is used as a persistent on/off switch which
may be cycled multiple times.
This patch makes such atomic state sticky so that it survives through
kill/reinit cycles. After this patch, atomic state is cleared only by
an explicit percpu_ref_switch_to_percpu() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.
Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.
This patch adds @flags to percpu_ref_init() and implements the
following flags.
* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained
These flags should be able to serve the above two use cases.
v2: target_core_tpg.c conversion was missing. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
percpu_ref has treated the dropping of the base reference and
switching to atomic mode as an integral operation; however, there's
nothing inherent tying the two together.
The use cases for percpu_ref have been expanding continuously. While
the current init/kill/reinit/exit model can cover a lot, the coupling
of kill/reinit with atomic/percpu mode switching is turning out to be
too restrictive for use cases where many percpu_refs are created and
destroyed back-to-back with only some of them reaching extended
operation. The coupling also makes implementing always-atomic debug
mode difficult.
This patch separates out percpu mode switching into
percpu_ref_switch_to_percpu() and reimplements percpu_ref_reinit() on
top of it.
* DEAD still requires ATOMIC. A dead ref can't be switched to percpu
mode w/o going through reinit.
v2: __percpu_ref_switch_to_percpu() was missing static. Fixed.
Reported by Fengguang aka kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
percpu_ref has treated the dropping of the base reference and
switching to atomic mode as an integral operation; however, there's
nothing inherent tying the two together.
The use cases for percpu_ref have been expanding continuously. While
the current init/kill/reinit/exit model can cover a lot, the coupling
of kill/reinit with atomic/percpu mode switching is turning out to be
too restrictive for use cases where many percpu_refs are created and
destroyed back-to-back with only some of them reaching extended
operation. The coupling also makes implementing always-atomic debug
mode difficult.
This patch separates out atomic mode switching into
percpu_ref_switch_to_atomic() and reimplements
percpu_ref_kill_and_confirm() on top of it.
* The handling of __PERCPU_REF_ATOMIC and __PERCPU_REF_DEAD is now
differentiated. Among get/put operations, percpu_ref_tryget_live()
is the only one which cares about DEAD.
* percpu_ref_switch_to_atomic() can be called multiple times on the
same ref. This means that multiple @confirm_switch may get queued
up which we can't do reliably without extra memory area. This is
handled by making the later invocation synchronously wait for the
completion of the previous one. This isn't particularly desirable
but such synchronous waits shouldn't happen in most cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
percpu_ref will be restructured so that percpu/atomic mode switching
and reference killing are dedoupled. In preparation, add
PCPU_REF_DEAD and PCPU_REF_ATOMIC_DEAD which is OR of ATOMIC and DEAD.
For now, ATOMIC and DEAD are changed together and all PCPU_REF_ATOMIC
uses are converted to PCPU_REF_ATOMIC_DEAD without causing any
behavior changes.
percpu_ref_init() now specifies an explicit alignment when allocating
the percpu counters so that the pointer has enough unused low bits to
accomodate the flags. Note that one flag was fine as min alignment
for percpu memory is 2 bytes but two flags are already too many for
the natural alignment of unsigned longs on archs like cris and m68k.
v2: The original patch had BUILD_BUG_ON() which triggers if unsigned
long's alignment isn't enough to accomodate the flags, which
triggered on cris and m64k. percpu_ref_init() updated to specify
the required alignment explicitly. Reported by Fengguang.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
percpu_ref will be restructured so that percpu/atomic mode switching
and reference killing are dedoupled. In preparation, do the following
renames.
* percpu_ref->confirm_kill -> percpu_ref->confirm_switch
* __PERCPU_REF_DEAD -> __PERCPU_REF_ATOMIC
* __percpu_ref_alive() -> __ref_is_percpu()
This patch is pure rename and doesn't introduce any functional
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
percpu_ref uses pcpu_ prefix for internal stuff and percpu_ for
externally visible ones. This is the same convention used in the
percpu allocator implementation. It works fine there but percpu_ref
doesn't have too much internal-only stuff and scattered usages of
pcpu_ prefix are confusing than helpful.
This patch replaces all pcpu_ prefixes with percpu_. This is pure
rename and there's no functional change. Note that PCPU_REF_DEAD is
renamed to __PERCPU_REF_DEAD to signify that the flag is internal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
* Some comments became stale. Updated.
* percpu_ref_tryget() unnecessarily initializes @ret. Removed.
* A blank line removed from percpu_ref_kill_rcu().
* Explicit function name in a WARN format string replaced with __func__.
* WARN_ON() in percpu_ref_reinit() converted to WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
percpu_ref is gonna go through restructuring. Move
percpu_ref_reinit() after percpu_ref_kill_and_confirm(). This will
make later changes easier to follow and result in cleaner
organization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
This reverts commit 0a30288da1, which
was a temporary fix for SCSI blk-mq stall issue. The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.
This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda9 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
percpu_ref is currently based on ints and the number of refs it can
cover is (1 << 31). This makes it impossible to use a percpu_ref to
count memory objects or pages on 64bit machines as it may overflow.
This forces those users to somehow aggregate the references before
contributing to the percpu_ref which is often cumbersome and sometimes
challenging to get the same level of performance as using the
percpu_ref directly.
While using ints for the percpu counters makes them pack tighter on
64bit machines, the possible gain from using ints instead of longs is
extremely small compared to the overall gain from per-cpu operation.
This patch makes percpu_ref based on longs so that it can be used to
directly count memory objects or pages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
percpu_ref's WARN messages can be a lot more helpful by indicating
who's the culprit. Make them report the release function that the
offending percpu-refcount is associated with. This should make it a
lot easier to track down the reported invalid refcnting operations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>