It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.
But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.
Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.
This function is for improving the current quiescing mechanism
in the following patches.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move most code into blk_mq_rq_ctx_init, and the rest into
blk_mq_get_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.
Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.
Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment). The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge three functions only tail-called by blk_mq_free_request into
blk_mq_free_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having these as separate helpers in a header really does not help
readability, or my chances to refactor this code sanely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having them out of line in blk-mq-sched.c just makes the code flow
unnecessarily complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When direct issue is done on request picked up from plug list,
the hctx need to be updated with the actual hw queue, otherwise
wrong hctx is used and may hurt performance, especially when
wrong SRCU readlock is acquired/released
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The tagset lock needs to be held when iterating the tag_list, so a
lockdep assert was added when updating number of hardware queues. The
drivers calling this API, however, were unaware of the new requirement,
so are failing the assertion.
This patch takes the lock within the blk-mq function so the drivers do
not have to be modified in order to be safe.
Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because what the per-sw-queue bio merge does is basically same with
scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
as the default .bio_merge if no scheduler is used or io scheduler
doesn't provide .bio_merge().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Before blk-mq is introduced, I/O is merged to elevator
before being putted into plug queue, but blk-mq changed the
order and makes merging to sw queue basically impossible.
Then it is observed that throughput of sequential I/O is degraded
about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.
This patch moves the bio merging per sw queue before plugging,
like what blk_queue_bio() does, and the performance regression is
fixed under this situation.
[1]. test script:
sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json
RW=read or write
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.
Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error
We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.
1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
With the fix, It showed us both of the first op and the second op have
correct bi and pi address.
1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Making __blk_mq_stop_hw_queues static fixes sparse warning:
block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
declared. Should it be static?
Fixes: 2719aa217e ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This can be triggered by hot-unplug one cpu.
======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0+ #17 Not tainted
-------------------------------------------------------
step_after_susp/2640 is trying to acquire lock:
(all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110
but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
get_online_cpus+0x64/0x80
blk_mq_init_allocated_queue+0x3a0/0x4e0
blk_mq_init_queue+0x3a/0x60
loop_add+0xe5/0x280
loop_init+0x124/0x177
do_one_initcall+0x53/0x1c0
kernel_init_freeable+0x1e3/0x27f
kernel_init+0xe/0x100
ret_from_fork+0x31/0x40
-> #0 (all_q_mutex){+.+...}:
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
do_syscall_64+0x8f/0x710
return_from_SYSCALL_64+0x0/0x7a
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(all_q_mutex);
lock(cpu_hotplug.lock);
lock(all_q_mutex);
*** DEADLOCK ***
8 locks held by step_after_susp/2640:
#0: (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
#1: (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
#2: (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
#3: (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
#4: (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
#5: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
#6: (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
#7: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
stack backtrace:
CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
Call Trace:
dump_stack+0x99/0xce
print_circular_bug+0x1fa/0x270
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
? lock_acquire+0x11c/0x230
? blk_mq_queue_reinit_work+0x18/0x110
? blk_mq_queue_reinit_work+0x18/0x110
__mutex_lock+0x92/0x990
? blk_mq_queue_reinit_work+0x18/0x110
? kmem_cache_free+0x2cb/0x330
? anon_transport_class_unregister+0x20/0x20
? blk_mq_queue_reinit_work+0x110/0x110
mutex_lock_nested+0x1b/0x20
? mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
? __flow_cache_shrink+0x160/0x160
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
? rcu_read_lock_sched_held+0x79/0x80
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
? rcu_read_lock_sched_held+0x79/0x80
? rcu_sync_lockdep_assert+0x2f/0x60
? __sb_start_write+0xd9/0x1c0
? vfs_write+0x1ad/0x1d0
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
? rcu_read_lock_sched_held+0x79/0x80
do_syscall_64+0x8f/0x710
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25
The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe06595d
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>