Commit Graph

565 Commits

Author SHA1 Message Date
huhai b4f6f38d9f blk-mq: remove wrong 'unlikely' check
When dispatch_rq_from_ctx is called, in the vast majority of cases
the ctx->rq_list is not empty.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-22 08:38:04 -06:00
huhai d416c92c5d blk-mq: clear hctx->dispatch_from when mappings change
When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
the ctx mappings, but the current code doesn't clear the
->dispatch_from hint. This can result in dispatch_from pointing to
a ctx that isn't mapped to the hctx anymore.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>

Moved the placement of the clearing to where we clear other items
pertaining to the existing mapping, added Fixes line, and reworded
the commit message.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-18 08:35:38 -06:00
huhai 8fa9f55645 blk-mq: remove redundant insert case in blk_mq_make_request()
We can use blk_mq_sched_insert_request() even if we don't have
an IO scheduler attached, since that case will end up being
exactly the same as what blk_mq_queue_io() was doing now.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-16 08:22:47 -06:00
Jens Axboe 17a5119932 blk-mq: don't call into depth limiting for reserved tags
It's not useful, they are internal and/or error handling recovery
commands.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:12 -06:00
Omar Sandoval 522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00
Omar Sandoval 4bc6339a58 block: move blk_stat_add() to __blk_mq_end_request()
We want this next to blk_account_io_done() for the next change so that
we can call ktime_get() only once for both.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:07 -06:00
Omar Sandoval 544ccc8dc9 block: get rid of struct blk_issue_stat
struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:05 -06:00
Omar Sandoval a8a4594170 block: pass struct request instead of struct blk_issue_stat to wbt
issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:02 -06:00
Omar Sandoval bf0ddaba65 blk-mq: fix sysfs inflight counter
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-26 09:02:01 -06:00
Omar Sandoval 6131837b1d blk-mq: count allocated but not started requests in iostats inflight
In the legacy block case, we increment the counter right after we
allocate the request, not when the driver handles it. In both the legacy
and blk-mq cases, part_inc_in_flight() is called from
blk_account_io_start() right after we've allocated the request. blk-mq
only considers requests started requests as inflight, but this is
inconsistent with the legacy definition and the intention in the code.
This removes the started condition and instead counts all allocated
requests.

Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-26 09:02:00 -06:00
Ming Lei 4412efecf7 Revert "blk-mq: remove code for dealing with remapping queue"
This reverts commit 37c7c6c76d.

Turns out some drivers(most are FC drivers) may not use managed
IRQ affinity, and has their customized .map_queues meantime, so
still keep this code for avoiding regression.

Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-25 09:49:22 -06:00
Jianchao Wang f4560231ec blk-mq: start request gstate with gen 1
rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.

For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-16 21:56:41 -06:00
Ming Lei 37c7c6c76d blk-mq: remove code for dealing with remapping queue
Firstly, from commit 4b855ad371 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue any more after CPU topo is changed.

Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map
all possible CPUs to hw queues, so at least one CPU is mapped to each hctx.

So queue mapping has became static and fixed just like percpu variable, and
we don't need to handle queue remapping any more.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei efea8450c3 blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
There are several reasons for removing the check:

1) blk_mq_hw_queue_mapped() returns true always now since each hctx
may be mapped by one CPU at least

2) when there isn't any online CPU mapped to this hctx, there won't
be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue
if there is IO queued to this hctx

3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(),
which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and
the hctx to be handled has to be mapped.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei 15fe8a90bb blk-mq: remove blk_mq_delay_queue()
No driver uses this interface any more, so remove it.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei f82ddf1923 blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
This patch introduces helper of blk_mq_hw_queue_first_cpu() for
figuring out the hctx's first cpu, and code duplication can be
avoided.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei 476f8c98a9 blk-mq: avoid to write intermediate result to hctx->next_cpu
This patch figures out the final selected CPU, then writes
it to hctx->next_cpu once, then we can avoid to intermediate
next cpu observed from other dispatch paths.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei a1c735fb79 blk-mq: make sure that correct hctx->next_cpu is set
From commit 20e4d81393 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Ming Lei 0bca799b92 blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:

1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.

2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.

3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B

4) both two IO pathes can't move on, and IO stall is caused.

This issue can be observed when running dbench on USB storage.

This patch fixes this issue by always getting budget before getting
driver tag.

Cc: stable@vger.kernel.org
Fixes: de14829740 ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-10 08:38:46 -06:00
Bart Van Assche 7dfdbc7367 block: Protect queue flag changes with the queue lock
Since the queue flags may be changed concurrently from multiple
contexts after a queue becomes visible in sysfs, make these changes
safe by protecting these with the queue lock.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche 8814ce8a0f block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()
Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche f78bac2c8e block: Use the queue_flag_*() functions instead of open-coding these
Except for changing the atomic queue flag manipulations that are
protected by the queue lock into non-atomic manipulations, this
patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche 5ee0524ba1 block: Add 'lock' as third argument to blk_alloc_queue_node()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28 12:23:35 -07:00
Omar Sandoval e9a99a6388 block: clear ctx pending bit under ctx lock
When we insert a request, we set the software queue pending bit while
holding the software queue lock. However, we clear it outside of the
lock, so it's possible that a concurrent insert could reset the bit
after we clear it but before we empty the request list. Afterwards, the
bit would still be set but the software queue wouldn't have any requests
in it, leading us to do a spurious run in the future. This is mostly a
benign/theoretical issue, but it makes the following change easier to
justify.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28 12:23:35 -07:00
Ming Lei 105976f517 blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
__blk_mq_requeue_request() covers two cases:

- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()

- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().

We should call io sched's .requeue_request callback only for the 2nd
case.

Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Omar Sandoval <osandov@fb.com>
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-24 15:55:54 -07:00