Commit Graph

2380 Commits

Author SHA1 Message Date
Bart Van Assche 959f5f5b2f blk-mq: Use all available hardware queues
Suppose that a system has two CPU sockets, three cores per socket,
that it does not support hyperthreading and that four hardware
queues are provided by a block driver. With the current algorithm
this will lead to the following assignment of CPU cores to hardware
queues:

  HWQ 0: 0 1
  HWQ 1: 2 3
  HWQ 2: 4 5
  HWQ 3: (none)

This patch changes the queue assignment into:

  HWQ 0: 0 1
  HWQ 1: 2
  HWQ 2: 3 4
  HWQ 3: 5

In other words, this patch has the following three effects:
- All four hardware queues are used instead of only three.
- CPU cores are spread more evenly over hardware queues. For the
  above example the range of the number of CPU cores associated
  with a single HWQ is reduced from [0..2] to [1..2].
- If the number of HWQ's is a multiple of the number of CPU sockets
  it is now guaranteed that all CPU cores associated with a single
  HWQ reside on the same CPU socket.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09 09:08:21 -07:00
Bart Van Assche 52f7eb945f blk-mq: Micro-optimize bt_get()
Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09 09:07:28 -07:00
Bart Van Assche c38d185d4a blk-mq: Fix a race between bt_clear_tag() and bt_get()
What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
  __bt_get_word() also observes the preceding addition of 'current'
  to the appropriate wait list. This is guaranteed by the semantics
  of the spin_unlock() operation performed by prepare_and_wait().
  Hence the conversion of test_and_set_bit_lock() into
  test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
  been cleared. clear_bit_unlock() guarantees that any thread that
  observes that the bit has been cleared also observes the store
  operations preceding clear_bit_unlock(). However,
  clear_bit_unlock() does not prevent that the wait lists are examined
  before that the tag bit is cleared. Hence the addition of a memory
  barrier between clear_bit() and the wait list examination.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09 09:07:16 -07:00
Bart Van Assche 9e98e9d7cf blk-mq: Avoid that __bt_get_word() wraps multiple times
If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09 09:07:14 -07:00
Bart Van Assche 45a9c9d909 blk-mq: Fix a use-after-free
blk-mq users are allowed to free the memory request_queue.tag_set
points at after blk_cleanup_queue() has finished but before
blk_release_queue() has started. This can happen e.g. in the SCSI
core. The SCSI core namely embeds the tag_set structure in a SCSI
host structure. The SCSI host structure is freed by
scsi_host_dev_release(). This function is called after
blk_cleanup_queue() finished but can be called before
blk_release_queue().

This means that it is not safe to access request_queue.tag_set from
inside blk_release_queue(). Hence remove the blk_sync_queue() call
from blk_release_queue(). This call is not necessary - outstanding
requests must have finished before blk_release_queue() is
called. Additionally, move the blk_mq_free_queue() call from
blk_release_queue() to blk_cleanup_queue() to avoid that struct
request_queue.tag_set gets accessed after it has been freed.

This patch avoids that the following kernel oops can be triggered
when deleting a SCSI host for which scsi-mq was enabled:

Call Trace:
 [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0
 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
 [<ffffffff8126c140>] kobject_put+0x30/0x70
 [<ffffffff81245895>] blk_put_queue+0x15/0x20
 [<ffffffff8125c409>] disk_release+0x99/0xd0
 [<ffffffff8133d056>] device_release+0x36/0xb0
 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
 [<ffffffff8126c140>] kobject_put+0x30/0x70
 [<ffffffff8125a78a>] put_disk+0x1a/0x20
 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
 [<ffffffff811d56a0>] blkdev_put+0x50/0x160
 [<ffffffff81199eb4>] kill_block_super+0x44/0x70
 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70
 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90
 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
 [<ffffffff8107252c>] task_work_run+0xac/0xe0
 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0
 [<ffffffff814d2c58>] int_signal+0x12/0x17

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-09 09:07:13 -07:00
Ming Lei 19c66e59ce blk-mq: prevent unmapped hw queue from being scheduled
When one hardware queue has no mapped software queues, it
shouldn't have been scheduled. Otherwise WARNING or OOPS
can triggered.

blk_mq_hw_queue_mapped() helper is introduce for fixing
the problem.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08 21:37:08 -07:00
Jens Axboe 080ff35114 blk-mq: re-check for available tags after running the hardware queue
If we run out of tags and have to sleep, we run the hardware queue
to kick pending IO into gear. During that run, we may have completed
requests, so re-check if we have free tags before going to sleep.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08 08:49:06 -07:00
Bart Van Assche b32232073e blk-mq: fix hang in bt_get()
Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:

* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
  hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-08 08:46:34 -07:00
Shaohua Li 6637fadf25 blk-mq: move the kdump check to blk_mq_alloc_tag_set
We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are
allocated in the former function. So the kdump check should be moved to there
to really save memory.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-30 18:35:39 -07:00
Jens Axboe 70114c393c blk-mq: cleanup tag free handling
We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag()
from blk_mq_put_tag(), so just inline the two calls instead of
having them as separate functions.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 15:52:30 -07:00
Jens Axboe a33c1ba291 blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
We currently use num_possible_cpus(), but that breaks on sparc64 where
the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID
instead, so we don't end up reading from invalid memory.

Cc: stable@kernel.org # 3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 15:02:42 -07:00
Gu Zheng 394ffa503b blk: introduce generic io stat accounting help function
Many block drivers accounting io stat based on bio (e.g. NVMe...),
the blk_account_io_start/end() which is based on request
does not make sense to them, so here we introduce the similar help
function named generic_start/end_io_acct base on raw sectors, and it can
simplify some driver's open io accounting code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:04:44 -07:00
Christoph Hellwig b657d7e632 blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:02:07 -07:00
Jens Axboe 5fabcb4c33 genhd: check for int overflow in disk_expand_part_tbl()
We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.

Reported-by: David Ramos <daramos@stanford.edu>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-19 13:09:07 -07:00
Jens Axboe 7c7f2f2bc9 blk-mq: add blk_mq_free_hctx_request()
It's silly to use blk_mq_free_request() which in turn maps the
request to the hardware queue, for places where we already know
what the hardware queue is. This saves us an extra mapping of a
hardware queue on request completion, if the caller knows this
information already.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17 10:41:57 -07:00
Jens Axboe 1a3b595a28 blk-mq: export blk_mq_free_request()
Drivers that know they are blk-mq should just use this function
instead of calling through blk_put_request().

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17 10:40:48 -07:00
Paolo Bonzini 2a90d4aae5 blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
blk-mq is using preempt_disable/enable in order to ensure that the
queue runners are placed on the right CPU.  This does not work with
the RT patches, because __blk_mq_run_hw_queue takes a non-raw
spinlock with the preemption-disabled region.  If there is contention
on the lock, this violates the rules for preemption-disabled regions.

While this should be easily fixable within the RT patches just by doing
migrate_disable/enable, we can do better and document _why_ this
particular region runs with disabled preemption.  After the previous
patch, it is trivial to switch it to get/put_cpu; the RT patches then
can change it to get_cpu_light, which lets virtio-blk run under RT
kernels.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-11 11:04:49 -07:00
Paolo Bonzini 398205b839 blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if needed
preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
except the one in blk-flush.c.  In fact that one is always asynchronous,
and it does not need smp_processor_id().

We can do the same for all other calls, avoiding preempt_disable when
async is true.  This avoids peppering blk-mq.c with preemption-disabled
regions.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-11 11:04:47 -07:00
Jens Axboe e167dfb53c blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag
Drivers can now tell blk-mq if they take advantage of the deferred
issue through 'last' or not. If they do, don't do queue-direct
for sync IO. This is a preparation patch for the nvme conversion.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-29 11:18:26 -06:00
Jens Axboe 74c450521d blk-mq: add a 'list' parameter to ->queue_rq()
Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent.

Instead of adding yet another argument to the hot ->queue_rq path,
encapsulate the passed arguments in a blk_mq_queue_data structure.
This is passed as a constant, and has been tested as faster than
passing 4 (or even 3) args through ->queue_rq. Update drivers for
the new ->queue_rq() prototype. There are no functional changes
in this patch for drivers - if they don't use the passed in list,
then they will just queue requests individually like before.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-29 11:14:52 -06:00
Christoph Hellwig 34b48db66e block: remove artifical max_hw_sectors cap
Set max_sectors to the value the drivers provides as hardware limit by
default.  Linux had proper I/O throttling for a long time and doesn't
rely on a artifically small maximum I/O size anymore.  By not limiting
the I/O size by default we remove an annoying tuning step required for
most Linux installation.

Note that both the user, and if absolutely required the driver can still
impose a limit for FS requests below max_hw_sectors_kb.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-21 14:02:54 -06:00
Linus Torvalds d3dc366bba Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block
Pull core block layer changes from Jens Axboe:
 "This is the core block IO pull request for 3.18.  Apart from the new
  and improved flush machinery for blk-mq, this is all mostly bug fixes
  and cleanups.

   - blk-mq timeout updates and fixes from Christoph.

   - Removal of REQ_END, also from Christoph.  We pass it through the
     ->queue_rq() hook for blk-mq instead, freeing up one of the request
     bits.  The space was overly tight on 32-bit, so Martin also killed
     REQ_KERNEL since it's no longer used.

   - blk integrity updates and fixes from Martin and Gu Zheng.

   - Update to the flush machinery for blk-mq from Ming Lei.  Now we
     have a per hardware context flush request, which both cleans up the
     code should scale better for flush intensive workloads on blk-mq.

   - Improve the error printing, from Rob Elliott.

   - Backing device improvements and cleanups from Tejun.

   - Fixup of a misplaced rq_complete() tracepoint from Hannes.

   - Make blk_get_request() return error pointers, fixing up issues
     where we NULL deref when a device goes bad or missing.  From Joe
     Lawrence.

   - Prep work for drastically reducing the memory consumption of dm
     devices from Junichi Nomura.  This allows creating clone bio sets
     without preallocating a lot of memory.

   - Fix a blk-mq hang on certain combinations of queue depths and
     hardware queues from me.

   - Limit memory consumption for blk-mq devices for crash dump
     scenarios and drivers that use crazy high depths (certain SCSI
     shared tag setups).  We now just use a single queue and limited
     depth for that"

* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
  block: Remove REQ_KERNEL
  blk-mq: allocate cpumask on the home node
  bio-integrity: remove the needless fail handle of bip_slab creating
  block: include func name in __get_request prints
  block: make blk_update_request print prefix match ratelimited prefix
  blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
  block: fix alignment_offset math that assumes io_min is a power-of-2
  blk-mq: Make bt_clear_tag() easier to read
  blk-mq: fix potential hang if rolling wakeup depth is too high
  block: add bioset_create_nobvec()
  block: use bio_clone_fast() in blk_rq_prep_clone()
  block: misplaced rq_complete tracepoint
  sd: Honor block layer integrity handling flags
  block: Replace strnicmp with strncasecmp
  block: Add T10 Protection Information functions
  block: Don't merge requests if integrity flags differ
  block: Integrity checksum flag
  block: Relocate bio integrity flags
  block: Add a disk flag to block integrity profile
  block: Add prefix to block integrity profile flags
  ...
2014-10-18 11:53:51 -07:00
Jens Axboe a86073e48a blk-mq: allocate cpumask on the home node
All other allocs are done on the specific node, somehow the
cpumask for hw queue runs was missed. Fix that by using
zalloc_cpumask_var_node() in blk_mq_init_queue().

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13 15:41:54 -06:00
Gu Zheng b65c7491cb bio-integrity: remove the needless fail handle of bip_slab creating
bip_slab is created with SLAB_PANIC, so the fail handler is unneeded.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13 15:09:38 -06:00
Robert Elliott 7b2b10e0e2 block: include func name in __get_request prints
In __get_request calls to printk_ratelimited, include the function name so
the callbacks suppressed message matches the messages that are printed,
and add "dev" before the device name so it matches other block layer
messages.

Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13 08:34:23 -06:00