We can safely merge anything that wont generate an SG list entry,
so if the bio is data-less (discard), don't look at potential
SG gaps.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
Also, the bio_inc_remaining() interface has been moved local to bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The only possible problem of using mutex_lock() instead of trylock
is about deadlock.
If there aren't any locks held before calling blkdev_reread_part(),
deadlock can't be caused by this conversion.
If there are locks held before calling blkdev_reread_part(),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.
Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.
For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.
For nbd, tx_lock is held when calling the function:
- both open and release won't hold the lock
- when blkdev_reread_part() is run, I/O thread has been stopped
already, so tx_lock won't be acquired in I/O path at that time.
- so the conversion won't cause deadlock for nbd
For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it. Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.
V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.
V2: check blk_queue_nomerges() as suggested by Jeff.
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The following appears in blk_sq_make_request:
/*
* If we have multiple hardware queues, just go directly to
* one of those for sync IO.
*/
We clearly don't have multiple hardware queues, here! This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):
We want slightly different behavior from them:
- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.
- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.
The old code had this:
use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);
and that was converted to:
use_plug = !is_flush_fua && !is_sync;
which is not equivalent. For the single queue case, that second half of
the && expression is always true. So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).
V2: delete the 'likely', which should not be a big deal
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This removes the request types and hacks from the block code and into the
old IDE driver. There is a small amunt of code duplication due to this,
but it's not too bad.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.
Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.
For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.
Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 889fa31f00 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f00 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer core bits from Jens Axboe:
"This is the core pull request for 4.1. Not a lot of stuff in here for
this round, mostly little fixes or optimizations. This pull request
contains:
- An optimization that speeds up queue runs on blk-mq, especially for
the case where there's a large difference between nr_cpu_ids and
the actual mapped software queues on a hardware queue. From Chong
Yuan.
- Honor node local allocations for requests on legacy devices. From
David Rientjes.
- Cleanup of blk_mq_rq_to_pdu() from me.
- exit_aio() fixup from me, greatly speeding up exiting multiple IO
contexts off exit_group(). For my particular test case, fio exit
took ~6 seconds. A typical case of both exposing RCU grace periods
to user space, and serializing exit of them.
- Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
wait if __GFP_WAIT is set. From Keith Busch.
- blk-mq exports and two added helpers from Mike Snitzer, which will
be used by the dm-mq code.
- Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"
* 'for-4.1/core' of git://git.kernel.dk/linux-block:
blk-mq: reduce unnecessary software queue looping
aio: fix serial draining in exit_aio()
blk-mq: cleanup blk_mq_rq_to_pdu()
blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
block: allocate request memory local to request queue
blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
blk-mq: export blk_mq_run_hw_queues
blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.
And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().
Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com>
Updated by me for formatting and commenting.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs update from Al Viro:
"Part one:
- struct filename-related cleanups
- saner iov_iter_init() replacements (and switching the syscalls to
use of those)
- ntfs switch to ->write_iter() (Anton)
- aio cleanups and splitting iocb into common and async parts
(Christoph)
- assorted fixes (me, bfields, Andrew Elble)
There's a lot more, including the completion of switchover to
->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
race fixes, etc, but that goes after #for-davem merge. David has
pulled it, and once it's in I'll send the next vfs pull request"
* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
sg_start_req(): use import_iovec()
sg_start_req(): make sure that there's not too many elements in iovec
blk_rq_map_user(): use import_single_range()
sg_io(): use import_iovec()
process_vm_access: switch to {compat_,}import_iovec()
switch keyctl_instantiate_key_common() to iov_iter
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
vmsplice_to_user(): switch to import_iovec()
kill aio_setup_single_vector()
aio: simplify arguments of aio_setup_..._rw()
aio: lift iov_iter_init() into aio_setup_..._rw()
lift iov_iter into {compat_,}do_readv_writev()
NFS: fix BUG() crash in notify_change() with patch to chown_common()
dcache: return -ESTALE not -EBUSY on distributed fs race
NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
VFS: Add iov_iter_fault_in_multipages_readable()
drop bogus check in file_open_root()
switch security_inode_getattr() to struct path *
constify tomoyo_realpath_from_path()
...
Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.
The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.
This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation. The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.
It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
caused blk_stack_limits() to not properly stack queue_limits for stacked
devices (e.g. DM).
Fix this regression by establishing lcm_not_zero() and switching
blk_stack_limits() over to using it.
DM uses blk_set_stacking_limits() to establish the initial top-level
queue_limits that are then built up based on underlying devices' limits
using blk_stack_limits(). In the case of optimal_io_size (io_opt)
blk_set_stacking_limits() establishes a default value of 0. With commit
69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
the underlying devices' io_opt.
Test:
$ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
$ cat /sys/block/sde/queue/optimal_io_size
786432
$ dmsetup create node --table "0 100 linear /dev/sde 0"
Before this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
0
After this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
786432
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>