Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.
This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.
blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.
However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.
Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.
The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.
* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.
* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.
* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was
!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))
mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.
This change simplifies the code and reduces confusion in the area.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 762380ad93 inadvertently changed a check for max_sectors
to max_hw_sectors. Revert that part, so we still compare against
max_sectors.
Signed-off-by: Jens Axboe <axboe@fb.com>
With the optimizations around not clearing the full request at alloc
time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
up to the user allocating the request.
Add a blk_rq_set_block_pc() that sets the command type to
REQ_TYPE_BLOCK_PC, and properly initializes the members associated
with this type of request. Update callers to use this function instead
of manipulating rq->cmd_type directly.
Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers have different limits on what size a request should
optimally be, depending on the offset of the request. Similar to
dividing a device into chunks. Add a setting that allows the driver
to inform the block layer of such a chunk size. The block layer will
then prevent merging across the chunks.
This is needed to optimally support NVMe with a non-zero stripe size.
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge misc updates from Andrew Morton:
- a few fixes for 3.16. Cc'ed to stable so they'll get there somehow.
- various misc fixes and cleanups
- most of the ocfs2 queue. Review is slow...
- most of MM. The MM queue is pretty huge this time, but not much in
the way of feature work.
- some tweaks under kernel/
- printk maintenance work
- updates to lib/
- checkpatch updates
- tweaks to init/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits)
fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init
fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul
init/main.c: remove an ifdef
kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND
init/main.c: add initcall_blacklist kernel parameter
init/main.c: don't use pr_debug()
fs/binfmt_flat.c: make old_reloc() static
fs/binfmt_elf.c: fix bool assignements
fs/efs: convert printk(KERN_DEBUG to pr_debug
fs/efs: add pr_fmt / use __func__
fs/efs: convert printk to pr_foo()
scripts/checkpatch.pl: device_initcall is not the only __initcall substitute
checkpatch: check stable email address
checkpatch: warn on unnecessary void function return statements
checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar);
checkpatch: add warning for kmalloc/kzalloc with multiply
checkpatch: warn on #defines ending in semicolon
checkpatch: make --strict a default for files in drivers/net and net/
checkpatch: always warn on missing blank line after variable declaration block
checkpatch: fix wildcard DT compatible string checking
...
A block device driver may choose to provide a rw_page operation. These
will be called when the filesystem is attempting to do page sized I/O to
page cache pages (ie not for direct I/O). This does preclude I/Os that
are larger than page size, so this may only be a performance gain for
some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.
This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."
[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'struct blk_mq_ctx' is __percpu, so add the annotation
and fix the sparse warning reported from Fengguang:
[block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect
type in initializer (different address spaces)
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.
Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.
Signed-off-by: Jens Axboe <axboe@fb.com>
I don't think we've ever caught any bugs with this, and there's the
list poisoning for the plug lists to catch uninitialized cases.
So remove the magic member and save 8 bytes in the struct.
Signed-off-by: Jens Axboe <axboe@fb.com>
mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the
core changes so we have a properly merged end result.
Signed-off-by: Jens Axboe <axboe@fb.com>
Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes. Replace them with a per-request_queue list of requests to
be requeued.
Based on an earlier test by Ming Lei.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.
If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.
The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.
Signed-off-by: Jens Axboe <axboe@fb.com>
This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
bsg currently checks ->request_fn to check whether a queue can
handle struct request. But with blk-mq, we don't have a request_fn
yet are request based. Add a queue_is_rq_based() helper and use
that in bsg, I'm guessing this is not the last place we need to
update for this. Besides, it better explains what is being
checked.
Signed-off-by: Jens Axboe <axboe@fb.com>
This allows to mirror the blk-mq code flow for more a more readable I/O
completion handler in SCSI.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will use this work_struct to requeue scsi commands from the
completion handler as well, so give it a more generic name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of setting the REQ_QUEUED flag on each of them just take it into
account in the only macro checking it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was used in the olden days, back when onions were proper
yellow. Basically it mapped to the current buffer to be
transferred. With highmem being added more than a decade ago,
most drivers map pages out of a bio, and rq->buffer isn't
pointing at anything valid.
Convert old style drivers to just use bio_data().
For the discard payload use case, just reference the page
in the bio.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...