Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2. Fix the math such that it works for non-power-of-2 io_min.
This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We'd occasionally merge requests with conflicting integrity flags.
Introduce a merge helper which checks that the requests have compatible
integrity payloads.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make the choice of checksum a per-I/O property by introducing a flag
that can be inspected by the SCSI layer. There are several reasons for
this:
1. It allows us to switch choice of checksum without unloading and
reloading the HBA driver.
2. During error recovery we need to be able to tell the HBA that
checksums read from disk should not be verified and converted to IP
checksums.
3. For error injection purposes we need to be able to write a bad guard
tag to storage. Since the storage device only supports T10 CRC we
need to be able to disable IP checksum conversion on the HBA.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
So far we have relied on the app tag size to determine whether a disk
has been formatted with T10 protection information or not. However, not
all target devices provide application tag storage.
Add a flag to the block integrity profile that indicates whether the
disk has been formatted with protection information.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a BLK_ prefix to the integrity profile flags. Also rename the flags
to be more consistent with the generate/verify terminology in the rest
of the integrity code.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of the "operate" parameter we pass in a seed value and a pointer
to a function that can be used to process the integrity metadata. The
generation function is changed to have a return value to fit into this
scheme.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The protection interval is not necessarily tied to the logical block
size of a block device. Stop using the terms "sector" and "sectors".
Going forward we will use the term "seed" to describe the initial
reference tag value for a given I/O. "Interval" will be used to describe
the portion of the data buffer that a given piece of protection
information is associated with.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
None of the filesystems appear interested in using the integrity tagging
feature. Potentially because very few storage devices actually permit
using the application tag space.
Remove the tagging functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For commands like REQ_COPY we need a way to pass extra information along
with each bio. Like integrity metadata this information must be
available at the bottom of the stack so bi_private does not suffice.
Rename the existing bi_integrity field to bi_special and make it a union
so we can have different bio extensions for each class of command.
We previously used bi_integrity != NULL as a way to identify whether a
bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the
indicator now that bi_special can contain different things.
In addition, bio_integrity(bio) will now return a pointer to the
integrity payload (when applicable).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch introduces 'struct blk_flush_queue' and puts all
flush machinery related fields into this structure, so that
- flush implementation details aren't exposed to driver
- it is easy to convert to per dispatch-queue flush machinery
This patch is basically a mechanical replacement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev_get_queue() returns the request_queue associated with the
specified block_device. blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.
All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().
Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.
This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.
blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.
However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.
Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.
The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.
* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.
* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.
* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was
!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))
mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.
This change simplifies the code and reduces confusion in the area.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 762380ad93 inadvertently changed a check for max_sectors
to max_hw_sectors. Revert that part, so we still compare against
max_sectors.
Signed-off-by: Jens Axboe <axboe@fb.com>
With the optimizations around not clearing the full request at alloc
time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
up to the user allocating the request.
Add a blk_rq_set_block_pc() that sets the command type to
REQ_TYPE_BLOCK_PC, and properly initializes the members associated
with this type of request. Update callers to use this function instead
of manipulating rq->cmd_type directly.
Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers have different limits on what size a request should
optimally be, depending on the offset of the request. Similar to
dividing a device into chunks. Add a setting that allows the driver
to inform the block layer of such a chunk size. The block layer will
then prevent merging across the chunks.
This is needed to optimally support NVMe with a non-zero stripe size.
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge misc updates from Andrew Morton:
- a few fixes for 3.16. Cc'ed to stable so they'll get there somehow.
- various misc fixes and cleanups
- most of the ocfs2 queue. Review is slow...
- most of MM. The MM queue is pretty huge this time, but not much in
the way of feature work.
- some tweaks under kernel/
- printk maintenance work
- updates to lib/
- checkpatch updates
- tweaks to init/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits)
fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init
fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul
init/main.c: remove an ifdef
kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND
init/main.c: add initcall_blacklist kernel parameter
init/main.c: don't use pr_debug()
fs/binfmt_flat.c: make old_reloc() static
fs/binfmt_elf.c: fix bool assignements
fs/efs: convert printk(KERN_DEBUG to pr_debug
fs/efs: add pr_fmt / use __func__
fs/efs: convert printk to pr_foo()
scripts/checkpatch.pl: device_initcall is not the only __initcall substitute
checkpatch: check stable email address
checkpatch: warn on unnecessary void function return statements
checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar);
checkpatch: add warning for kmalloc/kzalloc with multiply
checkpatch: warn on #defines ending in semicolon
checkpatch: make --strict a default for files in drivers/net and net/
checkpatch: always warn on missing blank line after variable declaration block
checkpatch: fix wildcard DT compatible string checking
...
A block device driver may choose to provide a rw_page operation. These
will be called when the filesystem is attempting to do page sized I/O to
page cache pages (ie not for direct I/O). This does preclude I/Os that
are larger than page size, so this may only be a performance gain for
some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.
This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."
[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'struct blk_mq_ctx' is __percpu, so add the annotation
and fix the sparse warning reported from Fengguang:
[block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect
type in initializer (different address spaces)
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.
Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.
Signed-off-by: Jens Axboe <axboe@fb.com>