Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since 018a17bdc8 ("bdi: reimplement bdev_inode_switch_bdi()") the
block device code writes out all dirty data whenever switching the
backing_dev_info for a block device inode. But a block device inode can
only be dirtied when it is in use, which means we only have to write it
out on the final blkdev_put, but not when doing a blkdev_get.
Factoring out the write out from the bdi list switch prepares from
removing the list switch later in the series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.
Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.
In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.
To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device. Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.
The raw driver needs the same end of device handling as was added for normal
block devices. Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem. But it is not correct due to the lack of
effective readpages() in the address space operations for block device.
This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.
Install 1GB of RAM disk storage:
# modprobe scsi_debug dev_size_mb=1024 delay=0
Sequential read from file on a filesystem:
# mkfs.ext4 /dev/$DEV
# mount /dev/$DEV /mnt
# fio --name=t --size=512m --rw=read --filename=/mnt/file
...
read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec
Sequential read from a block device:
# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
...
(Without this commit)
read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec
(With this commit)
read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A block_device may be attached to different gendisks and thus
different bdis over time. bdev_inode_switch_bdi() is used to switch
the associated bdi. The function assumes that the inode could be
dirty and transfers it between bdis if so. This is a bit nasty in
that it reaches into bdi internals.
This patch reimplements the function so that it writes out the inode
if dirty. This is a lot simpler and can be implemented without
exposing bdi internals.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev_get_queue() returns the request_queue associated with the
specified block_device. blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.
All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().
Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...
[AV: fixed the braino spotted by Cyrill]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A block device driver may choose to provide a rw_page operation. These
will be called when the filesystem is attempting to do page sized I/O to
page cache pages (ie not for direct I/O). This does preclude I/Os that
are larger than page size, so this may only be a performance gain for
some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
Pull writeback fix from Wu Fengguang:
"A trivial writeback fix"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Do not sort b_io list only because of block device inode
Pull aio changes from Ben LaHaise:
"First off, sorry for this pull request being late in the merge window.
Al had raised a couple of concerns about 2 items in the series below.
I addressed the first issue (the race introduced by Gu's use of
mm_populate()), but he has not provided any further details on how he
wants to rework the anon_inode.c changes (which were sent out months
ago but have yet to be commented on).
The bulk of the changes have been sitting in the -next tree for a few
months, with all the issues raised being addressed"
* git://git.kvack.org/~bcrl/aio-next: (22 commits)
aio: rcu_read_lock protection for new rcu_dereference calls
aio: fix race in ring buffer page lookup introduced by page migration support
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: remove unnecessary debugging from aio_free_ring()
aio: table lookup: verify ctx pointer
staging/lustre: kiocb->ki_left is removed
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: convert the ioctx list to table lookup v3
aio: double aio_max_nr in calculations
aio: Kill ki_dtor
aio: Kill ki_users
aio: Kill unneeded kiocb members
aio: Kill aio_rw_vect_retry()
aio: Don't use ctx->tail unnecessarily
aio: io_cancel() no longer returns the io_event
aio: percpu ioctx refcount
aio: percpu reqs_available
aio: reqs_active -> reqs_available
aio: fix build when migration is disabled
...
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>