This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull vfs thaw updates from Al Viro:
"An ancient series that has fallen through the cracks in the previous
cycle"
* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
buffer.c: call thaw_super during emergency thaw
vfs: factor sb iteration out of do_emergency_remount
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.
In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.
The patch is a hack which adds blind unfreezing.
In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull misc vfs updates from Al Viro:
"All kinds of misc stuff, without any unifying topic, from various
people.
Neil's d_anon patch, several bugfixes, introduction of kvmalloc
analogue of kmemdup_user(), extending bitfield.h to deal with
fixed-endians, assorted cleanups all over the place..."
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
alpha: osf_sys.c: use timespec64 where appropriate
alpha: osf_sys.c: fix put_tv32 regression
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
dcache: delete unused d_hash_mask
dcache: subtract d_hash_shift from 32 in advance
fs/buffer.c: fold init_buffer() into init_page_buffers()
fs: fold __inode_permission() into inode_permission()
fs: add RWF_APPEND
sctp: use vmemdup_user() rather than badly open-coding memdup_user()
snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
replace_user_tlv(): switch to vmemdup_user()
new primitive: vmemdup_user()
memdup_user(): switch to GFP_USER
eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
eventfd: fold eventfd_ctx_read() into eventfd_read()
eventfd: convert to use anon_inode_getfd()
nfs4file: get rid of pointless include of btrfs.h
uvc_v4l2: clean copyin/copyout up
vme_user: don't use __copy_..._user()
usx2y: don't bother with memdup_user() for 16-byte structure
...
Since commit e76004093d ("fs/buffer.c: remove unnecessary init
operation after allocating buffer_head"), there are no callers of
init_buffer() outside of init_page_buffers(). So just fold it into
init_page_buffers().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
Pull ext4 updates from Ted Ts'o:
- Add support for online resizing of file systems with bigalloc
- Fix a two data corruption bugs involving DAX, as well as a corruption
bug after a crash during a racing fallocate and delayed allocation.
- Finally, a number of cleanups and optimizations.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve smp scalability for inode generation
ext4: add support for online resizing with bigalloc
ext4: mention noload when recovering on read-only device
Documentation: fix little inconsistencies
ext4: convert timers to use timer_setup()
jbd2: convert timers to use timer_setup()
ext4: remove duplicate extended attributes defs
ext4: add ext4_should_use_dax()
ext4: add sanity check for encryption + DAX
ext4: prevent data corruption with journaling + DAX
ext4: prevent data corruption with inline data + DAX
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ext4: retry allocations conservatively
ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
ext4: Add iomap support for inline data
iomap: Add IOMAP_F_DATA_INLINE flag
iomap: Switch from blkno to disk offset
guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.
[ 60.268688] attempt to access beyond end of device
[ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[ 60.268693] buffer_io_error: 2 callbacks suppressed
[ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read
Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org # v4.13
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.
Since this is also the last user of free_more_memory(), kill
it off completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.
After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace iomap->blkno, the sector number, with iomap->addr, the disk
offset in bytes. For invalid disk offsets, use the special value
IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK.
This allows to use iomap for mappings which are not block aligned, such
as inline data on ext4.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> # iomap, xfs
Reviewed-by: Jan Kara <jack@suse.cz>
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
We want only pages from given range in page_cache_seek_hole_data(). Use
pagevec_lookup_range() instead of pagevec_lookup() and remove
unnecessary code.
Note that the check for getting less pages than desired can be removed
because index gets updated by pagevec_lookup_range().
Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e64855c6cf ("fs: Add helper to clean bdev aliases under a bh
and use it") added a wrapper for clean_bdev_aliases() that invalidates
bdev aliases underlying a single buffer head.
However this has caused a performance regression for bonnie++ benchmark
on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
average of 3 runs:
Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%)
Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%)
The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.
Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.
Fixes: e64855c6cf ("fs: Add helper to clean bdev aliases under a bh and use it")
Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To install a buffer_head into the cpu's LRU queue, bh_lru_install()
would construct a new copy of the queue and then memcpy it over the real
queue. But it's easily possible to do the update in-place, which is
faster and simpler. Some work can also be skipped if the buffer_head
was already in the queue.
As a microbenchmark I timed how long it takes to run sb_getblk()
10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
Effectively, this benchmarks looking up buffer_heads that are in the
page cache but not in the LRU:
Before this patch: 1.758s
After this patch: 1.653s
This patch also removes about 350 bytes of compiled code (on x86_64),
partly due to removal of the memcpy() which was being inlined+unrolled.
Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>