Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
Pull virtio updates from Rusty Russell:
"One cc: stable commit, the rest are a series of minor cleanups which
have been sitting in MST's tree during my vacation. I changed a
function name and made one trivial change, then they spent two days in
linux-next"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (25 commits)
virtio-rng: refactor probe error handling
virtio_scsi: drop scan callback
virtio_balloon: enable VQs early on restore
virtio_scsi: fix race on device removal
virito_scsi: use freezable WQ for events
virtio_net: enable VQs early on restore
virtio_console: enable VQs early on restore
virtio_scsi: enable VQs early on restore
virtio_blk: enable VQs early on restore
virtio_scsi: move kick event out from virtscsi_init
virtio_net: fix use after free on allocation failure
9p/trans_virtio: enable VQs early
virtio_console: enable VQs early
virtio_blk: enable VQs early
virtio_net: enable VQs early
virtio: add API to enable VQs early
virtio_net: minor cleanup
virtio-net: drop config_mutex
virtio_net: drop config_enable
virtio-blk: drop config_mutex
...
Pull Ceph updates from Sage Weil:
"There is the long-awaited discard support for RBD (Guangliang Zhao,
Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
(Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
performance and debugging improvements (Yan, Zheng, John Spray), and a
smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
ceph: fix divide-by-zero in __validate_layout()
rbd: rbd workqueues need a resque worker
libceph: ceph-msgr workqueue needs a resque worker
ceph: fix bool assignments
libceph: separate multiple ops with commas in debugfs output
libceph: sync osd op definitions in rados.h
libceph: remove redundant declaration
ceph: additional debugfs output
ceph: export ceph_session_state_name function
ceph: include the initial ACL in create/mkdir/mknod MDS requests
ceph: use pagelist to present MDS request data
libceph: reference counting pagelist
ceph: fix llistxattr on symlink
ceph: send client metadata to MDS
ceph: remove redundant code for max file size verification
ceph: remove redundant io_iter_advance()
ceph: move ceph_find_inode() outside the s_mutex
ceph: request xattrs if xattr_version is zero
rbd: set the remaining discard properties to enable support
rbd: use helpers to handle discard for layered images correctly
...
virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after restore returns, virtio block violated
this rule on restore by restarting queues, which might in theory
cause the VQ to be used directly within restore.
To fix, call virtio_device_ready before using starting queues.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after probe returns, virtio block violated this
rule by calling add_disk, which causes the VQ to be used directly within
probe.
To fix, call virtio_device_ready before using VQs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
config_mutex served two purposes: prevent multiple concurrent config
change handlers, and synchronize access to config_enable flag.
Since commit dbf2576e37
workqueue: make all workqueues non-reentrant
all workqueues are non-reentrant, and config_enable
is now gone.
Get rid of the unnecessary lock.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now that virtio core ensures config changes don't
arrive during probing, drop config_enable flag
in virtio blk.
On removal, flush is now sufficient to guarantee that
no change work is queued.
This help simplify the driver, and will allow
setting DRIVER_OK earlier without losing config
change notifications.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.
Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes. Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.
For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.
op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discards take a reference to the snapshot context of an image when
they are created. This reference needs to be cleaned up when the
request is done just as it is for regular writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This patch add the discard support for rbd driver.
There are three types operation in the driver:
1. The objects would be removed if they completely contained
within the discard range.
2. The objects would be truncated if they partly contained within
the discard range, and align with their boundary.
3. Others would be zeroed.
A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.
This resolve:
http://tracker.ceph.com/issues/190
[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
commits by Josh Durgin for refinements and fixes which weren't
folded in to preserve authorship. ]
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It could only handle the read and write operations now,
extend it for the coming discard support.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.
Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors. (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
Pull sparc updates from David Miller:
1) Move to 4-level page tables on sparc64 and support up to 53-bits of
physical addressing. Kernel static image BSS size reduced by
several megabytes.
2) M6/M7 cpu support, from Allan Pais.
3) Move to sparse IRQs, handle hypervisor TLB call errors more
gracefully, and add T5 perf_event support. From Bob Picco.
4) Recognize cdroms and compute geometry from capacity in virtual disk
driver, also from Allan Pais.
5) Fix memset() return value on sparc32, from Andreas Larsson.
6) Respect gfp flags in dma_alloc_coherent on sparc32, from Daniel
Hellstrom.
7) Fix handling of compound pages in virtual disk driver, from Dwight
Engen.
8) Fix lockdep warnings in LDC layer by moving IRQ requesting to
ldc_alloc() from ldc_bind().
9) Increase boot string length to 1024 bytes, from Dave Kleikamp.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: (31 commits)
sparc64: Fix lockdep warnings on reboot on Ultra-5
sparc64: Increase size of boot string to 1024 bytes
sparc64: Kill unnecessary tables and increase MAX_BANKS.
sparc64: sparse irq
sparc64: Adjust vmalloc region size based upon available virtual address bits.
sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53.
sparc64: Use kernel page tables for vmemmap.
sparc64: Fix physical memory management regressions with large max_phys_bits.
sparc64: Adjust KTSB assembler to support larger physical addresses.
sparc64: Define VA hole at run time, rather than at compile time.
sparc64: Switch to 4-level page tables.
sparc64: Fix reversed start/end in flush_tlb_kernel_range()
sparc64: Add vio_set_intr() to enable/disable Rx interrupts
vio: fix reuse of vio_dring slot
sunvdc: limit each sg segment to a page
sunvdc: compute vdisk geometry from capacity
sunvdc: add cdrom and v1.1 protocol support
sparc: VIO protocol version 1.6
sparc64: Fix hibernation code refrence to PAGE_OFFSET.
sparc64: Move request_irq() from ldc_bind() to ldc_alloc()
...
Pull Xen updates from David Vrabel:
"Features and fixes:
- Add pvscsi frontend and backend drivers.
- Remove _PAGE_IOMAP PTE flag, freeing it for alternate uses.
- Try and keep memory contiguous during PV memory setup (reduces
SWIOTLB usage).
- Allow front/back drivers to use threaded irqs.
- Support large initrds in PV guests.
- Fix PVH guests in preparation for Xen 4.5"
* tag 'stable/for-linus-3.18-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (22 commits)
xen: remove DEFINE_XENBUS_DRIVER() macro
xen/xenbus: Remove BUG_ON() when error string trucated
xen/xenbus: Correct the comments for xenbus_grant_ring()
x86/xen: Set EFER.NX and EFER.SCE in PVH guests
xen: eliminate scalability issues from initrd handling
xen: sync some headers with xen tree
xen: make pvscsi frontend dependant on xenbus frontend
arm{,64}/xen: Remove "EXPERIMENTAL" in the description of the Xen options
xen-scsifront: don't deadlock if the ring becomes full
x86: remove the Xen-specific _PAGE_IOMAP PTE flag
x86/xen: do not use _PAGE_IOMAP PTE flag for I/O mappings
x86: skip check for spurious faults for non-present faults
xen/efi: Directly include needed headers
xen-scsiback: clean up a type issue in scsiback_make_tpg()
xen-scsifront: use GFP_ATOMIC under spin_lock
MAINTAINERS: Add xen pvscsi maintainer
xen-scsiback: Add Xen PV SCSI backend driver
xen-scsifront: Add Xen PV SCSI frontend driver
xen: Add Xen pvSCSI protocol description
xen/events: support threaded irqs for interdomain event channels
...