Pull virtio updates from Michael Tsirkin:
"Fixes, cleanups, performance
A bunch of changes to virtio, most affecting virtio net. Also ptr_ring
batched zeroing - first of batching enhancements that seems ready."
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
s390/virtio: change maintainership
tools/virtio: fix spelling mistake: "wakeus" -> "wakeups"
virtio_net: tidy a couple debug statements
ptr_ring: support testing different batching sizes
ringtest: support test specific parameters
ptr_ring: batch ring zeroing
virtio: virtio_driver doc
virtio_net: don't reset twice on XDP on/off
virtio_net: fix support for small rings
virtio_net: reduce alignment for buffers
virtio_net: rework mergeable buffer handling
virtio_net: allow specifying context for rx
virtio: allow extra context per descriptor
tools/virtio: fix build breakage
virtio: add context flag to find vqs
virtio: wrap find_vqs
ringtest: fix an assert statement
Pull ceph updates from Ilya Dryomov:
"The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling
series from Jeff.
The former will allow rbd users to take advantage of exclusive lock's
built-in blacklist/break-lock functionality while staying in control
of who owns the lock. With the latter in place, we will abort
filesystem writes on -ENOSPC instead of having them block
indefinitely.
Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander"
* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
ceph: fix memory leak in __ceph_setxattr()
ceph: fix file open flags on ppc64
ceph: choose readdir frag based on previous readdir reply
rbd: exclusive map option
rbd: return ResponseMessage result from rbd_handle_request_lock()
rbd: kill rbd_is_lock_supported()
rbd: support updating the lock cookie without releasing the lock
rbd: store lock cookie
rbd: ignore unlock errors
rbd: fix error handling around rbd_init_disk()
rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
ceph: when seeing write errors on an inode, switch to sync writes
Revert "ceph: SetPageError() for writeback pages if writepages fails"
ceph: handle epoch barriers in cap messages
libceph: add an epoch_barrier field to struct ceph_osd_client
libceph: abort already submitted but abortable requests when map or pool goes full
libceph: allow requests to return immediately on full conditions if caller wishes
libceph: remove req->r_replay_version
ceph: make seeky readdir more efficient
...
CURRENT_TIME is not y2038 safe. The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.
struct timespec is also not y2038 safe. Retain timespec for timestamp
representation here as ceph uses it internally everywhere. These
references will be changed to use struct timespec64 in a separate patch.
The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.
Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().
Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.
This api will be transitioned to be y2038 safe along with vfs.
Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M: Ilya Dryomov <idryomov@gmail.com>
M: "Yan, Zheng" <zyan@redhat.com>
M: Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vmalloc* allows users to provide gfp flags for the underlying
allocation. This API is quite popular
$ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
77
The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space. About half of users don't
use this flag, though. This signals that we make the API unnecessarily
too complex.
This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
are simplified and drop the flag.
Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes and updates from Jens Axboe:
"Some fixes and followup features/changes that should go in, in this
merge window. This contains:
- Two fixes for lightnvm from Javier, fixing problems in the new code
merge previously in this merge window.
- A fix from Jan for the backing device changes, fixing an issue in
NFS that causes a failure to mount on certain setups.
- A change from Christoph, cleaning up the blk-mq init and exit
request paths.
- Remove elevator_change(), which is now unused. From Bart.
- A fix for queue operation invocation on a dead queue, from Bart.
- A series fixing up mtip32xx for blk-mq scheduling, removing a
bandaid we previously had in place for this. From me.
- A regression fix for this series, fixing a case where we wait on
workqueue flushing from an invalid (non-blocking) context. From me.
- A fix/optimization from Ming, ensuring that we don't both quiesce
and freeze a queue at the same time.
- A fix from Peter on lock ordering for CPU hotplug. Not a real
problem right now, but will be once the CPU hotplug rework goes in.
- A series from Omar, cleaning up out blk-mq debugfs support, and
adding support for exporting info from schedulers in debugfs as
well. This is really useful in debugging stalls or livelocks. From
Omar"
* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
mq-deadline: add debugfs attributes
kyber: add debugfs attributes
blk-mq-debugfs: allow schedulers to register debugfs attributes
blk-mq: untangle debugfs and sysfs
blk-mq: move debugfs declarations to a separate header file
blk-mq: Do not invoke queue operations on a dead queue
blk-mq-debugfs: get rid of a bunch of boilerplate
blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
blk-mq-debugfs: don't open code strstrip()
blk-mq-debugfs: error on long write to queue "state" file
blk-mq-debugfs: clean up flag definitions
blk-mq-debugfs: separate flags with |
nfs: Fix bdi handling for cloned superblocks
block/mq: Cure cpu hotplug lock inversion
lightnvm: fix bad back free on error path
lightnvm: create cmd before allocating request
blk-mq: don't use sync workqueue flushing from drivers
mtip32xx: convert internal commands to regular block infrastructure
mtip32xx: cleanup internal tag assumptions
block: don't call blk_mq_quiesce_queue() after queue is frozen
...
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.
Change summary:
- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.
This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.
- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.
- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.
- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
- commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang <yizhan@redhat.com>
- commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...
Support disabling automatic exclusive lock transfers to allow users
to be in charge of which node should own the lock while being able to
reuse exclusive lock's built-in blacklist/break-lock functionality.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Right now it's just 0, but "no automatic exclusive lock transfers" mode
code will need -EROFS.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Currently the exclusive lock is acquired only if the mapping is
writable, i.e. an image HEAD mapped in rw mode. This means that we
don't acquire the lock for executing a read from a snapshot or an image
HEAD mapped in ro mode, even if lock_on_read is set. This is somewhat
weird and inconsistent with "no automatic exclusive lock transfers"
mode, where the lock is acquired unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
In preparation for supporting set_cookie method (or rather set_cookie
fallback for older OSDs), store the lock cookie on lock and use it on
unlock instead of recalculating from rbd_dev->watch_cookie.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Currently the lock_state is set to UNLOCKED (preventing further I/O),
but RELEASED_LOCK notification isn't sent. Be consistent with userspace
and treat ceph_cls_unlock() errors as the image is unlocked.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
add_disk() takes an extra reference on disk->queue, which is put in
put_disk() -> disk_release(). Avoiding blk_cleanup_queue() (which also
puts the queue) until add_disk() sets GENHD_FL_UP works for the queue
itself, but leaks various queue internals. Conditioning tag_set freeing
on GENHD_FL_UP is wrong too: all error paths after rbd_init_disk() leak
the tag_set.
Move the final "announce" steps out of rbd_dev_device_setup() so that
it can be unwound like any other function. Leave "announce" steps to
do_rbd_add/remove().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
rbd_dev->disk tear down vs rbd_watch_cb() race shouldn't be a problem
anymore thanks to EXISTS and REMOVING checks in rbd_dev_update_size().
A similar race could occur on "rbd map", see commit 811c668877
("rbd: fix rbd map vs notify races").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
No reason to hide CephFS-specific features in the rbd case. Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Patch series "zram clean up", v2.
This patchset aims to clean up zram .
[1] clean up multiple pages's bvec handling.
[2] clean up partial IO handling
[3-6] clean up zram via using accessor and removing pointless structure.
With [2-6] applied, we can get a few hundred bytes as well as huge
readibility enhance.
x86: 708 byte save
add/remove: 1/1 grow/shrink: 0/11 up/down: 478/-1186 (-708)
function old new delta
zram_special_page_read - 478 +478
zram_reset_device 317 314 -3
mem_used_max_store 131 128 -3
compact_store 96 93 -3
mm_stat_show 203 197 -6
zram_add 719 712 -7
zram_slot_free_notify 229 214 -15
zram_make_request 819 803 -16
zram_meta_free 128 111 -17
zram_free_page 180 151 -29
disksize_store 432 361 -71
zram_decompress_page.isra 504 - -504
zram_bvec_rw 2592 2080 -512
Total: Before=25350773, After=25350065, chg -0.00%
ppc64: 231 byte save
add/remove: 2/0 grow/shrink: 1/9 up/down: 681/-912 (-231)
function old new delta
zram_special_page_read - 480 +480
zram_slot_lock - 200 +200
vermagic 39 40 +1
mm_stat_show 256 248 -8
zram_meta_free 200 184 -16
zram_add 944 912 -32
zram_free_page 348 308 -40
disksize_store 572 492 -80
zram_decompress_page 664 564 -100
zram_slot_free_notify 292 160 -132
zram_make_request 1132 1000 -132
zram_bvec_rw 2768 2396 -372
Total: Before=17565825, After=17565594, chg -0.00%
This patch (of 6):
Johannes Thumshirn reported system goes the panic when using NVMe over
Fabrics loopback target with zram.
The reason is zram expects each bvec in bio contains a single page
but nvme can attach a huge bulk of pages attached to the bio's bvec
so that zram's index arithmetic could be wrong so that out-of-bound
access makes system panic.
[1] in mainline solved solved the problem by limiting max_sectors with
SECTORS_PER_PAGE but it makes zram slow because bio should split with
each pages so this patch makes zram aware of multiple pages in a bvec
so it could solve without any regression(ie, bio split).
[1] 0bc315381f, zram: set physical queue limits to avoid array out of
bounds accesses
Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 1647b9b9 "brd: add dax_operations support" introduced the allocation
and freeing of a dax_device, but the allocated dax_device is not stored
into the brd_device, so brd_del_one() will eventually operate on an
uninitialized brd->dax_dev.
Fix this by storing the allocated dax_device to brd->dax_dev.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>