Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This contains:
- NVMe pull request from Christoph, one uuid attribute fix, and one
fix for the controller memory buffer address for remapped BARs.
- use-after-free fix for bsg, from Benjamin Block.
- bcache race/use-after-free fix for a list traversal, fixing a
regression in this merge window. From Coly Li.
- null_blk change configfs dependency change from a 'depends' to a
'select'. This is a change from this merge window as well. From me.
- nbd signal fix from Josef, fixing a regression introduced with the
status code changes.
- nbd MAINTAINERS mailing list entry update.
- blk-throttle stall fix from Joseph Qi.
- blk-mq-debugfs fix from Omar, fixing an issue where we don't
register the IO scheduler debugfs directory, if the driver is
loaded with it. Only shows up if you switch through the sysfs
interface"
* 'for-linus' of git://git.kernel.dk/linux-block:
bsg-lib: fix use-after-free under memory-pressure
nvme-pci: Use PCI bus address for data/queues in CMB
blk-mq-debugfs: fix device sched directory for default scheduler
null_blk: change configfs dependency to select
blk-throttle: fix possible io stall when upgrade to max
MAINTAINERS: update list for NBD
nbd: fix -ERESTARTSYS handling
nvme: fix visibility of "uuid" ns attribute
bcache: use llist_for_each_entry_safe() in __closure_wake_up()
Pull device mapper fixes from Mike Snitzer:
- a stable fix for the alignment of the event number reported at the
end of the 'DM_LIST_DEVICES' ioctl.
- a couple stable fixes for the DM crypt target.
- a DM raid health status reporting fix.
* tag 'for-4.14/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm raid: fix incorrect status output at the end of a "recover" process
dm crypt: reject sector_size feature if device length is not aligned to it
dm crypt: fix memory leak in crypt_ctr_cipher_old()
dm ioctl: fix alignment of event number in the device list
There are three important fields that indicate the overall health and
status of an array: dev_health, sync_ratio, and sync_action. They tell
us the condition of the devices in the array, and the degree to which
the array is synchronized.
This commit fixes a condition that is reported incorrectly. When a member
of the array is being rebuilt or a new device is added, the "recover"
process is used to synchronize it with the rest of the array. When the
process is complete, but the sync thread hasn't yet been reaped, it is
possible for the state of MD to be:
mddev->recovery = [ MD_RECOVERY_RUNNING MD_RECOVERY_RECOVER MD_RECOVERY_DONE ]
curr_resync_completed = <max dev size> (but not MaxSector)
and all rdevs to be In_sync.
This causes the 'array_in_sync' output parameter that is passed to
rs_get_progress() to be computed incorrectly and reported as 'false' --
or not in-sync. This in turn causes the dev_health status characters to
be reported as all 'a', rather than the proper 'A'.
This can cause erroneous output for several seconds at a time when tools
will want to be checking the condition due to events that are raised at
the end of a sync process. Fix this by properly calculating the
'array_in_sync' return parameter in rs_get_progress().
Also, remove an unnecessary intermediate 'recovery_cp' variable in
rs_get_progress().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a crypt mapping uses optional sector_size feature, additional
restrictions to mapped device segment size must be applied in
constructor, otherwise the device activation will fail later.
Fixes: 8f0009a225 ("dm crypt: optionally support larger encryption sector size")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
static checker reports a potential integer overflow. Cap the worker count to
avoid the overflow.
Reported:-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
raid_map calls pers->make_request, which missed the suspend check. Fix it with
the new md_handle_request API.
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
md_submit_flush_data calls pers->make_request, which missed the suspend check.
Fix it with the new md_handle_request API.
Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
With commit cc27b0c78c, pers->make_request could bail out without handling
the bio. If that happens, we should retry. The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.
Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c78c79(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
API") replaces the following while loop by llist_for_each_entry(),
-
- while (reverse) {
- cl = container_of(reverse, struct closure, list);
- reverse = llist_next(reverse);
-
+ llist_for_each_entry(cl, reverse, list) {
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
This modification introduces a potential race by iterating a corrupted
list. Here is how it happens.
In the above modification, closure_sub() may wake up a process which is
waiting on reverse list. If this process decides to wait again by calling
closure_wait(), its cl->list will be added to another wait list. Then
when llist_for_each_entry() continues to iterate next node, it will travel
on another new wait list which is added in closure_wait(), not the
original reverse list in __closure_wake_up(). It is more probably to
happen on UP machine because the waked up process may preempt the process
which wakes up it.
Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
next node before waking up a process. Then the copy of next node will make
sure list iteration stays on original reverse list.
Fixes: 09b3efec81 ("bcache: Don't reinvent the wheel but use existing llist API")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The size of struct dm_name_list is different on 32-bit and 64-bit
kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).
This mismatch caused some harmless difference in padding when using 32-bit
or 64-bit kernel. Commit 23d70c5e52 ("dm ioctl: report event number in
DM_LIST_DEVICES") added reporting event number in the output of
DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
userspace to determine the location of the event number (the location
would be different when running on 32-bit and 64-bit kernels).
Fix the padding by using offsetof(struct dm_name_list, name) instead of
sizeof(struct dm_name_list) to determine the location of entries.
Also, the ioctl version number is incremented to 37 so that userspace
can use the version number to determine that the event number is present
and correctly located.
In addition, a global event is now raised when a DM device is created,
removed, renamed or when table is swapped, so that the user can monitor
for device changes.
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Fixes: 23d70c5e52 ("dm ioctl: report event number in DM_LIST_DEVICES")
Cc: stable@vger.kernel.org # 4.13
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull MD fixes from Shaohua Li:
"Two small patches to fix long-lived raid5 stripe batch bugs, one from
Dennis and the other from me"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
md/raid5: fix a race condition in stripe batch
Pull device mapper updates from Mike Snitzer:
- Some request-based DM core and DM multipath fixes and cleanups
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM
integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush
abstraction that was stood up for DM's use.
* tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dax: remove the pmem_dax_ops->flush abstraction
dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
dm integrity: make blk_integrity_profile structure const
dm integrity: do not check integrity for failed read operations
dm log writes: fix >512b sectorsize support
dm log writes: don't use all the cpu while waiting to log blocks
dm ioctl: constify ioctl lookup table
dm: constify argument arrays
dm integrity: count and display checksum failures
dm integrity: optimize writing dm-bufio buffers that are partially changed
dm rq: do not update rq partially in each ending bio
dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
dm mpath: complain about unsupported __multipath_map_bio() return values
dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
Commit abebfbe2f7 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.
It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().
It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.
Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.
Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The new lockdep support for completions causeed the stack usage
in dm-integrity to explode, in case of write_journal from 504 bytes
to 1120 (using arm gcc-7.1.1):
drivers/md/dm-integrity.c: In function 'write_journal':
drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
The problem is that not only the size of 'struct completion' grows
significantly, but we end up having multiple copies of it on the stack
when we assign it from a local variable after the initial declaration.
COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we
want to declare and initialize a completion on the stack. However,
this driver doesn't do that and instead initializes the completion
just before it is used.
In this case, init_completion() does the same thing more efficiently,
and drops the stack usage for the function above down to 496 bytes.
While the other functions in this file are not bad enough to cause
a warning, they benefit equally from the change, so I do the change
across the entire file. In the one place where we reuse a completion,
I picked the cheaper reinit_completion() over init_completion().
Fixes: cd8084f91c ("locking/lockdep: Apply crossrelease to completions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make this structure const as it is only stored in the profile field of a
blk_integrity structure. This field is of type const, so make structure
as const.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Even though read operations fail, dm_integrity_map_continue() calls
integrity_metadata() to check integrity. In this case, just complete
these.
This also makes it so read I/O errors do not generate integrity warnings
in the kernel log.
Cc: stable@vger.kernel.org
Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Acked-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
512b sectors vs device's physical sectorsize was not maintained
consistently and as such the support for >512b sector devices has bugs.
The log metadata expects native sectorsize but 512b sectors were being
stored. Also, device's sectorsize was assumed when assigning the
bi_sector for blocks that were being logged.
Fix this up by adding two helpers to convert between bio and dev
sectors, and use these in the appropriate places to fix the problem and
make it clear which units go where. Doing so allows dm-log-writes use
with 4k devices.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The check to see if the logging kthread needs to go to sleep is wrong,
it checks lc->pending_blocks, which will be non-0 if there are any
blocks that are pending, whether they are ready to be logged or not.
What we really want is to go to sleep until it's time to log blocks, so
change this check so we do actually go to sleep in between flushes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull followup block layer updates from Jens Axboe:
"I ended up splitting the main pull request for this series into two,
mainly because of clashes between NVMe fixes that went into 4.13 after
the for-4.14 branches were split off. This pull request is mostly
NVMe, but not exclusively. In detail, it contains:
- Two pull request for NVMe changes from Christoph. Nothing new on
the feature front, basically just fixes all over the map for the
core bits, transport, rdma, etc.
- Series from Bart, cleaning up various bits in the BFQ scheduler.
- Series of bcache fixes, which has been lingering for a release or
two. Coly sent this in, but patches from various people in this
area.
- Set of patches for BFQ from Paolo himself, updating both
documentation and fixing some corner cases in performance.
- Series from Omar, attempting to now get the 4k loop support
correct. Our confidence level is higher this time.
- Series from Shaohua for loop as well, improving O_DIRECT
performance and fixing a use-after-free"
* 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits)
bcache: initialize dirty stripes in flash_dev_run()
loop: set physical block size to logical block size
bcache: fix bch_hprint crash and improve output
bcache: Update continue_at() documentation
bcache: silence static checker warning
bcache: fix for gc and write-back race
bcache: increase the number of open buckets
bcache: Correct return value for sysfs attach errors
bcache: correct cache_dirty_target in __update_writeback_rate()
bcache: gc does not work when triggering by manual command
bcache: Don't reinvent the wheel but use existing llist API
bcache: do not subtract sectors_to_gc for bypassed IO
bcache: fix sequential large write IO bypass
bcache: Fix leak of bdev reference
block/loop: remove unused field
block/loop: fix use after free
bfq: Use icq_to_bic() consistently
bfq: Suppress compiler warnings about comparisons
bfq: Check kstrtoul() return value
bfq: Declare local functions static
...
Pull MD updates from Shaohua Li:
"This update mainly fixes bugs:
- Make raid5 ppl support several ppl from Pawel
- Several raid5-cache bug fixes from Song
- Bitmap fixes from Neil and Me
- One raid1/10 regression fix since 4.12 from Me
- Other small fixes and cleanup"
* tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/bitmap: disable bitmap_resize for file-backed bitmaps.
raid5-ppl: Recovery support for multiple partial parity logs
md: Runtime support for multiple ppls
md/raid0: attach correct cgroup info in bio
lib/raid6: align AVX512 constants to 512 bits, not bytes
raid5: remove raid5_build_block
md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
md: replace seq_release_private with seq_release
md: notify about new spare disk in the container
md/raid1/10: reset bio allocated from mempool
md/raid5: release/flush io in raid5_do_work()
md/bitmap: copy correct data for bitmap super
bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.
Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.
This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().
(Commit log is composed by Coly Li)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
Most importantly, solve a crash where %llu was used to format signed
numbers. This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.
Always use the units mechanism rather than having different output
paths for simplicity.
Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100. (Remainders of >= 1000 would print as .10).
Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown. Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>