Commit adc0daad36 ("dm: report suspended
device during destroy") broke integrity recalculation.
The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.
To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().
Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad36 ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Only triggering reclaim based on the percentage of unmapped cache
zones can fail to detect cases where reclaim is needed, e.g. if the
target has only 2 or 3 cache zones and only one unmapped cache zone,
the percentage of free cache zones is higher than
DMZ_RECLAIM_LOW_UNMAP_ZONES (30%) and reclaim does not trigger.
This problem, combined with the fact that dmz_schedule_reclaim() is
called from dmz_handle_bio() without the map lock held, leads to a
race between zone allocation and dmz_should_reclaim() result.
Depending on the workload applied, this race can lead to the write
path waiting forever for a free zone without reclaim being triggered.
Fix this by moving dmz_schedule_reclaim() inside dmz_alloc_zone()
under the map lock. This results in checking the need for zone reclaim
whenever a new data or buffer zone needs to be allocated.
Also fix dmz_reclaim_percentage() to always return 0 if the number of
unmapped cache (or random) zones is less than or equal to 1.
Suggested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix unused but set variable warnings:
drivers/md/dm-zoned-reclaim.c:504:42: warning:
variable nr_rnd set but not used [-Wunused-but-set-variable]
504 | unsigned int p_unmap, nr_unmap_rnd = 0, nr_rnd = 0;
| ^~~~~~
drivers/md/dm-zoned-reclaim.c:504:24: warning:
variable nr_unmap_rnd set but not used [-Wunused-but-set-variable]
504 | unsigned int p_unmap, nr_unmap_rnd = 0, nr_rnd = 0;
| ^~~~~~~~~~~~
Fixes: f97809aec5 ("dm zoned: per-device reclaim")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail. Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Given request-based DM now uses blk-mq's blk_mq_queue_inflight() to
determine if outstanding IO has completed (and DM has no control over
the blk-mq state machine used to track outstanding IO) it is unsafe to
wakeup waiter (dm_wait_for_completion) before blk-mq has cleared a
request's state bits (e.g. MQ_RQ_IN_FLIGHT or MQ_RQ_COMPLETE). As
such dm_wait_for_completion() could be left to wait indefinitely if no
other requests complete.
Fix this by eliminating request-based DM's use of waitqueue to wait
for blk-mq requests to complete in dm_wait_for_completion.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Depends-on: 3c94d83cb3 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull device mapper fixes from Mike Snitzer:
- Quite a few DM zoned target fixes and a Zone append fix in DM core.
Considering the amount of dm-zoned changes that went in during the
5.8 merge window these fixes are not that surprising.
- A few DM writecache target fixes.
- A fix to Documentation index to include DM ebs target docs.
- Small cleanup to use struct_size() in DM core's retrieve_deps().
* tag 'for-5.8/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm writecache: add cond_resched to loop in persistent_memory_claim()
dm zoned: Fix reclaim zone selection
dm zoned: Fix random zone reclaim selection
dm: update original bio sector on Zone Append
dm zoned: Fix metadata zone size check
docs: device-mapper: add dm-ebs.rst to an index file
dm ioctl: use struct_size() helper in retrieve_deps()
dm writecache: skip writecache_wait when using pmem mode
dm writecache: correct uncommitted_block when discarding uncommitted entry
dm zoned: assign max_io_len correctly
dm zoned: fix uninitialized pointer dereference
Add cond_resched() to a loop that fills in the mapper memory area
because the loop can be executed many times.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When dm zoned has multiple devices, random zones are never selected for
reclaim if all reserved sequential write zones are in use and no
sequential write required zones can be selected for reclaim. This can
lead to deadlocks as selecting a cache zone allows reclaiming a
sequential zone, ensuring forward progress.
Fix this by always defaulting to selecting a random zone when no
sequential write required zone can be selected.
[Damien: fix commit message]
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 2094045fe5 ("dm zoned: prefer full zones for reclaim")
modified dmz_get_rnd_zone_for_reclaim() to add a search for the buffer
zone with the heaviest weight as an optimal candidate for reclaim. This
modification uses the zone pointer variabl "last" which is set only once
and never modified as zones are scanned, resulting in the search being
inefective. Furthermore, if the selected buffer zone at the end of the
search loop is active or already locked for reclaim,
dmz_get_rnd_zone_for_reclaim() returns NULL even if other random zones
with a lesser weight can be reclaimed.
To fix the search and to guarantee that reclaim can make forward
progress, fix dmz_get_rnd_zone_for_reclaim() loop to correctly find
the buffer zone with the heaviest weight using the variable maxw_z.
Also make sure to fallback to finding the first random zone that can
be reclaimed if this best candidate zone cannot be reclaimed.
While at it, also fix the device index check to consider only random
zones, ignoring cache zones belonging to the cache device if one is
used as that device does not have a reclaim process.
Fixes: 2094045fe5 ("dm zoned: prefer full zones for reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Naohiro reported that issuing zone-append bios to a zoned block device
underneath a dm-linear device does not work as expected.
This because we forgot to reverse-map the sector the device wrote to the
original bio.
For zone-append bios, get the offset in the zone of the written sector
from the clone bio and add that to the original bio's sector position.
Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When dm zoned has multiple devices, metadata is on the cache device, not
in random zones of the zoned devices. Then the number of metadata zones
shall be checked with the number of cache zones, not random zones.
Fixes: 34f5affd04 ("dm zoned: separate random and cache zones")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct dm_target_deps {
...
__u64 dev[0]; /* out */
};
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The array bio_in_progress is only used with ssd mode. So skip
writecache_wait_for_ios in writecache_discard when pmem mode.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make sure that the local variable rzone in dmz_do_reclaim() is always
initialized before being used for printing debug messages.
Fixes: f97809aec5 ("dm zoned: per-device reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
scripts/checkpatch.pl reports following warning for patch
("bcache: check and adjust logical block size for backing devices"),
WARNING: quoted string split across lines
#146: FILE: drivers/md/bcache/super.c:896:
+ pr_info("%s: sb/logical block size (%u) greater than page size "
+ "(%lu) falling back to device logical block size (%u)",
There are two things to fix up,
- The kernel message print should be in a single line.
- pr_info() won't automatically add new line since v5.8, a '\n' should
be added.
This patch just does the above cleanup in bcache_device_init().
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch changes the asynchronous registration kworker to a delayed
kworker. There is probability queue_work() queues the async registration
kworker to the same CPU (even though very little), then the process
which writing sysfs interface to reigster bcache device may won't return
immeidately. queue_delayed_work() in this patch will delay 10 jiffies
before insert the kworker to run queue, which makes sure the registering
process may always returns to user space in time.
Fixes: 9e23ccf8f0 ("bcache: asynchronous devices registration")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's possible for a block driver to set logical block size to
a value greater than page size incorrectly; e.g. bcache takes
the value from the superblock, set by the user w/ make-bcache.
This causes a BUG/NULL pointer dereference in the path:
__blkdev_get()
-> set_init_blocksize() // set i_blkbits based on ...
-> bdev_logical_block_size()
-> queue_logical_block_size() // ... this value
-> bdev_disk_changed()
...
-> blkdev_readpage()
-> block_read_full_page()
-> create_page_buffers() // size = 1 << i_blkbits
-> create_empty_buffers() // give size/take pointer
-> alloc_page_buffers() // return NULL
.. BUG!
Because alloc_page_buffers() is called with size > PAGE_SIZE,
thus it initializes head = NULL, skips the loop, return head;
then create_empty_buffers() gets (and uses) the NULL pointer.
This has been around longer than commit ad6bf88a6c ("block:
fix an integer overflow in logical block size"); however, it
increased the range of values that can trigger the issue.
Previously only 8k/16k/32k (on x86/4k page size) would do it,
as greater values overflow unsigned short to zero, and queue_
logical_block_size() would then use the default of 512.
Now the range with unsigned int is much larger, and users w/
the 512k value, which happened to be zero'ed previously and
work fine, started to hit this issue -- as the zero is gone,
and queue_logical_block_size() does return 512k (>PAGE_SIZE.)
Fix this by checking the bcache device's logical block size,
and if it's greater than page size, fallback to the backing/
cached device's logical page size.
This doesn't affect cache devices as those are still checked
for block/page size in read_super(); only the backing/cached
devices are not.
Apparently it's a regression from commit 2903381fce ("bcache:
Take data offset from the bdev superblock."), moving the check
into BCACHE_SB_VERSION_CDEV only. Now that we have superblocks
of backing devices out there with this larger value, we cannot
refuse to load them (i.e., have a similar check in _BDEV.)
Ideally perhaps bcache should use all values from the backing
device (physical/logical/io_min block size)? But for now just
fix the problematic case.
Test-case:
# IMG=/root/disk.img
# dd if=/dev/zero of=$IMG bs=1 count=0 seek=1G
# DEV=$(losetup --find --show $IMG)
# make-bcache --bdev $DEV --block 8k
< see dmesg >
Before:
# uname -r
5.7.0-rc7
[ 55.944046] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[ 55.949742] CPU: 3 PID: 610 Comm: bcache-register Not tainted 5.7.0-rc7 #4
...
[ 55.952281] RIP: 0010:create_empty_buffers+0x1a/0x100
...
[ 55.966434] Call Trace:
[ 55.967021] create_page_buffers+0x48/0x50
[ 55.967834] block_read_full_page+0x49/0x380
[ 55.972181] do_read_cache_page+0x494/0x610
[ 55.974780] read_part_sector+0x2d/0xaa
[ 55.975558] read_lba+0x10e/0x1e0
[ 55.977904] efi_partition+0x120/0x5a6
[ 55.980227] blk_add_partitions+0x161/0x390
[ 55.982177] bdev_disk_changed+0x61/0xd0
[ 55.982961] __blkdev_get+0x350/0x490
[ 55.983715] __device_add_disk+0x318/0x480
[ 55.984539] bch_cached_dev_run+0xc5/0x270
[ 55.986010] register_bcache.cold+0x122/0x179
[ 55.987628] kernfs_fop_write+0xbc/0x1a0
[ 55.988416] vfs_write+0xb1/0x1a0
[ 55.989134] ksys_write+0x5a/0xd0
[ 55.989825] do_syscall_64+0x43/0x140
[ 55.990563] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 55.991519] RIP: 0033:0x7f7d60ba3154
...
After:
# uname -r
5.7.0.bcachelbspgsz
[ 31.672460] bcache: bcache_device_init() bcache0: sb/logical block size (8192) greater than page size (4096) falling back to device logical block size (512)
[ 31.675133] bcache: register_bdev() registered backing device loop0
# grep ^ /sys/block/bcache0/queue/*_block_size
/sys/block/bcache0/queue/logical_block_size:512
/sys/block/bcache0/queue/physical_block_size:8192
Reported-by: Ryan Finnie <ryan@finnie.org>
Reported-by: Sebastian Marsching <sebastian@marsching.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
coccicheck reports:
drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417
In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.
btree_gc_coalesce func details as follows:
if alloc new_nodes[i] fails:
goto out_nocoalesce;
// obtain new_nodes[i]->write_lock
mutex_lock(&new_nodes[i]->write_lock)
// main coalescing process
for (i = nodes - 1; i > 0; --i)
[snipped]
if coalescing process fails:
// Here, directly goto out_nocoalesce
// tag will cause a deadlock
goto out_nocoalesce;
[snipped]
// release new_nodes[i]->write_lock
mutex_unlock(&new_nodes[i]->write_lock)
// coalesing succ, return
return;
out_nocoalesce:
btree_node_free(new_nodes[i]) // free new_nodes[i]
// obtain new_nodes[i]->write_lock
mutex_lock(&new_nodes[i]->write_lock);
// set flag for reuse
clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
// release new_nodes[i]->write_lock
mutex_unlock(&new_nodes[i]->write_lock);
To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.
(Coly Li helps to clean up commit log format.)
Fixes: 2a285686c1 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull device mapper updates from Mike Snitzer:
- The largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block devices
with a zoned device to ease the performance impact associated with
finite random zones of zoned device.
The changes came in three batches: the first prepared for and then
added the ability to pair a single regular block device, the second
was a batch of fixes to improve zoned's reclaim heuristic, and the
third removed the limitation of only adding a single additional
regular block device to allow many devices.
Testing has shown linear scaling as more devices are added.
- Add new emulated block size (ebs) target that emulates a smaller
logical_block_size than a block device supports
The primary use-case is to emulate "512e" devices that have 512 byte
logical_block_size and 4KB physical_block_size. This is useful to
some legacy applications that otherwise wouldn't be able to be used
on 4K devices because they depend on issuing IO in 512 byte
granularity.
- Add discard interfaces to DM bufio. First consumer of the interface
is the dm-ebs target that makes heavy use of dm-bufio.
- Fix DM crypt's block queue_limits stacking to not truncate
logic_block_size.
- Add Documentation for DM integrity's status line.
- Switch DMDEBUG from a compile time config option to instead use
dynamic debug via pr_debug.
- Fix DM multipath target's hueristic for how it manages
"queue_if_no_path" state internally.
DM multipath now avoids disabling "queue_if_no_path" unless it is
actually needed (e.g. in response to configure timeout or explicit
"fail_if_no_path" message).
This fixes reports of spurious -EIO being reported back to userspace
application during fault tolerance testing with an NVMe backend.
Added various dynamic DMDEBUG messages to assist with debugging
queue_if_no_path in the future.
- Add a new DM multipath "Historical Service Time" Path Selector.
- Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.
- Improve DM writecache target performance by using explicit cache
flushing for target's single-threaded usecase and a small cleanup to
remove unnecessary test in persistent_memory_claim.
- Other small cleanups in DM core, dm-persistent-data, and DM
integrity.
* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
dm crypt: avoid truncating the logical block size
dm mpath: add DM device name to Failing/Reinstating path log messages
dm mpath: enhance queue_if_no_path debugging
dm mpath: restrict queue_if_no_path state machine
dm mpath: simplify __must_push_back
dm zoned: check superblock location
dm zoned: prefer full zones for reclaim
dm zoned: select reclaim zone based on device index
dm zoned: allocate zone by device index
dm zoned: support arbitrary number of devices
dm zoned: move random and sequential zones into struct dmz_dev
dm zoned: per-device reclaim
dm zoned: add metadata pointer to struct dmz_dev
dm zoned: add device pointer to struct dm_zone
dm zoned: allocate temporary superblock for tertiary devices
dm zoned: convert to xarray
dm zoned: add a 'reserved' zone flag
dm zoned: improve logging messages for reclaim
dm zoned: avoid unnecessary device recalulation for secondary superblock
dm zoned: add debugging message for reading superblocks
...
queue_limits::logical_block_size got changed from unsigned short to
unsigned int, but it was forgotten to update crypt_io_hints() to use the
new type. Fix it.
Fixes: ad6bf88a6c ("block: fix an integer overflow in logical block size")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>