[ Upstream commit 2d669ceb69 ]
Commit 24f6b6036c ("dm table: fix zoned iterate_devices based device
capability checks") triggered dm table load failure when dm-zoned device
is set up for zoned block devices and a regular device for cache.
The commit inverted logic of two callback functions for iterate_devices:
device_is_zoned_model() and device_matches_zone_sectors(). The logic of
device_is_zoned_model() was inverted then all destination devices of all
targets in dm table are required to have the expected zoned model. This
is fine for dm-linear, dm-flakey and dm-crypt on zoned block devices
since each target has only one destination device. However, this results
in failure for dm-zoned with regular cache device since that target has
both regular block device and zoned block devices.
As for device_matches_zone_sectors(), the commit inverted the logic to
require all zoned block devices in each target have the specified
zone_sectors. This check also fails for regular block device which does
not have zones.
To avoid the check failures, fix the zone model check and the zone
sectors check. For zone model check, introduce the new feature flag
DM_TARGET_MIXED_ZONED_MODEL, and set it to dm-zoned target. When the
target has this flag, allow it to have destination devices with any
zoned model. For zone sectors check, skip the check if the destination
device is not a zoned block device. Also add comments and improve an
error message to clarify expectations to the two checks.
Fixes: 24f6b6036c ("dm table: fix zoned iterate_devices based device capability checks")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a666e5c05e upstream.
The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.
This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.
This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 5a6c35f9af ("block: remove direct_make_request") there
is no benefit to DM special-casing NVMe. Remove all code used to
establish DM_TYPE_NVME_BIO_BASED.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add DM target feature flag DM_TARGET_NOWAIT which advertises that
target works with REQ_NOWAIT bios.
Add dm_table_supports_nowait() and update dm_table_set_restrictions()
to set/clear QUEUE_FLAG_NOWAIT accordingly.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
Commit adc0daad36 ("dm: report suspended
device during destroy") broke integrity recalculation.
The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.
To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().
Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad36 ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch to use dynamic debug to avoid having recompile the kernel
just to enable debugging messages.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The HST path selector needs this information to perform path
prediction. For request-based mpath, struct request's io_start_time_ns
is used, while for bio-based, use the start_time stored in dm_io.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull device mapper updates from Mike Snitzer:
- Fix DM core to disallow stacking request-based DM on partitions.
- Fix DM raid target to properly resync raidset even if bitmap needed
additional pages.
- Fix DM crypt performance regression due to use of WQ_HIGHPRI for the
IO and crypt workqueues.
- Fix DM integrity metadata layout that was aligned on 128K boundary
rather than the intended 4K boundary (removes 124K of wasted space
for each metadata block).
- Improve the DM thin, cache and clone targets to use spin_lock_irq
rather than spin_lock_irqsave where possible.
- Fix DM thin single thread performance that was lost due to needless
workqueue wakeups.
- Fix DM zoned target performance that was lost due to excessive
backing device checks.
- Add ability to trigger write failure with the DM dust test target.
- Fix whitespace indentation in drivers/md/Kconfig.
- Various smalls fixes and cleanups (e.g. use struct_size, fix
uninitialized variable, variable renames, etc).
* tag 'for-5.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
dm: Fix Kconfig indentation
dm thin: wakeup worker only when deferred bios exist
dm integrity: fix excessive alignment of metadata runs
dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layout
dm zoned: reduce overhead of backing device checks
dm dust: add limited write failure mode
dm dust: change ret to r in dust_map_read and dust_map
dm dust: change result vars to r
dm cache: replace spin_lock_irqsave with spin_lock_irq
dm bio prison: replace spin_lock_irqsave with spin_lock_irq
dm thin: replace spin_lock_irqsave with spin_lock_irq
dm clone: add bucket_lock_irq/bucket_unlock_irq helpers
dm clone: replace spin_lock_irqsave with spin_lock_irq
dm writecache: handle REQ_FUA
dm writecache: fix uninitialized variable warning
dm stripe: use struct_size() in kmalloc()
dm raid: streamline rs_get_progress() and its raid_status() caller side
dm raid: simplify rs_setup_recovery call chain
dm raid: to ensure resynchronization, perform raid set grow in preresume
...
Avoid the need to allocate a potentially large array of struct blk_zone
in the block layer by switching the ->report_zones method interface to
a callback model. Now the caller simply supplies a callback that is
executed on each reported zone, and private data for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct stripe_c {
...
struct stripe stripe[0];
};
In this case alloc_context() and dm_array_too_big() are removed and
replaced by the direct use of the struct_size() helper in kmalloc().
Notice that open-coded form is prone to type mistakes.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull more device mapper updates from Mike Snitzer:
- Fix zone state management race in DM zoned target by eliminating the
unnecessary DMZ_ACTIVE state.
- A couple fixes for issues the DM snapshot target's optional discard
support added during first week of the 5.3 merge.
- Increase default size of outstanding IO that is allowed for a each
dm-kcopyd client and introduce tunable to allow user adjust.
- Update DM core to use printk ratelimiting functions rather than
duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
messages.
* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: use printk ratelimiting functions
dm kcopyd: Increase default sub-job size to 512KB
dm snapshot: fix oversights in optional discard support
dm zoned: fix zone state management race
DM provided its own ratelimiting printk wrapper but given printk
advances this is no longer needed.
Also, switching DMDEBUG_LIMIT to using pr_debug_ratelimited() fixes the
reported issue where DMDEBUG_LIMIT() still caused a flood of "callbacks
suppressed" messages.
Reported-by: Milan Broz <gmazyland@gmail.com>
Depends-on: 29fc2bc753 ("printk: pr_debug_ratelimited: check state first to reduce "callbacks suppressed" messages")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE. That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.
In fact, even without commit 396eaf21ee, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.
The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice. In addition,
queue-length path selector will also be affected.
To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue. map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add a "create" module parameter, which allows device-mapper targets to
be configured at boot time. This enables early use of DM targets in the
boot process (as the root device or otherwise) without the need of an
initramfs.
The syntax used in the boot param is based on the concise format from
the dmsetup tool to follow the rule of least surprise:
dmsetup table --concise /dev/mapper/lroot
Which is:
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
Where,
<name> ::= The device name.
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
<minor> ::= The device minor number | ""
<flags> ::= "ro" | "rw"
<table> ::= <start_sector> <num_sectors> <target_type> <target_args>
<target_type> ::= "verity" | "linear" | ...
For example, the following could be added in the boot parameters:
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
Only the targets that were tested are allowed and the ones that don't
change any block device when the device is create as read-only. For
example, mirror and cache targets are not allowed. The rationale behind
this is that if the user makes a mistake, choosing the wrong device to
be the mirror or the cache can corrupt data.
The only targets initially allowed are:
* crypt
* delay
* linear
* snapshot-origin
* striped
* verity
Co-developed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A dm-raid array with devices larger than 4GB won't assemble on
a 32 bit host since _check_data_dev_sectors() was added in 4.16.
This is because to_sector() treats its argument as an "unsigned long"
which is 32bits (4GB) on a 32bit host. Using "unsigned long long"
is more correct.
Kernels as early as 4.2 can have other problems due to to_sector()
being used on the size of a device.
Fixes: 0cf4503174 ("dm raid: add support for the MD RAID0 personality")
cc: stable@vger.kernel.org (v4.2+)
Reported-and-tested-by: Guillaume Perréal <gperreal@free.fr>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits. A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull device mapper updates from Mike Snitzer:
- The biggest change this cycle is to remove support for the legacy IO
path (.request_fn) from request-based DM.
Jens has already started preparing for complete removal of the legacy
IO path in 4.21 but this earlier removal of support from DM has been
coordinated with Jens (as evidenced by the commit being attributed to
him).
Making request-based DM exclussively blk-mq only cleans up that
portion of DM core quite nicely.
- Convert the thinp and zoned targets over to using refcount_t where
applicable.
- A couple fixes to the DM zoned target for refcounting and other races
buried in the implementation of metadata block creation and use.
- Small cleanups to remove redundant unlikely() around a couple
WARN_ON_ONCE().
- Simplify how dm-ioctl copies from userspace, eliminating some
potential for a malicious user trying to change the executed ioctl
after its processing has begun.
- Tweaked DM crypt target to use the DM device name when naming the
various workqueues created for a particular DM crypt device (makes
the N workqueues for a DM crypt device more easily understood and
enhances user's accounting capabilities at a glance via "ps")
- Small fixup to remove dead branch in DM writecache's memory_entry().
* tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm writecache: remove disabled code in memory_entry()
dm zoned: fix various dmz_get_mblock() issues
dm zoned: fix metadata block ref counting
dm raid: avoid bitmap with raid4/5/6 journal device
dm crypt: make workqueue names device-specific
dm: add dm_table_device_name()
dm ioctl: harden copy_params()'s copy_from_user() from malicious users
dm: remove unnecessary unlikely() around WARN_ON_ONCE()
dm zoned: target: use refcount_t for dm zoned reference counters
dm thin: use refcount_t for thin_c reference counting
dm table: require that request-based DM be layered on blk-mq devices
dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
dm: remove legacy request-based IO path
Dispatching a report zones command through the request queue is a major
pain due to the command reply payload rewriting necessary. Given that
blkdev_report_zones() is executing everything synchronously, implement
report zones as a block device file operation instead, allowing major
simplification of the code in many places.
sd, null-blk, dm-linear and dm-flakey being the only block device
drivers supporting exposing zoned block devices, these drivers are
modified to provide the device side implementation of the
report_zones() block device file operation.
For device mappers, a new report_zones() target type operation is
defined so that the upper block layer calls blkdev_report_zones() can
be propagated down to the underlying devices of the dm targets.
Implementation for this new operation is added to the dm-linear and
dm-flakey targets.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Changed method block_device argument to gendisk
* Various bug fixes and improvements
* Added support for null_blk, dm-linear and dm-flakey.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that request-based DM is only using blk-mq, there is no need to
differentiate between legacy "rq" and new "mq". We're back to a single
request-based DM -- and there was much rejoicing!
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>