Pull device mapper fixes from Mike Snitzer:
"Apologies for not pressing this request-based DM partial completion
issue further, it was an oversight on my part. We'll have to get it
fixed up properly and revisit for a future release.
- Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data
corruption.
- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have
been made)"
* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: fix "default" version to be 1.4.0
dm: bump the ioctl version to 4.32.0
Revert "block, dm: don't copy bios for request clones"
Revert "dm: do not allocate any mempools for blk-mq request-based DM"
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Request-based DM's blk-mq support defaults to off; but a user can easily
change the default using the dm_mod.use_blk_mq module/boot option.
Also, you can check what mode a given request-based DM device is using
with: cat /sys/block/dm-X/dm/use_blk_mq
This change enabled further cleanup and reduced work (e.g. the
md->io_pool and md->rq_pool isn't created if using blk-mq).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise, for sequential workloads, the dm_request_fn can allow
excessive request merging at the expense of increased service time.
Add a per-device sysfs attribute to allow the user to control how long a
request, that is a reasonable merge candidate, can be queued on the
request queue. The resolution of this request dispatch deadline is in
microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline
The dm_request_fn's merge heuristic and associated extra accounting is
disabled by default (rq_based_seq_io_merge_deadline is 0).
This sysfs attribute is not applicable to bio-based DM devices so it
will only ever report 0 for them.
By allowing a request to remain on the queue it will block others
requests on the queue. But introducing a short dequeue delay has proven
very effective at enabling certain sequential IO workloads on really
fast, yet IOPS constrained, devices to build up slightly larger IOs --
yielding 90+% throughput improvements. Having precise control over the
time taken to wait for larger requests to build affords control beyond
that of waiting for certain IO sizes to accumulate (which would require
a deadline anyway). This knob will only ever make sense with sequential
IO workloads and the particular value used is storage configuration
specific.
Given the expected niche use-case for when this knob is useful it has
been deemed acceptable to expose this relatively crude method for
crafting optimal IO on specific storage -- especially given the solution
is simple yet effective. In the context of DM multipath, it is
advisable to tune this sysfs attribute to a value that offers the best
performance for the common case (e.g. if 4 paths are expected active,
tune for that; if paths fail then performance may be slightly reduced).
Alternatives were explored to have request-based DM autotune this value
(e.g. if/when paths fail) but they were quickly deemed too fragile and
complex to warrant further design and development time. If this problem
proves more common as faster storage emerges we'll have to look at
elevating a generic solution into the block core.
Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
DM multipath is the only caller of blk_lld_busy() -- which calls a
queue's lld_busy_fn hook. Request-based DM doesn't support stacking
multipath devices so there is no reason to register the lld_busy_fn hook
on a multipath device's queue using blk_queue_lld_busy().
As such, remove functions dm_lld_busy and dm_table_any_busy_target.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise replacing the multipath target with the error target fails:
device-mapper: ioctl: can't change device type after initial table load.
The error target was mistakenly considered to be target type
DM_TYPE_REQUEST_BASED rather than DM_TYPE_MQ_REQUEST_BASED even if the
target it was to replace was of type DM_TYPE_MQ_REQUEST_BASED.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type. Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).
Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path. As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
non-blk-mq devices -- the preallocated clone request is passed in
from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
blk-mq devices -- blk_get_request() and blk_put_request() are used
respectively from these hooks.
dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set. This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.
[This patch was started by Keith and later heavily modified by Mike]
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.
Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.
Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared. Add lockdep annotation to dm_suspend() and dm_resume().
The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().
Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend". A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.
But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.
Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Until this change, when loading a new DM table, DM core would re-open
all of the devices in the DM table. Now, DM core will avoid redundant
device opens (and closes when destroying the old table) if the old
table already has a device open using the same mode. This is achieved
by managing reference counts on the table_devices that DM core now
stores in the mapped_device structure (rather than in the dm_table
structure). So a mapped_device's active and inactive dm_tables' dm_dev
lists now just point to the dm_devs stored in the mapped_device's
table_devices list.
This improvement in DM core's device reference counting has the
side-effect of fixing a long-standing limitation of the multipath
target: a DM multipath table couldn't include any paths that were unusable
(failed). For example: if all paths have failed and you add a new,
working, path to the table; you can't use it since the table load would
fail due to it still containing failed paths. Now a re-load of a
multipath table can include failed devices and when those devices become
active again they can be used instantly.
The device list code in dm.c isn't a straight copy/paste from the code in
dm-table.c, but it's very close (aside from some variable renames). One
subtle difference is that find_table_device for the tables_devices list
will only match devices with the same name and mode. This is because we
don't want to upgrade a device's mode in the active table when an
inactive table is loaded.
Access to the mapped_device structure's tables_devices list requires a
mutex (tables_devices_lock), so that tables cannot be created and
destroyed concurrently.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The function dm_table_supports_discards is only called from
dm-table.c:dm_table_set_restrictions(). So move it above
dm_table_set_restrictions and make it static.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce dm_table_run_md_queue_async() to run the request_queue of the
mapped_device associated with a request-based DM table.
Also add dm_md_get_queue() wrapper to extract the request_queue from a
mapped_device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Make the function dm_table_alloc_md_mempools static because it is not
called from another file.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
This patch allows the removal of an open device to be deferred until
it is closed. (Previously such a removal attempt would fail.)
The deferred remove functionality is enabled by setting the flag
DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or
DM_REMOVE_ALL ioctl.
On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if
the device was removed immediately or flagged to be removed on close -
if the flag is clear, the device was removed.
On return from DM_DEV_STATUS and other ioctls, the flag
DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on
closure.
A device that is scheduled to be deleted can be revived using the
message "@cancel_deferred_remove". This message clears the
DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allow user to change the number of IOs that are reserved by
bio-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_bio_based_ios
The default value is RESERVED_BIO_BASED_IOS (16). The maximum allowed
value is RESERVED_MAX_IOS (1024).
Export dm_get_reserved_bio_based_ios() for use by DM targets and core
code. Switch to sizing dm-io's mempool and bioset using DM core's
configurable 'reserved_bio_based_ios'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Allow user to change the number of IOs that are reserved by
request-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_rq_based_ios
The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum
allowed value is RESERVED_MAX_IOS (1024).
Export dm_get_reserved_rq_based_ios() for use by DM targets and core
code. Switch to sizing dm-mpath's mempool using DM core's configurable
'reserved_rq_based_ios'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.
Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
It may be useful to switch a request-based table to the "error" target.
Enhance the DM core to allow a hybrid target_type which is capable of
handling either bios (via .map) or requests (via .map_rq).
Add a request-based map function (.map_rq) to the "error" target_type;
making it DM's first hybrid target. Train dm_table_set_type() to prefer
the mapped device's established type (request-based or bio-based). If
the mapped device doesn't have an established type default to making the
table with the hybrid target(s) bio-based.
Tested 'dmsetup wipe_table' to work on both bio-based and request-based
devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a field per_bio_data_size in struct dm_target.
Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.
Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.
If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a safety net that will re-use the DM device's existing limits in the
event that DM device has a temporary table that doesn't have any
component devices. This is to reduce the chance that requests not
respecting the hardware limits will reach the device.
DM recalculates queue limits based only on devices which currently exist
in the table. This creates a problem in the event all devices are
temporarily removed such as all paths being lost in multipath. DM will
reset the limits to the maximum permissible, which can then assemble
requests which exceed the limits of the paths when the paths are
restored. The request will fail the blk_rq_check_limits() test when
sent to a path with lower limits, and will be retried without end by
multipath. This became a much bigger issue after v3.6 commit fe86cdcef
("block: do not artificially constrain max_sectors for stacking
drivers").
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>