Commit Graph

149 Commits

Author SHA1 Message Date
Andy Grover
63c8ecb626 dm thin: include metadata_low_watermark threshold in pool status
The metadata low watermark threshold is set by the kernel.  But the
kernel depends on userspace to extend the thinpool metadata device when
the threshold is crossed.

Since the metadata low watermark threshold is not visible to userspace,
upon receiving an event, userspace cannot tell that the kernel wants the
metadata device extended, instead of some other eventing condition.
Making it visible (but not settable) enables userspace to affirmatively
know the kernel is asking for a metadata device extension, by comparing
metadata_low_watermark against nr_free_blocks_metadata, also reported in
status.

Current solutions like dmeventd have their own thresholds for extending
the data and metadata devices, and both devices are checked against
their thresholds on each event.  This lessens the value of the kernel-set
threshold, since userspace will either extend the metadata device sooner,
when receiving another event; or will receive the metadata lowater event
and do nothing, if dmeventd's threshold is less than the kernel's.
(This second case is dangerous. The metadata lowater event will not be
re-sent, so no further event will be generated before the metadata
device is out if space, unless some other event causes userspace to
recheck its thresholds.)

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-30 11:49:08 -04:00
Mikulas Patocka
a3fcf72531 dm integrity: recalculate checksums on creation
When using external metadata device and internal hash, recalculate the
checksums when the device is created - so that dm-integrity doesn't
have to overwrite the device.  The superblock stores the last position
when the recalculation ended, so that it is properly restarted.

Integrity tags that haven't been recalculated yet are ignored.

Also bump the target version.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:27 -04:00
Mikulas Patocka
cda6b5ab7f dm delay: add flush as a third class of IO
Add a new class for dm-delay that delays flush requests.  Previously,
flushes were delayed as writes, but it caused problems if the user
needed to create a device with one or a few slow sectors for the purpose
of testing - all flushes would be forwarded to this device and delayed,
and that skews the test results.  Fix this by allowing to select 0 delay
for flushes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:19 -04:00
Mike Snitzer
6c7413c0f5 dm thin: update stale "Status" Documentation
Documentation/device-mapper-/thin-provisioning.txt's "Status" section no
longer reflected the current fitness level of DM thin-provisioning.
That is, DM thinp is no longer "EXPERIMENTAL".  It has since seen
considerable improvement, has been fairly widely deployed and has
performed in a robust manner.

Update Documentation to dispel concern raised by potential DM thinp
users.

Reported-by: Drew Hastings <dhastings@crucialwebhost.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:03 -04:00
Mikulas Patocka
d284f8248c dm writecache: support optional offset for start of device
Add an optional parameter "start_sector" to allow the start of the
device to be offset by the specified number of 512-byte sectors.  The
sectors below this offset are not used by the writecache device and are
left to be used for disk labels and/or userspace metadata (e.g. lvm).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-02 16:14:02 -04:00
Mikulas Patocka
48debafe4f dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

If persistent memory isn't available this target can still be used in
SSD mode.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix missing goto
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> # fix compilation issue with !DAX
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # use msecs_to_jiffies
Acked-by: Dan Williams <dan.j.williams@intel.com> # reworks to unify ARM and x86 flushing
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
2018-06-08 11:59:51 -04:00
Mike Snitzer
28700a3623 dm thin: update Documentation to clarify when "read_only" is valid
Due to user confusion, clarify that it doesn't make sense to try to
create a thin-pool with "read_only" mode enabled.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-05-10 11:18:49 -04:00
Patrik Torstensson
843f38d382 dm verity: add 'check_at_most_once' option to only validate hashes once
This allows platforms that are CPU/memory contrained to verify data
blocks only the first time they are read from the data device, rather
than every time.  As such, it provides a reduced level of security
because only offline tampering of the data device's content will be
detected, not online tampering.

Hash blocks are still verified each time they are read from the hash
device, since verification of hash blocks is less performance critical
than data blocks, and a hash block will not be verified any more after
all the data blocks it covers have been verified anyway.

This option introduces a bitset that is used to check if a block has
been validated before or not.  A block can be validated more than once
as there is no thread protection for the bitset.

These changes were developed and tested on entry-level Android Go
devices.

Signed-off-by: Patrik Torstensson <totte@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:29 -04:00
John Pittman
9614e2ba91 dm cache: Documentation: update default migration_throttling value
In commit f8350daf7a ("dm cache: tune migration throttling") the
value for DEFAULT_MIGRATION_THRESHOLD was decreased from 204800 to
2048.  Edit device-mapper/cache.txt to reflect the correct default
value for migration_threshold.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-30 16:55:47 -05:00
mulhern
7efd5fed6f dm thin: extend thinpool status format string with omitted fields
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:12 -05:00
mulhern
cc3ff0af19 dm thin: fixes in thin-provisioning.txt
Make the format string for thinpool status more correct.

Swap the order of two items to correspond with reality.

Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:12 -05:00
mulhern
2bc8a61c69 dm thin: document representation of <highest mapped sector> when there is none
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:11 -05:00
mulhern
9b28a1102e dm thin: fix documentation relative to low water mark threshold
Fixes:
1. The use of "exceeds" when the opposite of exceeds, falls below,
was meant.
2. Properly speaking, a table can not exceed a threshold.

It emphasizes the important point, which is that it is the userspace
daemon's responsibility to check for low free space when a device
is resumed, since it won't get a special event indicating low free
space in that situation.

Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:10 -05:00
mulhern
1346638e5f dm cache: be consistent in specifying sectors and SI units in cache.txt
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:09 -05:00
mulhern
3716e20af5 dm cache: delete obsoleted paragraph in cache.txt
The 'mq' policy is no longer the default policy, and the default policy,
'smq', does not store hit counts.

Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:08 -05:00
mulhern
677210462d dm cache: fix grammar in cache-policies.txt
Use possessive pronoun where appropriate, instead of contraction.

Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:07 -05:00
Mikulas Patocka
424da29c5a dm snapshot: improve documentation relative to origin suspend requirements
Add a note to snapshot.txt that the origin target must be suspended when
loading or unloading the snapshot target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:06 -05:00
Scott Bauer
18a5bf2705 dm: add unstriped target
This device mapper "unstriped" target remaps and unstripes I/O so it
is issued solely on a single drive in a HW RAID0 or dm-striped target.

In a 4 drive HW RAID0 the striped target exposes 1/4th of the LBA range
as a virtual drive.  Each I/O to that virtual drive will only be issued
to the 1 drive that was selected of the 4 drives in the HW RAID0.

This unstriped target is most useful for Intel NVMe drives that have
multiple cores but that do not have firmware control to pin separate LBA
ranges to each discrete cpu core.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-17 09:16:00 -05:00
Heinz Mauelshagen
11e4723206 dm raid: stop keeping raid set frozen altogether
In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared.  The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.

Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether.  This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.

Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13 11:52:02 -05:00
Mike Snitzer
b84cf26924 dm raid: bump target version to reflect numerous fixes
Also update Documentation accordingly.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08 10:59:58 -05:00
Jonathan Brassow
41dcf197ad dm raid: fix incorrect status output at the end of a "recover" process
There are three important fields that indicate the overall health and
status of an array: dev_health, sync_ratio, and sync_action.  They tell
us the condition of the devices in the array, and the degree to which
the array is synchronized.

This commit fixes a condition that is reported incorrectly.  When a member
of the array is being rebuilt or a new device is added, the "recover"
process is used to synchronize it with the rest of the array.  When the
process is complete, but the sync thread hasn't yet been reaped, it is
possible for the state of MD to be:
 mddev->recovery = [ MD_RECOVERY_RUNNING MD_RECOVERY_RECOVER MD_RECOVERY_DONE ]
 curr_resync_completed = <max dev size> (but not MaxSector)
 and all rdevs to be In_sync.
This causes the 'array_in_sync' output parameter that is passed to
rs_get_progress() to be computed incorrectly and reported as 'false' --
or not in-sync.  This in turn causes the dev_health status characters to
be reported as all 'a', rather than the proper 'A'.

This can cause erroneous output for several seconds at a time when tools
will want to be checking the condition due to events that are raised at
the end of a sync process.  Fix this by properly calculating the
'array_in_sync' return parameter in rs_get_progress().

Also, remove an unnecessary intermediate 'recovery_cp' variable in
rs_get_progress().

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-10-05 16:21:30 -04:00
Heinz Mauelshagen
ac6a318888 dm raid: bump target version
Bumo dm-raid target version to 1.12.1 to reflect that commit cc27b0c78c
("md: fix deadlock between mddev_suspend() and md_write_start()") is
available.

This version change allows userspace to detect that MD fix is available.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25 14:54:20 -04:00
Damien Le Moal
3b1a94c88b dm zoned: drive-managed zoned block device target
The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.

Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.

dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 14TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most about
3 MB and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.

dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.

Zoned block devices can be formatted and checked for use with the dm-zoned
target using the dmzadm utility available at:

https://github.com/hgst/dm-zoned-tools

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer partly refactored Damien's original work to cleanup the code]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-19 11:05:20 -04:00
Linus Torvalds
d35a878ae1 Merge tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:

 - A major update for DM cache that reduces the latency for deciding
   whether blocks should migrate to/from the cache. The bio-prison-v2
   interface supports this improvement by enabling direct dispatch of
   work to workqueues rather than having to delay the actual work
   dispatch to the DM cache core. So the dm-cache policies are much more
   nimble by being able to drive IO as they see fit. One immediate
   benefit from the improved latency is a cache that should be much more
   adaptive to changing workloads.

 - Add a new DM integrity target that emulates a block device that has
   additional per-sector tags that can be used for storing integrity
   information.

 - Add a new authenticated encryption feature to the DM crypt target
   that builds on the capabilities provided by the DM integrity target.

 - Add MD interface for switching the raid4/5/6 journal mode and update
   the DM raid target to use it to enable aid4/5/6 journal write-back
   support.

 - Switch the DM verity target over to using the asynchronous hash
   crypto API (this helps work better with architectures that have
   access to off-CPU algorithm providers, which should reduce CPU
   utilization).

 - Various request-based DM and DM multipath fixes and improvements from
   Bart and Christoph.

 - A DM thinp target fix for a bio structure leak that occurs for each
   discard IFF discard passdown is enabled.

 - A fix for a possible deadlock in DM bufio and a fix to re-check the
   new buffer allocation watermark in the face of competing admin
   changes to the 'max_cache_size_bytes' tunable.

 - A couple DM core cleanups.

* tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits)
  dm bufio: check new buffer allocation watermark every 30 seconds
  dm bufio: avoid a possible ABBA deadlock
  dm mpath: make it easier to detect unintended I/O request flushes
  dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
  dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
  dm: introduce enum dm_queue_mode to cleanup related code
  dm mpath: verify __pg_init_all_paths locking assumptions at runtime
  dm: verify suspend_locking assumptions at runtime
  dm block manager: remove an unused argument from dm_block_manager_create()
  dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
  dm mpath: delay requeuing while path initialization is in progress
  dm mpath: avoid that path removal can trigger an infinite loop
  dm mpath: split and rename activate_path() to prepare for its expanded use
  dm ioctl: prevent stack leak in dm ioctl call
  dm integrity: use previously calculated log2 of sectors_per_block
  dm integrity: use hex2bin instead of open-coded variant
  dm crypt: replace custom implementation of hex2bin()
  dm crypt: remove obsolete references to per-CPU state
  dm verity: switch to using asynchronous hash crypto API
  dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
  ...
2017-05-03 10:31:20 -07:00
Mikulas Patocka
9d609f85b7 dm integrity: support larger block sizes
The DM integrity block size can now be 512, 1k, 2k or 4k.  Using larger
blocks reduces metadata handling overhead.  The block size can be
configured at table load time using the "block_size:<value>" option;
where <value> is expressed in bytes (defult is still 512 bytes).

It is safe to use larger block sizes with DM integrity, because the
DM integrity journal makes sure that the whole block is updated
atomically even if the underlying device doesn't support atomic writes
of that size (e.g. 4k block ontop of a 512b device).

Depends-on: 2859323e ("block: fix blk_integrity_register to use template's interval_exp if not 0")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:33 -04:00