commit a9d45396f5 upstream.
The persistent-data library used by dm-thin, dm-cache, etc is
transactional. If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.
Atomicity when committing a transaction is achieved by:
a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
disk.
This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.
Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.
As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now
the unlocking must be done separately.
This issue was discovered by forcing io errors at the crucial time
using dm-flakey.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fca028438f upstream.
This bug was introduced in commit 7e664b3dec ("dm space map metadata:
fix extending the space map").
When extending a dm-thin metadata volume we:
- Switch the space map into a simple bootstrap mode, which allocates
all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
entries
- Commit changes to disk
- Switch back out of bootstrap mode.
But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.
The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit. This would cause the disk commit to fail, and thinp
to enter read_only mode. The metadata was not damaged (thin_check
passed).
The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space. In practise this loop only
runs twice.
With this fix the following device mapper testsuite test passes:
dmtest run --suite thin-provisioning -n thin_remove_works_after_resize
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e664b3dec upstream.
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.
That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.
With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
dmtest run --suite thin-provisioning -n /resize_metadata_no_io/
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12c91a5c2d upstream.
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ed9571f0cf upstream.
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a threshold callback to dm persistent data space maps.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a threshold callback function to the persistent data space map
interface for a subsequent patch to use.
dm-thin and dm-cache are interested in knowing when they're getting
low on metadata or data blocks. This patch introduces a new method
for registering a callback against a threshold.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support extending a dm persistent data metadata space map.
The extend itself is implemented by switching back to the boostrap
allocator and pointing to the new space. The extra bitmap indexes are
then allocated from the new space, and finally we switch back to the
proper space map ops and tweak the reference counts.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix some typos in dm-space-map-metadata.c error messages.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix a bug in dm_btree_remove that could leave leaf values with incorrect
reference counts. The effect of this was that removal of a shared block
could result in the space maps thinking the block was no longer used.
More concretely, if you have a thin device and a snapshot of it, sending
a discard to a shared region of the thin could corrupt the snapshot.
Thinp uses a 2-level nested btree to store it's mappings. This first
level is indexed by thin device, and the second level by logical
block.
Often when we're removing an entry in this mapping tree we need to
rebalance nodes, which can involve shadowing them, possibly creating a
copy if the block is shared. If we do create a copy then children of
that node need to have their reference counts incremented. In this
way reference counts percolate down the tree as shared trees diverge.
The rebalance functions were incrementing the children at the
appropriate time, but they were always assuming the children were
internal nodes. This meant the leaf values (in our case packed
block/flags entries) were not being incremented.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.
A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules. The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch. Other modules are
under development.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add dm_btree_walk to iterate through the contents of a btree.
This will be used by the dm cache target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mark some constant parameters constant in some dm-btree functions.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the warning:
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."
* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...
Improve space map error message when unable to allocate a new
metadata block.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.
This patch corrects both problems.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>