Pull md fixes from NeilBrown:
"Several bug fixes for md in 3.7:
- raid5 discard has problems
- raid10 replacement devices have problems
- bad block lock seqlock usage has problems
- dm-raid doesn't free everything"
* tag 'md-3.7-fixes' of git://neil.brown.name/md:
md/raid10: decrement correct pending counter when writing to replacement.
md/raid10: close race that lose writes lost when replacement completes.
md/raid5: Make sure we clear R5_Discard when discard is finished.
md/raid5: move resolving of reconstruct_state earlier in stripe_handle.
md/raid5: round discard alignment up to power of 2.
md: make sure everything is freed when dm-raid stops an array.
md: Avoid write invalid address if read_seqretry returned true.
md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
Request based dm attempts to re-run the request queue off the
request completion path. If used with a driver that potentially does
end_io from its request_fn, we could deadlock trying to recurse
back into request dispatch. Fix this by punting the request queue
run to kblockd.
Tested to fix a quickly reproducible deadlock in such a scenario.
Cc: stable@kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a write to a replacement device completes, we carefully
and correctly found the rdev that the write actually went to
and the blithely called rdev_dec_pending on the primary rdev,
even if this write was to the replacement.
This means that any writes to an array while a replacement
was ongoing would cause the nr_pending count for the primary
device to go negative, so it could never be removed.
This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.
Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When a replacement operation completes there is a small window
when the original device is marked 'faulty' and the replacement
still looks like a replacement. The faulty should be removed and
the replacement moved in place very quickly, bit it isn't instant.
So the code write out to the array must handle the possibility that
the only working device for some slot in the replacement - but it
doesn't. If the primary device is faulty it just gives up. This
can lead to corruption.
So make the code more robust: if either the primary or the
replacement is present and working, write to them. Only when
neither are present do we give up.
This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.
Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
commit 9e44476851
MD: raid5 avoid unnecessary zero page for trim
change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.
This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.
Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared. But R5_Uptodate really should be clear at this point.
So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.
Signed-off-by: NeilBrown <neilb@suse.de>
stripe_handle.
The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().
After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.
Signed-off-by: NeilBrown <neilb@suse.de>
blkdev_issue_discard currently assumes that the granularity
is a power of 2. So in raid5, round the chosen number up to
avoid embarrassment.
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.
This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.
Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 2863b9eb didn't take into account the changes to add TRIM support to
RAID10 (commit 532a2a3fb). That is, when using dm-raid.c to create the
RAID10 arrays, there is no mddev->gendisk or mddev->queue. The code added
to support TRIM simply assumes that mddev->queue is available without
checking. The result is an oops any time dm-raid.c attempts to create a
RAID10 device.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
setup_conf in raid1.c uses conf->raid_disks before assigning
a value. It is used when including 'Replacement' devices.
The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.
Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.
So use mddev->raid_disks, which is initialised, instead.
Bug was introduced by commit c19d57980b
md/raid1: recognise replacements when assembling arrays.
in 3.3, so fix is suitable for 3.3.y thru 3.6.y.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
in:
fe86cdce block: do not artificially constrain max_sectors for stacking drivers
max_sectors defaults to UINT_MAX. md faulty wasn't using
disk_stack_limits(), so inherited this large value as well.
This triggered a bug in XFS when stressed over md_faulty, when
a very large bio_alloc() failed.
That was on an older kernel, and I can't reproduce exactly the
same thing upstream, but I think the fix is appropriate in any
case.
Thanks to Mike Snitzer for pointing out the problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md updates from NeilBrown:
- "discard" support, some dm-raid improvements and other assorted bits
and pieces.
* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
md: refine reporting of resync/reshape delays.
md/raid5: be careful not to resize_stripes too big.
md: make sure manual changes to recovery checkpoint are saved.
md/raid10: use correct limit variable
md: writing to sync_action should clear the read-auto state.
Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
md/raid5: make sure to_read and to_write never go negative.
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
md/raid5: protect debug message against NULL derefernce.
md/raid5: add some missing locking in handle_failed_stripe.
MD: raid5 avoid unnecessary zero page for trim
MD: raid5 trim support
md/bitmap:Don't use IS_ERR to judge alloc_page().
md/raid1: Don't release reference to device while handling read error.
raid: replace list_for_each_continue_rcu with new interface
add further __init annotations to crypto/xor.c
DM RAID: Fix for "sync" directive ineffectiveness
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Add rebuild capability for RAID10
DM RAID: Move 'rebuild' checking code to its own function
...
Use the recently-added bio front_pad field to allocate struct dm_target_io.
Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.
This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.
_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.
This idea was introduced by Kent Overstreet.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to other future DM targets so
move it to a separate module.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to share with future DM targets.
Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.
This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().
spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.
So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.
Signed-off-by: NeilBrown <neilb@suse.de>
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large. This particularly happens
when md_check_recovery calls ->check_reshape().
If we use ->previous_raid_disks, we don't risk this.
Signed-off-by: NeilBrown <neilb@suse.de>
If you make an array bigger but suppress resync of the new region with
mdadm --grow /dev/mdX --size=max --assume-clean
then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>