Commit Graph

2540 Commits

Author SHA1 Message Date
Jonathan Brassow
ed30be077e MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c
Commit 2863b9eb didn't take into account the changes to add TRIM support to
RAID10 (commit 532a2a3fb).  That is, when using dm-raid.c to create the
RAID10 arrays, there is no mddev->gendisk or mddev->queue.  The code added
to support TRIM simply assumes that mddev->queue is available without
checking.  The result is an oops any time dm-raid.c attempts to create a
RAID10 device.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:30 +11:00
NeilBrown
02b898f2f0 md/raid1: Fix assembling of arrays containing Replacements.
setup_conf in raid1.c uses conf->raid_disks before assigning
a value.  It is used when including 'Replacement' devices.

The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.

Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.

So use mddev->raid_disks, which is initialised, instead.

Bug was introduced by commit c19d57980b
      md/raid1: recognise replacements when assembling arrays.

in 3.3, so fix is suitable for 3.3.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:03 +11:00
Eric Sandeen
0be1fecd7e md faulty: use disk_stack_limits()
in:
fe86cdce block: do not artificially constrain max_sectors for stacking drivers

max_sectors defaults to UINT_MAX.  md faulty wasn't using
disk_stack_limits(), so inherited this large value as well.
This triggered a bug in XFS when stressed over md_faulty, when
a very large bio_alloc() failed.

That was on an older kernel, and I can't reproduce exactly the
same thing upstream, but I think the fix is appropriate in any
case.

Thanks to Mike Snitzer for pointing out the problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-22 10:44:55 +11:00
Linus Torvalds
9db908806b Merge tag 'md-3.7' of git://neil.brown.name/md
Pull md updates from NeilBrown:
 - "discard" support, some dm-raid improvements and other assorted bits
   and pieces.

* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
  md: refine reporting of resync/reshape delays.
  md/raid5: be careful not to resize_stripes too big.
  md: make sure manual changes to recovery checkpoint are saved.
  md/raid10: use correct limit variable
  md: writing to sync_action should clear the read-auto state.
  Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
  md/raid5: make sure to_read and to_write never go negative.
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
  md/raid5: protect debug message against NULL derefernce.
  md/raid5: add some missing locking in handle_failed_stripe.
  MD: raid5 avoid unnecessary zero page for trim
  MD: raid5 trim support
  md/bitmap:Don't use IS_ERR to judge alloc_page().
  md/raid1: Don't release reference to device while handling read error.
  raid: replace list_for_each_continue_rcu with new interface
  add further __init annotations to crypto/xor.c
  DM RAID: Fix for "sync" directive ineffectiveness
  DM RAID: Fix comparison of index and quantity for "rebuild" parameter
  DM RAID: Add rebuild capability for RAID10
  DM RAID: Move 'rebuild' checking code to its own function
  ...
2012-10-13 13:22:01 -07:00
Mikulas Patocka
dba141601d dm: store dm_target_io in bio front_pad
Use the recently-added bio front_pad field to allocate struct dm_target_io.

Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.

This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.

_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.

This idea was introduced by Kent Overstreet.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:15 +01:00
Mike Snitzer
4f81a41762 dm thin: move bio_prison code to separate module
The bio prison code will be useful to other future DM targets so
move it to a separate module.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:13 +01:00
Mike Snitzer
44feb387f6 dm thin: prepare to separate bio_prison code
The bio prison code will be useful to share with future DM targets.

Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:10 +01:00
Mike Snitzer
28eed34e76 dm thin: support discard with non power of two block size
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.

This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize").  That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:07 +01:00
Wei Yongjun
0bcf08798e dm persistent data: convert to use le32_add_cpu
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:47 +01:00
Mikulas Patocka
fe5fe90639 dm: use ACCESS_ONCE for sysfs values
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:46 +01:00
Wei Yongjun
54499afbb8 dm bufio: use list_move
Use list_move() instead of list_del() + list_add().

spatch with a semantic match was used to find this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:44 +01:00
Wei Yongjun
a71a261f5c dm mpath: fix check for null mpio in end_io fn
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().

spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:42 +01:00
NeilBrown
72f36d5972 md: refine reporting of resync/reshape delays.
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.

So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:25:57 +11:00
NeilBrown
e56108d65f md/raid5: be careful not to resize_stripes too big.
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large.  This particularly happens
when md_check_recovery calls ->check_reshape().

If we use ->previous_raid_disks, we don't risk this.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:24:13 +11:00
NeilBrown
db07d85ef6 md: make sure manual changes to recovery checkpoint are saved.
If you make an array bigger but suppress resync of the new region with
  mdadm --grow /dev/mdX --size=max --assume-clean

then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.

Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:22:17 +11:00
Dan Carpenter
91502f099d md/raid10: use correct limit variable
Clang complains that we are assigning a variable to itself.  This should
be using bad_sectors like the similar earlier check does.

Bug has been present since 3.1-rc1.  It is minor but could
conceivably cause corruption or other bad behaviour.

Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:20:58 +11:00
NeilBrown
48c26ddc9f md: writing to sync_action should clear the read-auto state.
In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to.  The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).

Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.

So allow any successful write to sync_action to clear any read-auto
state.

Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:19:39 +11:00
Jianpeng Ma
7f7583d420 Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:17:59 +11:00
NeilBrown
1ed850f356 md/raid5: make sure to_read and to_write never go negative.
to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero.  Thus it is not a problem if they are a
little bit larger than they should be.

So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.

So remove those decrements.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:13 +11:00
Alexander Lyakas
a7854487cd md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Suggested-by: Yair Hershko <yair@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
NeilBrown
b97390aec4 md/raid5: protect debug message against NULL derefernce.
The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
NeilBrown
143c4d0573 md/raid5: add some missing locking in handle_failed_stripe.
We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.

Reported-by: "Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:50:12 +11:00
Shaohua Li
9e44476851 MD: raid5 avoid unnecessary zero page for trim
We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.

To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:49:49 +11:00
Shaohua Li
620125f2bf MD: raid5 trim support
Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk.  To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.

Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.

The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.

For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard.  Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.

If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.

If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:49:05 +11:00
Jianpeng Ma
582e2e056a md/bitmap:Don't use IS_ERR to judge alloc_page().
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 13:45:36 +11:00