Commit Graph

1363 Commits

Author SHA1 Message Date
Linus Torvalds
435a71d9ef Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  Fix new incorrect error return from do_md_stop.
2009-08-18 13:54:08 -07:00
NeilBrown
80ffb3ccea Fix new incorrect error return from do_md_stop.
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough.  There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-18 10:35:26 +10:00
Randy Dunlap
894ef820b1 dm-log-userspace: fix printk format warning
drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t'

Previously posted and acked, but apparently lost.
http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-16 08:35:58 -07:00
NeilBrown
4d484a4a7a md: allow upper limit for resync/reshape to be set when array is read-only
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races.  But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:50 +10:00
NeilBrown
1a67dde0ab md/raid5: Properly remove excess drives after shrinking a raid5/6
We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:49 +10:00
NeilBrown
a639755cf8 md/raid5: make sure a reshape restarts at the correct address.
This "if" don't allow for the possibility that the number of devices
doesn't change, and so sector_nr isn't set correctly in that case.
So change '>' to '>='.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:13:00 +10:00
NeilBrown
67ac6011db md/raid5: allow new reshape modes to be restarted in the middle.
md/raid5 doesn't allow a reshape to restart if it involves writing
over the same part of disk that it would be reading from.
This happens at the beginning of a reshape that increases the number
of devices, at the end of a reshape that decreases the number of
devices, and continuously for a reshape that does not change the
number of devices.

The current code is correct for the "increase number of devices"
case as the critical section at the start is handled by userspace
performing a backup.

It does not work for reducing the number of devices, or the
no-change case.
For 'reducing', we need to invert the test.  For no-change we cannot
really be sure things will be safe, so simply require the array
to be read-only, which is how the user-space code which carefully
starts such arrays works.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:06:24 +10:00
NeilBrown
51d5668cb2 md: never advance 'events' counter by more than 1.
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'.  This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2.  This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason.  So change it to never increase by
two.  This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost.  The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 09:54:02 +10:00
NeilBrown
c8c00a6915 Remove deadlock potential in md_open
A recent commit:
  commit 449aad3e25

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74de
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL.  So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-10 12:50:52 +10:00
NeilBrown
449aad3e25 md: Use revalidate_disk to effect changes in size of device.
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown
64bd660b51 md: allow raid5_quiesce to work properly when reshape is happening.
The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.

However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown
e516402c0d md/raid5: set reshape_position correctly when reshape starts.
As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect.  It is visible in sysfs and that
is all.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown
70471dafe3 md: Handle growth of v1.x metadata correctly.
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown
3673f305fa md: avoid array overflow with bad v1.x metadata
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:56 +10:00
NeilBrown
3a981b03f3 md: when a level change reduces the number of devices, remove the excess.
When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:55 +10:00
Andre Noll
ac5e7113e7 md: Push down data integrity code to personalities.
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:47 +10:00
Dan Williams
95fc17aac4 md/raid6: release spare page at ->stop()
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-31 12:39:15 +10:00
Mike Snitzer
5dea271b6d dm table: pass correct dev area size to device_area_is_valid
Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes).  An example of a
consequent incorrect error message is:

  device-mapper: table: 254:0: sdb too small for target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:42 +01:00
Mike Snitzer
a732c207d1 dm: remove queue next_ordered workaround for barriers
This patch removes DM's bio-based vs request-based conditional setting
of next_ordered.  For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path).  For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:40 +01:00
Mikulas Patocka
69885683d2 dm raid1: wake kmirrord when requeueing delayed bios after remote recovery
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing).  Otherwise the bios could sit on the list indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:37 +01:00
Jens Axboe
8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Joe Perches
ad361c9884 Remove multiple KERN_ prefixes from printk formats
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 10:30:03 -07:00
Linus Torvalds
2027bd9f92 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
  blocK: Restore barrier support for md and probably other virtual devices.
  block: get rid of queue-private command filter
  block: Create bip slabs with embedded integrity vectors
  cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
  cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
  Trivial typo fixes in Documentation/block/data-integrity.txt.
2009-07-01 10:41:09 -07:00
Linus Torvalds
544ae5f96e Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: use interruptible wait when duration is controlled by userspace.
  md/raid5: suspend shouldn't affect read requests.
  md: tidy up error paths in md_alloc
  md: fix error path when duplicate name is found on md device creation.
  md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
  md: Use new topology calls to indicate alignment and I/O sizes
2009-07-01 10:31:26 -07:00
Martin K. Petersen
7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00