Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough. There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races. But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.
If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.
So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.
Signed-off-by: NeilBrown <neilb@suse.de>
We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.
Signed-off-by: NeilBrown <neilb@suse.de>
This "if" don't allow for the possibility that the number of devices
doesn't change, and so sector_nr isn't set correctly in that case.
So change '>' to '>='.
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid5 doesn't allow a reshape to restart if it involves writing
over the same part of disk that it would be reading from.
This happens at the beginning of a reshape that increases the number
of devices, at the end of a reshape that decreases the number of
devices, and continuously for a reshape that does not change the
number of devices.
The current code is correct for the "increase number of devices"
case as the critical section at the start is handled by userspace
performing a backup.
It does not work for reducing the number of devices, or the
no-change case.
For 'reducing', we need to invert the test. For no-change we cannot
really be sure things will be safe, so simply require the array
to be read-only, which is how the user-space code which carefully
starts such arrays works.
Signed-off-by: NeilBrown <neilb@suse.de>
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'. This is to cope with
a system failure between updating the metadata on two difference
devices.
However there are currently times when we update the event count by
2. This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.
This is bad for the above reason. So change it to never increase by
two. This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost. The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.
Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent commit:
commit 449aad3e25
introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.
__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.
This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74de
do avoid a warning of an impossible deadlock.
It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.
So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.
This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d
Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode. So use that instead of mucking about with locks and
i_size_write.
Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.
Signed-off-by: NeilBrown <neilb@suse.de>
The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.
However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.
Signed-off-by: NeilBrown <neilb@suse.de>
As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect. It is visible in sysfs and that
is all.
Signed-off-by: NeilBrown <neilb@suse.de>
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.
Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array. This isn't really safe.
So range-check the value first.
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is changed from RAID6 to RAID5, fewer drives are
needed. So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.
md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().
The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.
For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Incorrect device area lengths are being passed to device_area_is_valid().
The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.
With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes). An example of a
consequent incorrect error message is:
device-mapper: table: 254:0: sdb too small for target
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch removes DM's bio-based vs request-based conditional setting
of next_ordered. For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path). For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.
bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18
request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366
The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.
The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing). Otherwise the bios could sit on the list indefinitely.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
blocK: Restore barrier support for md and probably other virtual devices.
block: get rid of queue-private command filter
block: Create bip slabs with embedded integrity vectors
cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
Trivial typo fixes in Documentation/block/data-integrity.txt.
* 'for-linus' of git://neil.brown.name/md:
md: use interruptible wait when duration is controlled by userspace.
md/raid5: suspend shouldn't affect read requests.
md: tidy up error paths in md_alloc
md: fix error path when duplicate name is found on md device creation.
md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
md: Use new topology calls to indicate alignment and I/O sizes
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs. Each bip slab
has an embedded bio_vec array at the end. This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version. Only the largest bip slab is backed by a mempool. The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>