It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close). And it
is quite possible for do_md_stop to be called when the device is not
open. So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
commit 934d9c23b4
It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel
Signed-off-by: NeilBrown <neilb@suse.de>
md arrays are not currently destroyed when they are stopped - they
remain in /sys/block. Last time I tried this I tripped over locking
too much.
A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.
As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: allow extended partitions on md devices.
md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
md: use sysfs_notify_dirent to notify changes to md/array_state
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new extended partition support provides a much nicer was
to have partitions on md devices that the 'mdp' alternate major.
We cannot really get rid of 'mdp' at this time, but we can
enable extended partitions as that will probably make life
easier for sysadmins.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.
Signed-off-by: NeilBrown <neilb@suse.de>
safe_delay_store() currently truncates the last character of input since
it tells strlcpy that the buffer can only hold 'len' characters, off by
one. sysfs already null terminates the buffer, so just increase the
last argument to strlcpy.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.
For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.
So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.
Signed-off-by: NeilBrown <neilb@suse.de>
Having
function (args)
instead of
function(args)
make is harder to search for calls of particular functions.
So remove all those spaces.
Signed-off-by: NeilBrown <neilb@suse.de>
'read-auto' is a variant of 'readonly' which will switch to writable
on the first write attempt.
Calling do_md_stop to set the array readonly when it is already readonly
returns an error. So make sure not to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
For externally managed metadata, the 'metadata_version' sysfs
attribute is really just a channel for user-space programs to
communicate about how the array is being managed.
It can be useful for this to be changed while the array is active.
Normally changes to metadata_version are not permitted while the array
is active. Change that so that if the metadata is externally managed,
the metadata_version can be changed to a different flavour of external
management.
Signed-off-by: NeilBrown <neilb@suse.de>
Fix rdev_size_store with size == 0.
size == 0 means to use the largest size allowed by the
underlying device and is used when modifying an active array.
This fixes a regression introduced by
commit d7027458d6
Cc: <stable@kernel.org>
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Till now, bdev->bd_part is set only if the bdev was for parts other
than part0. This patch makes bdev->bd_part always set so that code
paths don't have to differenciate common handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Implement {disk|part}_to_dev() and use them to access generic device
instead of directly dereferencing {disk|part}->dev. To make sure no
user is left behind, rename generic devices fields to __dev.
This is in preparation of unifying partition 0 handling with other
partitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.
This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.
So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).
However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.
This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.
All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev. Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.
So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.
So remove it.
This patch possibly addresses the regression recored in
http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
http://bugzilla.kernel.org/show_bug.cgi?id=11452
though it isn't yet clear how it ever worked.
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening. Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device. Then it can be removed
by a request from user space.
The first step is currently not performed for read-only arrays,
so the second step can never succeed.
So allow readonly arrays to remove failed devices (which aren't
blocked).
Signed-off-by: NeilBrown <neilb@suse.de>
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.
For now, simply reject any request to change the size of an array
which has a bitmap. mdadm can remove the bitmap and add a new one
after the array has changed size.
Signed-off-by: NeilBrown <neilb@suse.de>