It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue. Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.
The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a6 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.
This policy of disallowing changes the integrity profile once one has
been established is shared with DM.
Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277]
[ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[ 59.078302] md: data integrity enabled on md0
[..]
[ 90.489209] md0: incompatible integrity profile for pmem1m
[..]
[ 205.671277] md: super_written gets error=-5
[ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[ 205.677386] md/raid1:md0: Operation continuing on 1 devices.
[ 205.683037] RAID1 conf printout:
[ 205.684699] --- wd:1 rd:2
[ 205.685972] disk 0, wo:0, o:1, dev:pmem0s
[ 205.687562] disk 1, wo:1, o:1, dev:pmem1s
[ 205.691717] md: recovery of RAID array md0
Fixes: c7bfced9a6 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors. Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that the ->stop function only frees the private data,
rename is accordingly.
Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().
Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).
Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NeilBrown <neilb@suse.de>
Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.
All the steps in '1' can apply to all arrays and so can be
performed in common code.
This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.
So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NeilBrown <neilb@suse.de>
There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.
So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.
When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.
Signed-off-by: NeilBrown <neilb@suse.de>
Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.
So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.
Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.
So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.
Signed-off-by: NeilBrown <neilb@suse.de>
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...
Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NeilBrown <neilb@suse.de>
Two related problems:
1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.
2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.
So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.
Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.
Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
- there is no need to test_bit Faulty, as that was already done in
md_error which is the only caller of these functions.
- MD_CHANGE_DEVS should be set *after* faulty is set to ensure
metadata is updated correctly.
- spinlock should be held while updating ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>