If we discover a bad block when reading we split the request and
potentially read some of it from a different device.
The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c
This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.
Cc: stable@vger.kernel.org (v3.1+)
Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.
So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.
Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.
Signed-off-by: NeilBrown <neilb@suse.de>
Since:
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.
If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.
This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).
Bug was introduced in 3.10 and causes data corruption.
Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.
Here is one place I wasn't careful enough. If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.
This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.
Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NeilBrown <neilb@suse.de>
1/ When an different between blocks is found, data is copied from
one bio to the other. However bv_len is used as the length to
copy and this could be zero. So use r10_bio->sectors to calculate
length instead.
Using bv_len was probably always a bit dubious, but the introduction
of bio_advance made it much more likely to be a problem.
2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
except on bios that we schedule for a read. This ensures that
missing/failed devices don't confuse the loop at the top of
sync_request write.
Commit 8be185f2c9 "raid10: Use bio_reset()"
removed a loop which set BIO_UPTDATE on all appropriate bios.
So we need to re-add that flag.
These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.
Cc: stable@vger.kernel.org (3.10)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The recent comment:
commit 7e83ccbecd
md/raid10: Allow skipping recovery when clean arrays are assembled
Causes raid10 to skip a recovery in certain cases where it is safe to
do so. Unfortunately it also causes a reshape to be skipped which is
never safe. The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).
Bug was introduced in 3.10, so this is suitable for 3.10-stable.
Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
1/ If a RAID10 is being reshaped to a fewer number of devices
and is stopped while this is ongoing, then when the array is
reassembled the 'mirrors' array will be allocated too small.
This will lead to an access error or memory corruption.
2/ A sanity test for a reshaping RAID10 array is restarted
is slightly incorrect.
Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.
Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NeilBrown <neilb@suse.de>
It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.
Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.
Signed-off-by: NeilBrown <neilb@suse.de>
As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.
So 'error' needs to hold 'device_lock'.
On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.
So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.
All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.
Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Add ability to restore transiently failed devices on resume
This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.
Using raise_barrier will sometimes deadlock. Using freeze_array
should not.
As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.
The deadlock was made particularly noticeable by commits
050b66152f (raid10) and 6b740b8d79 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.
This patch probably won't apply directly to some early kernels and
will need to be applied by hand.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.
This is suitable for stable tree since v3.6.
Cc: stable@vger.kernel.org
Reported-and-tested-by: Markus <M4rkusXXL@web.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Hi.
Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.
Thanks,
Hirokazu Takahashi.
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is assembled incrementally with mdadm -I -R
and the array switches to "active" mode, md starts a recovery.
If the array was clean, the "fullsync" flag will be 0. Skip
the full recovery in this case, as RAID1 does (the code was
actually copied from the sync_request() method of RAID1).
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
More prep work for immutable bio vecs, mainly getting rid of references
to bi_idx.
bio_reset was being open coded in a few places. The one in sync_request
was a bit nontrivial to convert, so could use some extra eyeballs.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: NeilBrown <neilb@suse.de>
Random cleanup - this code was duplicated and it's not really specific
to md.
Also added the ability to return the actual error code.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.
Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
In the current code bio_split() won't be seeing partially completed bios
so this doesn't change any behaviour, but this makes the code a bit
clearer as to what bio_split() actually requires.
The immediate purpose of the patch is removing unnecessary bi_idx
references, but the end goal is to allow partial completed bios to be
submitted, which along with immutable biovecs enables effecient bio
splitting.
Some of the callers were (double) checking that bios could be split, so
update their checks too.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Martin K. Petersen <martin.petersen@oracle.com>