A clear_region function is permitted to block (in practice, rare) but gets
called in rh_update_states() with a spinlock held.
The bits being marked and cleared by the above functions are used
to update the on-disk log, but are never read directly. We can
perform these operations outside the spinlock since the
bits are only changed within one thread viz.
- mark_region in rh_inc()
- clear_region in rh_update_states().
So, we grab the clean_regions list items via list_splice() within the
spinlock and defer clear_region() until we iterate over the list for
deletion - similar to how the recovered_regions list is already handled.
We then move the flush() call down to ensure it encapsulates the changes
which are done by the later calls to clear_region().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow invalid snapshots to be activated instead of failing.
This allows userspace to reinstate any given snapshot state - for
example after an unscheduled reboot - and clean up the invalid snapshot
at its leisure.
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Process persistent exception store metadata IOs in a separate thread.
A snapshot may become invalid while inside generic_make_request().
A synchronous write is then needed to update the metadata while still
inside that function. Since the introduction of
md-dm-reduce-stack-usage-with-stacked-block-devices.patch this has to
be performed by a separate thread to avoid deadlock.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix mirror status line broken in dm-log-report-fault-status.patch:
- space missing between two words
- placeholder ("0") required for compatibility with a subsequent patch
- incorrect offset parameter
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove explicit module name from messages as the macro now includes it
automatically.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new KMEM_CACHE() macro and make the newly-exposed structure names more
meaningful. Also remove some superfluous casts and inlines (let a modern
compiler be the judge).
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If raid1/repair (which reads all block and fixes any differences it finds)
hits a read error, it doesn't reset the bio for writing before writing
correct data back, so the read error isn't fixed, and the device probably
gets a zero-length write which it might complain about.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
garbage can get synced on top of good data.
2/ We round the wrong way in part of the device size calculation, which
can cause confusion.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding a drive to a linear array seems to have stopped working, due to changes
elsewhere in md, and insufficient ongoing testing...
So the patch to make linear hot-add work in the first place introduced a
subtle bug elsewhere that interracts poorly with older version of mdadm.
This fixes it all up.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible that real data or metadata follows the bitmap without full page
alignment.
So limit the last write to be only the required number of bytes, rounded up to
the hard sector size of the device.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a raid0 has a component device larger than 4TB, and is accessed on a 32bit
machines, then as 'chunk' is unsigned long,
chunk << chunksize_bits
can overflow (this can be as high as the size of the device in KB). chunk
itself will not overflow (without triggering a BUG).
So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.
Cc: "Jeff Zheng" <Jeff.Zheng@endace.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they are.
This test in done in is_mddev_idle, and it is somewhat fragile - it
sometimes thinks there is non-sync io when there isn't.
The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io). This has problems because total sectors gets
updated when a request completes, while resync io gets updated when the
request is submitted. The time difference can cause large differenced
between the two which do not actually imply non-resync activity. The test
currently allows for some fuzz (+/- 4096) but there are some cases when it
is not enough.
The test currently looks for any (non-fuzz) difference, either positive or
negative. This clearly is not needed. Any non-sync activity will cause
the total sectors to grow faster than the sync_io count (never slower) so
we only need to look for a positive differences.
If we do this then the amount of in-flight sync io will never cause the
appearance of non-sync IO. Once enough non-sync IO to worry about starts
happening, resync will be slowed down and the measurements will thus be
more precise (as there is less in-flight) and control of resync will still
be suitably responsive.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a raid1 has only one working drive, we want read error to propagate up
to the filesystem as there is no point failing the last drive in an array.
Currently the code perform this check is racy. If a write and a read a
both submitted to a device on a 2-drive raid1, and the write fails followed
by the read failing, the read will see that there is only one working drive
and will pass the failure up, even though the one working drive is actually
the *other* one.
So, tighten up the locking.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 5b479c91da.
Quoth Neil Brown:
"It causes an oops when auto-detecting raid arrays, and it doesn't
seem easy to fix.
The array may not be 'open' when do_md_run is called, so
bdev->bd_disk might be NULL, so bd_set_size can oops.
This whole approach of opening an md device before it has been
assembled just seems to get more and more painful. I think I'm going
to have to come up with something clever to provide both backward
comparability with usage expectation, and sane integration into the
rest of the kernel."
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.
However that doesn't happen until the array is opened, which is later
than some people would like.
So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.
This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).
When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.
So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it. We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.
So replace it with C code to do the same thing. Speed is not crucial here, so
something simple and correct is best.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.
With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the possibility of having uninitialized log state if the
log device has failed.
When a mirror resumes operation, it calls 'resume' on the logging module. If
disk based logging is being used, the log device is read to fill in the log
state. If the log device has failed, we cannot simply return, because this
would leave the in-memory log state uninitialized. Instead, we assume all
regions are out-of-sync and reset the log state. Failure to do this could
result in the logging code reporting a region as in-sync, even though it
isn't; which could result in a corrupted mirror.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>