commit faa5673617 upstream.
The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.
Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 29ebf465b9 upstream.
Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.
This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c9502ea442 upstream.
If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.
So keep track of whether we've removed stuff from sysfs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5caa52afc5 upstream.
Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 54d12f2b4f upstream.
Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6aa8f1a6ca upstream.
In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.
Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7bb23c4934 upstream.
1/ When an different between blocks is found, data is copied from
one bio to the other. However bv_len is used as the length to
copy and this could be zero. So use r10_bio->sectors to calculate
length instead.
Using bv_len was probably always a bit dubious, but the introduction
of bio_advance made it much more likely to be a problem.
2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
except on bios that we schedule for a read. This ensures that
missing/failed devices don't confuse the loop at the top of
sync_request write.
Commit 8be185f2c9 "raid10: Use bio_reset()"
removed a loop which set BIO_UPTDATE on all appropriate bios.
So we need to re-add that flag.
These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 78eaa0d4cb upstream.
1/ If a RAID10 is being reshaped to a fewer number of devices
and is stopped while this is ongoing, then when the array is
reassembled the 'mirrors' array will be allocated too small.
This will lead to an access error or memory corruption.
2/ A sanity test for a reshaping RAID10 array is restarted
is slightly incorrect.
Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1376512065 upstream.
The recent comment:
commit 7e83ccbecd
md/raid10: Allow skipping recovery when clean arrays are assembled
Causes raid10 to skip a recovery in certain cases where it is safe to
do so. Unfortunately it also causes a reshape to be skipped which is
never safe. The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).
Bug was introduced in 3.10, so this is suitable for 3.10-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.
Using raise_barrier will sometimes deadlock. Using freeze_array
should not.
As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.
The deadlock was made particularly noticeable by commits
050b66152f (raid10) and 6b740b8d79 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.
This patch probably won't apply directly to some early kernels and
will need to be applied by hand.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block layer fixes from Jens Axboe:
"Outside of bcache (which really isn't super big), these are all
few-liners. There are a few important fixes in here:
- Fix blk pm sleeping when holding the queue lock
- A small collection of bcache fixes that have been done and tested
since bcache was included in this merge window.
- A fix for a raid5 regression introduced with the bio changes.
- Two important fixes for mtip32xx, fixing an oops and potential data
corruption (or hang) due to wrong bio iteration on stacked devices."
* 'for-linus' of git://git.kernel.dk/linux-block:
scatterlist: sg_set_buf() argument must be in linear mapping
raid5: Initialize bi_vcnt
pktcdvd: silence static checker warning
block: remove refs to XD disks from documentation
blkpm: avoid sleep when holding queue lock
mtip32xx: Correctly handle bio->bi_idx != 0 conditions
mtip32xx: Fix NULL pointer dereference during module unload
bcache: Fix error handling in init code
bcache: clarify free/available/unused space
bcache: drop "select CLOSURES"
bcache: Fix incompatible pointer type warning
Fix detection of the need to resize the dm thin metadata device.
The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e9 ("dm thin:
detect metadata device resizing").
device-mapper: transaction manager: couldn't open metadata space map
device-mapper: thin metadata: tm_open_with_sm failed
device-mapper: thin: aborting transaction failed
device-mapper: thin: switching pool to failure mode
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
The function pointer release in struct block_device_operations
should point to functions declared as void.
Sparse warnings:
drivers/md/bcache/super.c:656:27: warning:
incorrect type in initializer (different base types)
drivers/md/bcache/super.c:656:27:
expected void ( *release )( ... )
drivers/md/bcache/super.c:656:27:
got int ( static [toplevel] *<noident> )( ... )
drivers/md/bcache/super.c:656:2: warning:
initialization from incompatible pointer type [enabled by default]
drivers/md/bcache/super.c:656:2: warning:
(near initialization for ‘bcache_ops.release’) [enabled by default]
Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Share configuration option processing code between the dm cache
ctr and message functions.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.
The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a threshold callback to dm persistent data space maps.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>