Commit Graph

2279 Commits

Author SHA1 Message Date
NeilBrown afbaa90b80 md/bitmap: prevent bitmap_daemon_work running while initialising bitmap
If a bitmap is added while the array is active, it is possible
for bitmap_daemon_work to run while the bitmap is being
initialised.
This is particularly a problem if bitmap_daemon_work sees
bitmap->filemap as non-NULL before it has been filled in properly.
So hold bitmap_info.mutex while filling in ->filemap
to prevent problems.

This patch is suitable for any -stable kernel, though it might not
apply cleanly before about 3.1.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:05:06 +10:00
majianpeng f4380a9158 md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.
If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.

This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:04:47 +10:00
Andrei Warkentin 9e41dd35b3 MD: Bitmap version cleanup.
bitmap_new_disk_sb() would still create V3 bitmap superblock
with host-endian layout.

Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be
creating a V4 bitmap superblock instead, that is portable,
as per comment in bitmap.h?

Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 15:55:21 +10:00
NeilBrown 5020ad7d14 md/raid1,raid10: don't compare excess byte during consistency check.
When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.

In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.

This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:39:23 +10:00
majianpeng c6d2e084c7 md/raid5: Fix a bug about judging if the operation is syncing or replacing
When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
		if (do_recovery ||
		    sh->sector >= conf->mddev->recovery_cp)
			s->syncing = 1;
		else
			s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.

This bug was introduced by commit 9a3e1101b8
    md/raid5:  detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:38 +10:00
majianpeng a42f9d83b5 md/raid1:Remove unnecessary rcu_dereference(conf->mirrors[i].rdev).
Because rde->nr_pending > 0,so can not remove this disk.
And in any case, we aren't holding rcu_read_lock()

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:33 +10:00
Jes Sorensen 24b961f811 md: Avoid OOPS when reshaping raid1 to raid0
raid1 arrays do not have the notion of chunk size. Calculate the
largest chunk sector size we can use to avoid a divide by zero OOPS
when aligning the size of the new array to the chunk size.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:26 +10:00
NeilBrown 18b9837ea0 md/raid5: fix handling of bad blocks during recovery.
1/ We can only treat a known-bad-block like a read-error if we
   have the data that belongs in that block.  So fix that test.

2/ If we cannot recovery a stripe due to insufficient data,
   don't tell "md_done_sync" that the sync failed unless we really
   did fail something.  If we successfully record bad blocks,
   that is success.

Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:36:17 +10:00
majianpeng 5220ea1e64 md/raid1: If md_integrity_register() failed,run() must free the mem
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:38 +10:00
majianpeng 0366ef8475 md/raid0: If md_integrity_register() fails, raid0_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
majianpeng 98d5561bfb md/linear: If md_integrity_register() fails, linear_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
Mikulas Patocka a4ffc15219 dm: add verity target
This device-mapper target creates a read-only device that transparently
validates the data on one underlying device against a pre-generated tree
of cryptographic checksums stored on a second device.

Two checksum device formats are supported: version 0 which is already
shipping in Chromium OS and version 1 which incorporates some
improvements.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Cc: Milan Broz <mbroz@redhat.com>
Cc: Olof Johansson <olofj@chromium.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:43:38 +01:00
Mikulas Patocka a66cc28f53 dm bufio: prefetch
This patch introduces a new function dm_bufio_prefetch. It prefetches
the specified range of blocks into dm-bufio cache without waiting
for i/o completion.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 67e2e2b281 dm thin: add pool target flags to control discard
Add dm thin target arguments to control discard support.

ignore_discard: Disables discard support

no_discard_passdown: Don't pass discards down to the underlying data
device, but just remove the mapping within the thin provisioning target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 104655fd4d dm thin: support discards
Support discards in the thin target.

On discard the corresponding mapping(s) are removed from the thin
device.  If the associated block(s) are no longer shared the discard
is passed to the underlying device.

All bios other than discards now have an associated deferred_entry
that is saved to the 'all_io_entry' in endio_hook.  When non-discard
IO completes and associated mappings are quiesced any discards that
were deferred, via ds_add_work() in process_discard(), will be queued
for processing by the worker thread.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

drivers/md/dm-thin.c |  173 ++++++++++++++++++++++++++++++++++++++++++++++----
 drivers/md/dm-thin.c |  172 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 158 insertions(+), 14 deletions(-)
2012-03-28 18:41:28 +01:00
Joe Thornber eb2aa48d4e dm thin: prepare to support discard
This patch contains the ground work needed for dm-thin to support discard.

  - Adds endio function that replaces shared_read_endio.

  - Introduce an explicit 'quiesced' flag into the new_mapping structure.
    Before, this was implicitly indicated by m->list being empty.

  - The map_info->ptr remains constant for the duration of a bio's trip
    through the thin target.  Make it easier to reason about it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Alasdair G Kergon 6efd6e8309 dm thin: use dm_target_offset
Use dm_target_offset wrapper instead of referencing the awkward ti->begin
explicitly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 2dd9c257fb dm thin: support read only external snapshot origins
Support the use of an external _read only_ device as an origin for a thin
device.

Any read to an unprovisioned area of the thin device will be passed
through to the origin.  Writes trigger allocation of new blocks as
usual.

One possible use case for this would be VM hosts that want to run
guests on thinly-provisioned volumes but have the base image on another
device (possibly shared between many VMs).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Mike Snitzer c4a69ecdb4 dm thin: relax hard limit on the maximum size of a metadata device
The thin metadata format can only make use of a device that is <=
THIN_METADATA_MAX_SECTORS (currently 15.9375 GB).  Therefore, there is no
practical benefit to using a larger device.

However, it may be that other factors impose a certain granularity for
the space that is allocated to a device (E.g. lvm2 can impose a coarse
granularity through the use of large, >= 1 GB, physical extents).

Rather than reject a larger metadata device, during thin-pool device
construction, switch to allowing it but issue a warning if a device
larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
provided.  Any space over 15.9375 GB will not be used.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 71fd5ae25d dm persistent data: remove space map ref_count entries if redundant
Save space by removing entries from the space map ref_count tree if
they're no longer needed.

Ref counts are stored in two places: a bitmap if the ref_count is
below 3, or a btree of uint32_t if 3 or above.

When a ref_count that was above 3 drops below we can remove it from
the tree and save some metadata space.  This removal was commented out
before because I was unsure why this was causing under-populated btree
nodes.  Earlier patches have fixed this issue.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Joe Thornber 905e51b39a dm thin: commit outstanding data every second
Commit unwritten data every second to prevent too much building up.

Released blocks don't become available until after the next commit
(for crash resilience).  Prior to this patch commits were only
triggered by a message to the target or a REQ_{FLUSH,FUA} bio.  This
allowed far too big a position to build up.

The interval is hard-coded to 1 second.  This is a sensible setting.
I'm not making this user configurable, since there isn't much to be
gained by tweaking this - and a lot lost by setting it far too high.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Mikulas Patocka 31998ef193 dm: reject trailing characters in sccanf input
Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.

For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".

As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.

This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.

The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Jonathan E Brassow 0447568fc5 dm raid: handle failed devices during start up
The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read.  This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.

With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table.  That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Joe Thornber fef838cc1a dm thin metadata: pass correct space map to dm_sm_root_size
Fix a harmless typo.

The root is a chunk of data that gets written to the superblock.  This
data is used to recreate the space map when opening a metadata area.
We have two space maps; one tracking space on the metadata device and
one of the data device.  Both of these use the same format for their
root, so this typo was harmless.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Joe Thornber a3aefb395e dm persistent data: remove redundant value_size arg from value_ptr
Now that the value_size is held within every node of the btrees we can
remove this argument from value_ptr().

For the last few months a BUG_ON has been checking this argument is
the same as that held in the node.  No issues were reported.  So this
is a safe change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00