Commit Graph

2215 Commits

Author SHA1 Message Date
Linus Torvalds 5d0edf2915 Merge tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper fixes for 3.3 from Alasdair Kergon

Eight small device-mapper bug fixes.

* tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm raid: fix flush support
  dm raid: set MD_CHANGE_DEVS when rebuilding
  dm thin metadata: decrement counter after removing mapped block
  dm thin metadata: unlock superblock in init_pmd error path
  dm thin metadata: remove incorrect close_device on creation error paths
  dm flakey: fix crash on read when corrupt_bio_byte not set
  dm io: fix discard support
  dm ioctl: do not leak argv if target message only contains whitespace
2012-03-08 17:21:51 -08:00
Jonathan E Brassow 0ca93de9b7 dm raid: fix flush support
Fix dm-raid flush support.

Both md and dm have support for flush, but the dm-raid target
forgot to set the flag to indicate that flushes should be
passed on.  (Important for data integrity e.g. with writeback cache
enabled.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:48 +00:00
Jonathan E Brassow 3aa3b2b2b1 dm raid: set MD_CHANGE_DEVS when rebuilding
The 'rebuild' parameter is used to rebuild individual devices in an
array (e.g. resynchronize a RAID1 device or recalculate a parity device
in higher RAID).  The MD_CHANGE_DEVS flag must be set when this
parameter is given in order to write out the superblocks and make the
change take immediate effect.  The code that handles new devices in
super_load already sets MD_CHANGE_DEVS and 'FirstUse'.  (The 'FirstUse'
flag was being set as a special case for rebuilds in
super_init_validation.)

Add a condition for rebuilds in super_load to take care of both flags
without the special case in 'super_init_validation'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:47 +00:00
Joe Thornber af63bcb817 dm thin metadata: decrement counter after removing mapped block
Correct the number of mapped sectors shown on a thin device's
status line by decrementing td->mapped_blocks in __remove() each time
a block is removed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:44 +00:00
Joe Thornber 4469a5f387 dm thin metadata: unlock superblock in init_pmd error path
If dm_sm_disk_create() fails the superblock must be unlocked.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:43 +00:00
Mike Snitzer 1f3db25d8b dm thin metadata: remove incorrect close_device on creation error paths
The __open_device() error paths in __create_thin() and __create_snap()
incorrectly call __close_device() even if td was not initialized by
__open_device().  Remove this.

Also document __open_device() return values, remove a redundant
td->changed = 1 in __create_thin(), and insert an additional
safeguard against creating an already-existing device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:41 +00:00
Mike Snitzer 1212268fd9 dm flakey: fix crash on read when corrupt_bio_byte not set
The following BUG is hit on the first read that is submitted to a dm
flakey test device while the device is "down" if the corrupt_bio_byte
feature wasn't requested when the device's table was loaded.

Example DM table that will hit this BUG:
0 2097152 flakey 8:0 2048 0 30

This bug was introduced by commit a3998799fb
(dm flakey: add corrupt_bio_byte feature) in v3.1-rc1.

BUG: unable to handle kernel paging request at ffff8801cfce3fff
IP: [<ffffffffa008c233>] corrupt_bio_data+0x6e/0xae [dm_flakey]
PGD 1606063 PUD 0
Oops: 0002 [#1] SMP
...
Call Trace:
 <IRQ>
 [<ffffffffa008c2b5>] flakey_end_io+0x42/0x48 [dm_flakey]
 [<ffffffffa00dca98>] clone_endio+0x54/0xb6 [dm_mod]
 [<ffffffff81130587>] bio_endio+0x2d/0x2f
 [<ffffffff811c819a>] req_bio_endio+0x96/0x9f
 [<ffffffff811c94b9>] blk_update_request+0x1dc/0x3a9
 [<ffffffff812f5ee2>] ? rcu_read_unlock+0x21/0x23
 [<ffffffff811c96a6>] blk_update_bidi_request+0x20/0x6e
 [<ffffffff811c9713>] blk_end_bidi_request+0x1f/0x5d
 [<ffffffff811c978d>] blk_end_request+0x10/0x12
 [<ffffffff8128f450>] scsi_io_completion+0x1e5/0x4b1
 [<ffffffff812882a9>] scsi_finish_command+0xec/0xf5
 [<ffffffff8128f830>] scsi_softirq_done+0xff/0x108
 [<ffffffff811ce284>] blk_done_softirq+0x84/0x98
 [<ffffffff81048d19>] __do_softirq+0xe3/0x1d5
 [<ffffffff8138f83f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffff810997cf>] ? handle_irq_event+0x4c/0x61
 [<ffffffff8139833c>] call_softirq+0x1c/0x30
 [<ffffffff81003b37>] do_softirq+0x4b/0xa3
 [<ffffffff81048a39>] irq_exit+0x53/0xca
 [<ffffffff81398acd>] do_IRQ+0x9d/0xb4
 [<ffffffff81390333>] common_interrupt+0x73/0x73
...

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:39 +00:00
Milan Broz 0c535e0d6f dm io: fix discard support
This patch fixes a crash by recognising discards in dm_io.

Currently dm_mirror can send REQ_DISCARD bios if running over a
discard-enabled device and without support in dm_io the system
crashes badly.

BUG: unable to handle kernel paging request at 00800000
IP:  __bio_add_page.part.17+0xf5/0x1e0
...
 bio_add_page+0x56/0x70
 dispatch_io+0x1cf/0x240 [dm_mod]
 ? km_get_page+0x50/0x50 [dm_mod]
 ? vm_next_page+0x20/0x20 [dm_mod]
 ? mirror_flush+0x130/0x130 [dm_mirror]
 dm_io+0xdc/0x2b0 [dm_mod]
...

Introduced in 2.6.38-rc1 by commit 5fc2ffeabb
(dm raid1: support discard).

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@kernel.org
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:37 +00:00
Jesper Juhl 902c6a96a7 dm ioctl: do not leak argv if target message only contains whitespace
If 'argc' is zero we jump to the 'out:' label, but this leaks the
(unused) memory that 'dm_split_args()' allocated for 'argv' if the
string being split consisted entirely of whitespace.  Jump to the
'out_argv:' label instead to free up that memory.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:34 +00:00
Linus Torvalds a2e5f13ce8 Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
 "Three fixes for md in 3.3-rc: Two relate to the recently added drive
  replacement.  One fixes the problem where a read error in RAID10 would
  sometimes be retried indefinitely."

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md/raid10: fix assembling of arrays with replacement devices.
  md/raid10: fix handling of error on last working device in array.
  md/raid1: fix buglet in md_raid1_contested.
2012-03-05 16:01:25 -08:00
NeilBrown 7a90484825 md/raid10: fix assembling of arrays with replacement devices.
commit 56a2559bb6 (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'.  So it was largely ineffective.

So remove that now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-06 10:12:45 +11:00
NeilBrown fae8cc5ed0 md/raid10: fix handling of error on last working device in array.
If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.

This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
   was the last usable device.  This is efficient but a little racy
   and will sometimes retry when it should not.

2/ in handle_read_error we are careful to exclude any device from
   retry which we tried to mark as faulty (that might have failed if
   it was the last device).  This is race-free but less efficient.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-14 11:10:10 +11:00
NeilBrown f53e29fc87 md/raid1: fix buglet in md_raid1_contested.
Since we added 'replacement' capability, RAID1 can have twice
as many devices as ->raid_disks indicates.
So md_raid1_congested needs to check that many possible devices,
not just ->raid_disks many.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-13 14:24:05 +11:00
Linus Torvalds 4d39aa1b99 Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Some simple md-related fixes.

1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md: two small fixes to handling interrupt resync.
  Prevent DM RAID from loading bitmap twice.
2012-02-08 19:06:30 -08:00
NeilBrown db91ff55bd md: two small fixes to handling interrupt resync.
1/ If a resync is aborted we should record how far we got
 (recovery_cp) the last request that we know has completed
 (->curr_resync_completed) rather than the last request that was
 submitted (->curr_resync).

2/ When a resync aborts we still want to update the metadata with
 any changes, so set MD_CHANGE_DEVS even if we 'skip'.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-07 12:01:51 +11:00
Jonathan Brassow 34f8ac6d79 Prevent DM RAID from loading bitmap twice.
The life cycle of a device-mapper target is:
1) create
2) resume
3) suspend
*) possibly repeat from 2
4) destroy

The dm-raid target is unconditionally calling MD's bitmap_load function upon
every resume.  If steps 2 & 3 above are repeated, bitmap_load is called
multiple times.  It is only written to be called once; otherwise, it allocates
new memory for the bitmap (without freeing the old) and incrementing the number
of pages it thinks it has without zeroing first.  This ultimately leads to
access beyond allocated memory and lost memory.

Simply avoiding the bitmap_load call upon resume is not sufficient.  If the
target was suspended while the initial recovery was only partially complete,
it needs to be restarted when the target is resumed.  This is why
'md_wakeup_thread' is called before issuing the 'mddev_resume'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-31 09:43:41 +11:00
Linus Torvalds b3c9dd182e Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
  Revert "block: recursive merge requests"
  block: Stop using macro stubs for the bio data integrity calls
  blockdev: convert some macros to static inlines
  fs: remove unneeded plug in mpage_readpages()
  block: Add BLKROTATIONAL ioctl
  block: Introduce blk_set_stacking_limits function
  block: remove WARN_ON_ONCE() in exit_io_context()
  block: an exiting task should be allowed to create io_context
  block: ioc_cgroup_changed() needs to be exported
  block: recursive merge requests
  block, cfq: fix empty queue crash caused by request merge
  block, cfq: move icq creation and rq->elv.icq association to block core
  block, cfq: restructure io_cq creation path for io_context interface cleanup
  block, cfq: move io_cq exit/release to blk-ioc.c
  block, cfq: move icq cache management to block core
  block, cfq: move io_cq lookup to blk-ioc.c
  block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
  block, cfq: reorganize cfq_io_context into generic and cfq specific parts
  block: remove elevator_queue->ops
  block: reorder elevator switch sequence
  ...

Fix up conflicts in:
 - block/blk-cgroup.c
	Switch from can_attach_task to can_attach
 - block/cfq-iosched.c
	conflict with now removed cic index changes (we now use q->id instead)
2012-01-15 12:24:45 -08:00
Paolo Bonzini ec8013bedd dm: do not forward ioctls from logical volumes to the underlying device
A logical volume can map to just part of underlying physical volume.
In this case, it must be treated like a partition.

Based on a patch from Alasdair G Kergon.

Cc: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-14 15:07:24 -08:00
Linus Torvalds c086ae4ed9 Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Two bugfixes for md.

One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON.  Has been tagged for -stable.
The other is minor missing functionality.

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md/raid1: perform bad-block tests for WriteMostly devices too.
  md: notify the 'degraded' sysfs attribute on failure.
2012-01-11 18:51:55 -08:00
Martin K. Petersen b1bd055d39 block: Introduce blk_set_stacking_limits function
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11 16:27:11 +01:00
NeilBrown 307729c8bc md/raid1: perform bad-block tests for WriteMostly devices too.
We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.

With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.

This bug was introduced in commit d2eb35acfd and so the patch
is suitable for 3.1.x and 3.2.x

Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
2012-01-11 08:35:17 +11:00
NeilBrown f2a371c5e7 md: notify the 'degraded' sysfs attribute on failure.
We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.

Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.

Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-11 08:35:14 +11:00
Linus Torvalds 2943c83322 Merge tag 'md-3.3' of git://neil.brown.name/md
md update for 3.3

Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.

* tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
  md/raid1: Mark device want_replacement when we see a write error.
  md/raid1: If there is a spare and a want_replacement device, start replacement.
  md/raid1: recognise replacements when assembling arrays.
  md/raid1: handle activation of replacement device when recovery completes.
  md/raid1: Allow a failed replacement device to be removed.
  md/raid1: Allocate spare to store replacement devices and their bios.
  md/raid1:  Replace use of mddev->raid_disks with conf->raid_disks.
  md/raid10: If there is a spare and a want_replacement device, start replacement.
  md/raid10: recognise replacements when assembling array.
  md/raid10: Allow replacement device to be replace old drive.
  md/raid10: handle recovery of replacement devices.
  md/raid10:  Handle replacement devices during resync.
  md/raid10: writes should get directed to replacement as well as original.
  md/raid10: allow removal of failed replacement devices.
  md/raid10: preferentially read from replacement device if possible.
  md/raid10:  change read_balance to return an rdev
  md/raid10: prepare data structures for handling replacement.
  md/raid5: Mark device want_replacement when we see a write error.
  md/raid5: If there is a spare and a want_replacement device, start replacement.
  md/raid5: recognise replacements when assembling array.
  ...
2012-01-08 13:28:33 -08:00
Al Viro ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
NeilBrown 19d671695e md/raid1: Mark device want_replacement when we see a write error.
Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Signed-off-by:  NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00