Pull device mapper fixes from Mike Snitzer:
- Fix DM multipath IO hang regression from 3.15 due to logic bug in
multipath_busy. This impacted cable-pull testing and also the
ability to boot with IPR SCSI on a POWER8 box.
- Fix possible deadlock with deferred device removal by using a new
dedicated workqueue rather than using the system workqueue.
- Fix NULL pointer crash due to race condition in dm-io's wake up code
for sync_io by using a completion.
- Update dm-crypt and dm-zero author name following legal name change;
this is important to Jana so I didn't see any reason to hold it back.
* tag 'dm-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm mpath: fix IO hang due to logic bug in multipath_busy
dm io: fix a race condition in the wake up code for sync_io
dm crypt, dm zero: update author name following legal name change
dm: allocate a special workqueue for deferred device removal
Commit e80991773 ("dm mpath: push back requests instead of queueing")
modified multipath_busy() to return true if !pg_ready(). pg_ready()
checks the current state of the multipath device and may return false
even if a new IO is needed to change the state.
Bart Van Assche reported that he had multipath IO lockup when he was
performing cable pull tests. Analysis showed that the multipath
device had a single path group with both paths active, but that the
path group itself was not active. During the multipath device state
transitions 'queue_io' got set but nothing could clear it. Clearing
'queue_io' only happens in __choose_pgpath(), but it won't be called
if multipath_busy() returns true due to pg_ready() returning false
when 'queue_io' is set.
As such the !pg_ready() check in multipath_busy() is wrong because new
IO will not be sent to multipath target and the multipath state change
won't happen. That results in multipath IO lockup.
The intent of multipath_busy() is to avoid unnecessary cycles of
dequeue + request_fn + requeue if it is known that the multipath
device will requeue.
Such "busy" situations would be:
- path group is being activated
- there is no path and the multipath is setup to requeue if no path
Fix multipath_busy() to return "busy" early only for these specific
situations.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.15
There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread. If the thread
is spuriously woken immediately after the decrement it may exit,
making the on stack io struct invalid, yet the dec_count could still
be using it.
Fix this race by using a completion in sync_io() and dec_count().
Reported-by: Minfei Huang <huangminfei@ucloud.cn>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
The commit 2c140a246d ("dm: allow remove to be deferred") introduced a
deferred removal feature for the device mapper. When this feature is
used (by passing a flag DM_DEFERRED_REMOVE to DM_DEV_REMOVE_CMD ioctl)
and the user tries to remove a device that is currently in use, the
device will be removed automatically in the future when the last user
closes it.
Device mapper used the system workqueue to perform deferred removals.
However, some targets (dm-raid1, dm-mpath, dm-stripe) flush work items
scheduled for the system workqueue from their destructor. If the
destructor itself is called from the system workqueue during deferred
removal, it introduces a possible deadlock - the workqueue tries to flush
itself.
Fix this possible deadlock by introducing a new workqueue for deferred
removals. We allocate just one workqueue for all dm targets. The
ability of dm targets to process IOs isn't dependent on deferred removal
of unused targets, so a deadlock due to shared workqueue isn't possible.
Also, cleanup local_init() to eliminate potential for returning success
on failure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).
If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.
Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.
We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not. That
depends on details of the particular raid level and particular write
request.
This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel. It applied correctly to 3.0 at
least and will minor editing to earlier kernels.
Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
Pull device mapper updates from Mike Snitzer:
"This pull request is later than I'd have liked because I was waiting
for some performance data to help finally justify sending the
long-standing dm-crypt cpu scalability improvements upstream.
Unfortunately we came up short, so those dm-crypt changes will
continue to wait, but it seems we're not far off.
. Add dm_accept_partial_bio interface to DM core to allow DM targets
to only process a portion of a bio, the remainder being sent in the
next bio. This enables the old dm snapshot-origin target to only
split write bios on chunk boundaries, read bios are now sent to the
origin device unchanged.
. Add DM core support for disabling WRITE SAME if the underlying SCSI
layer disables it due to command failure.
. Reduce lock contention in DM's bio-prison.
. A few small cleanups and fixes to dm-thin and dm-era"
* tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: update discard_granularity to reflect the thin-pool blocksize
dm bio prison: implement per bucket locking in the dm_bio_prison hash table
dm: remove symbol export for dm_set_device_limits
dm: disable WRITE SAME if it fails
dm era: check for a non-NULL metadata object before closing it
dm thin: return ENOSPC instead of EIO when error_if_no_space enabled
dm thin: cleanup noflush_work to use a proper completion
dm snapshot: do not split read bios sent to snapshot-origin target
dm snapshot: allocate a per-target structure for snapshot-origin target
dm: introduce dm_accept_partial_bio
dm: change sector_count member in clone_info from sector_t to unsigned
DM thinp already checks whether the discard_granularity of the data
device is a factor of the thin-pool block size. But when using the
dm-thin-pool's discard passdown support, DM thinp was not selecting the
max of the underlying data device's discard_granularity and the
thin-pool's block size.
Update set_discard_limits() to set discard_granularity to the max of
these values. This enables blkdev_issue_discard() to properly align the
discards that are sent to the DM thin device on a full block boundary.
As such each discard will now cover an entire DM thin-pool block and the
block will be reclaimed.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Split the single per bio-prison lock by using per bucket locking. Per
bucket locking benefits both dm-thin and dm-cache targets by reducing
bio-prison lock contention.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull md updates from Neil Brown:
"Assorted md fixes for 3.16
Mostly performance improvements with a few corner-case bug fixes"
* tag 'md/3.16' of git://neil.brown.name/md:
raid5: speedup sync_request processing
md/raid5: deadlock between retry_aligned_read with barrier io
raid5: add an option to avoid copy data from bio to stripe cache
md/bitmap: remove confusing code from filemap_get_page.
raid5: avoid release list until last reference of the stripe
md: md_clear_badblocks should return an error code on failure.
md/raid56: Don't perform reads to support writes until stripe is ready.
md: refuse to change shape of array if it is active but read-only
The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread. The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe. This
creates some more parallelism between the resync-thread and raid5 kernel daemon.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs, the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0
Signed-off-by: NeilBrown <neilb@suse.de>
There is no need for code other than DM core to use dm_set_device_limits
so remove its EXPORT_SYMBOL_GPL. Also, cleanup a couple whitespace nits.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add DM core support for disabling WRITE SAME on first failure to both
request-based and bio-based targets. The need to disable WRITE SAME
stems from SCSI enabling it by default but then disabling it when it
fails. When SCSI does this it returns "permanent target failure, do
not retry" using -EREMOTEIO. Update DM core to only disable WRITE SAME
on failure if the returned error is -EREMOTEIO.
Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails")
implemented multipath specific disabling of WRITE SAME if it fails.
However, as that commit detailed, the multipath-only solution doesn't go
far enough if bio-based DM targets are stacked ontop of the
request-based dm-multipath target (as is commonly done using dm-linear
to support partitions on multipath devices, via kpartx).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Alex Chen <alex.chen@huawei.com>
era_ctr() may call era_destroy() before era->md is initialized so
era_destory() must only close the metadata object if it is not NULL.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Naohiro Aota <naota@elisp.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.15+
Update the DM thin provisioning target's allocation failure error to be
consistent with commit a9d6ceb8 ("[SCSI] return ENOSPC on thin
provisioning failure").
The DM thin target now returns -ENOSPC rather than -EIO when
block allocation fails due to the pool being out of data space (and
the 'error_if_no_space' thin-pool feature is enabled).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-By: Joe Thornber <ejt@redhat.com>
Factor out a pool_work interface that noflush_work makes use of to wait
for and complete work items (in terms of a proper completion struct).
Allows discontinuing the use of a custom completion in terms of atomic_t
and wait_event.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change the snapshot-origin target so that only write bios are split on
chunk boundary. Read bios are passed unchanged to the underlying
device, so they don't have to be split.
Later, we could change the target so that it accepts a larger write bio
if it spans an area that is completely covered by snapshot exceptions.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allocate a per-target dm_origin structure. This is a prerequisite for
the next commit ("dm snapshot: do not split read bios sent to
snapshot-origin target") which adds a new member to this structure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The function dm_accept_partial_bio allows the target to specify how many
sectors of the current bio it will process. If the target only wants to
accept part of the bio, it calls dm_accept_partial_bio and the DM core
sends the rest of the data in next bio.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It is impossible to create bios with 2^23 or more sectors (the size is
stored as a 32-bit byte count in the bio). So we convert some sector_t
values to unsigned integers.
This is needed for the next commit ("dm: introduce
dm_accept_partial_bio") that replaces integer value arguments with
pointers, so the size of the integer must match.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull two md bugfixes from Neil Brown:
"Two md bugfixes for possible corruption when restarting reshape
If a raid5/6 reshape is restarted (After stopping and re-assembling
the array) and the array is marked read-only (or read-auto), then the
reshape will appear to complete immediately, without actually moving
anything around. This can result in corruption.
There are two patches which do much the same thing in different
places. They are separate because one is an older bug and so can be
applied to more -stable kernels"
* tag 'md/3.15-fixes' of git://neil.brown.name/md:
md: always set MD_RECOVERY_INTR when interrupting a reshape thread.
md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".