* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cciss: fix cciss_revalidate panic
block: max hardware sectors limit wrapper
block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
blk-throttle: Correct the placement of smp_rmb()
blk-throttle: Trim/adjust slice_end once a bio has been dispatched
block: check for proper length of iov entries earlier in blk_rq_map_user_iov()
drbd: fix for spin_lock_irqsave in endio callback
drbd: don't recvmsg with zero length
Implement blk_limits_max_hw_sectors() and make
blk_queue_max_hw_sectors() a wrapper around it.
DM needs this to avoid setting queue_limits' max_hw_sectors and
max_sectors directly. dm_set_device_limits() now leverages
blk_limits_max_hw_sectors() logic to establish the appropriate
max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
incorrectly setting max_sectors rather than max_hw_sectors (which
caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.
There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.
The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.
Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.
Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called. This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.
So fix this at both ends:
- md_thread should only call the handler if THREAD_WAKEUP has been
set.
- raid10 should call md_unregister_thread before setting things
to NULL just like all the other raid modules do.
This is applicable to 2.6.35 and later.
Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
With v0.90 metadata, a hot-spare does not become a full member of the
array until recovery is complete. So if we re-add such a device to
the array, we know that all of it is as up-to-date as the event count
would suggest, and so it a bitmap-based recovery is possible.
However with v1.x metadata, the hot-spare immediately becomes a full
member of the array, but it record how much of the device has been
recovered. If the array is stopped and re-assembled recovery starts
from this point.
When such a device is hot-added to an array we currently lose the 'how
much is recovered' information and incorrectly included it as a full
in-sync member (after bitmap-based fixup).
This is wrong and unsafe and could corrupt data.
So be more careful about setting saved_raid_disk - which is what
guides the re-adding of devices back into an array.
The new code matches the code in slot_store which does a similar
thing, which is encouraging.
This is suitable for any -stable kernel.
Reported-by: "Dailey, Nate" <Nate.Dailey@stratus.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
As recorded in
https://bugzilla.kernel.org/show_bug.cgi?id=24012
it is possible for a flush request through md to hang. This is due to
an interaction between the recursion avoidance in
generic_make_request, the insistence in md of only having one flush
active at a time, and the possibility of dm (or md) submitting two
flush requests to a device from the one generic_make_request.
If a generic_make_request call into dm causes two flush requests to be
queued (as happens if the dm table has two targets - they get one
each), these two will be queued inside generic_make_request.
Assume they are for the same md device.
The first is processed and causes 1 or more flush requests to be sent
to lower devices. These get queued within generic_make_request too.
Then the second flush to the md device gets handled and it blocks
waiting for the first flush to complete. But it won't complete until
the two lower-device requests complete, and they haven't even been
submitted yet as they are on the generic_make_request queue.
The deadlock can be broken by using a separate thread to submit the
requests to lower devices. md has such a thread readily available:
md_wq.
So use it to submit these requests.
Reported-by: Giacomo Catenazzi <cate@cateee.net>
Tested-by: Giacomo Catenazzi <cate@cateee.net>
Signed-off-by: NeilBrown <neilb@suse.de>
submit_flushes is called from exactly one place.
Move the code that is before and after that call into
submit_flushes.
This has not functional change, but will make the next patch
smaller and easier to follow.
Signed-off-by: NeilBrown <neilb@suse.de>
None of the functions called between setting flush_pending to 1, and
atomic_dec_and_test can change flush_pending, or will anything
running in any other thread (as ->flush_bio is not NULL). So the
atomic_dec_and_test will always succeed.
So remove the atomic_sec and the atomic_dec_and_test.
Signed-off-by: NeilBrown <neilb@suse.de>
Before 2.6.37, the md layer had a mechanism for catching I/Os with the
barrier flag set, and translating the barrier into barriers for all
the underlying devices. With 2.6.37, I/O barriers have become plain
old flushes, and the md code was updated to reflect this. However,
one piece was left out -- the md layer does not tell the block layer
that it supports flushes or FUA access at all, which results in md
silently dropping flush requests.
Since the support already seems there, just add this one piece of
bookkeeping.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 4044ba58dd supposedly fixed a
problem where if a raid1 with just one good device gets a read-error
during recovery, the recovery would abort and immediately restart in
an infinite loop.
However it depended on raid1_remove_disk removing the spare device
from the array. But that does not happen in this case. So add a test
so that in the 'recovery_disabled' case, the device will be removed.
This suitable for any kernel since 2.6.29 which is when
recovery_disabled was introduced.
Cc: stable@kernel.org
Reported-by: Sebastian Färber <faerber@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When trying to grow an array by enlarging component devices,
rdev_size_store() expects the return value of rdev_size_change() to be
in sectors, but the actual value is returned in KBs.
This functionality was broken by commit
dd8ac336c1
so this patch is suitable for any kernel since 2.6.30.
Cc: stable@kernel.org
Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Convert direct reads of an inode's i_size to using i_size_read().
i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes. Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}. But
i_size_read() callers do not need to use additional locking.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The code for searching through the device list to read-balance in
raid1 is rather clumsy and hard to follow. Try to simplify it a bit.
No important functionality change here.
Signed-off-by: NeilBrown <neilb@suse.de>
When writing to an 'external' bitmap we don't currently unplug the
device before waiting, so we can get a 3msec delay each time;
So use REQ_UNPLUG to force and unplug.
Signed-off-by: NeilBrown <neilb@suse.de>
bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.
So allocate a local bio pool for each md array and use that rather
than the common pool.
This pool is used both for regular IO and metadata updates.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.
Signed-off-by: NeilBrown <neilb@suse.de>
Though this mem alloc is GFP_NOIO an so will not deadlock, it seems
better to do the allocation before 'raise_barrier' which stops any IO
requests while the resync proceeds.
raid10 always uses this order, so it is at least consistent to do the
same in raid1.
Signed-off-by: NeilBrown <neilb@suse.de>
bio_alloc can never fail (as it uses a mempool) but an block
indefinitely, especially if the caller is holding a reference to a
previously allocated bio.
So these to places which both handle failure and hold multiple bios
should not use bio_alloc, they should use bio_kmalloc.
Signed-off-by: NeilBrown <neilb@suse.de>
It is not safe to allocate from a mempool while holding an item
previously allocated from that mempool as that can deadlock when the
mempool is close to exhaustion.
So don't use a bio list to collect the bios to write to multiple
devices in raid1 and raid10.
Instead queue each bio as it becomes available so an unplug will
activate all previously allocated bios and so a new bio has a chance
of being allocated.
This means we must set the 'remaining' count to '1' before submitting
any requests, then when all are submitted, decrement 'remaining' and
possible handle the write completion at that point.
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Workqueue usage in md has two problems.
* Flush can be used during or depended upon by memory reclaim, but md
uses the system workqueue for flush_work which may lead to deadlock.
* md depends on flush_scheduled_work() to achieve exclusion against
completion of removal of previous instances. flush_scheduled_work()
may incur unexpected amount of delay and is scheduled to be removed.
This patch adds two workqueues to md - md_wq and md_misc_wq. The
former is guaranteed to make forward progress under memory pressure
and serves flush_work. The latter serves as the flush domain for
other works.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.
Signed-off-by: NeilBrown <neilb@suse.de>
lock_kernel calls were recently pushed down into open/release
functions.
md doesn't need that protection.
Then the BKL calls were change to md_mutex. We don't need those
either.
So remove it all.
Signed-off-by: NeilBrown <neilb@suse.de>
A RAID1 which has no persistent metadata, whether internal or
external, will hang on the first write.
This is caused by commit 070dc6dd71
In that case, MD_CHANGE_PENDING never gets cleared.
So during md_update_sb, is neither persistent or external,
clear MD_CHANGE_PENDING.
This is suitable for 2.6.36-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org