When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.
Apply this pattern 3 times, and wrap some other long lines.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the rdev on which a read error happened could be removed
before we perform the fix_error handling. This requires extra tests
for NULL.
So delay the rdev_dec_pending call until after the call to
fix_read_error so that we can be sure that the rdev still exists.
This allows an 'if' clause to be removed so the body gets re-indented
back one level.
Signed-off-by: NeilBrown <neilb@suse.de>
raid10 read balance has two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.
Signed-off-by: NeilBrown <neilb@suse.de>
We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.
Also remove some plug-related code that is no longer needed.
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.
Signed-off-by: NeilBrown <neilb@suse.de>
MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.
Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.
Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_throtl_exit assumes that ->queue_lock still exists,
so make sure that it does.
To do this, we stop redirecting ->queue_lock to conf->device_lock
and leave it pointing where it is initialised - __queue_lock.
As the blk_plug functions check the ->queue_lock is held, we now
take that spin_lock explicitly around the plug functions. We don't
need the locking, just the warning removal.
This is needed for any kernel with the blk_throtl code, which is
which is 2.6.37 and later.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Following symptoms were observed:
1. After raid0->raid10 takeover operation we have array with 2
missing disks.
When we add disk for rebuild, recovery process starts as expected
but it does not finish- it stops at about 90%, md126_resync process
hangs in "D" state.
2. Similar behavior is when we have mounted raid0 array and we
execute takeover to raid10. After this when we try to unmount array-
it causes process umount hangs in "D"
In scenarios above processes hang at the same function- wait_barrier
in raid10.c.
Process waits in macro "wait_event_lock_irq" until the
"!conf->barrier" condition will be true.
In scenarios above it never happens.
Reason was that at the end of level_store, after calling pers->run,
we call mddev_resume. This calls pers->quiesce(mddev, 0) with
RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.
This patch introduces setting conf->barrier=1 after takeover
operation. It prevents to become barrier negative after call
lower_barrier().
Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add new parameter to 'sync_page_io'.
The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.
Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called. This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.
So fix this at both ends:
- md_thread should only call the handler if THREAD_WAKEUP has been
set.
- raid10 should call md_unregister_thread before setting things
to NULL just like all the other raid modules do.
This is applicable to 2.6.35 and later.
Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.
So allocate a local bio pool for each md array and use that rather
than the common pool.
This pool is used both for regular IO and metadata updates.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.
Signed-off-by: NeilBrown <neilb@suse.de>
bio_alloc can never fail (as it uses a mempool) but an block
indefinitely, especially if the caller is holding a reference to a
previously allocated bio.
So these to places which both handle failure and hold multiple bios
should not use bio_alloc, they should use bio_kmalloc.
Signed-off-by: NeilBrown <neilb@suse.de>
It is not safe to allocate from a mempool while holding an item
previously allocated from that mempool as that can deadlock when the
mempool is close to exhaustion.
So don't use a bio list to collect the bios to write to multiple
devices in raid1 and raid10.
Instead queue each bio as it becomes available so an unplug will
activate all previously allocated bios and so a new bio has a chance
of being allocated.
This means we must set the 'remaining' count to '1' before submitting
any requests, then when all are submitted, decrement 'remaining' and
possible handle the write completion at that point.
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.
* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.
* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.
* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.
For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.
* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.
* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.
* raid10: FUA bit needs to be propagated to write clones.
linear, raid0, 1, 5 and 10 tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
commit 7b6d91daee changed the behaviour
of a few variables in raid1 and raid10 from flags to bit-sets, but
left them as type 'bool' so they did not work.
Change them (back) to unsigned long.
(historical note: see 1ef04fefe2)
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
md_check_recovery expects ->spare_active to return 'true' if any
spares were activated, but none of them do, so the consequent change
in 'degraded' is not notified through sysfs.
So count the number of spares activated, subtract it from 'degraded'
just once, and return it.
Reported-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When RAID1 is done syncing disks, it'll update the state
of synced rdevs to In_sync. But it neglected to notify
sysfs that the attribute changed. So any programs that
are waiting for an rdev's state to change will not be
woken.
(raid5/raid10 added by neilb)
Signed-off-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>