If a device fails in a way that causes pending request to take a while
to complete, md will not be able to immediately remove it from the
array in remove_and_add_spares.
It will then incorrectly look like a spare device and md will try to
recover it even though it is failed.
This leads to a recovery process starting and instantly aborting over
and over again.
We should check if the device is faulty before considering it to be a
spare. This will avoid trying to start a recovery that cannot
proceed.
This bug was introduced in 2.6.26 so that patch is suitable for any
kernel since then.
Cc: stable@kernel.org
Reported-by: Jim Paradis <james.paradis@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In the bio_for_each_segment loop, bvl always points current
bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of
it.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit e9c7469bb4 ("md: implment REQ_FLUSH/FUA support")
introduced R5_WantFUA flag and set rw to WRITE_FUA in that case.
However remaining code still checks whether rw is exactly same
as WRITE or not, so FUAed-write ends up with being treated as
READ. Fix it.
This bug has been present since 2.6.37 and the fix is suitable for any
-stable kernel since then. It is not clear why this has not caused
more problems.
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The @bio->bi_phys_segments consists of active stripes count in the
lower 16 bits and processed stripes count in the upper 16 bits. So
logical-OR operator should be bitwise one.
This bug has been present since 2.6.27 and the fix is suitable for any
-stable kernel since then. Fortunately the bad code is only used on
error paths and is relatively unlikely to be hit.
Cc: stable@kernel.org
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Get rid of ->syncchunk and ->counter_bits since they're never used.
Also discard COUNTER_BYTE_RATIO which is unused.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add check to determine if a device needs full resync or if partial resync will do
RAID 5 was assuming that if a device was not In_sync, it must undergo a full
resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'.
If it is, we can safely skip the full resync and rely on the bitmap for
partial recovery instead. This is the legitimate purpose of 'saved_raid_disk',
from md.h:
int saved_raid_disk; /* role that device used to have in the
* array and could again if we did a partial
* resync from the bitmap
*/
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add bitmap support to the device-mapper specific metadata area.
This patch allows the creation of the bitmap metadata area upon
initial array creation via device-mapper.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)
Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move personality and sync/recovery thread starting outside md_run.
Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Make message a bit clearer by s/blocks/k/
I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Disallow resync I/O while the RAID array is suspended.
Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Don't attempt md_integrity_register if there is no gendisk struct available.
When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reserve just the minimum of pages needed to process one job.
Because we allocate pages from page allocator, we don't need to reserve
a large number of pages. The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.
The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4. This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.
Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures. One structure is enough to
process the whole request.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch changes dm-kcopyd so that it allocates pages from the main
page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
fail in case of memory pressure). If the allocation fails, dm-kcopyd
allocates pages from its own reserve.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a parameter for gfp flags to alloc_pl() for use in following
patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the spinlock protecting the pages allocation. The spinlock is only
taken on initialization or from single-threaded workqueue. Therefore, the
spinlock is useless.
The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.
kcopyd_get_pages is only called from run_pages_job, which is only
called from process_jobs called from do_work.
kcopyd_put_pages is called from client_alloc_pages (which is initialization
function) or from run_complete_job. run_complete_job is only called from
process_jobs called from do_work.
Another spinlock, kc->job_lock is taken each time someone pushes or pops
some work for the worker thread. Once we take kc->job_lock, we
guarantee that any written memory is visible to the other CPUs.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There's a possible theoretical deadlock in dm-kcopyd because multiple
allocations from the same mempool are required to finish a request.
Avoid this by preallocating sub jobs.
There is a mempool of 512 entries. Each request requires up to 9
entries from the mempool. If we have at least 57 concurrent requests
running, the mempool may overflow and mempool allocations may start
blocking until another entry is freed to the mempool. Because the same
thread is used to free entries to the mempool and allocate entries from
the mempool, this may result in a deadlock.
This patch changes it so that one mempool entry contains all 9 "struct
kcopyd_job" required to fulfill the whole request. The allocation is
done only once in dm_kcopyd_copy and no further mempool allocations are
done during request processing.
If dm_kcopyd_copy is not run in the completion thread, this
implementation is deadlock-free.
MIN_JOBS needs reducing accordingly and we've chosen to reduce it
further to 8.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>