Add check to determine if a device needs full resync or if partial resync will do
RAID 5 was assuming that if a device was not In_sync, it must undergo a full
resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'.
If it is, we can safely skip the full resync and rely on the bitmap for
partial recovery instead. This is the legitimate purpose of 'saved_raid_disk',
from md.h:
int saved_raid_disk; /* role that device used to have in the
* array and could again if we did a partial
* resync from the bitmap
*/
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add bitmap support to the device-mapper specific metadata area.
This patch allows the creation of the bitmap metadata area upon
initial array creation via device-mapper.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)
Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move personality and sync/recovery thread starting outside md_run.
Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Make message a bit clearer by s/blocks/k/
I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Disallow resync I/O while the RAID array is suspended.
Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Don't attempt md_integrity_register if there is no gendisk struct available.
When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reserve just the minimum of pages needed to process one job.
Because we allocate pages from page allocator, we don't need to reserve
a large number of pages. The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.
The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4. This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.
Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures. One structure is enough to
process the whole request.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch changes dm-kcopyd so that it allocates pages from the main
page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
fail in case of memory pressure). If the allocation fails, dm-kcopyd
allocates pages from its own reserve.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a parameter for gfp flags to alloc_pl() for use in following
patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the spinlock protecting the pages allocation. The spinlock is only
taken on initialization or from single-threaded workqueue. Therefore, the
spinlock is useless.
The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.
kcopyd_get_pages is only called from run_pages_job, which is only
called from process_jobs called from do_work.
kcopyd_put_pages is called from client_alloc_pages (which is initialization
function) or from run_complete_job. run_complete_job is only called from
process_jobs called from do_work.
Another spinlock, kc->job_lock is taken each time someone pushes or pops
some work for the worker thread. Once we take kc->job_lock, we
guarantee that any written memory is visible to the other CPUs.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There's a possible theoretical deadlock in dm-kcopyd because multiple
allocations from the same mempool are required to finish a request.
Avoid this by preallocating sub jobs.
There is a mempool of 512 entries. Each request requires up to 9
entries from the mempool. If we have at least 57 concurrent requests
running, the mempool may overflow and mempool allocations may start
blocking until another entry is freed to the mempool. Because the same
thread is used to free entries to the mempool and allocate entries from
the mempool, this may result in a deadlock.
This patch changes it so that one mempool entry contains all 9 "struct
kcopyd_job" required to fulfill the whole request. The allocation is
done only once in dm_kcopyd_copy and no further mempool allocations are
done during request processing.
If dm_kcopyd_copy is not run in the completion thread, this
implementation is deadlock-free.
MIN_JOBS needs reducing accordingly and we've chosen to reduce it
further to 8.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Don't split SUB_JOB_SIZE jobs
If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
Splitting it just unnecessarily wastes time, because the split job size
is SUB_JOB_SIZE too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Integrity errors need to be passed to the owner of the integrity
metadata for processing. Consequently EILSEQ should be passed up the
stack.
Cc: stable@kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds a check that a block device has a request function
defined before it is used. Otherwise, misconfiguration can cause an oops.
Because we are allowing devices with zero size e.g. an offline multipath
device as in commit 2cd54d9bed
("dm: allow offline devices") there needs to be an additional check
to ensure devices are initialised. Some block devices, like a loop
device without a backing file, exist but have no request function.
Reproducer is trivial: dm-mirror on unbound loop device
(no backing file on loop devices)
dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0"
and mirror resync will immediatelly cause OOps.
BUG: unable to handle kernel NULL pointer dereference at (null)
? generic_make_request+0x2bd/0x590
? kmem_cache_alloc+0xad/0x190
submit_bio+0x53/0xe0
? bio_add_page+0x3b/0x50
dispatch_io+0x1ca/0x210 [dm_mod]
? read_callback+0x0/0xd0 [dm_mirror]
dm_io+0xbb/0x290 [dm_mod]
do_mirror+0x1e0/0x748 [dm_mirror]
Signed-off-by: Milan Broz <mbroz@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Permit a target to support discards regardless of whether or not all its
underlying devices do.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...
The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to. A value of 0 means the array is
not known to be in-sync at all. A value of MaxSector means the array
is believed to be fully in-sync.
When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match. This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.
However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable. So it would be good if the implied resync
after the array is resized could be avoided.
So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'. Changing it once resync has started is
not going to be useful anyway.
This allows the array to be resized without a resync by:
write 'frozen' to 'sync_action'
write new size to 'component_size' (this will set resync_start)
write 'none' to 'resync_start'
write 'idle' to 'sync_action'.
Also slightly improve some tests on recovery_cp when resizing
raid1/raid5. Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NeilBrown <neilb@suse.de>
When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.
Apply this pattern 3 times, and wrap some other long lines.
Signed-off-by: NeilBrown <neilb@suse.de>