[ Based on an original patch by Yuri Tikhonov ]
The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.
The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct
The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)
[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ops_complete_compute5 can be reused in the raid6 path if it is updated to
generically handle a second target.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use percpu memory rather than stack for storing the buffer lists used in
parity calculations. Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.
[ Impact: move memory pressure from stack to heap ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.
Make this allocation cpu hotplug aware to maximize allocation
efficiency.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Prepare the api for the arrival of a new parameter, 'scribble'. This
will allow callers to identify scratchpad memory for dma address or page
address conversions. As this adds yet another parameter, take this
opportunity to convert the common submission parameters (flags,
dependency, callback, and callback argument) into an object that is
passed by reference.
Also, take this opportunity to fix up the kerneldoc and add notes about
the relevant ASYNC_TX_* flags for each routine.
[ Impact: moves api pass-by-value parameters to a pass-by-reference struct ]
Signed-off-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In support of inter-channel chaining async_tx utilizes an ack flag to
gate whether a dependent operation can be chained to another. While the
flag is not set the chain can be considered open for appending. Setting
the ack flag closes the chain and flags the descriptor for garbage
collection. The ASYNC_TX_DEP_ACK flag essentially means "close the
chain after adding this dependency". Since each operation can only have
one child the api now implicitly sets the ack flag at dependency
submission time. This removes an unnecessary management burden from
clients of the api.
[ Impact: clean up and enforce one dependency per operation ]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
'zero_sum' does not properly describe the operation of generating parity
and checking that it validates against an existing buffer. Change the
name of the operation to 'val' (for 'validate'). This is in
anticipation of the p+q case where it is a requirement to identify the
target parity buffers separately from the source buffers, because the
target parity buffers will not have corresponding pq coefficients.
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We currently update the metadata :
1/ every 3Megabytes
2/ When the place we will write new-layout data to is recorded in
the metadata as still containing old-layout data.
Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart. So it should really be time based rather
than size based. So change it to "every 10 seconds".
Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped. In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.
Signed-off-by: NeilBrown <neilb@suse.de>
... and to be certain the that make_request doesn't wait forever,
add a 'wake_up' when ->reshape_progress has been set to MaxSector
Signed-off-by: NeilBrown <neilb@suse.de>
This was only needed when the code was experimental. Most of it
is well tested now, so the option is no longer useful.
Signed-off-by: NeilBrown <neilb@suse.de>
When we are reshaping an array, it is very important that we read
the data from a particular sector offset before writing new data
at that offset.
In most cases when growing or shrinking an array we read long before
we even consider writing. But when restriping an array without
changing it size, there is a small possibility that we might have
some data to available write before the read has happened at the same
location. This would require some stripes to be in cache already.
To guard against this small possibility, we check, before writing,
that the 'old' stripe at the same location is not in the process of
being read. And we ensure that we mark all 'source' stripes as such
before allowing new 'destination' stripes to proceed.
Signed-off-by: NeilBrown <neilb@suse.de>
If an array has 3 or more devices, we allow the chunksize or layout
to be changed and when a reshape starts, we use these as the 'new'
values.
Signed-off-by: NeilBrown <neilb@suse.de>
This ensures that even when old and new stripes are overlapping,
we will try to read all of the old before having to write any
of the new.
Signed-off-by: NeilBrown <neilb@suse.de>
Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.
This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.
Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position. In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.
Signed-off-by: NeilBrown <neilb@suse.de>
During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'. They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.
However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size). So make the difference more
explicit with a 'generation' number.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping a raid5 to have fewer devices, we work from the end of
the array to the beginning.
md_do_sync gives addresses to sync_request that go from the beginning
to the end. So largely ignore them use the internal state variable
"reshape_progress" to keep track of what to do next.
Never allow the size to be reduced below the minimum (4 for raid6,
3 otherwise).
We require that the size of the array has already been reduced before
the array is reshaped to a smaller size. This is because simply
reducing the size is an easily reversible operation, while the reshape
is immediately destructive and so is not reversible for the blocks at
the ends of the devices.
Thus to reshape an array to have fewer devices, you must first write
an appropriately small size to md/array_size.
When reshape finished, we remove any drives that are no longer
needed and fix up ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning. So we need to handle expand_progress and expand_lo
differently.
This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid5 (the only module that supports restriping)
notices that the reshape has finished be sync_request being
given a large value, and handles any cleanup them.
This patch changes it so md_check_recovery calls into an
explicit finish_reshape method as well.
The clean-up from sync_request can do things that need to be
done promptly, typically things local to the raid5_conf_t
structure.
The "finish_reshape" method is called under the mddev_lock
so it can do things involving reconfiguring the device.
This allows us to get rid of md_set_array_sectors_locked, which
would have caused a deadlock if you tried to stop and array
while a reshape was happening.
Signed-off-by: NeilBrown <neilb@suse.de>
This is the first of four patches which combine to allow md/raid5 to
reduce the number of devices in the array by restriping the data over
a subset of the devices.
If the number of disks in a raid4/5/6 is being reduced, then the
default size must be based on the new number, not the old number
of devices.
In general, it should be based on the smaller of new and old.
Signed-off-by: NeilBrown <neilb@suse.de>
Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.
To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>