Most often we have the page number, not the page. And that is what
the *_page_attr() functions really want. So change the arguments to
take that number.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of allocating pages in read_sb_page, read_page and
bitmap_read_sb, allocate them all in bitmap_init_from disk.
Also replace the hack of calling "attach_page_buffers(page, NULL)" to
ensure that free_buffer() won't complain, by putting a test for
PagePrivate in free_buffer().
Signed-off-by: NeilBrown <neilb@suse.de>
An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk
The second requires the first, but the first doesn't require the
second.
Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.
So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.
A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.
We don't actually activate these bitmaps yet - that will come
in a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.
So create an attribute to store this information and set appropriate
defaults.
It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.
Signed-off-by: NeilBrown <neilb@suse.de>
There are two different 'pending' concepts in the handling of the
write intent bitmap.
Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
may have changes (bits cleared) that should be written in due course.
There is no hurry for these and the page will transition from
PENDING to NEEDWRITE and will then be written, though if it ever
becomes DIRTY it will be written much sooner and PENDING will be
cleared.
Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
for each bit, can usefully have a 'pending' flag which indicates if
any of the counters are low (2 or 1) and ready to be processed by
bitmap_daemon_work(). If this flag is clear we can skip the whole
page.
These two concepts are currently combined in the bitmap-file flag.
This causes a tighter connection between the counters and the bitmap
file than I would like - as I want to add some flexibility to the
bitmap file.
So introduce a new flag with the page-of-counters, and rewrite
bitmap_daemon_work() so that it handles the two different 'pending'
concepts separately.
This also allows us to clear BITMAP_PAGE_PENDING when we write out
a dirty page, which may occasionally reduce the number of times we
write a page.
Signed-off-by: NeilBrown <neilb@suse.de>
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If the allocation of rep1_bio fails, we currently don't free the 'bio'
of the same dev.
Reported by kmemleak.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When attempting to fix a read error, it is acceptable to read from a
device that is recovering, provided the recovery has got past the
place we are reading from. This makes the test for "can we read from
here" the same as the test in read_balance.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This ensures that it is always freed - there were case where
we failed to free the page.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.
So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c
Signed-off-by: NeilBrown <neilb@suse.de>
Reorders functions in raid6_algos as well as the preference check
to reduce the number of functions tested on initialization.
Also, creates symmetry between choosing the gen_syndrome functions
and choosing the recovery functions.
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add SSSE3 optimized recovery functions, as well as a system
for selecting the most appropriate recovery functions to use.
Originally-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
<linux/module.h> drags in headers which are not visible to userspace,
thus breaking the build for the test program.
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
With CONFIG_PREEMPT=y, we need to disable preemption while benchmarking
RAID5 xor checksumming to ensure we're actually measuring what we think
we're measuring.
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In the existing do_xor_speed(), there is no guarantee that we actually
run do_2() for a full jiffy. We get the current jiffy, then run do_2()
until the next jiffy.
Instead, let's get the current jiffy, then wait until the next jiffy
to start our test.
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.
Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'. Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.
This means that RAID10 reshape is not supportable on v0.90 metadata.
The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).
A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).
Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.
[unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NeilBrown <neilb@suse.de>
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape. So split it out into separate
function.
Signed-off-by: NeilBrown <neilb@suse.de>
When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it. At other times, use both as
appropriate.
For now, both are identical (And reshape_position is never set).
When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.
Signed-off-by: NeilBrown <neilb@suse.de>
Some resync type operations need to act on the address space of the
device, others on the address space of the array.
This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only. However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.
Signed-off-by: NeilBrown <neilb@suse.de>
Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.
So add that understanding.
Signed-off-by: NeilBrown <neilb@suse.de>
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there. Then we will more easily be able to add
a second set of fields.
Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.
Signed-off-by: NeilBrown <neilb@suse.de>
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.
To this end we find the minimum offset difference across all devices
and add that where appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>