Conflicts:
drivers/md/md.c
- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.
Signed-off-by: NeilBrown <neilb@suse.de>
Many 'printk' messages from the raid456 module mention 'raid5' even
though it may be a 'raid6' or even 'raid4' array. This can cause
confusion.
Also the actual array name is not always reported and when it is
it is not reported consistently.
So change all the messages to start:
md/raid:%s:
where '%s' becomes e.g. md3 to identify the particular array.
Signed-off-by: NeilBrown <neilb@suse.de>
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.
Signed-off-by: NeilBrown <neilb@suse.de>
We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.
Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.
Signed-off-by: NeilBrown <neilb@suse.de>
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.
Signed-off-by: NeilBrown <neilb@suse.de>
Diving through ->queue to find mddev is unnecessarily complex - there
is an easier path to finding mddev, so use that.
Signed-off-by: NeilBrown <neilb@suse.de>
void pointers do not need to be cast to other pointer types.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.
This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.
When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Previous patch changes stripe and chunk_number to sector_t but
mistakenly did not update all of the divisions to use sector_dev().
This patch changes all the those divisions (actually the '%' operator)
to sector_div.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
With many large drives and small chunk sizes it is possible
to create a RAID5 with more than 2^31 chunks. Make sure this
works.
Reported-by: Brett King <king.br@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Except for SCSI no device drivers distinguish between physical and
hardware segment limits. Consolidate the two into a single segment
limit.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add __percpu sparse annotations to places which didn't make it in one
of the previous patches. All converions are trivial.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Neil Brown <neilb@suse.de>
======
This fix is related to
http://bugzilla.kernel.org/show_bug.cgi?id=15142
but does not address that exact issue.
======
sysfs does like attributes being removed while they are being accessed
(i.e. read or written) and waits for the access to complete.
As accessing some md attributes takes the same lock that is held while
removing those attributes a deadlock can occur.
This patch addresses 3 issues in md that could lead to this deadlock.
Two relate to calling flush_scheduled_work while the lock is held.
This is probably a bad idea in general and as we use schedule_work to
delete various sysfs objects it is particularly bad.
In one case flush_scheduled_work is called from md_alloc (called by
md_probe) called from do_md_run which holds the lock. This call is
only present to ensure that ->gendisk is set. However we can be sure
that gendisk is always set (though possibly we couldn't when that code
was originally written. This is because do_md_run is called in three
different contexts:
1/ from md_ioctl. This requires that md_open has succeeded, and it
fails if ->gendisk is not set.
2/ from writing a sysfs attribute. This can only happen if the
mddev has been registered in sysfs which happens in md_alloc
after ->gendisk has been set.
3/ from autorun_array which is only called by autorun_devices, which
checks for ->gendisk to be set before calling autorun_array.
So the call to md_probe in do_md_run can be removed, and the check on
->gendisk can also go.
In the other case flush_scheduled_work is being called in do_md_stop,
purportedly to wait for all md_delayed_delete calls (which delete the
component rdevs) to complete. However there really isn't any need to
wait for them - they have already been disconnected in all important
ways.
The third issue is that raid5->stop() removes some attribute names
while the lock is held. There is already some infrastructure in place
to delay attribute removal until after the lock is released (using
schedule_work). So extend that infrastructure to remove the
raid5_attrs_group.
This does not address all lockdep issues related to the sysfs
"s_active" lock. The rest can be address by splitting that lockdep
context between symlinks and non-symlinks which hopefully will happen.
Signed-off-by: NeilBrown <neilb@suse.de>
This code was written long ago when it was not possible to
reshape a degraded array. Now it is so the current level of
degraded-ness needs to be taken in to account. Also newly addded
devices should only reduce degradedness if they are deemed to be
in-sync.
In particular, if you convert a RAID5 to a RAID6, and increase the
number of devices at the same time, then the 5->6 conversion will
make the array degraded so the current code will produce a wrong
value for 'degraded' - "-1" to be precise.
If the reshape runs to completion end_reshape will calculate a correct
new value for 'degraded', but if a device fails during the reshape an
incorrect decision might be made based on the incorrect value of
"degraded".
This patch is suitable for 2.6.32-stable and if they are still open,
2.6.31-stable and 2.6.30-stable as well.
Cc: stable@kernel.org
Reported-by: Michael Evans <mjevans1983@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The post-barrier-flush is sent by md as soon as make_request on the
barrier write completes. For raid5, the data might not be in the
per-device queues yet. So for barrier requests, wait for any
pre-reading to be done so that the request will be in the per-device
queues.
We use the 'preread_active' count to check that nothing is still in
the preread phase, and delay the decrement of this count until after
write requests have been submitted to the underlying devices.
Signed-off-by: NeilBrown <neilb@suse.de>
Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.
When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.
The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.
The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.
RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.
Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.
That will be done in later patches.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>