Commit Graph

69 Commits

Author SHA1 Message Date
NeilBrown 19fdb9eefb Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus
Conflicts:
	drivers/md/md.c

- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-22 08:31:36 +10:00
NeilBrown 2dc40f8094 md/linear: standardise all printk messages
md/linear:mdname:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown 21a52c6d05 md: pass mddev to make_request functions rather than request_queue
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown 490773268c md: move io accounting out of personalities into md_make_request
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:52 +10:00
NeilBrown ef2f80ff73 md/linear: avoid possible oops and array stop
Since commit ef286f6fa6
it has been important that each personality clears
->private in the ->stop() function, or sets it to a
attribute group to be removed.
linear.c doesn't.  This can sometimes lead to an oops,
though it doesn't always.

Suitable for 2.6.33-stable and 2.6.34.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-05-17 14:38:18 +10:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
NeilBrown 627a2d3c29 md: deal with merge_bvec_fn in component devices better.
If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to.  Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.

So instead set max_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.

This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-03-16 17:04:24 +11:00
Martin K. Petersen 086fa5ff08 block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors
The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.

Also introduce a temporary wrapper for backwards compability.  This can
be removed after the merge window is closed.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-26 13:58:08 +01:00
NeilBrown 0efb9e6191 md: add MODULE_DESCRIPTION for all md related modules.
Suggested by  Oren Held <orenhe@il.ibm.com>

Signed-off-by: NeilBrown <neilb@suse.de>
2009-12-14 12:51:41 +11:00
NeilBrown a2826aa92e md: support barrier requests on all personalities.
Previously barriers were only supported on RAID1.  This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device.  When that completes - and if the original request was not
empty -  we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail.  If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail.  That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet.  So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted.  Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
2009-12-14 12:49:49 +11:00
NeilBrown 3fa841d7e7 md: report device as congested when suspended
This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-09-23 18:10:29 +10:00
Jens Axboe 1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
NeilBrown 449aad3e25 md: Use revalidate_disk to effect changes in size of device.
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
Andre Noll ac5e7113e7 md: Push down data integrity code to personalities.
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:47 +10:00
Martin K. Petersen 8f6c2e4b32 md: Use new topology calls to indicate alignment and I/O sizes
Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:13:45 +10:00
NeilBrown 495d357301 md/linear: use call_rcu to free obsolete 'conf' structures.
Current, when we update the 'conf' structure, when adding a
drive to a linear array, we keep the old version around until
the array is finally stopped, as it is not safe to free it
immediately.

Now that we have rcu protection on all accesses to 'conf',
we can use call_rcu to free it more promptly.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:42 +10:00
SandeepKsinha af11c397fd md linear: Protecting mddev with rcu locks to avoid races
Due to the lack of memory ordering guarantees, we may have races around
mddev->conf.

In particular, the correct contents of the structure we get from
dereferencing ->private might not be visible to this CPU yet, and
they might not be correct w.r.t mddev->raid_disks.

This patch addresses the problem using rcu protection to avoid
such race conditions.

Signed-off-by: SandeepKsinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:35 +10:00
Andre Noll 0894cc3066 md: Move check for bitmap presence to personality code.
If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:49:23 +10:00
NeilBrown 13f2682b72 md: raid0/linear: ensure device sizes are rounded to chunk size.
This is currently ensured by common code, but it is more reliable to
ensure it where it is needed in personality code.
All the other personalities that care already round the size to
the chunk_size.  raid0 and linear are the only hold-outs.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:48:55 +10:00
Andre Noll 9d8f036362 md: Make mddev->chunk_size sector-based.
This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 08:45:01 +10:00
Sandeep K Sinha aece3d1f40 md: Binary search in linear raid
Replace the linear search with binary search in which_dev.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:57:08 +10:00
Sandeep K Sinha 4db7cdc859 md: Removing num_sector and replacing start_sector with end_sector
Remove num_sectors from dev_info and replace start_sector with
end_sector.  This makes a lot of comparisons much simpler.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:56:13 +10:00
Sandeep K Sinha 45d4582f21 md: Removal of hash table in linear raid
Get rid of sector_div and hash table for linear raid and replace
with a linear search in which_dev.
The hash table adds a lot of complexity for little if any gain.
Ultimately a binary search will be used which will have smaller
cache foot print, a similar number of memory access, and no
divisions.

Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:55:26 +10:00
NeilBrown 070ec55d07 md: remove mddev_to_conf "helper" macro
Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16 16:54:21 +10:00
Martin K. Petersen ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00