Commit Graph

1664 Commits

Author SHA1 Message Date
Akinobu Mita
55af6bb509 md: convert cpu notifier to return encapsulate errno value
By the previous modification, the cpu notifier can return encapsulate
errno value.  This converts the cpu notifiers for raid5.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:48 -07:00
Linus Torvalds
e8bebe2f71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
  fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
  get rid of home-grown mutex in cris eeprom.c
  switch ecryptfs_write() to struct inode *, kill on-stack fake files
  switch ecryptfs_get_locked_page() to struct inode *
  simplify access to ecryptfs inodes in ->readpage() and friends
  AFS: Don't put struct file on the stack
  Ban ecryptfs over ecryptfs
  logfs: replace inode uid,gid,mode initialization with helper function
  ufs: replace inode uid,gid,mode initialization with helper function
  udf: replace inode uid,gid,mode init with helper
  ubifs: replace inode uid,gid,mode initialization with helper function
  sysv: replace inode uid,gid,mode initialization with helper function
  reiserfs: replace inode uid,gid,mode initialization with helper function
  ramfs: replace inode uid,gid,mode initialization with helper function
  omfs: replace inode uid,gid,mode initialization with helper function
  bfs: replace inode uid,gid,mode initialization with helper function
  ocfs2: replace inode uid,gid,mode initialization with helper function
  nilfs2: replace inode uid,gid,mode initialization with helper function
  minix: replace inode uid,gid,mode init with helper
  ext4: replace inode uid,gid,mode init with helper
  ...

Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21 19:37:45 -07:00
NeilBrown
19fdb9eefb Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus
Conflicts:
	drivers/md/md.c

- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-22 08:31:36 +10:00
Christoph Hellwig
8018ab0574 sanitize vfs_fsync calling conventions
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:21 -04:00
Eric W. Biederman
3ff195b011 sysfs: Implement sysfs tagged directory support.
The problem.  When implementing a network namespace I need to be able
to have multiple network devices with the same name.  Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.

What this patch does is to add an additional tag field to the
sysfs dirent structure.  For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible.  Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.

I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.

For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded.  Which means I need
a simple race free way to setup those directories as tagged.

To achieve a reace free design all tagged directories are created
and managed by sysfs itself.

Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid

- Implement mount_ns() which returns the ns of the calling process
  so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.

Everything else is left up to sysfs and the driver layer.

For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.

Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.

The work needed in sysfs is more extensive.  At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent.  Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.

Currently only directories which hold kobjects, and
symlinks are supported.  There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21 09:37:31 -07:00
NeilBrown
be6800a73a md: don't insist on valid event count for spare devices.
Devices which know that they are spares do not really need to have
an event count that matches the rest of the array, so there are no
data-in-sync issues. It is enough that the uuid matches.
So remove the requirement that the event count is up-to-date.

We currently still write out and event count on spares, but this
allows us in a year or 3 to stop doing that completely.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:01 +10:00
NeilBrown
a8707c08f4 md: simplify updating of event count to sometimes avoid updating spares.
When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even.  This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:01 +10:00
Gabriele A. Trombetti
7b0bb5368a md/raid6: Fix raid-6 read-error correction in degraded state
Fix: Raid-6 was not trying to correct a read-error when in
singly-degraded state and was instead dropping one more device, going to
doubly-degraded state. This patch fixes this behaviour.

Tested-by: Janos Haar <janos.haar@netcenter.hu>
Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com>
Reported-by: Janos Haar <janos.haar@netcenter.hu>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-05-18 15:28:00 +10:00
NeilBrown
75a73a29e5 md: restore ability of spare drives to spin down.
Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down.  Device failure/removal
etc are still recorded on spares.

However commit 51d5668cb2 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:

   This means that the alignment between 'odd/even' and
    'clean/dirty' might take a little longer to attain,

how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.

So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately.  There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update).  We just piggy-back on that.

Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-05-18 15:28:00 +10:00
NeilBrown
af3a2cd6b8 md: Fix read balancing in RAID1 and RAID10 on drives > 2TB
read_balance uses a "unsigned long" for a sector number which
will get truncated beyond 2TB.
This will cause read-balancing to be non-optimal, and can cause
data to be read from the 'wrong' branch during a resync.  This has a
very small chance of returning wrong data.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:00 +10:00
NeilBrown
2dc40f8094 md/linear: standardise all printk messages
md/linear:mdname:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
b5a20961f3 md/raid0: tidy up printk messages.
All messages now start
   md/raid0:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
128595ed6f md/raid10: tidy up printk messages.
All raid10 printk messages now start
   md/raid10:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
9dd1e2faf7 md/raid1: improve printk messages
Make sure the array name is included in a uniform way in all printk
messages.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
0c55e02259 md/raid5: improve consistency of error messages.
Many 'printk' messages from the raid456 module mention 'raid5' even
though it may be a 'raid6' or even 'raid4' array.  This can cause
confusion.
Also the actual array name is not always reported and when it is
it is not reported consistently.

So change all the messages to start:
    md/raid:%s:
where '%s' becomes e.g. md3 to identify the particular array.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:58 +10:00
NeilBrown
08fb730ca3 md: remove EXPERIMENTAL designation from RAID10
RAID10 has been available for quite a while now and is quite well
tested, so we can remove the EXPERIMENTAL designation.

Reported-by: Eric MSP Veith <eveith@wwweb-library.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:58 +10:00
Dan Williams
f2859af671 md: allow integers to be passed to md/level
e.g. allow md to interpret 'echo 4 > md/level' as a request for raid4.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:58 +10:00
Dan Williams
bb7f8d2217 md: notify mdstat waiters of level change
Level modifications change the output of mdstat.  The mdmon manager
thread is interested in these events for external metadata management.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:57 +10:00
Dan Williams
f1b29bcae1 md/raid4: permit raid0 takeover
For consistency allow raid4 to takeover raid0 in addition to raid5 (with a
raid4 layout).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:57 +10:00
NeilBrown
e555190d82 md/raid1: delay reads that could overtake behind-writes.
When a raid1 array is configured to support write-behind
on some devices, it normally only reads from other devices.
If all devices are write-behind (because the rest have failed)
it is possible for a read request to be serviced before a
behind-write request, which would appear as data corruption.

So when forced to read from a WriteMostly device, wait for any
write-behind to complete, and don't start any more behind-writes.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:57 +10:00
NeilBrown
d754c5ae1f md/raid1: fix confusing 'redirect sector' message.
This message seems to suggest the named device is the one on which a
read failed, however it is actually the device that the read will be
redirected to.
So make the message a little clearer.

Reported-by: Tim Burgess <ozburgess@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:56 +10:00
NeilBrown
9e35b99c7e md: don't unregister the thread in mddev_suspend
This is
 - unnecessary because mddev_suspend is always followed by a call to
   ->stop, and each ->stop unregisters the thread, and
 - a problem as it makes it awkwards to suspend and then resume a
   device as we will want later.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:56 +10:00
NeilBrown
fafd7fb052 md: factor out init code for an mddev
This is a simple factorisation that makes mddev_find easier to read.


Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown
21a52c6d05 md: pass mddev to make_request functions rather than request_queue
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown
cca9cf90c5 md: call md_stop_writes from md_stop
This moves the call to the other side of set_readonly, but that should
not be an issue.
This encapsulates in 'md_stop' all of the functionality for internally
stopping the array, leaving all the interactions with externalities
(sysfs, request_queue, gendisk) in do_md_stop.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:54 +10:00