This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.
Abstract:
When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.
This case, the raid array has already been broken and it may not matter. But
I think stall is not preferable. If it occurs, even shutdown or reboot will
fail because of resource busy.
The deadlock mechanism:
The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering. The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().
If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented. I found a case
where this decrement is forgotten. This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.
For example, this problem would be reproduced in the following case:
Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
[>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec
This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
[=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec
2739 ? S< 0:17 [md0_raid10]
28608 ? D< 0:00 [md0_resync]
28629 pts/1 Ss 0:00 bash
28830 pts/1 R+ 0:00 ps ax
31819 ? D< 0:00 [kjournald]
The resync thread keeps working, but actually it is deadlocked.
Patch:
By this patch, the remaining counter will be decremented if needed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.
If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.
This patch fixes the problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.
Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.
Some advantages:
The fastest part which is the outer sectors of the disks involved will be
used. The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.
Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.
Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.
If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.
Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes. We
currently do that when we change an attribute, but not when we read an
attribute. We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.
So add appropriate locking (mddev_lock) to rdev_attr_show.
rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions. We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain. So take a copy of rdev->mddev for use at the end of the
function.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only. So whenever it mark it not read-only, it is important to
wake up thread resync thread. There is one place we didn't do this.
The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started. The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write). So the reshape will not proceed.
On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a raid1 array is stopped, all components currently get added to the list
for auto-detection. However we should really only add components that were
found by autodetection in the first place. So add a flag to record that
information, and use it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On an md array with a write-intent bitmap, a thread wakes up every few seconds
and scans the bitmap looking for work to do. If the array is idle, there will
be no work to do, but a lot of scanning is done to discover this.
So cache the fact that the bitmap is completely clean, and avoid scanning the
whole bitmap when the cache is known to be clean.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.
This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).
However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.
So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.
Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds struct device argument to sba_alloc_range and ccio_alloc_range, a
preparation for modifications to fix the IOMMU segment boundary problem. This
change enables ccio_alloc_range to access to LLD's segment boundary limits.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iommu_is_span_boundary is used internally in the IOMMU helper
(lib/iommu-helper.c), a primitive function that judges whether a memory area
spans LLD's segment boundary or not.
It's difficult to convert some IOMMUs to use the IOMMU helper but
iommu_is_span_boundary is still useful for them. So this patch exports it.
This is needed for the parisc iommu fixes.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After a quick glance at the code, we're getting the DEBUG_SHIRQ spurious
interrupt before we have the adapter template filled in. Real interrupts
appear to be turned on by fcpci*_init(), so move request_irq until just before
that.
Signed-off-by: Kyle McMartin <kmcmartin@redhat.com>
Cc: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the page is not up to date, ecryptfs_prepare_write() should be
acting much like ecryptfs_readpage(). This includes the painfully
obvious step of actually decrypting the page contents read from the
lower encrypted file.
Note that this patch resolves a bug in eCryptfs in 2.6.24 that one can
produce with these steps:
# mount -t ecryptfs /secret /secret
# echo "abc" > /secret/file.txt
# umount /secret
# mount -t ecryptfs /secret /secret
# echo "def" >> /secret/file.txt
# cat /secret/file.txt
Without this patch, the resulting data returned from cat is likely to
be something other than "abc\ndef\n".
(Thanks to Benedikt Driessen for reporting this.)
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Benedikt Driessen <bdriessen@escrypt.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last commit for unistd was not correct, it only had a partial update of
syscall numbers for __NR_timerfd_settime and __NR_timerfd_gettime. Also,
NR_syscalls was not incremented for the new syscalls.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Mikael Starvik <mikael.starvik@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function __copy_user_zeroing in arch/lib/usercopy.c had the wrong parameter
set as __user, and in include/asm-cris/uaccess.h, it was not set at all for
some of the calling functions.
This will cut the number of warnings quite dramatically when using sparse.
While we're here, remove useless CVS log and correct confusing typo.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Mikael Starvik <mikael.starvik@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e6bafba5b4 ("wmi: (!x & y)
strikes again"), a bug was fixed that involved converting !x & y to !(x
& y). The code below shows the same pattern, and thus should perhaps be
fixed in the same way.
This is not tested and clearly changes the semantics, so it is only
something to consider.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@ expression E1,E2; @@
(
!E1 & !E2
|
- !E1 & E2
+ !(E1 & E2)
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e6bafba5b4 ("wmi: (!x & y)
strikes again"), a bug was fixed that involved converting !x & y to !(x
& y). The code below shows the same pattern, and thus should perhaps be
fixed in the same way.
This is not tested and clearly changes the semantics, so it is only
something to consider.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@ expression E1,E2; @@
(
!E1 & !E2
|
- !E1 & E2
+ !(E1 & E2)
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e6bafba5b4 ("wmi: (!x & y)
strikes again"), a bug was fixed that involved converting !x & y to !(x
& y). The code below shows the same pattern, and thus should perhaps be
fixed in the same way.
This is not tested and clearly changes the semantics, so it is only
something to consider.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@ expression E1,E2; @@
(
!E1 & !E2
|
- !E1 & E2
+ !(E1 & E2)
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e6bafba5b4 ("wmi: (!x & y)
strikes again"), a bug was fixed that involved converting !x & y to !(x
& y). The code below shows the same pattern, and thus should perhaps be
fixed in the same way.
This is not tested and clearly changes the semantics, so it is only
something to consider.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@ expression E1,E2; @@
(
!E1 & !E2
|
- !E1 & E2
+ !(E1 & E2)
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
I couldn't see what racing tasks on other cpus were doing, but surmise that
another must have been in mem_cgroup_charge_common on the same page, between
its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
which I'd almost changed to a kmalloc).
Normally such a race cannot happen, the ref_cnt prevents it, the final
uncharge cannot race with the initial charge. But force_empty buggers the
ref_cnt, that's what it's all about; and thereafter forced pages are
vulnerable to races such as this (just think of a shared page also mapped into
an mm of another mem_cgroup than that just emptied). And remain vulnerable
until they're freed indefinitely later.
This patch just fixes the oops by moving the unlock_page_cgroups down below
adding to and removing from the list (only possible given the previous patch);
and while we're at it, we might as well make it an invariant that
page->page_cgroup is always set while pc is on lru.
But this behaviour of force_empty seems highly unsatisfactory to me: why have
a ref_cnt if we always have to cope with it being violated (as in the earlier
page migration patch). We may prefer force_empty to move pages to an orphan
mem_cgroup (could be the root, but better not), from which other cgroups could
recover them; we might need to reverse the locking again; but no time now for
such concerns.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As for force_empty, though this may not be the main topic here,
mem_cgroup_force_empty_list() can be implemented simpler. It is possible to
make the function just call mem_cgroup_uncharge_page() instead of releasing
page_cgroups by itself. The tip is to call get_page() before invoking
mem_cgroup_uncharge_page(), so the page won't be released during this
function.
Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
the page might have been reassigned to an lru of a different mem_cgroup, and
now be emptied from that; but Hugh claims that's okay, the end state is the
same as when it hasn't gone to another list.
And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
holding page_cgroup lock (but still has to use try_lock_page_cgroup).
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
into page freeing, I've hit it from time to time in testing on some machines,
sometimes only after many days. Recently found a machine which could usually
produce it within a few hours, which got me there at last.
The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
arrangement of structures was such that you got page_cgroups from the lru list
neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
independently.
The main problem was that it was missing the lock_page_cgroup it needs to
safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
comments on the constraints.
This patch immediately gets replaced by a simpler one from Hirokazu-san; but
is it just foolish pride that tells me to put this one on record, in case we
need to come back to it later?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
it, and before removing page_cgroup from one of its lru lists: isn't there a
danger that struct mem_cgroup memory could be freed and reused before
completing that, so corrupting something? Never seen it, and for all I know
there may be other constraints which make it impossible; but let's be
defensive and reverse the ordering there.
mem_cgroup_force_empty_list is safe because there's an extra css_get around
all its works; but even so, change its ordering the same way round, to help
get in the habit of doing it like this.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>