* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Unlike on most other architectures ino_t is an unsigned int on s390. So
add an explicit cast to avoid this compile warning:
fs/ext2/namei.c: In function 'ext2_lookup':
fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable removing of corrupted pages through truncation
for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
These should cover most server needs.
I chose the set of migration aware file systems for this
for now, assuming they have been especially audited.
But in general it should be safe for all file systems
on the data area that support read/write and truncate.
Caveat: the hardware error handler does not take i_mutex
for now before calling the truncate function. Is that ok?
Cc: tytso@mit.edu
Cc: hch@infradead.org
Cc: mfasheh@suse.com
Cc: aia21@cantab.net
Cc: hugh.dickins@tiscali.co.uk
Cc: swhiteho@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ext2_rename(), dir_page is acquired through ext2_dotdot(). It is
then released through ext2_set_link() but only if old_dir != new_dir.
Failing that, the pkmap reference count is never decremented and the
page remains pinned forever. Repeat that a couple times with highmem
pages and all pkmap slots get exhausted, and every further kmap() calls
end up stalling on the pkmap_map_wait queue at which point the whole
system comes to a halt.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext2_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode. This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.
The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.
This patch thus changes ext2_lookup to return -EIO if it receives -ESTALE
from ext2_iget(), as ext2 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.
Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).
ubifs/xattr.c needed includes reordered, the rest is a plain switchover.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One of our users is complaining that his backup tool is upset on ext2
(while it's happy on ext3, xfs, ...) because of the mtime change.
The problem is:
mkdir foo
mkdir bar
mkdir foo/a
Now under ext2:
mv foo/a foo/b
changes mtime of 'foo/a' (foo/b after the move). That does not really
make sense and it does not happen under any other filesystem I've seen.
More complicated is:
mv foo/a bar/a
This changes mtime of foo/a (bar/a after the move) and it makes some
sense since we had to update parent directory pointer of foo/a. But
again, no other filesystem does this. So after some thoughts I'd vote
for consistency and change ext2 to behave the same as other filesystems.
Do not update mtime of a moved directory. Specs don't say anything
about it (neither that it should, nor that it should not be updated) and
other common filesystems (ext3, ext4, xfs, reiserfs, fat, ...) don't do
it. So let's become more consistent.
Spotted by ronny.pretzsch@dfs.de, initial fix by Jörn Engel.
Reported-by: <ronny.pretzsch@dfs.de>
Cc: <hare@suse.de>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kill ext2_sync_file() (along with ext2/fsync.c), get rid of
ext2_update_inode() - it's an alias of ext2_write_inode().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move BKL into ->put_super from the only caller. A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.
[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.
Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.
Exceptions:
- affs already has identical copy & pasted code at the beginning of
affs_put_super so no need to do it twice.
- xfs does the right thing without it and I have changes pending for
the xfs tree touching this are so I don't really need conflicts
here..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If two writers allocating blocks to file race with each other (e.g.
because writepages races with ordinary write or two writepages race with
each other), ext2_getblock() can be called on the same inode in parallel.
Before we are going to allocate new blocks, we have to recheck the block
chain we have obtained so far without holding truncate_mutex. Otherwise
we could overwrite the indirect block pointer set by the other writer
leading to data loss.
The below test program by Ying is able to reproduce the data loss with ext2
on in BRD in a few minutes if the machine is under memory pressure:
long kMemSize = 50 << 20;
int kPageSize = 4096;
int main(int argc, char **argv) {
int status;
int count = 0;
int i;
char *fname = "/mnt/test.mmap";
char *mem;
unlink(fname);
int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
status = ftruncate(fd, kMemSize);
mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
// Fill the memory with 1s.
memset(mem, 1, kMemSize);
sleep(2);
for (i = 0; i < kMemSize; i++) {
int byte_good = mem[i] != 0;
if (!byte_good && ((i % kPageSize) == 0)) {
//printf("%d ", i / kPageSize);
count++;
}
}
munmap(mem, kMemSize);
close(fd);
unlink(fname);
if (count > 0) {
printf("Running %d bad page\n", count);
return 1;
}
return 0;
}
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around. According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:
commit 70688e4dd1
Author: Nick Piggin <npiggin@suse.de>
Date: Mon Apr 28 02:13:02 2008 -0700
xip: support non-struct page backed memory
In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x. During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork). Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.
Both parent and child process crash some time later due to inconsistencies
in their address spaces. The issue seems to be timing sensitive, various
attempts to recreate it have failed.
This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated. This patch keeps users from running into
that issue.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>