commit 720e774927 upstream.
gfs2_lock() will skip locks on file which have mode set to 02666. This is a problem in cases where the mode of the file is changed after a process has obtained a lock on the file. Such a lock will be skipped and will result in a BUG in locks_remove_flock().
gfs2_lock() should skip the check for mandatory locks when unlocking a file.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
GFS2 has been altered to pass THIS_MODULE to slow_work_register_user(), but
hasn't been altered to #include <linux/module.h> to provide it, resulting in
the following error:
fs/gfs2/recovery.c:596: error: 'THIS_MODULE' undeclared here (not in a function)
Add the missing #include.
Signed-off-by: David Howells <dhowells@redhat.com>
Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility. This prevents the put_ref
code of a work item from being taken away before it returns.
Signed-off-by: David Howells <dhowells@redhat.com>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable removing of corrupted pages through truncation
for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
These should cover most server needs.
I chose the set of migration aware file systems for this
for now, assuming they have been especially audited.
But in general it should be safe for all file systems
on the data area that support read/write and truncate.
Caveat: the hardware error handler does not take i_mutex
for now before calling the truncate function. Is that ok?
Cc: tytso@mit.edu
Cc: hch@infradead.org
Cc: mfasheh@suse.com
Cc: aia21@cantab.net
Cc: hugh.dickins@tiscali.co.uk
Cc: swhiteho@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
block: use blkdev_issue_discard in blk_ioctl_discard
Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
block: don't assume device has a request list backing in nr_requests store
block: Optimal I/O limit wrapper
cfq: choose a new next_req when a request is dispatched
Seperate read and write statistics of in_flight requests
aoe: end barrier bios with EOPNOTSUPP
block: trace bio queueing trial only when it occurs
block: enable rq CPU completion affinity by default
cfq: fix the log message after dispatched a request
block: use printk_once
cciss: memory leak in cciss_init_one()
splice: update mtime and atime on files
block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
cfq-iosched: get rid of must_alloc flag
block: use interrupts disabled version of raise_softirq_irqoff()
block: fix comment in blk-iopoll.c
block: adjust default budget for blk-iopoll
block: fix long lines in block/blk-iopoll.c
block: add blk-iopoll, a NAPI like approach for block devices
...
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request. To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour. This will be very useful later on for using the waiting
funcitonality for other callers.
Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The /sys/fs/gfs2/<fsname>/lock_module/id file has been unused for
some time now, so we can remove it. We still accept the mount option
though, as userspace still sends that.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is a potential race in the inode deallocation code if two
nodes try to deallocate the same inode at the same time. Most of
the issue is solved by the iopen locking. There is still a small
window which is not covered by the iopen lock. This patches fixes
that and also makes the deallocation code more robust in the face of
any errors in the rgrp bitmaps, or erroneous iopen callbacks from
other nodes.
This does introduce one extra disk read, but that is generally not
an issue since its the same block that must be written to later
in the deallocation process. The total disk accesses therefore stay
the same,
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The inum structure used throughout GFS2 has two fields. One
no_addr is the disk block number of the inode in question and
is used everywhere as the inode number. The other, no_formal_ino,
is used only as the generation number for NFS.
Historically the no_formal_ino field was set using a complicated
system of one global and one per-node file containing inode numbers
in order to ensure that each no_formal_ino was unique. Also this
code made no provision for what would happen when eventually the
(64 bit) numbers ran out. Now I know that is pretty unlikely to
happen given the large space of numbers, but it is possible
nevertheless.
The only guarantee required for no_formal_ino is that, for any
single inode, the same number doesn't get reused too quickly.
We already have a generation number which is kept in the inode
and initialised from a counter in the resource group (almost
no overhead, since we have to touch the resource group anyway
in order to allocate an inode in the first place). Aside from
ensuring that we never use the value 0 in the no_formal_ino
field, we can use that counter directly.
As a result of that change, we lose about 200 lines of code and
also gain about 10 creates/sec on the postmark benchmark (on
my test machine).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Use the more conventional name for the extended attribute
support code. Update all the places which care.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This has been on my list for some time. We need to change the way
in which we handle extended attributes to allow faster file creation
times (by reducing the number of transactions required) and the
extended attribute code is the main obstacle to this.
In addition to that, the VFS provides a way to demultiplex the xattr
calls which we ought to be using, rather than rolling our own. This
patch changes the GFS2 code to use that VFS feature and as a result
the code shrinks by a couple of hundred lines or so, and becomes
easier to read.
I'm planning on doing further clean up work in this area, but this
patch is a good start. The cleaned up code also uses the more usual
"xattr" shorthand, I plan to eliminate the use of "eattr" eventually
and in the mean time it serves as a flag as to which bits of the code
have been updated.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds "-o errors=panic" and "-o errors=withdraw" to the
gfs2 mount options. The "errors=withdraw" option is today's
current behaviour, meaning to withdraw from the file system if a
non-serious gfs2 error occurs. The new "errors=panic" option
tells gfs2 to force a kernel panic if a non-serious gfs2 file
system error occurs. This may be useful, for example, where
fabric-level fencing is used that has no way to reboot (such as
fence_scsi).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
this patch is for the same problem that Benjamin Marzinski fixes at commit
b94a170e96
quotation of the original problem:
---cut here---
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.
---end cut---
after applying Benjamin's patch, I think there is still a case in which the disk
inode remains even when "no space" is hit. the case is that when running
d_prune_aliases() against the inode, there are one or more dentries(aliases)
which have reference count number > 0. in this case the dentries won't be pruned.
and even later, the reference count becomes to 0, the dentries can still be
cached in memory. unfortunately, no callback come again, things come back to
the state before the callback runs. thus the on disk inode remains there until
in memoryinode is removed for some other reason(shrinking inode cache or unmount
the volume..).
this patch is to remove those dentries when their reference count becomes to 0 and
the inode is deleted by remote node. for implementation, gfs2_dentry_delete() is
added as dentry_operations.d_delete. the function returns true when the inode is
deleted by remote node. in dput(), gfs2_dentry_delete() is called and since it
returns true, the dentry is unhashed from dcache and then removed. when all dentries
are removed, the in memory inode get removed so that the on disk inode is freed.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds a link from the per-gfs2 sb sysfs directory to
the block device upon which the filesystem is mounted. The
link is called "device", strangely enough :-)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A little while back, block allocation was given some improved
error handling which meant that -EIO was returned in the case
of there being a problem in the resource group data. In addition
a message is printed explaning what went wrong and how to fix it.
This extends that error handling so that it also covers inode
allocation too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With each uevent, we now always include the journal ID. We
can't call it JID since that is already in use by some of
the individual events relating to recovery, so we use
JOURNALID instead. We don't send the JOURNALID for spectator
mounts, since there isn't one.
Also the ADD event now has both RDONLY and SPECTATOR information
to match that of the ONLINE event.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We already have an offline uevent (used when a withdraw occurs)
but no online uevent. This adds an online uevent so that userspace
will be able to detect a successful mount by means other than
not receiving a remove event after the add & recovery (change)
uevents.
It has also been added to the remount path as well - we can't use
a change uevent there as older GFS2 userspace acts on change uevents
according to the state that it thinks the fs is in, so we can't
easily add any new ones.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Although this file is only ever written and not read by
userspace, it seems that the utils are opening this
file O_RDWR, so we need to allow that.
Also fixes the whitespace which seemed to be broken.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>