Commit Graph

804 Commits

Author SHA1 Message Date
Linus Torvalds 355bbd8cb8 Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
  block: use blkdev_issue_discard in blk_ioctl_discard
  Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
  block: don't assume device has a request list backing in nr_requests store
  block: Optimal I/O limit wrapper
  cfq: choose a new next_req when a request is dispatched
  Seperate read and write statistics of in_flight requests
  aoe: end barrier bios with EOPNOTSUPP
  block: trace bio queueing trial only when it occurs
  block: enable rq CPU completion affinity by default
  cfq: fix the log message after dispatched a request
  block: use printk_once
  cciss: memory leak in cciss_init_one()
  splice: update mtime and atime on files
  block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
  cfq-iosched: get rid of must_alloc flag
  block: use interrupts disabled version of raise_softirq_irqoff()
  block: fix comment in blk-iopoll.c
  block: adjust default budget for blk-iopoll
  block: fix long lines in block/blk-iopoll.c
  block: add blk-iopoll, a NAPI like approach for block devices
  ...
2009-09-14 17:55:15 -07:00
Steven Whitehouse 86d0063656 GFS2: Whitespace fixes
Reported-by: Daniel Walker <dwalker@fifo99.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-09-14 09:50:57 +01:00
Christoph Hellwig 746cd1e7e4 block: use blkdev_issue_discard in blk_ioctl_discard
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request.  To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour.  This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:53 +02:00
Steven Whitehouse 2b88f7c535 GFS2: Remove unused sysfs file
The /sys/fs/gfs2/<fsname>/lock_module/id file has been unused for
some time now, so we can remove it. We still accept the mount option
though, as userspace still sends that.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-09-09 15:59:35 +01:00
Steven Whitehouse acf7e2444a GFS2: Be extra careful about deallocating inodes
There is a potential race in the inode deallocation code if two
nodes try to deallocate the same inode at the same time. Most of
the issue is solved by the iopen locking. There is still a small
window which is not covered by the iopen lock. This patches fixes
that and also makes the deallocation code more robust in the face of
any errors in the rgrp bitmaps, or erroneous iopen callbacks from
other nodes.

This does introduce one extra disk read, but that is generally not
an issue since its the same block that must be written to later
in the deallocation process. The total disk accesses therefore stay
the same,

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-09-08 18:00:30 +01:00
Steven Whitehouse 8d8291ae93 GFS2: Remove no_formal_ino generating code
The inum structure used throughout GFS2 has two fields. One
no_addr is the disk block number of the inode in question and
is used everywhere as the inode number. The other, no_formal_ino,
is used only as the generation number for NFS.

Historically the no_formal_ino field was set using a complicated
system of one global and one per-node file containing inode numbers
in order to ensure that each no_formal_ino was unique. Also this
code made no provision for what would happen when eventually the
(64 bit) numbers ran out. Now I know that is pretty unlikely to
happen given the large space of numbers, but it is possible
nevertheless.

The only guarantee required for no_formal_ino is that, for any
single inode, the same number doesn't get reused too quickly.

We already have a generation number which is kept in the inode
and initialised from a counter in the resource group (almost
no overhead, since we have to touch the resource group anyway
in order to allocate an inode in the first place). Aside from
ensuring that we never use the value 0 in the no_formal_ino
field, we can use that counter directly.

As a result of that change, we lose about 200 lines of code and
also gain about 10 creates/sec on the postmark benchmark (on
my test machine).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-27 15:51:07 +01:00
Steven Whitehouse 307cf6e63c GFS2: Rename eattr.[ch] as xattr.[ch]
Use the more conventional name for the extended attribute
support code. Update all the places which care.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-26 18:51:04 +01:00
Steven Whitehouse 40b78a3223 GFS2: Clean up of extended attribute support
This has been on my list for some time. We need to change the way
in which we handle extended attributes to allow faster file creation
times (by reducing the number of transactions required) and the
extended attribute code is the main obstacle to this.

In addition to that, the VFS provides a way to demultiplex the xattr
calls which we ought to be using, rather than rolling our own. This
patch changes the GFS2 code to use that VFS feature and as a result
the code shrinks by a couple of hundred lines or so, and becomes
easier to read.

I'm planning on doing further clean up work in this area, but this
patch is a good start. The cleaned up code also uses the more usual
"xattr" shorthand, I plan to eliminate the use of "eattr" eventually
and in the mean time it serves as a flag as to which bits of the code
have been updated.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-26 18:41:32 +01:00
Bob Peterson d34843d0c4 GFS2: Add "-o errors=panic|withdraw" mount options
This patch adds "-o errors=panic" and "-o errors=withdraw" to the
gfs2 mount options.  The "errors=withdraw" option is today's
current behaviour, meaning to withdraw from the file system if a
non-serious gfs2 error occurs.  The new "errors=panic" option
tells gfs2 to force a kernel panic if a non-serious gfs2 file
system error occurs.  This may be useful, for example, where
fabric-level fencing is used that has no way to reboot (such as
fence_scsi).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-24 10:44:18 +01:00
Roel Kluin cd0120751d GFS2: jumping to wrong label?
Also a gfs2_glock_dq() is required here.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-24 10:41:44 +01:00
Wengang Wang 970343cd49 GFS2: free disk inode which is deleted by remote node -V2
this patch is for the same problem that Benjamin Marzinski fixes at commit
b94a170e96

quotation of the original problem:

---cut here---
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.
---end cut---

after applying Benjamin's patch, I think there is still a case in which the disk
inode remains even when "no space" is hit. the case is that when running
d_prune_aliases() against the inode, there are one or more dentries(aliases)
which have reference count number > 0. in this case the dentries won't be pruned.
and even later, the reference count becomes to 0, the dentries can still be
cached in memory. unfortunately, no callback come again, things come back to
the state before the callback runs. thus the on disk inode remains there until
in memoryinode is removed for some other reason(shrinking inode cache or unmount
the volume..).

this patch is to remove those dentries when their reference count becomes to 0 and
the inode is deleted by remote node. for implementation, gfs2_dentry_delete() is
added as dentry_operations.d_delete. the function returns true when the inode is
deleted by remote node. in dput(), gfs2_dentry_delete() is called and since it
returns true, the dentry is unhashed from dcache and then removed. when all dentries
are removed, the in memory inode get removed so that the on disk inode is freed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-18 10:29:39 +01:00
Steven Whitehouse 31e54b01f3 GFS2: Add sysfs link to device
This adds a link from the per-gfs2 sb sysfs directory to
the block device upon which the filesystem is mounted. The
link is called "device", strangely enough :-)

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-17 11:11:18 +01:00
Steven Whitehouse 05164e5b37 GFS2: Replace assertion with proper error handling
One fewer assert, one more place we can recover gracefully
if there is an error.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-17 11:06:43 +01:00
Steven Whitehouse 6050b9c74f GFS2: Improve error handling in inode allocation
A little while back, block allocation was given some improved
error handling which meant that -EIO was returned in the case
of there being a problem in the resource group data. In addition
a message is printed explaning what went wrong and how to fix it.
This extends that error handling so that it also covers inode
allocation too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-17 11:05:31 +01:00
Steven Whitehouse 440d6da207 GFS2: Add some more info to uevents
With each uevent, we now always include the journal ID. We
can't call it JID since that is already in use by some of
the individual events relating to recovery, so we use
JOURNALID instead. We don't send the JOURNALID for spectator
mounts, since there isn't one.

Also the ADD event now has both RDONLY and SPECTATOR information
to match that of the ONLINE event.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-17 11:05:00 +01:00
Steven Whitehouse 8633ecfaba GFS2: Add online uevent to GFS2
We already have an offline uevent (used when a withdraw occurs)
but no online uevent. This adds an online uevent so that userspace
will be able to detect a successful mount by means other than
not receiving a remove event after the add & recovery (change)
uevents.

It has also been added to the remount path as well - we can't use
a change uevent there as older GFS2 userspace acts on change uevents
according to the state that it thinks the fs is in, so we can't
easily add any new ones.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-08-17 11:04:42 +01:00
Steven Whitehouse d7e623da1a GFS2: Fix permissions on "recover" file
Although this file is only ever written and not read by
userspace, it seems that the utils are opening this
file O_RDWR, so we need to allow that.

Also fixes the whitespace which seemed to be broken.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
2009-08-14 14:04:46 +01:00
Benjamin Marzinski b94a170e96 GFS2: remove dcache entries for remote deleted inodes
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:01:03 +01:00
Benjamin Marzinski 6b94617024 GFS2: Fix incorrent statfs consistency check
Since both linked and unlinked inodes are counted by rgd->rd_dinodes, It
makes no sense to count them with the used data blocks (first check that
I changed), it makes sense to count them with the linked inodes (second
check), and it makes no sense to care if there are more unlinked inodes
than linked ones. This fixes these errors.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:00:28 +01:00
Benjamin Marzinski 8ff22a6f9b GFS2: Don't put unlikely reclaim candidates on the reclaim list.
GFS2 was placing far too many glocks on the reclaim list that were not good
candidates for freeing up from cache.  These locks would sit there and
repeatedly get scanned to see if they could be reclaimed, wasting a lot
of time when there was memory pressure. This fix does more checks on the
locks to see if they are actually likely to be removable from cache.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:00:09 +01:00
Steven Whitehouse 1e19a19584 GFS2: Don't try and dealloc own inode
When searching for unlinked, but still allocated inodes during block
allocation, avoid the block relating to the inode that is doing the
allocation. This fixes a hang caused when an unlinked, but still
open, inode tries to allocate some more blocks and lands up
finding itself during the search for deallocatable inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:59:50 +01:00
Benjamin Marzinski a51b56fff3 GFS2: Fix panic in glock memory shrinker
It is possible for gfs2_shrink_glock_memory() to check a glock for
demotion
that's in the process of being freed by gfs2_glock_put().  In this case,
gfs2_shrink_glock_memory() will acquire a new reference to this glock,
and
then try to free the glock itself when it drops the refernce.  To solve
this, gfs2_shrink_glock_memory() just needs to check if the glock is in
the process of being freed, and if so skip it without ever unlocking the
lru_lock.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:59:28 +01:00
Benjamin Marzinski 1946f70ab5 GFS2: keep statfs info in sync on grows
GFS2 wasn't syncing its statfs info on grows.  This causes a problem
when you grow the filesystem on multiple nodes.  GFS2 would calculate
the new space based on the resource groups (which are always current),
and then assume that the filesystem had grown the from the existing
statfs size.  If you grew the filesystem on two different nodes in a
short time, the second node wouldn't see the statfs size change from the
first node, and would assume that it was grown by a larger amount than
it was.  When all these changes were synced out, the total fileystem
size would be incorrect (the first grow would be counted twice).

This patch syncs makes GFS2 read in the statfs changes from disk before
a grow, and write them out after the grow, while the master statfs inode
is locked.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:52:33 +01:00
Steven Whitehouse 2163b1e616 GFS2: Shrink the shrinker
This patch removes some of the special cases that the shrinker
was trying to deal with. As a result we leave fewer items on
the list and none at all which cannot be demoted. This makes
the list scanning more efficient and solves some issues seen
with large numbers of inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:52:14 +01:00
Li Zefan d0b6e04a4c tracing/events: Move TRACE_SYSTEM outside of include guard
If TRACE_INCLDUE_FILE is defined, <trace/events/TRACE_INCLUDE_FILE.h>
will be included and compiled, otherwise it will be
<trace/events/TRACE_SYSTEM.h>

So TRACE_SYSTEM should be defined outside of #if proctection,
just like TRACE_INCLUDE_FILE.

Imaging this scenario:

 #include <trace/events/foo.h>
    -> TRACE_SYSTEM == foo
 ...
 #include <trace/events/bar.h>
    -> TRACE_SYSTEM == bar
 ...
 #define CREATE_TRACE_POINTS
 #include <trace/events/foo.h>
    -> TRACE_SYSTEM == bar !!!

and then bar.h will be included and compiled.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A5A9CF1.2010007@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-07-13 10:59:55 +02:00