Commit Graph

6274 Commits

Author SHA1 Message Date
Mark Fasheh 415cb80037 ocfs2: Allow smaller allocations during large writes
The ocfs2 write code loops through a page much like the block code, except
that ocfs2 allocation units can be any size, including larger than page
size. Typically it's equal to or larger than page size - most kernels run 4k
pages, the minimum ocfs2 allocation (cluster) size.

Some changes introduced during 2.6.23 changed the way writes to pages are
handled, and inadvertantly broke support for > 4k page size. Instead of just
writing one cluster at a time, we now handle the whole page in one pass.

This means that multiple (small) seperate allocations might happen in the
same pass. The allocation code howver typically optimizes by getting the
maximum which was reserved. This triggered a BUG_ON in the extend code where
it'd ask for a single bit (for one part of a > 4k page) and get back more
than it asked for.

Fix this by providing a variant of the high level allocation function which
allows the caller to specify a maximum. The traditional function remains and
just calls the new one with a maximum determined from the initial
reservation.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2007-09-20 15:06:09 -07:00
Linus Torvalds a78feb7c8a Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
  [XFS] Avoid replaying inode buffer initialisation log items if on-disk version is newer.
  [XFS] Ensure file size updates have been completed before writing inode to disk.
  [XFS] On-demand reaping of the MRU cache
2007-09-19 11:40:13 -07:00
Eric Sandeen ef2b02d3e6 ext34: ensure do_split leaves enough free space in both blocks
The do_split() function for htree dir blocks is intended to split a leaf
block to make room for a new entry.  It sorts the entries in the original
block by hash value, then moves the last half of the entries to the new
block - without accounting for how much space this actually moves.  (IOW,
it moves half of the entry *count* not half of the entry *space*).  If by
chance we have both large & small entries, and we move only the smallest
entries, and we have a large new entry to insert, we may not have created
enough space for it.

The patch below stores each record size when calculating the dx_map, and
then walks the hash-sorted dx_map, calculating how many entries must be
moved to more evenly split the existing entries between the old block and
the new block, guaranteeing enough space for the new entry.

The dx_map "offs" member is reduced to u16 so that the overall map size
does not change - it is temporarily stored at the end of the new block, and
if it grows too large it may be overwritten.  By making offs and size both
u16, we won't grow the map size.

Also add a few comments to the functions involved.

This fixes the testcase reported by hooanon05@yahoo.co.jp on the
linux-ext4 list, "ext3 dir_index causes an error"

Thanks to Andreas Dilger for discussing the problem & solution with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Tested-by: Junjiro Okajima <hooanon05@yahoo.co.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19 11:24:18 -07:00
Alexey Dobriyan 49af7ee181 nfs: fix oops re sysctls and V4 support
NFS unregisters sysctls only if V4 support is compiled in.  However, sysctl
table is not V4 specific, so unregister it always.

Steps to reproduce:

	[build nfs.ko with CONFIG_NFS_V4=n]
	modrobe nfs
	rmmod nfs
	ls /proc/sys

Unable to handle kernel paging request at ffffffff880661c0 RIP:
 [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350
PGD 203067 PUD 207063 PMD 7e216067 PTE 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: lockd nfs_acl sunrpc
Pid: 3335, comm: ls Not tainted 2.6.23-rc3-bloat #2
RIP: 0010:[<ffffffff802af8e3>]  [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350
RSP: 0018:ffff81007fd93e78  EFLAGS: 00010286
RAX: ffffffff880661c0 RBX: ffffffff80466370 RCX: ffffffff880661c0
RDX: 00000000000014c0 RSI: ffff81007f3ad020 RDI: ffff81007efd8b40
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffffffff802a8570 R12: ffffffff880661c0
R13: ffff81007e219640 R14: ffff81007efd8b40 R15: ffff81007ded7280
FS:  00002ba25ef03060(0000) GS:ffff81007ff81258(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff880661c0 CR3: 000000007dfaf000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ls (pid: 3335, threadinfo ffff81007fd92000, task ffff81007d8a0000)
Stack:  ffff81007f3ad150 ffffffff80283f30 ffff81007fd93f48 ffff81007efd8b40
 ffff81007ee00440 0000000422222222 0000000200035593 ffffffff88037e9a
 2222222222222222 ffffffff80466500 ffff81007e416400 ffff81007e219640
Call Trace:
 [<ffffffff80283f30>] filldir+0x0/0xf0
 [<ffffffff80283f30>] filldir+0x0/0xf0
 [<ffffffff802840c7>] vfs_readdir+0xa7/0xc0
 [<ffffffff80284376>] sys_getdents+0x96/0xe0
 [<ffffffff8020bb3e>] system_call+0x7e/0x83

Code: 41 8b 14 24 85 d2 74 dc 49 8b 44 24 08 48 85 c0 74 e7 49 3b
RIP  [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350
 RSP <ffff81007fd93e78>
CR2: ffffffff880661c0
Kernel panic - not syncing: Fatal exception

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19 11:24:18 -07:00
Eric Sandeen 3d82abae95 dir_index: error out instead of BUG on corrupt dx dirs
Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
errors with helpful warnings.  With help catching other asserts from Duane
Griffin <duaneg@dghda.com>

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19 11:24:18 -07:00
Lachlan McIlroy b394e43e99 [XFS] Avoid replaying inode buffer initialisation log items if on-disk version is newer.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29676a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-18 20:16:00 +10:00
Lachlan McIlroy 776a75fa5c [XFS] Ensure file size updates have been completed before writing inode to disk.
SGI-PV: 968767
SGI-Modid: xfs-linux-melb:xfs-kern:29675a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-18 20:12:51 +10:00
David Chinner 65de556756 [XFS] On-demand reaping of the MRU cache
Instead of running the mru cache reaper all the time based on a timeout,
we should only run it when the cache has active objects. This allows CPUs
to sleep when there is no activity rather than be woken repeatedly just to
check if there is anything to do.

SGI-PV: 968554
SGI-Modid: xfs-linux-melb:xfs-kern:29305a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-17 16:42:02 +10:00
Jeff Garzik a2ca44c30d Merge branch 'fixes-jgarzik' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream-fixes 2007-09-15 19:29:07 -04:00
Masakazu Mokuno 53c5725581 As struct iw_point is bi-directional payload, we should copy back the content
on return from ioctl calls

Signed-off-by: Masakazu Mokuno <mokuno@sm.sony.co.jp>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-09-14 14:35:38 -04:00
Linus Torvalds 577107e8e4 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: Fix calculation of i_blocks during truncate
  [PATCH] ocfs2: Fix a wrong cluster calculation.
  [PATCH] ocfs2: fix mount option parsing
  ocfs2: update docs for new features
2007-09-11 17:23:16 -07:00
Pavel Emelyanov 0e2f6db88a Leases can be hidden by flocks
The inode->i_flock list contains the leases, flocks and posix
locks in the specified order. However, the flocks are added in
the head of this list thus hiding the leases from F_GETLEASE
command, from time_out_leases() and other code that expects
the leases to come first.

The following example will demonstrate this:

#define _GNU_SOURCE

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/file.h>

static void show_lease(int fd)
{
        int res;

        res = fcntl(fd, F_GETLEASE);
        switch (res) {
                case F_RDLCK:
                        printf("Read lease\n");
                        break;
                case F_WRLCK:
                        printf("Write lease\n");
                        break;
                case F_UNLCK:
                        printf("No leases\n");
                        break;
                default:
                        printf("Some shit\n");
                        break;
        }
}

int main(int argc, char **argv)
{
        int fd, res;

        fd = open(argv[1], O_RDONLY);
        if (fd == -1) {
                perror("Can't open file");
                return 1;
        }

        res = fcntl(fd, F_SETLEASE, F_WRLCK);
        if (res == -1) {
                perror("Can't set lease");
                return 1;
        }

        show_lease(fd);

        if (flock(fd, LOCK_SH) == -1) {
                perror("Can't flock shared");
                return 1;
        }

        show_lease(fd);

        return 0;
}

The first call to show_lease() will show the write lease set, but
the second will show no leases.

Fix the flock adding so that the leases always stay in the head
of this list.

Found during making the flocks pid-namespaces aware.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:27 -07:00
Alexey Dobriyan dd23aae4f5 Fix select on /proc files without ->poll
Taneli Vähäkangas <vahakang@cs.helsinki.fi> reported that commit
786d7e1612 aka "Fix rmmod/read/write races
in /proc entries" broke SBCL + SLIME combo.

The old code in do_select() used DEFAULT_POLLMASK, if couldn't find
->poll handler.  The new code makes ->poll always there and returns 0 by
default, which is not correct.  Return DEFAULT_POLLMASK instead.

Steps to reproduce:

	install emacs, SBCL, SLIME
	emacs
	M-x slime	in *inferior-lisp* buffer
	[watch it doing "Connecting to Swank on port X.."]

Please, apply before 2.6.23.

P.S.: why SBCL can't just read(2) /proc/cpuinfo is a mystery.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: T Taneli Vahakangas <vahakang@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:20 -07:00
Andreas Gruenbacher 1a1a1a758b afs: mntput called before dput
dput must be called before mntput here.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-By: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:19 -07:00
Jan Kara 9c3013e9b9 quota: fix infinite loop
If we fail to start a transaction when releasing dquot, we have to call
dquot_release() anyway to mark dquot structure as inactive.  Otherwise we
end in an infinite loop inside dqput().

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: xb <xavier.bru@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:19 -07:00
Mark Fasheh e535e2efd2 ocfs2: Fix calculation of i_blocks during truncate
We were setting i_blocks too early - before truncating any allocation.
Correct things to set i_blocks after the allocation change.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2007-09-11 11:39:46 -07:00
tao.ma@oracle.com 30b8548f2c [PATCH] ocfs2: Fix a wrong cluster calculation.
In ocfs2_alloc_write_write_ctxt, the written clusters length is calculated
by the byte length only. This may cause some problems if we start to write
at some position in the end of one cluster and last to a second cluster
while the "len" is smaller than a cluster size. In that case, we have to
write 2 clusters actually.
So we have to take the start position into consideration also.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2007-09-11 11:39:05 -07:00
Tiger Yang c0123adef6 [PATCH] ocfs2: fix mount option parsing
For some mount option types, ocfs2_parse_options() will try to access
sb->s_fs_info to get at the ocfs2 private superblock. Unfortunately, that
hasn't been allocated yet and will cause a kernel crash.

Fix this by storing options in a struct which can then get pushed into the
ocfs2_super once it's been allocated later. If we need more options which
store to the ocfs2_super in the future, we can just fields to this struct.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2007-09-11 11:38:48 -07:00
Mark Fasheh 10b0845bed ocfs2: update docs for new features
Update documentation listing ocfs2 features to reflect the current state of
the file system. Add missing descriptions for some mount options which ocfs2
supports.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
2007-09-11 11:38:25 -07:00
Neil Brown b8da0d1c27 knfsd: Validate filehandle type in fsid_source
fsid_source decided where to get the 'fsid' number to
return for a GETATTR based on the type of filehandle.
It can be from the device, from the fsid, or from the
UUID.

It is possible for the filehandle to be inconsistent
with the export information, so make sure the export information
actually has the info implied by the value returned by
fsid_source.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Luiz Fernando N. Capitulino" <lcapitulino@gmail.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-10 18:57:47 -07:00
Neil Brown a1033be72c knfsd: Fixed problem with NFS exporting directories which are mounted on.
Recent changes in NFSd cause a directory which is mounted-on
to not appear properly when the filesystem containing it is exported.

*exp_get* now returns -ENOENT rather than NULL and when
  commit 5d3dbbeaf5
removed the NULL checks, it didn't add a check for -ENOENT.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-10 18:57:47 -07:00
Eric Sandeen 5995cb7d80 [XFS] fix nasty quota hashtable allocation bug
This git mod: 77e4635ae1
converted to a "greedy" allocation interface, but for the quota hashtables
it switched from allocating XFS_QM_HASHSIZE (nr of elements)
xfs_dqhash_t's to allocating only XFS_QM_HASHSIZE *bytes* - quite a lot
smaller! Then when we converted hsize "back" to nr of elements (the
division line) hsize went to 0. This was leading to oopses when running
any quota tests on the Fedora 8 test kernel, but the problem has been
there for almost a year.

SGI-PV: 968837
SGI-Modid: xfs-linux-melb:xfs-kern:29354a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:51:04 +10:00
Christoph Hellwig 265c1fac38 [XFS] fix sparse shadowed variable warnings
- in xfs_probe_cluster rename the inner len to pg_len. There's no harm
  here because the outer len isn't used after the inner len comes into
  existence but it keeps the code clean.
- in xfs_da_do_buf remove the inner i because they don't overlap
  and they are both the same type.

SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29311a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:50:26 +10:00
Christoph Hellwig ee5c80239d [XFS] fix ASSERT and ASSERT_ALWAYS
- remove the != 0 inside the unlikely in ASSERT_ALWAYS because sparse now
  complains about comparisons between pointers and 0
- add a standalone ASSERT implementation because defining it to
  ASSERT_ALWAYS means the string is expanded before the token passing
  stringification. This way we get the actual content of the
  assertion in the assfail message and don't overflow sparse's
  stringification buffer leading to sparse error messages.

SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29310a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:49:30 +10:00
Christoph Hellwig 34521c5e49 [XFS] Fix sparse warning in kmem_shake_allow
We can't return a masked result of a __bitwise type. Compare it to 0 first
to keep the behaviour without the warning.

SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29309a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-09-05 14:48:00 +10:00