Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file. This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code. This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When renaming a file such that a link to another inode is overwritten,
force any delay allocated blocks that to be allocated so that if the
filesystem is mounted with data=ordered, the data blocks will be
pushed out to disk along with the journal commit. Many application
programs expect this, so we do this to avoid zero length files if the
system crashes unexpectedly.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.
The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.
This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.
Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The rec_len field in the directory entry is 16 bits, so to encode
blocksizes larger than 64k becomes problematic. This patch allows us
to supprot block sizes up to 256k, by using the low 2 bits to extend
the range of rec_len to 2**18-1 (since valid rec_len sizes must be a
multiple of 4). We use the convention that a rec_len of 0 or 65535
means the filesystem block size, for compatibility with older kernels.
It's unlikely we'll see VM pages of up to 256k, but at some point we
might find that the Linux VM has been enhanced to support filesystem
block sizes > than the VM page size, at which point it might be useful
for some applications to allow very large filesystem block sizes.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.
Thanks to Sami Liedes for reporting this bug.
http://bugzilla.kernel.org/show_bug.cgi?id=12430
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
ext4: Remove "extents" mount option
block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
ext4: Make printk's consistently prefixed with "EXT4-fs: "
ext4: Add sanity checks for the superblock before mounting the filesystem
ext4: Add mount option to set kjournald's I/O priority
jbd2: Submit writes to the journal using WRITE_SYNC
jbd2: Add pid and journal device name to the "kjournald2 starting" message
ext4: Add markers for better debuggability
ext4: Remove code to create the journal inode
ext4: provide function to release metadata pages under memory pressure
ext3: provide function to release metadata pages under memory pressure
add releasepage hooks to block devices which can be used by file systems
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
ext4: Init the complete page while building buddy cache
ext4: Don't allow new groups to be added during block allocation
ext4: mark the blocks/inode bitmap beyond end of group as used
ext4: Use new buffer_head flag to check uninit group bitmaps initialization
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
ext4: code cleanup
...
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.
Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal. Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs. We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal. It just so happens that running without a journal is
slightly faster for most everything.
We did
iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2
which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache. Results:
ext2 ext4, default ext4, no journal
initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s
rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s
reads 15.2 MB/s 16.9 MB/s 17.2 MB/s
re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s
random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s
random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s
So it seems that, so far, this was a useful exercise.
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This fixes a gcc warning but it doesn't appear able to result in a
failure, since the primary way the loop is exited is the first
conditional in the for loop, and at least for a consistent filesystem,
the signed/unsigned should in practice never be exposed.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The original ext3 hash algorithms assumed that variables of type char
were signed, as God and K&R intended. Unfortunately, this assumption
is not true on some architectures. Userspace support for marking
filesystems with non-native signed/unsigned chars was added two years
ago, but the kernel-side support was never added (until now).
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ All users removed in "switch all filesystems over to d_obtain_alias",
aka commit 440037287c ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix. Also remove the requirement for the TEST_FILESYS
flag.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This is a port of a patch from Linus which fixes a 200+ byte stack
usage problem in ext4_get_parent().
It's more efficient to pass down only the actual parts of the dentry
that matter: the parent inode and the name, instead of allocating a
struct dentry on the stack.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
dx_root_limit() will had some dead code which forced it to always return
20, and dx_node_limit to always return 22 for debugging purposes.
Remove it.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_next_entry() is used by the debugging function dx_show_leaf(), so
it must be defined before that function.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is
valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop
to check the validity of entries before checking whether they match and
moving onto the next one.
There are other uses of ext4_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a test-case
that triggers them.
This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>