This moves the function around so that it can be called from
ext4_mb_load_buddy().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode: If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
instead of ext4_handle_dirty_metadata().
This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync(). It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list via mark_buffer_dirty_inode() so that
ext4_sync_file() --- which is called to service fsync() and
fdatasync() system calls --- can write out the inode's metadata blocks
by calling sync_mapping_buffers().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be dropped from the
current transaction and/or revoked. This is done by calling the
functions ext4_journal_forget() and ext4_journal_revoke(), which call
jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
Since the jbd2_journal_forget() and jbd2_journal_revoke() call
bforget(), if ext4 is not using a journal, ext4_journal_forget() and
ext4_journal_revoke() must call bforget() to avoid a dirty metadata
block overwriting a block after it has been reallocated and reused for
another inode's data block.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Drop the WARN_ON(1), as he stack trace is not appropriate, since it is
triggered by file system corruption, and it misleads users into
thinking there is a kernel bug. In addition, change the message
displayed by ext4_error() to make it clear that this is a file system
corruption problem.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In order to check whether the buffer_heads are mapped we need to hold
page lock. Otherwise a reclaim can cleanup the attached buffer_heads.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function means moving extents every page, so change its name from
move_exgtent_par_page().
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.co.jp>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Return exchanged blocks count (moved_len) to user space,
if ext4_move_extents() failed on the way.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext4_move_extents() functions checks with BUG_ON() whether the
exchanged blocks count accords with request blocks count. But, if the
target range (orig_start + len) includes sparse block(s), 'moved_len'
(exchanged blocks count) does not agree with 'len' (request blocks
count), since sparse block is not counted in 'moved_len'. This causes
us to hit the BUG_ON(), even though the function succeeded.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The mext_check_arguments() function in move_extents.c has wrong
comparisons. orig_start which is passed from user-space is block
unit, but i_size of inode is byte unit, therefore the checks do not
work fine. This mis-check leads to the overflow of 'len' and then
hits BUG_ON() in ext4_move_extents(). The patch fixes this issue.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Greg Freemyer <greg.freemyer@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We need to flush the write cache unconditionally in ->fsync, otherwise
writes into already allocated blocks can get lost. Writes into fully
allocated files are very common when using disk images for
virtualization, and without this fix can easily lose data after
an fdatasync, which is the typical implementation for a cache flush on
the virtual drive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add a new tracepoint which shows the pages that will be written using
write_cache_pages() by ext4_da_writepages().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
To solve a lock inversion problem, we implement part of the
range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26
for more details.)
As part of that change wbc->range_start was modified by ext4's
writepages function, which causes its callers to get confused since
they aren't expecting the filesystem to modify it. The simplest fix
is to save and restore wbc->range_start in ext4_da_writepages.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_link we need to check using EXT4_LINK_MAX, and not
EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of
regular files, and not directories.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new
parent directory without running into the EXT4_LINK_MAX limit.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The extents sanity-checking code depends on the ext4_ext_space_*()
functions returning the maximum alloable size for eh_max; however,
when the debugging #ifdef AGGRESSIVE_TEST is enabled to test the
extent tree handling code, this prevents a normally created ext4
filesystem from being mounted with the errors:
Aug 26 15:43:50 bsd086 kernel: [ 96.070277] EXT4-fs error (device sda8): ext4_ext_check_inode: bad header/extent in inode #8: too large eh_max - magic f30a, entries 1, max 4(3), depth 0(0)
Aug 26 15:43:50 bsd086 kernel: [ 96.070526] EXT4-fs (sda8): no journal found
Bug reported by Akira Fujita.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
unsigned short is potentially too small to track blocks within
a group; today it is safe due to restrictions in e2fsprogs but
we have _lo / _hi bits for group blocks with the intent to go
up to 32 bits, so clean this up now.
There are many more places where we use unsigned/int/unsigned int
to contain a group block but this should at least fix all the
short types.
I added a few comments to the struct ext4_group_info definition
as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Precursor to changing some types; to keep things in sync, it
seems better to allocate/memset based on the size of the
variables we are using rather than on some disconnected
basic type like "unsigned short"
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
A user reported that although his root ext4 filesystem was mounting
fine, other filesystems would not mount, with the:
"Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF"
error on his 32-bit box built without CONFIG_LBDAF. This is because
the test at mount time for this situation was not being re-checked
on remount, and the normal boot process makes an ro->rw transition,
so this was being missed.
Refactor to make a common helper function to test the filesystem
features against the type of mount request (RO vs. RW) so that we
stay consistent.
Addresses Red-Hat-Bugzilla: #517650
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
While reading through some of the mballoc code it seems that a couple
spots in the size normalization function could be streamlined.
The test for non-overlapping PAs can be or'd for the start & end
conditions, and the tests for adjacent PAs can be else-if'd -
it's essentially independently testing:
if (A + B <= C)
...
if (A > C)
...
These cannot both be true so it seems like the else-if might
be slightly more efficient and/or informative.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mb_update_group_info is only called in one place, and it's
extremely simple. There's no reason to have it in a separate function
in a separate file as far as I can tell, it just obfuscates what's
really going on.
Perhaps it was intended to keep the grp->bb_* manipulation local to
mballoc.c but we're already accessing other grp-> fields in balloc.c
directly so this seems ok.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4 will happily mount a > 16T filesystem on a 32-bit box, but
this is not safe; writes to the block device will wrap past 16T
and the page cache can't index past 16T (232 index * 4k pages).
Adding another test to the existing "too many sectors" test
should do the trick.
Add a comment, a relevant return value, and fix the reference
to the CONFIG_LBD(AF) option as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>