In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.
Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
Reduce the object size ~10% could be useful for embedded systems.
Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.
$ size fs/ext4/built-in.o*
text data bss dec hex filename
239375 610 888 240873 3ace9 fs/ext4/built-in.o.new
264167 738 888 265793 40e41 fs/ext4/built-in.o.old
$ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
# CONFIG_PRINTK is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_POSIX_ACL=y
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure. For
keeping this order, a spin lock is used to protect this list. But this
lock burns a lot of CPU time. We can use the following steps to trigger
it.
% cd /dev/shm
% dd if=/dev/zero of=ext4-img bs=1M count=2k
% mkfs.ext4 ext4-img
% mount -t ext4 -o loop ext4-img /mnt
% cd /mnt
% for ((i=0;i<160;i++)); do truncate -s 64g $i; done
% for ((i=0;i<160;i++)); do cp $i /dev/null &; done
% perf record -a -g
% perf report
This commit tries to fix this problem. Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode. Meanwhile we never need to keep a proper in-order
LRU list. So this can avoid to burns some CPU time. When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list. Then we traverse this list to discard some
entries. In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list. When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list. When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.
In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.
Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.
An observed result was:
- allocation fail means ext4_mb_new_group_pa() does not update
ext4_allocation_context;
- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
of number of blocks requested (1);
- that activates update cycle in ext4_splice_branch():
for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
*(where->p + i) = cpu_to_le32(current_block++);
- it iterates 511 times and corrupts a chunk of memory including inode
structure;
- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();
- system hangs with 'scheduling while atomic' BUG.
The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.
Found by Linux File System Verification project (linuxtesting.org).
[ Patch restructed by tytso to make the flow of control easier to follow. ]
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.
Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state
Test case: http://patchwork.ozlabs.org/patch/244297
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.
Also fix data integrity for nojournal mode.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 18888cf088: "ext4: speed up truncate/unlink by not using
bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
the most important codepath for file systems using extents, but a
similar optimization also can be done for file systems using indirect
blocks, and for the two special cases in the ext4 extents code.
Cc: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match. This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number. So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.
For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more. So it's better to add a
cond_resched() here a well.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename ext4_da_writepages() to ext4_writepages() and use it for all
modes. We still need to iterate over all the pages in the case of
data=journalling, but in the case of nodelalloc/data=ordered (which is
what file systems mounted using ext3 backwards compatibility will use)
this will allow us to use a much more efficient I/O submission path.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The test_root() function could potentially loop forever due to
overflow issues. So rewrite test_root() to avoid this issue; as a
bonus, it is 38% faster when benchmarked via a test loop:
int main(int argc, char **argv)
{
int i;
for (i = 0; i < 1 << 24; i++) {
if (test_root(i, 7))
printf("%d\n", i);
}
}
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The group number passed to ext4_get_group_info() should be valid, but
let's add an assert to check this before we start creating a pointer
based on that group number and dereferencing it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Check the group number for sanity earilier, before calling routines
such as ext4_bg_has_super() or ext4_group_overhead_blocks().
Reported-by: Jonathan Salwan <jonathan.salwan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now that we clear PageWriteback after extent conversion, there's no
need to wait for io_end processing in ext4_evict_inode(). Running
AIO/DIO keeps file reference until aio_complete() is called so
ext4_evict_inode() cannot be called. For io_end structures resulting
from buffered IO waiting is happening because we wait for
PageWriteback in truncate_inode_pages().
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We don't have to wait for extent conversion in ext4_punch_hole() as
buffered IO for the punched range has been flushed and waited upon
(thus all extent conversions for that range have completed). Also we
wait for all DIO to finish using inode_dio_wait() so there cannot be
any extent conversions pending due to direct IO.
Also remove ext4_flush_unwritten_io() since it's unused now.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We don't have to wait for unwritten extent conversion in
ext4_ind_direct_IO() as all writes that happened before DIO are
flushed by the generic code and extent conversion has happened before
we cleared PageWriteback bit.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
doesn't need i_mutex anymore. Forcing of transaction commits doesn't
need i_mutex as there's nothing inode specific in that code apart from
grabbing transaction ids from the inode. So remove the lock.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Just use the generic function instead of duplicating it. We only need
to reshuffle the read-only check a bit (which is there to prevent
writing to a filesystem which has been remounted read-only after error
I assume).
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Since PageWriteback bit is now cleared after extents are converted
from unwritten to written ones, we have full exclusion of writeback
path from truncate (truncate_inode_pages() waits for PageWriteback
bits to get cleared on all invalidated pages). Exclusion from DIO
path is achieved by inode_dio_wait() call in ext4_setattr(). So
there's no need to wait for extent convertion in ext4_truncate()
anymore.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make sure extent conversion after DIO happens while i_dio_count is
still elevated so that inode_dio_wait() waits until extent conversion
is done. This removes the need for explicit waiting for extent
conversion in some cases.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently PageWriteback bit gets cleared from put_io_page() called
from ext4_end_bio(). This is somewhat inconvenient as extent tree is
not fully updated at that time (unwritten extents are not marked as
written) so we cannot read the data back yet. This design was
dictated by lock ordering as we cannot start a transaction while
PageWriteback bit is set (we could easily deadlock with
ext4_da_writepages()). But now that we use transaction reservation
for extent conversion, locking issues are solved and we can move
PageWriteback bit clearing after extent conversion is done. As a
result we can remove wait for unwritten extent conversion from
ext4_sync_file() because it already implicitely happens through
wait_on_page_writeback().
We implement deferring of PageWriteback clearing by queueing completed
bios to appropriate io_end and processing all the pages when io_end is
going to be freed instead of at the moment ext4_io_end() is called.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did). So split lists, work items, and work queues to
reserved and unreserved parts.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>