The f2fs always scans the next chain of direct node blocks.
But some garbage blocks are able to be remained due to no discard support or
SSR triggers.
This occasionally wreaks recovering wrong inodes that were used or BUG_ONs
due to reallocating node ids as follows.
When mount this f2fs image:
http://linuxtesting.org/downloads/f2fs_fault_image.zip
BUG_ON is triggered in f2fs driver (messages below are generated on
kernel 3.13.2; for other kernels output is similar):
kernel BUG at fs/f2fs/node.c:215!
Call Trace:
[<ffffffffa032ebad>] recover_inode_page+0x1fd/0x3e0 [f2fs]
[<ffffffff811446e7>] ? __lock_page+0x67/0x70
[<ffffffff81089990>] ? autoremove_wake_function+0x50/0x50
[<ffffffffa0337788>] recover_fsync_data+0x1398/0x15d0 [f2fs]
[<ffffffff812b9e5c>] ? selinux_d_instantiate+0x1c/0x20
[<ffffffff811cb20b>] ? d_instantiate+0x5b/0x80
[<ffffffffa0321044>] f2fs_fill_super+0xb04/0xbf0 [f2fs]
[<ffffffff811b861e>] ? mount_bdev+0x7e/0x210
[<ffffffff811b8769>] mount_bdev+0x1c9/0x210
[<ffffffffa0320540>] ? validate_superblock+0x210/0x210 [f2fs]
[<ffffffffa031cf8d>] f2fs_mount+0x1d/0x30 [f2fs]
[<ffffffff811b9497>] mount_fs+0x47/0x1c0
[<ffffffff81166e00>] ? __alloc_percpu+0x10/0x20
[<ffffffff811d4032>] vfs_kern_mount+0x72/0x110
[<ffffffff811d6763>] do_mount+0x493/0x910
[<ffffffff811615cb>] ? strndup_user+0x5b/0x80
[<ffffffff811d6c70>] SyS_mount+0x90/0xe0
[<ffffffff8166f8d9>] system_call_fastpath+0x16/0x1b
Found by Linux File System Verification project (linuxtesting.org).
Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.
If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.
Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduce is_merged_page() to check whether current page is merged
in f2fs bio cache. When page is not in cache, we can avoid submitting bio cache,
resulting in having more chance to merge pages.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If multiple redundant fsync calls are triggered, we don't need to write its
node pages with fsync mark continuously.
So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
written with the fsync mark or not.
If the mark was set, a new fsync doesn't need to write a node block.
Otherwise, we should do a new node block with the mark for roll-forward
recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces fi->i_sem to protect fi's info that includes xattr_ver,
pino, i_nlink.
This enables to remove i_mutex during f2fs_sync_file, resulting in performance
improvement when a number of fsync calls are triggered from many concurrent
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces ram_thresh, a sysfs entry, which controls the memory
footprint used by the free nid list and the nat cache.
Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
cache was managed by NM_WOUT_THRESHOLD.
However, this approach cannot be applied dynamically according to the system.
So, this patch adds ram_thresh that users can specify the threshold, which is
in order of 1 / 1024.
For example, if the total ram size is 4GB and the value is set to 10 by default,
f2fs tries to control the number of free nids and nat caches not to consume over
10 * (4GB / 1024) = 10MB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_xattr_block for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_inline_xattr for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch use existing macro F2FS_INODE/NEXT_FREE_BLKADDR to clean up some
codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If there are multi segments in one section, we will read those SSA blocks which
have contiguous address one by one in f2fs_gc. It may lost performance, let's
read ahead SSA blocks by merge multi read request.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds an sysfs entry to control dir_level used by the large directory.
The description of this entry is:
dir_level This parameter controls the directory level to
support large directory. If a directory has a
number of files, it can reduce the file lookup
latency by increasing this dir_level value.
Otherwise, it needs to decrease this value to
reduce the space overhead. The default value is 0.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces an i_dir_level field to support large directory.
Previously, f2fs maintains multi-level hash tables to find a dentry quickly
from a bunch of chiild dentries in a directory, and the hash tables consist of
the following tree structure as below.
In Documentation/filesystems/f2fs.txt,
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
But, if we can guess that a directory will handle a number of child files,
we don't need to traverse the tree from level #0 to #N all the time.
Since the lower level tables contain relatively small number of dentries,
the miss ratio of the target dentry is likely to be high.
In order to avoid that, we can configure the hash tables sparsely from level #0
like this.
level #0 | A(2B) - A(2B) - A(2B) - A(2B)
level #1 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
With this structure, we can skip the ineffective tree searches in lower level
hash tables.
This patch adds just a facility for this by introducing i_dir_level in
f2fs_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The stat_show is just to show the current status of f2fs.
So, we can remove all the there-in locks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a radix tree for the list of free_nids, which enhances
the performance on free nid management.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help macro on_build_free_nids() which just uses build_lock
to judge whether the building free nid is going, so that we can remove
the on_build_free_nids field from f2fs_sb_info.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: remove an unnecessary white line removal]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch help us to cleanup the readahead code by merging ra_{sit,nat}_pages
function into ra_meta_pages.
Additionally the new function is used to readahead cp block in
recover_orphan_inodes.
Change log from v1:
o fix a deadloop bug pointed by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch modifies the use of bi_private to remove pointer chasing for sbi.
Previously, we had a bi_private structure, but it needs memory allocation.
So this patch uses bi_private by the sbi pointer and adds a completion pointer
into the sbi.
This can achieve no memory allocation and nice use of the bi_private.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a new xattr node page was allocated and its inode is fsynced, we should
recover the xattr node page during the roll-forward process after power-cut.
But, previously, f2fs didn't handle that case, resulting in kernel panic as
follows reported by Tom Li.
BUG: unable to handle kernel paging request at ffffc9001c861a98
IP: [<ffffffffa0295236>] check_index_in_prev_nodes+0x86/0x2d0 [f2fs]
Call Trace:
[<ffffffff815ece9b>] ? printk+0x48/0x4a
[<ffffffffa029626a>] recover_fsync_data+0xdca/0xf50 [f2fs]
[<ffffffffa02873ae>] f2fs_fill_super+0x92e/0x970 [f2fs]
[<ffffffff8112c9f8>] mount_bdev+0x1b8/0x200
[<ffffffffa0286a80>] ? f2fs_remount+0x130/0x130 [f2fs]
[<ffffffffa0285e40>] f2fs_mount+0x10/0x20 [f2fs]
[<ffffffff8112d4de>] mount_fs+0x3e/0x1b0
[<ffffffff810ef4eb>] ? __alloc_percpu+0xb/0x10
[<ffffffff8114761f>] vfs_kern_mount+0x6f/0x120
[<ffffffff811497b9>] do_mount+0x259/0xa90
[<ffffffff810ead1d>] ? memdup_user+0x3d/0x80
[<ffffffff810eadb3>] ? strndup_user+0x53/0x70
[<ffffffff8114a2c9>] SyS_mount+0x89/0xd0
[<ffffffff815feae2>] system_call_fastpath+0x16/0x1b
This patch adds a recovery function of xattr node pages.
Reported-by: Tom Li <biergaizi@members.fsf.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to make fs consistency, update_inode_page should not be failed all
the time. Otherwise, it is possible to lose some metadata in the inode like
a link count.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...