On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90cefc ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.
The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache. Here, gc-inode is the inode which
buffers blocks to be relocated on GC. That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag. Thus, some of live blocks are wrongly
overrode without being moved to new logs.
This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_iget_for_gc() returns an ERR_PTR() on failure and doesn't return
NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.
# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618
These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance. Currently there is no
way to inspect dirty and writeback speed over time. It's not possible for
nr_dirty/nr_writeback.
These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end. This allows folks to
understand their io workloads and track kernel issues. Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.
Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
Patch #5 add writeback thresholds to /proc/vmstat
Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.
The output is as below:
# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223
This patch:
This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using these
routines means that some code will lose track of the accounting and we get
bugs.
Modify nilfs2 to use interface.
Signed-off-by: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (36 commits)
nilfs2: eliminate sparse warning - "context imbalance"
nilfs2: eliminate sparse warnings - "symbol not declared"
nilfs2: get rid of bdi from nilfs object
nilfs2: change license of exported header file
nilfs2: add bdev freeze/thaw support
nilfs2: accept 64-bit checkpoint numbers in cp mount option
nilfs2: remove own inode allocator and destructor for metadata files
nilfs2: get rid of back pointer to writable sb instance
nilfs2: get rid of mi_nilfs back pointer to nilfs object
nilfs2: see state of root dentry for mount check of snapshots
nilfs2: use iget for all metadata files
nilfs2: get rid of GCDAT inode
nilfs2: add routines to redirect access to buffers of DAT file
nilfs2: add routines to roll back state of DAT file
nilfs2: add routines to save and restore bmap state
nilfs2: do not allocate nilfs_mdt_info structure to gc-inodes
nilfs2: allow nilfs_clear_inode to clear metadata file inodes
nilfs2: get rid of snapshot mount flag
nilfs2: simplify life cycle management of nilfs object
nilfs2: do not allocate multiple super block instances for a device
...
insert sparse annotations to fix following sparse warning.
fs/nilfs2/segment.c:2681:3: warning: context imbalance in 'nilfs_segctor_kill_thread' - unexpected unlock
nilfs_segctor_kill_thread is only called inside sc_state_lock lock.
sparse doesn't detect the context and warn "unexpected unlock".
__acquires/__releases pretend to lock/unlock the sc_state_lock for sparse.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
change nilfs_dat_commit_free and nilfs_inode_cachep static
to fix following warnings
fs/nilfs2/super.c:72:19: warning: symbol 'nilfs_inode_cachep' was not declared. Should it be static?
fs/nilfs2/dat.c:106:6: warning: symbol 'nilfs_dat_commit_free' was not declared. Should it be static?
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs now can use sb->s_bdi to get backing_dev_info, so we use it
instead of ns_bdi on the nilfs object and remove ns_bdi.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs hasn't supported the freeze/thaw feature because it didn't work
due to the peculiar design that multiple super block instances could
be allocated for a device. This limitation was removed by the patch
"nilfs2: do not allocate multiple super block instances for a device".
So now this adds the freeze/thaw support to nilfs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current implementation doesn't mount snapshots with checkpoint
numbers larger than INT_MAX since it uses match_int() for parsing
"cp=" mount option.
This uses simple_strtoull() for the conversion to resolve the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This finally removes own inode allocator and destructor functions for
metadata files. Several routines, nilfs_mdt_new(),
nilfs_mdt_new_common(), nilfs_mdt_clear(), nilfs_mdt_destroy(), and
nilfs_alloc_inode_common() will be gone.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs object holds a back pointer to a writable super block instance
in nilfs->ns_writer, and this became eliminable since sb is now made
per device and all inodes have a valid pointer to it.
This deletes the ns_writer pointer and a reader/writer semaphore
protecting it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes a back pointer to nilfs object from nilfs_mdt_info
structure that is attached to metadata files.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
After applied the patch that unified sb instances, root dentry of
snapshots can be left in dcache even after their trees are unmounted.
The orphan root dentry/inode keeps a root object, and this causes
false positive of nilfs_checkpoint_is_mounted function.
This resolves the issue by having nilfs_checkpoint_is_mounted test
whether the root dentry is busy or not.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This makes use of iget5_locked to allocate or get inode for metadata
files to stop using own inode allocator.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This applies prepared rollback function and redirect function of
metadata file to DAT file, and eliminates GCDAT inode.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
During garbage collection (GC), DAT file, which converts virtual block
number to real block number, may return disk block number that is not
yet written to the device.
To avoid access to unwritten blocks, the current implementation stores
changes to the caches of GCDAT during GC and atomically commit the
changes into the DAT file after they are written to the device.
This patch, instead, adds a function that makes a copy of specified
buffer and stores it in nilfs_shadow_map, and a function to get the
backup copy as needed (nilfs_mdt_freeze_buffer and
nilfs_mdt_get_frozen_buffer respectively).
Before DAT changes block number in an entry block, it makes a copy and
redirect access to the buffer so that address conversion function
(i.e. nilfs_dat_translate) refers to the old address saved in the
copy.
This patch gives requisites for such redirection.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds optional function to metadata files which makes a copy of
bmap, page caches, and b-tree node cache, and rolls back to the copy
as needed.
This enhancement is intended to displace gcdat inode that provides a
similar function in a different way.
In this patch, nilfs_shadow_map structure is added to store a copy of
the foregoing states. nilfs_mdt_setup_shadow_map relates this
structure to a metadata file. And, nilfs_mdt_save_to_shadow_map() and
nilfs_mdt_restore_from_shadow_map() provides save and restore
functions respectively. Finally, nilfs_mdt_clear_shadow_map() clears
states of nilfs_shadow_map.
The copy of b-tree node cache and page cache is made by duplicating
only dirty pages into corresponding caches in nilfs_shadow_map. Their
restoration is done by clearing dirty pages from original caches and
by copying dirty pages back from nilfs_shadow_map.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds routines to save and restore the state of bmap structure.
The bmap state is stored in a given nilfs_bmap_store object.
These routines will be used to roll back the state of dat inode
without using gcdat inode.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
GC-inode now doesn't need the nilfs_mdt_info structure and there is no
reason that it is a sort of metadata files.
This stops the allocation and makes them not dependent on metadata
file routines.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Allows clear inode function (nilfs_clear_inode) to handle metadata
files that uses bitmap-based object alloctor. DAT and ifile
correspond to this.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This stops pre-allocating nilfs object in nilfs_get_sb routine, and
stops managing its life cycle by reference counting.
nilfs_find_or_create_nilfs() function, nilfs->ns_mount_mutex,
nilfs_objects list, and the reference counter will be removed through
the simplification.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>