* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...
Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions. And it's done
in two different places.
Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk. It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.
writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
I had to go back to a 2.6.20 tree to work out why we're adding a
number-of-inodes into a number-of-pages count. Restore the lost comment.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.
Let's remove the internal bound.
At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.
And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
it is an aesthetic improvement as well as a functional one :)
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Proposed-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
writeback lists. Currenly this can't happen, but once we move from
inode_lock to more fine grained locking we can have an inode that's
still on the writeback lists but has I_FREEING set, and we absolutely
need to skip it here, just like we do for all other inode list walks.
Based on a patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.
This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation. A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags. We're also using for tracking dirty inodes for
writeback.
To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.
Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.
The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently. For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Inodes of devices such as /dev/zero can get dirty for example via
utime(2) syscall or due to atime update. Backing device of such inodes
(zero_bdi, etc.) is however unable to handle dirty inodes and thus
__mark_inode_dirty complains. In fact, inode should be rather dirtied
against backing device of the filesystem holding it. This is generally a
good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
in these pseudofilesystems are referenced from ordinary filesystem
inodes and carry mapping with real data of the device. Thus for these
inodes we have to use inode->i_mapping->backing_dev_info as we did so
far. We distinguish these filesystems by checking whether sb->s_bdi
points to a non-trivial backing device or not.
Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
There's a device inode A described by a path "/dev/sdb" on this
filesystem. This inode will be dirtied against backing device "8:0"
after this patch. bdev filesystem contains block device inode B coupled
with our inode A. When someone modifies a page of /dev/sdb, it's B that
gets dirtied and the dirtying happens against the backing device "8:16".
Thus both inodes get filed to a correct bdi list.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Setting the task state here may cause us to miss the wake up from
kthread_stop(), so we need to recheck kthread_should_stop() or risk
sleeping forever in the following schedule().
Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create
multiple mounts in a temporary namespace while traversing the mount
path, and since the temporary namespace is immediately destroyed, it may
end up destroying a mount very soon after it was created, possibly
making this race more likely.)
INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000
ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
Call Trace:
[<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0
[<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0
[<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0
[<ffffffff8195fc80>] wait_for_common+0x120/0x190
[<ffffffff81033c70>] ? default_wake_function+0x0/0x20
[<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20
[<ffffffff810595fa>] kthread_stop+0x4a/0x150
[<ffffffff81061a60>] ? thaw_process+0x70/0x80
[<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0
[<ffffffff81229dc9>] nfs_put_super+0x19/0x20
[<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0
[<ffffffff810ee9b6>] kill_anon_super+0x16/0x60
[<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90
[<ffffffff810eda45>] deactivate_locked_super+0x45/0x60
[<ffffffff810edfb9>] deactivate_super+0x49/0x70
[<ffffffff81108294>] mntput_no_expire+0x84/0xe0
[<ffffffff811084ef>] release_mounts+0x9f/0xc0
[<ffffffff81108575>] put_mnt_ns+0x65/0x80
[<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420
[<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0
[<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360
[<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0
[<ffffffff810ede92>] do_kern_mount+0x52/0x130
[<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170
[<ffffffff81108e9e>] do_mount+0x26e/0x7f0
[<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190
[<ffffffff811094b8>] sys_mount+0x98/0xf0
[<ffffffff810024d8>] system_call_fastpath+0x16/0x1b
1 lock held by mount.nfs4/4314:
#0: (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Commit 83ba7b071f ("writeback: simplify the write back thread queue")
broke writeback_in_progress() as in that commit we started to remove work
items from the list at the moment we start working on them and not at the
moment they are finished. Thus if the flusher thread was doing some work
but there was no other work queued, writeback_in_progress() returned
false. This could in particular cause unnecessary queueing of background
writeback from balance_dirty_pages() or writeout work from
writeback_sb_if_idle().
This patch fixes the problem by introducing a bit in the bdi state which
indicates that the flusher thread is processing some work and uses this
bit for writeback_in_progress() test.
NOTE: Both callsites of writeback_in_progress() (namely,
writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
need a different information than what writeback_in_progress() provides.
They would need to know whether *the kind of writeback they are going to
submit* is already queued. But this information isn't that simple to
provide so let's fix writeback_in_progress() for the time being.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unify the logic for kupdate and non-kupdate cases. There won't be
starvation because the inodes requeued into b_more_io will later be
spliced _after_ the remaining inodes in b_io, hence won't stand in the way
of other inodes in the next run.
It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode from
the data integrity sync operation (which checks timestamp to avoid extra
work and livelock).
===
How the redirty_tail() comes about:
It was a long story.. This redirty_tail() was introduced with
wbc.more_io. The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:
reiserfs:
http://lkml.org/lkml/2007/10/23/93
jfs:
commit 29a424f283
JFS: clear PAGECACHE_TAG_DIRTY for no-write pages
ext2:
http://www.spinics.net/linux/lists/linux-ext4/msg04762.html
They are all old bugs hidden in various filesystems that become "visible"
with the more_io patch. At the time, the ext2 bug is thought to be
"trivial", so not fixed. Instead the following updated more_io patch with
redirty_tail() is merged:
http://www.spinics.net/linux/lists/linux-ext4/msg04507.html
This will in general prevent 100% on ext2 and possibly other unknown FS bugs.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid delaying writeback for an expire inode with lots of dirty pages, but
no active dirtier at the moment. Previously we only do that for the
kupdate case.
Any filesystem that does delayed allocation or unwritten extent conversion
after IO completion will cause this - for example, XFS.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
write out some huge inode for too long while starving writeout of other
inodes. To avoid livelocks, we record time we started writeback in
wbc->wb_start and do not write out inodes which were dirtied after this
time. But currently, writeback_inodes_wb() resets wb_start each time it
is called thus effectively invalidating this logic and making any
WB_SYNC_NONE writeback prone to livelocks.
This patch makes sure wb_start is set only once when we start writeback.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>