This fourth patch of eight in this cleancache series provides the
core hooks in VFS for: initializing cleancache per filesystem;
capturing clean pages reclaimed by page cache; attempting to get
pages from cleancache before filesystem read; and ensuring coherency
between pagecache, disk, and cleancache. Note that the placement
of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
change was required due to a patchset in 2.6.39.
All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
a check of a boolean global if CONFIG_CLEANCACHE is set but no
cleancache "backend" has claimed cleancache_ops.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: simplify iget & friends
fs: pull inode->i_lock up out of writeback_single_inode
fs: rename inode_lock to inode_hash_lock
fs: move i_wb_list out from under inode_lock
fs: move i_sb_list out from under inode_lock
fs: remove inode_lock from iput_final and prune_icache
fs: Lock the inode LRU list separately
fs: factor inode disposal
fs: protect inode->i_state with inode->i_lock
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
autofs4 - remove autofs4_lock
autofs4 - fix d_manage() return on rcu-walk
autofs4 - fix autofs4_expire_indirect() traversal
autofs4 - fix dentry leak in autofs4_expire_direct()
autofs4 - reinstate last used update on access
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.
This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.
Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
truncate_inode_pages_range() - stop looking further when it returns 0.
But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
especially if we have preemptible RCU enabled, isn't it conceivable that
all 14 pages returned could be removed from the page cache by
shrink_page_list(), before find_get_pages() gets to process them? So
causing it to return 0 although there may be plenty more pages beyond.
Make find_get_pages() and find_get_pages_tag() check for this unlikely
case, and restart should it occur; but callers of find_get_pages_contig()
have no such expectation, it's okay for that to return 0 early.
I have not seen this in practice, just worried by the possibility.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Salman Qazi <sqazi@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The radix_tree_deref_retry() case in find_get_pages() has a strange little
excrescence, not seen in the other gang lookups: it looks like the start
of an abandoned attempt to guarantee forward progress in a case that
cannot arise.
ret should always be 0 here: if it isn't, then going back to restart will
leak references to pages already gotten. There used to be a comment
saying nr_found is necessarily 1 here: that's not quite true, but the
radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
race with it being moved out of the radix_tree root or back.
Remove the worrisome two lines, add a brief comment here and in
find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
find_get_pages() should it ever be seen elsewhere than at 0.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Salman Qazi <sqazi@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference. It's not a good API.
A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release(). But now
the situation is changed. I think at least things in current mainline
don't have any obstacles. The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().
This patch:
This function works as just wrapper remove_from_page_cache(). The
difference is that it decreases page references in itself. So caller have
to make sure it has a page reference before calling.
This patch is ready for removing remove_from_page_cache().
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function basically does:
remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);
Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.
If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.
This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.
[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GUP user may want to try to acquire a reference to a page if it is already
in memory, but not if IO, to bring it in, is needed. For example KVM may
tell vcpu to schedule another guest process if current one is trying to
access swapped out page. Meanwhile, the page will be swapped in and the
guest process, that depends on it, will be able to run again.
This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
instead.
[akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206
Adding a trace_printk() and running the function tracer limited to
just this function I can see:
gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8
gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9
gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10
gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11
gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin
Which shows that a lot of calls from ext3_write_begin will result in the
page returned by "find_lock_page" will be NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Temporary IO failures, eg. due to loss of both multipath paths, can
permanently leave the PageError bit set on a page, resulting in msync or
fsync returning -EIO over and over again, even if IO is now getting to the
disk correctly.
We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
filemap_fdatawait_range function. Also clearing the PageError bit on the
page allows subsequent msync or fsync calls on this file to return without
an error, if the subsequent IO succeeds.
Unfortunately data written out in the msync or fsync call that returned
-EIO can still get lost, because the page dirty bit appears to not get
restored on IO error. However, the alternative could be potentially all
of memory filling up with uncleanable dirty pages, hanging the system, so
there is no nice choice here...
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Valerie Aurora <vaurora@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.
This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.
Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
[trondmy: cover the cases of invalidate_inode_pages2() and
truncate_inode_pages()]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Salman Qazi describes the following radix-tree bug:
In the following case, we get can get a deadlock:
0. The radix tree contains two items, one has the index 0.
1. The reader (in this case find_get_pages) takes the rcu_read_lock.
2. The reader acquires slot(s) for item(s) including the index 0 item.
3. The non-zero index item is deleted, and as a consequence the other item is
moved to the root of the tree. The place where it used to be is queued for
deletion after the readers finish.
3b. The zero item is deleted, removing it from the direct slot, it remains in
the rcu-delayed indirect node.
4. The reader looks at the index 0 slot, and finds that the page has 0 ref
count
5. The reader looks at it again, hoping that the item will either be freed or
the ref count will increase. This never happens, as the slot it is looking
at will never be updated. Also, this slot can never be reclaimed because
the reader is holding rcu_read_lock and is in an infinite loop.
The fix is to re-use the same "indirect" pointer case that requires a slot
lookup retry into a general "retry the lookup" bit.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reported-by: Salman Qazi <sqazi@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:
int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
unsigned long from)
{
----> struct inode *inode = page->mapping->host;
It looks like page->mapping was the culprit. (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate(). However, it doesn't revalidate the
page->mapping after the page is locked. So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.
We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.
I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated. This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.
xmon info:
1f:mon> e
cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
sp: c0000002ae36f9f0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc000000378f99e30
paca = 0xc000000000f66300
pid = 21946, comm = bash
1f:mon> r
R00 = 0025c0500000006d R16 = 0000000000000000
R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
R02 = c000000000e8cd80 R18 = ffffffffffffffff
R03 = c0000000031d0f88 R19 = 0000000000000001
R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
R05 = 0000000000000000 R21 = c0000002ae36fa68
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 0000000000000001 R23 = c0000002ae36fbb0
R08 = 0000000000000002 R24 = 0000000000000000
R09 = 0000000000000000 R25 = c000000362cd3a80
R10 = 0000000000000000 R26 = 0000000000000002
R11 = c0000000001e7b60 R27 = 0000000000000000
R12 = 0000000042000484 R28 = 0000000000000001
R13 = c000000000f66300 R29 = c0000003bb97b9b8
R14 = 0000000000000001 R30 = c000000000e28a08
R15 = 000000000000ffff R31 = c0000000031d0f88
pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
msr = 8000000000009032 cr = 22000488
ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
1f:mon> t
[link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
[c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
[c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
[c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
[c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
[c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 00000080a840bc54
SP (fffca15df30) is in userspace
1f:mon> di c0000000001e7a6c
c0000000001e7a6c e9290000 ld r9,0(r9)
c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
c0000000001e7a74 e9440008 ld r10,8(r4)
c0000000001e7a78 78a80020 clrldi r8,r5,32
c0000000001e7a7c 3c000001 lis r0,1
c0000000001e7a80 812900a8 lwz r9,168(r9)
c0000000001e7a84 39600001 li r11,1
c0000000001e7a88 7c080050 subf r0,r8,r0
c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
c0000000001e7a90 7d6b4830 slw r11,r11,r9
c0000000001e7a94 796b0020 clrldi r11,r11,32
c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
c0000000001e7aa0 7d004214 add r8,r0,r8
c0000000001e7aa4 79080020 clrldi r8,r8,32
c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <arunabal@in.ibm.com>
Cc: <sbest@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'end' shadows earlier one and is not necessary at all. Remove it and use
'pos' instead. This removes following sparse warnings:
mm/filemap.c:2180:24: warning: symbol 'end' shadows an earlier one
mm/filemap.c:2132:25: originally declared here
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs.
It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
site wants mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
do_page_fault() will then re-acquire mmap_sem and retry the page fault.
It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.
Tests:
- microbenchmark: thread A mmaps a large file and does random read accesses
to the mmaped area - achieves about 55 iterations/s. Thread B does
mmap/munmap in a loop at a separate location - achieves 55 iterations/s
before, 15000 iterations/s after.
- We are seeing related effects in some applications in house, which show
significant performance regressions when running without this change.
[akpm@linux-foundation.org: fix warning & crash]
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>