Nothing uses prepare_write or commit_write. Remove them from the tree
completely.
[akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.
"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not". This patch also adds BUG_ON() to
mem_cgroup_uncharge_cache_page();
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.
Also, mark trylock as likely to succeed (and remove the annotation from
callers).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.
The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.
This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.
[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.
This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.
However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.
Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.
Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dio write returns EIO when try_to_release_page fails because bh is
still referenced.
The patch
commit 3f31fddfa2
Author: Mingming Cao <cmm@us.ibm.com>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.
I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.
The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race, freeing
buffer(dio) and ext3/4_ordered_writepage.
: background_writeout()
->write_cache_pages()
->ext3_ordered_writepage()
walk_page_buffers() -> take a bh ref
block_write_full_page() -> unlock_page
: <- end_page_writeback
: <- race! (dio write->try_to_release_page fails)
walk_page_buffers() ->release a bh ref
ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.
To fix this race, I used the approach of falling back to buffered
writes if try_to_release_page() fails on a page.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).
This also facilitates lockdeping of page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The iov_iter_advance() function would look at the iov->iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end. This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.
The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.
Thanks to Alexey Dobriyan for finding this case, and testing the fix.
Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.
Clean up callers by passing in a file instead of a dentry.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on the
read-side).
The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages. In
practice, this atomicity guarantee is not used anyway, and it is to
replace individual lookups, so these semantics are natural.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.
This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie. if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).
Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page. We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.
This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).
Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.
[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk. If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller. We have seen the directo IO failed
due to this race. Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().
With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As akpm points out, there's really no need for generic_file_aio_read to
make a special case of count 0: just loop through nr_segs doing nothing.
And as Harvey Harrison points out, there's no need to reset retval to 0
where it's already 0.
Setting count (or ocount) to 0 before calling generic_segment_checks is
unnecessary too; but reluctantly I'll leave that removal to someone with a
wider range of gcc versions to hand - 4.1.2 and 4.2.1 don't warn about it,
but perhaps others do - I forget which are the warniest versions.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Tested-by: Lawrence Greenfield <leg@google.com>
Cc: Christoph Rohland <hans-christoph.rohland@sap.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_file_direct_IO is a common helper around the invocation of
->direct_IO. But there's almost nothing shared between the read and write
side, so we're better off without this helper.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make filemap_fdatawrite_range() function public, so that it can later
be used in ordered mode rewrite by JBD/JBD2.
Signed-off-by: Jan Kara <jack@suse.cz>
filemap_fault will go into an infinite loop if ->readpage() fails
asynchronously.
AFAICS the bug was introduced by this commit, which removed the wait after the
final readpage:
commit d00806b183
Author: Nick Piggin <npiggin@suse.de>
Date: Thu Jul 19 01:46:57 2007 -0700
mm: fix fault vs invalidate race for linear mappings
Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
the page is up-to-date before jumping back to the beginning of the function.
I've noticed this while testing nfs exporting on fuse. The patch
fixes it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_file_splice_write() duplicates remove_suid() just because it
doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
anyway, so this is rather pointless.
Move locking to generic_file_splice_write() and call remove_suid() and
__splice_from_pipe() instead.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Clean up messy conditional calling of test_clear_page_writeback() from both
rotate_reclaimable_page() and end_page_writeback().
The only user of rotate_reclaimable_page() is end_page_writeback() so this is
OK.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.
The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: "Kevin Coffman" <kwc@citi.umich.edu>
Cc: "Alexey Dobriyan" <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>