Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead. But there are many more and it's impractical to annotate
them all.
First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well. This simplifies the code quite a bit for
added bonus.
Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts. If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.
Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These VM_<READfoo> macros aren't used very often and three of them
aren't used at all.
Expand the ones that are used in-place, and remove all the now unused
#define VM_<foo> macros.
VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
before 2.4 and appears have never been used.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
There are two convenient ways to report errors to userspace
1) retun error to original syscall for example write(2)
2) mark mapping with error flag and return it on later fsync(2)
Second one is broken if (mapping->nrpages == 0) This is real-life
situation because after error pages are likey to be truncated or
invalidated.
We have to return an error regardless to number of pages in the mapping.
#Original testcase: git@github.com:dmonakhov/xfstests.git
MOUNT_OPTIONS="-b1024"
./check shared/305
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the events API to trace filemap loading and unloading of file pieces
into the page cache.
This patch aims at tracing the eviction reload cycle of executable and
shared libraries pages in a memory constrained environment.
The typical usage is to spot a specific device and inode (for example
/lib/libc.so) to see the eviction cycles, and find out if frequently
used code is rather spread across many pages (bad) or coallesced (good).
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function does not modify iov_iter which 'i' points to.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.
Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.
I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)
Signed-off-by: Shaohua Li <shaohua.li@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Buffered write(2) is not directly tied to IO, so it's not suitable to
handle plug in generic_file_aio_write().
Note that plugging for O_SYNC writes is also removed. The user may pass
arbitrary @size arguments, which may be much larger than the preferable
I/O size, or may cross extent/device boundaries. Let the lower layers
handle the plugging. The plugging code here actually turns them into
no-ops.
CC: Li Shaohua <shli@fusionio.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().
->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.
Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"
Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.
I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
file_remove_suid() is a generic function operates on struct file,
it almost has no relations with file mapping, so move it to fs/inode.c.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace radix_tree_gang_lookup_slot() and
radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
brand-new radix-tree direct iterating. This avoids the double-scanning
and pointer copying.
Iterator don't stop after nr_pages page-get fails in a row, it continue
lookup till the radix-tree end. Thus we can safely remove these restart
conditions.
Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
case does not fit into new code, so the patch adds an extra check at the
beginning.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>