Commit Graph

6044 Commits

Author SHA1 Message Date
David Rientjes
1e11ad8dc4 mm, oom: fix badness score underflow
If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of commit
a7f638f999 ("mm, oom: normalize oom scores to oom_score_adj scale only
for userspace").

Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.

Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-08 15:07:35 -07:00
Hugh Dickins
0142ef6cdc shmem: replace_page must flush_dcache and others
Commit bde05d1ccd ("shmem: replace page if mapping excludes its zone")
is not at all likely to break for anyone, but it was an earlier version
from before review feedback was incorporated.  Fix that up now.

* shmem_replace_page must flush_dcache_page after copy_highpage [akpm]
* Expand comment on why shmem_unuse_inode needs page_swapcount [akpm]
* Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong]
* Check page_private matches swap before calling shmem_replace_page [hughd]
* shmem_replace_page allow for unexpected race in radix_tree lookup [hughd]

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephane Marchesin <marcheu@chromium.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Rob Clark <rob.clark@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:54 -07:00
Linus Torvalds
99becf1328 Pull 'for-linus' branches of git://git.kernel.org/pub/scm/linux/kernel/git/viro/{signal,vfs}
Pull signal and vfs compile breakage fixes from Al Viro.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  fixups for signal breakage

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nommu: fix compilation of nommu.c
2012-06-04 15:09:08 -07:00
Greg Ungerer
ad1ed2937e nommu: fix compilation of nommu.c
Compiling 3.5-rc1 for nommu targets gives:

  CC      mm/nommu.o
mm/nommu.c: In function ‘sys_mmap_pgoff’:
mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function)
mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in

It is trivially fixed by replacing 'ret' with the local variable that is
already defined for the return value 'retval'.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-04 17:17:31 -04:00
Linus Torvalds
a3fe778c78 Merge tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm
Pull frontswap feature from Konrad Rzeszutek Wilk:
 "Frontswap provides a "transcendent memory" interface for swap pages.
  In some environments, dramatic performance savings may be obtained
  because swapped pages are saved in RAM (or a RAM-like device) instead
  of a swap disk.  This tag provides the basic infrastructure along with
  some changes to the existing backends."

Fix up trivial conflict in mm/Makefile due to removal of swap token code
changing a line next to the new frontswap entry.

This pull request came in before the merge window even opened, it got
delayed to after the merge window by me just wanting to make sure it had
actual users.  Apparently IBM is using this on their embedded side, and
Jan Beulich says that it's already made available for SLES and OpenSUSE
users.

Also acked by Rik van Riel, and Konrad points to other people liking it
too.  So in it goes.

By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
via Konrad Rzeszutek Wilk
* tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
  frontswap: s/put_page/store/g s/get_page/load
  MAINTAINER: Add myself for the frontswap API
  mm: frontswap: config and doc files
  mm: frontswap: core frontswap functionality
  mm: frontswap: core swap subsystem hooks and headers
  mm: frontswap: add frontswap header file
2012-06-04 12:28:45 -07:00
Linus Torvalds
68e3e92620 Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"
This reverts commit 5ceb9ce6fe.

That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported.  The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.

So revert it for now, we can re-visit this for 3.6.  If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03 20:05:57 -07:00
Hugh Dickins
752dc185da mm: fix warning in __set_page_dirty_nobuffers
New tmpfs use of !PageUptodate pages for fallocate() is triggering the
WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
is called from migrate_page_copy() for compaction.

It is anomalous that migration should use __set_page_dirty_nobuffers()
on an address_space that does not participate in dirty and writeback
accounting; and this has also been observed to insert surprising dirty
tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

We should probably give migrate_page_copy() a better way to preserve the
tag and migrate accounting info, when mapping_cap_account_dirty().  But
that needs some more work: so in the interim, avoid the warning by using
a simple SetPageDirty on PageSwapBacked pages.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03 20:05:47 -07:00
Linus Torvalds
af4f8ba31a Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
Pull slab updates from Pekka Enberg:
 "Mainly a bunch of SLUB fixes from Joonsoo Kim"

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  slub: use __SetPageSlab function to set PG_slab flag
  slub: fix a memory leak in get_partial_node()
  slub: remove unused argument of init_kmem_cache_node()
  slub: fix a possible memory leak
  Documentations: Fix slabinfo.c directory in vm/slub.txt
  slub: fix incorrect return type of get_any_partial()
2012-06-01 16:50:23 -07:00
Linus Torvalds
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Josef Bacik
c3b2da3148 fs: introduce inode operation ->update_time
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail.  We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates.  So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made.  The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.

I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01 12:07:25 -04:00
Al Viro
17d1587f55 unexport do_munmap()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:18 -04:00
Al Viro
eb36c5873b new helper: vm_mmap_pgoff()
take it to mm/util.c, convert vm_mmap() to use of that one and
take it to mm/util.c as well, convert both sys_mmap_pgoff() to
use of vm_mmap_pgoff()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:18 -04:00
Al Viro
dc982501d9 kill do_mmap() completely
just pull into vm_mmap()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:17 -04:00
Al Viro
e3fc629d7b switch aio and shm to do_mmap_pgoff(), make do_mmap() static
after all, 0 bytes and 0 pages is the same thing...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:17 -04:00
Al Viro
9ac4ed4bd0 move security_mmap_addr() to saner place
it really should be done by get_unmapped_area(); that cuts down on
the amount of callers considerably and it's the right place for
that stuff anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:16 -04:00
Al Viro
8b3ec6814c take security_mmap_file() outside of ->mmap_sem
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:01 -04:00
Linus Torvalds
08615d7d85 Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:

 - the "misc" tree - stuff from all over the map

 - checkpatch updates

 - fatfs

 - kmod changes

 - procfs

 - cpumask

 - UML

 - kexec

 - mqueue

 - rapidio

 - pidns

 - some checkpoint-restore feature work.  Reluctantly.  Most of it
   delayed a release.  I'm still rather worried that we don't have a
   clear roadmap to completion for this work.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
  kconfig: update compression algorithm info
  c/r: prctl: add ability to set new mm_struct::exe_file
  c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
  c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  syscalls, x86: add __NR_kcmp syscall
  fs, proc: introduce /proc/<pid>/task/<tid>/children entry
  sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
  aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
  eventfd: change int to __u64 in eventfd_signal()
  fs/nls: add Apple NLS
  pidns: make killed children autoreap
  pidns: use task_active_pid_ns in do_notify_parent
  rapidio/tsi721: add DMA engine support
  rapidio: add DMA engine support for RIO data transfers
  ipc/mqueue: add rbtree node caching support
  tools/selftests: add mq_perf_tests
  ipc/mqueue: strengthen checks on mqueue creation
  ipc/mqueue: correct mq_attr_ok test
  ipc/mqueue: improve performance of send/recv
  selftests: add mq_open_tests
  ...
2012-05-31 18:10:18 -07:00
Christopher Yeoh
ac34ebb3a6 aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
changes made to support CMA in an earlier patch.

Rather than having an additional check_access parameter to these
functions, the first paramater type is overloaded to allow the caller to
specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
are valid, but do not check the memory that they point to.  This is used
by process_vm_readv/writev where we need to validate that a iovec passed
to the syscall is valid but do not want to check the memory that it points
to at this point because it refers to an address space in another process.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Al Viro
e5467859f7 split ->file_mmap() into ->mmap_addr()/->mmap_file()
... i.e. file-dependent and address-dependent checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-31 13:11:54 -04:00
Al Viro
cf74d14c4f unexport do_mmap()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-30 21:04:57 -04:00
Al Viro
63a81db132 merge do_mremap() into sys_mremap()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-30 21:04:57 -04:00
Cong Wang
3ed37648e1 fs: move file_remove_suid() to fs/inode.c
file_remove_suid() is a generic function operates on struct file,
it almost has no relations with file mapping, so move it to fs/inode.c.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-30 21:04:52 -04:00
Dave Hansen
4523e14585 mm: fix vma_resv_map() NULL pointer
hugetlb_reserve_pages() can be used for either normal file-backed
hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
mode, there is not a VMA around.  The new call to resv_map_put() assumed
that there was, and resulted in a NULL pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
  IP: vma_resv_map+0x9/0x30
  PGD 141453067 PUD 1421e1067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP
  ...
  Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
  RIP: vma_resv_map+0x9/0x30
  ...
  Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
  Call Trace:
    resv_map_put+0xe/0x40
    hugetlb_reserve_pages+0xa6/0x1d0
    hugetlb_file_setup+0x102/0x2c0
    newseg+0x115/0x360
    ipcget+0x1ce/0x310
    sys_shmget+0x5a/0x60
    system_call_fastpath+0x16/0x1b

This was reported by Dave Jones, but was reproducible with the
libhugetlbfs test cases, so shame on me for not running them in the
first place.

With this, the oops is gone, and the output of libhugetlbfs's
run_tests.py is identical to plain 3.4 again.

[ Marked for stable, since this was introduced by commit c50ac05081
  ("hugetlb: fix resv_map leak in error path") which was also marked for
  stable ]

Reported-by: Dave Jones <davej@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>        [2.6.32+]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 08:48:13 -07:00
Al Viro
b0b0382bb4 ->encode_fh() API change
pass inode + parent's inode or NULL instead of dentry + bool saying
whether we want the parent or not.

NOTE: that needs ceph fix folded in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-29 23:28:33 -04:00
Glauber Costa
3f13461939 memcg: decrement static keys at real destroy time
We call the destroy function when a cgroup starts to be removed, such as
by a rmdir event.

However, because of our reference counters, some objects are still
inflight.  Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference, some
objects will still have charges, but the code to properly uncharge them
won't be run.

This becomes a problem specially if it is ever enabled again, because now
new charges will be added to the staled charges making keeping it pretty
much impossible.

We just need to be careful with the static branch activation: since there
is no particular preferred order of their activation, we need to make sure
that we only start using it after all call sites are active.  This is
achieved by having a per-memcg flag that is only updated after
static_key_slow_inc() returns.  At this time, we are sure all sites are
active.

This is made per-memcg, not global, for a reason: it also has the effect
of making socket accounting more consistent.  The first memcg to be
limited will trigger static_key() activation, therefore, accounting.  But
all the others will then be accounted no matter what.  After this patch,
only limited memcgs will have its sockets accounted.

[akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                            document enum sock_flag_bits,
                            convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                            redo tcp_update_limit() comment to 80 cols]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:28 -07:00