The number of ptes and swap entries are used in the oom killer's badness
heuristic, so they should be shown in the tasklist dump.
This patch adds those fields and replaces cpu and oom_adj values that are
currently emitted. Cpu isn't interesting and oom_adj is deprecated and
will be removed later this year, the same information is already displayed
as oom_score_adj which is used internally.
At the same time, make the documentation a little more clear to state this
information is helpful to determine why the oom killer chose the task it
did to kill.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/sys/vm/oom_kill_allocating_task will immediately kill current when
the oom killer is called to avoid a potentially expensive tasklist scan
for large systems.
Currently, however, it is not checking current's oom_score_adj value which
may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
killing.
This patch avoids killing current in such a condition and simply falls
back to the tasklist scan since memory still needs to be freed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Wong reported his test suite failex when /tmp is tmpfs.
https://lkml.org/lkml/2012/2/24/479
Currentlt the input check of POSIX_FADV_WILLNEED has two problems.
- requires a_ops->readpage. But in fact, force_page_cache_readahead()
requires that the target filesystem has either ->readpage or ->readpages.
- returns -EINVAL when the filesystem doesn't have ->readpage. But
posix says that fadvise is merely a hint. Thus fadvise() should return
0 if filesystem has no means of implementing fadvise(). The userland
application should not know nor care whcih type of filesystem backs the
TMPDIR directory, as Eric pointed out. There is nothing which userspace
can do to solve this error.
So change the return value to 0 when filesytem doesn't support readahead.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Wong <normalperson@yhbt.net>
Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_COMPACTION is enabled, compaction_deferred() tries to
recalculate the deferred limit again, which isn't necessary.
When CONFIG_COMPACTION is disabled, compaction_deferred() should return
"true" or "false" since it has "bool" for its return value.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY
indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return
a bool to simplify the logic.
[akpm@linux-foundation.org: rework mem_cgroup_force_empty_list()'s comment]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()"
mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove
the -ENOMEM and -EINTR checks.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim
will occur when removing a memory cgroup. If -EINTR is returned here,
cgroup will show a warning.
We don't need to handle any user interruption signal. Remove this.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer currently schedules away from current in an uninterruptible
sleep if it does not have access to memory reserves. It's possible that
current was killed because it shares memory with the oom killed thread or
because it was killed by the user in the interim, however.
This patch only schedules away from current if it does not have a pending
kill, i.e. if it does not share memory with the oom killed thread. It's
possible that it will immediately retry its memory allocation and fail,
but it will immediately be given access to memory reserves if it calls the
oom killer again.
This prevents the delay of memory freeing when threads that share memory
with the oom killed thread get unnecessarily scheduled.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A page's hugetlb cgroup assignment and movement to the active list should
occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup
we will iterate the active list and find pages with NULL hugetlb cgroup
values.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we fail to allocate pages from the reserve pool, hugetlb tries to
allocate huge pages using alloc_buddy_huge_page. Add these to the active
list. We also need to add the huge page we allocate when we soft offline
the oldpage to active list.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement a new controller that allows us to control HugeTLB allocations.
The extension allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how
much HugeTLB pages it would require for its use.
The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.
[akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
[akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g]
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>