The memory.numa_stat file was not hierarchical. Memory charged to the
children was not shown in parent's numa_stat.
This change adds the "hierarchical_" stats to the existing stats. The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.
Tested: Create cgroup a, a/b and run workload under b. The values of
b are included in the "hierarchical_*" under a.
$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b
Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &
The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.
This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.
The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_charge_common() is invoked as the entry point for cgroup limits
charge rather than mem_cgroup_charge(), as the later has been removed for
years. Update the cgroup/memory.txt to reflect this change.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_badness() takes a totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004fbf ("mm: avoid swapping out
with swappiness==0") we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.
The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While reading through Documentation/cgroups/memory.txt, I found a number
of minor wordos and typos. The patch below is a conservative handling of
some of these: it provides just a number of "obviously correct" fixes to
the English that improve the readability of the document somewhat.
Obviously some more significant fixes need to be made to the document, but
some of those may not be in the "obvious correct" category.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently, at removal of cgroup, ->pre_destroy() is called and moves
charges to the parent cgroup. A major reason for returning -EBUSY from
->pre_destroy() is that the 'moving' hits the parent's resource
limitation. It happens only when use_hierarchy=0.
Considering use_hierarchy=0, all cgroups should be flat. So, no one
cannot justify moving charges to parent...parent and children are in flat
configuration, not hierarchical.
This patch modifes the code to move charges to the root cgroup at
rmdir/force_empty if use_hierarchy==0. This will much simplify rmdir()
and reduce error in ->pre_destroy.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes memcg's behavior at task_move().
At task_move(), the kernel scans a task's page table and move the changes
for mapped pages from source cgroup to target cgroup. There has been a
bug at handling shared anonymous pages for a long time.
Before patch:
- The spec says 'shared anonymous pages are not moved.'
- The implementation was 'shared anonymoys pages may be moved'.
If page_mapcount <=2, shared anonymous pages's charge were moved.
After patch:
- The spec says 'all anonymous pages are moved'.
- The implementation is 'all anonymous pages are moved'.
Considering usage of memcg, this will not affect user's experience.
'shared anonymous' pages only exists between a tree of processes which
don't do exec(). Moving one of process without exec() seems not sane.
For example, libcgroup will not be affected by this change. (Anyway, no
one noticed the implementation for a long time...)
Below is a discussion log:
- current spec/implementation are complex
- Now, shared file caches are moved
- It adds unclear check as page_mapcount(). To do correct check,
we should check swap users, etc.
- No one notice this implementation behavior. So, no one get benefit
from the design.
- In general, once task is moved to a cgroup for running, it will not
be moved....
- Finally, we have control knob as memory.move_charge_at_immigrate.
Here is a patch to allow moving shared pages, completely. This makes
memcg simpler and fix current broken code.
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit e5671dfae5.
After a follow up discussion with Michal, it was agreed it would
be better to leave the kmem controller with just the tcp files,
deferring the behavior of the other general memory.kmem.* files
for a later time, when more caches are controlled. This is because
generic kmem files are not used by tcp accounting and it is
not clear how other slab caches would fit into the scheme.
We are reverting the original commit so we can track the reference.
Part of the patch is kept, because it was used by the later tcp
code. Conflicts are shown in the bottom. init/Kconfig is removed from
the revert entirely.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
CC: Johannes Weiner <jweiner@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/cgroups/memory.txt
mm/memcontrol.c
Signed-off-by: David S. Miller <davem@davemloft.net>