Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By this patch, when a transparent hugepage is charged, not only the head
page but also all the tail pages are committed, IOW pc->mem_cgroup and
pc->flags of tail pages are set.
Without this patch:
- Tail pages are not linked to any memcg's LRU at splitting. This causes many
problems, for example, the charged memcg's directory can never be rmdir'ed
because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
root cgroup is calculated by the stat not by res_counter(since 2.6.32),
it would be incorrect too.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked.
But as commented in mem_cgroup_from_task(), mm->owner can be NULL
in some racy case. This check of VM_BUG_ON() is bad.
A possible story to hit this is at swapoff()->try_to_unuse(). It passes
mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we
can't get proper mem_cgroup from swap_cgroup information, mm->owner is used
as charge target and we see NULL.
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
configuration option and then it is turned on by default. There is a boot
option (noswapaccount) which can disable this feature.
This makes it hard for distributors to enable the configuration option as
this feature leads to a bigger memory consumption and this is a no-go for
general purpose distribution kernel. On the other hand swap accounting
may be very usuful for some workloads.
This patch adds a new configuration option which controls the default
behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected
then the feature is turned on by default.
It also adds a new boot parameter swapaccount[=1|0] which enhances the
original noswapaccount parameter semantic by means of enable/disable logic
(defaults to 1 if no value is provided to be still consistent with
noswapaccount).
The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
mlock does it). This means it can cause deadlock if it races with move charge:
Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()
Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()
To avoid this deadlock, we do all the move charge works (both can_attach() and
attach()) under one mmap_sem section.
And after this patch, we set/clear mc.moving_task outside mc.lock, because we
use the lock only to check mc.from/to.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch extracts the core logic from mem_cgroup_update_file_mapped() as
mem_cgroup_update_file_stat() and adds a wrapper.
As a planned future update, memory cgroup has to count dirty pages to
implement dirty_ratio/limit. And more, the number of dirty pages is
required to kick flusher thread to start writeback. (Now, no kick.)
This patch is preparation for it and makes other statistics implementation
clearer. Just a clean up.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An event counter MEM_CGROUP_ON_MOVE is used for quick check whether file
stat update can be done in async manner or not. Now, it use percpu
counter and for_each_possible_cpu to update.
This patch replaces for_each_possible_cpu to for_each_online_cpu and adds
necessary synchronization logic at CPU HOTPLUG.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, memcgroup's per cpu coutner uses for_each_possible_cpu() to get the
value. It's better to use for_each_online_cpu() and a cpu hotplug
handler.
This patch only handles statistics counter. MEM_CGROUP_ON_MOVE will be
handled in another patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In memory cgroup management, we sometimes have to walk through
subhierarchy of cgroup to gather informaiton, or lock something, etc.
Now, to do that, mem_cgroup_walk_tree() function is provided. It calls
given callback function per cgroup found. But the bad thing is that it
has to pass a fixed style function and argument, "void*" and it adds much
type casting to memcontrol.c.
To make the code clean, this patch replaces walk_tree() with
for_each_mem_cgroup_tree(iter, root)
An iterator style call. The good point is that iterator call doesn't have
to assume what kind of function is called under it. A bad point is that
it may cause reference-count leak if a caller use "break" from the loop by
mistake.
I think the benefit is larger. The modified code seems straigtforward and
easy to read because we don't have misterious callbacks and pointer cast.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.
But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. (See comment in the
patch)
Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.
Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.
Before patch:
28.25% mmap mmap [.] main
22.64% mmap [kernel.kallsyms] [k] page_fault
9.96% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped
3.67% mmap [kernel.kallsyms] [k] filemap_fault
3.50% mmap [kernel.kallsyms] [k] unmap_vmas
2.99% mmap [kernel.kallsyms] [k] __do_fault
2.76% mmap [kernel.kallsyms] [k] find_get_page
After patch:
30.00% mmap mmap [.] main
23.78% mmap [kernel.kallsyms] [k] page_fault
5.52% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped
3.81% mmap [kernel.kallsyms] [k] unmap_vmas
3.26% mmap [kernel.kallsyms] [k] find_get_page
3.18% mmap [kernel.kallsyms] [k] __do_fault
3.03% mmap [kernel.kallsyms] [k] filemap_fault
2.40% mmap [kernel.kallsyms] [k] handle_mm_fault
2.40% mmap [kernel.kallsyms] [k] do_page_fault
This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)
Note: It seems some more improvements are required..but no idea.
maybe removing set/unset flag is required.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently memory cgroup accounts file-mapped by counter and flag. counter
is working in the same way with zone_stat but FileMapped flag only exists
in memcg (for helping move_account).
This flag can be updated wrongly in a case. Assume CPU0 and CPU1 and a
thread mapping a page on CPU0, another thread unmapping it on CPU1.
CPU0 CPU1
rmv rmap (mapcount 1->0)
add rmap (mapcount 0->1)
lock_page_cgroup()
memcg counter+1 (some delay)
set MAPPED FLAG.
unlock_page_cgroup()
lock_page_cgroup()
memcg counter-1
clear MAPPED flag
In the above sequence counter is properly updated but FLAG is not. This
means that representing a state by a flag which is maintained by counter
needs some special care.
To handle this, when clearing a flag, this patch check mapcount directly
and clear the flag only when mapcount == 0. (if mapcount >0, someone will
make it to zero later and flag will be cleared.)
Reverse case, dec-after-inc cannot be a problem because page_table_lock()
works well for it. (IOW, to make above sequence, 2 processes should touch
the same page at once with map/unmap.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, memory cgroup increments css(cgroup subsys state)'s reference count
per a charged page. And the reference count is kept until the page is
uncharged. But this has 2 bad effect.
1. Because css_get/put calls atomic_inc()/dec, heavy call of them
on large smp will not scale well.
2. Because css's refcnt cannot be in a state as "ready-to-release",
cgroup's notify_on_release handler can't work with memcg.
3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.
This has been a problem since the 1st merge of memcg.
This is a trial to remove css's refcnt per a page. Even if we remove
refcnt, pre_destroy() does enough synchronization as
- check res->usage == 0.
- check no pages on LRU.
This patch removes css's refcnt per page. Even after this patch, at the
1st look, it seems css_get() is still called in try_charge().
But the logic is.
- If a memcg of mm->owner is cached one, consume_stock() will work.
At success, return immediately.
- If consume_stock returns false, css_get() is called and go to
slow path which may be blocked. At the end of slow path,
css_put() is called and restart from the start if necessary.
So, in the fast path, we don't call css_get() and can avoid access to
shared counter. This patch can make the most possible case fast.
Here is a result of multi-threaded page fault benchmark.
[Before]
25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c
9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm <=====(*)
7.83% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
5.38% multi-fault-all [kernel.kallsyms] [k] __css_put
5.29% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
4.92% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
4.24% multi-fault-all [kernel.kallsyms] [k] up_read
3.53% multi-fault-all [kernel.kallsyms] [k] css_put
2.11% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
1.76% multi-fault-all [kernel.kallsyms] [k] __rmqueue
1.64% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
[After]
28.41% multi-fault-all [kernel.kallsyms] [k] clear_page_c
10.08% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
9.58% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
9.38% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.86% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
5.65% multi-fault-all [kernel.kallsyms] [k] up_read
2.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
2.64% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list
2.48% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
removes css_tryget() in it. (But yes, this is an extreme case.)
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
don't have to call them in task_in_mem_cgroup().
- *mz is not used in __mem_cgroup_uncharge_common().
- we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
after we've cleared PCG_MIGRATION of @oldpage.
- remove empty comment.
- remove redundant empty line in mem_cgroup_cache_charge().
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, for checking a memcg is under task-account-moving, we do css_tryget()
against mc.to and mc.from. But this is just complicating things. This
patch makes the check easier.
This patch adds a spinlock to move_charge_struct and guard modification of
mc.to and mc.from. By this, we don't have to think about complicated
races arount this not-critical path.
[balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_try_charge() has a big loop in it and seems to be hard to read.
Most of routines are for slow path. This patch moves codes out from the
loop and make it clear what's done.
Summary:
- refactoring a function to detect a memcg is under acccount move or not.
- refactoring a function to wait for the end of moving task acct.
- refactoring a main loop('s slow path) as a function and make it clear
why we retry or quit by return code.
- add fatal_signal_pending() check for bypassing charge loops.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>