While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.
[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html
Ulrich said:
glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow.
...
The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.
Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently task_active_pid_ns is not safe to call after a task becomes a
zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
reading the pid namespace from the pid of the task we can trivially solve
this problem at the cost of one extra memory read in what should be the
same cacheline as we read the namespace from.
When moving things around I have made task_active_pid_ns out of line
because keeping it in pid_namespace.h would require adding includes of
pid.h and sched.h that I don't think we want.
This change does make task_active_pid_ns unsafe to call during
copy_process until we attach a pid on the task_struct which seems to be a
reasonable trade off.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Bastian Blank <bastian@waldi.eu.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A current problem with the pid namespace is that it is easy to do pid
related work after exit_task_namespaces which drops the nsproxy pointer.
However if we are doing pid namespace related work we are always operating
on some struct pid which retains the pid_namespace pointer of the pid
namespace it was allocated in.
So provide ns_of_pid which allows us to find the pid namespace a pid was
allocated in.
Using this we have the needed infrastructure to do pid namespace related
work at anytime we have a struct pid, removing the chance of accidentally
having a NULL pointer dereference when accessing current->nsproxy.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Bastian Blank <bastian@waldi.eu.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanups, use new cpumask API
Final trivial cleanups: mainly s/cpumask_t/struct cpumask
Note there is a FIXME in generate_sched_domains(). A future patch will
change struct cpumask *doms to struct cpumask *doms[].
(I suppose Rusty will do this.)
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mike Travis <travis@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add css_tryget(), that obtains a counted reference on a CSS. It is used
in situations where the caller has a "weak" reference to the CSS, i.e.
one that does not protect the cgroup from removal via a reference count,
but would instead be cleaned up by a destroy() callback.
css_tryget() will return true on success, or false if the cgroup is being
removed.
This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
with the difference that in the event of css_tryget() racing with a
cgroup_rmdir(), css_tryget() will only return false if the cgroup really
does get removed.
This implementation is done by biasing css->refcnt, so that a refcnt of 1
means "releasable" and 0 means "released or releasing". In the event of a
race, css_tryget() distinguishes between "released" and "releasing" by
checking for the CSS_REMOVED flag in css->flags.
Signed-off-by: Paul Menage <menage@google.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches introduce new locking/refcount support for cgroups to
reduce the need for subsystems to call cgroup_lock(). This will
ultimately allow the atomicity of cgroup_rmdir() (which was removed
recently) to be restored.
These three patches give:
1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
use to prevent changes to its own cgroup tree
2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
memory controller
3/3 - introduce a css_tryget() function similar to the one recently
proposed by Kamezawa, but avoiding spurious refcount failures in
the event of a race between a css_tryget() and an unsuccessful
cgroup_rmdir()
Future patches will likely involve:
- using hierarchy mutex in place of cgroup_lock() in more subsystems
where appropriate
- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()
This patch:
Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
the hierarchy observed by that subsystem. It is taken by the cgroup
subsystem (in addition to cgroup_mutex) for the following operations:
- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
bind() callback)
Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.
Signed-off-by: Paul Menage <menage@google.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, you can see following even when swap accounting is enabled.
1. Create Group 01, and 02.
2. allocate a "file" on tmpfs by a task under 01.
3. swap out the "file" (by memory pressure)
4. Read "file" from a task in group 02.
5. the charge of "file" is moved to group 02.
This is not ideal behavior. This is because SwapCache which was loaded
by read-ahead is not taken into account..
This is a patch to fix shmem's swapcache behavior.
- remove mem_cgroup_cache_charge_swapin().
- Add SwapCache handler routine to mem_cgroup_cache_charge().
By this, shmem's file cache is charged at add_to_page_cache()
with GFP_NOWAIT.
- pass the page of swapcache to shrink_mem_cgroup.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, inactive_ratio of memcg is calculated at setting limit.
because page_alloc.c does so and current implementation is straightforward
porting.
However, memcg introduced hierarchy feature recently. In hierarchy
restriction, memory limit is not only decided memory.limit_in_bytes of
current cgroup, but also parent limit and sibling memory usage.
Then, The optimal inactive_ratio is changed frequently. So, everytime
calculation is better.
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inactive_anon_is_low() is key component of active/inactive anon
balancing on reclaim. However current inactive_anon_is_low() function
only consider global reclaim.
Therefore, we need following ugly scan_global_lru() condition.
if (lru == LRU_ACTIVE_ANON &&
(!scan_global_lru(sc) || inactive_anon_is_low(zone))) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
it cause that memcg reclaim always deactivate pages when shrink_list() is
called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive
anon balancing of memcgroup.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: "Pekka Enberg" <penberg@cs.helsinki.fi>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.
But in recent discussion, it's NACKed because it sounds ugly.
This patch is for reverting it and add some clean up to gfp_mask of
callers of charge. No behavior change but need review before generating
HUNK in deep queue.
This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current mmtom has new oom function as pagefault_out_of_memory(). It's
added for select bad process rathar than killing current.
When memcg hit limit and calls OOM at page_fault, this handler called and
system-wide-oom handling happens. (means kernel panics if panic_on_oom is
true....)
To avoid overkill, check memcg's recent behavior before starting
system-wide-oom.
And this patch also fixes to guarantee "don't accnout against process with
TIF_MEMDIE". This is necessary for smooth OOM.
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for building hierarchies in resource counters. Cgroups allows
us to build a deep hierarchy, but we currently don't link the resource
counters belonging to the memory controller control groups, in the same
fashion as the corresponding cgroup entries in the cgroup hierarchy. This
patch provides the infrastructure for resource counters that have the same
hiearchy as their cgroup counter parts.
These set of patches are based on the resource counter hiearchy patches
posted by Pavel Emelianov.
NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are
expensive, so the user needs to be careful and aware of the trade-offs
before creating very deep ones.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We check mem_cgroup is disabled or not by checking
mem_cgroup_subsys.disabled. I think it has more references than expected,
now.
replacing
if (mem_cgroup_subsys.disabled)
with
if (mem_cgroup_disabled())
give us good look, I think.
[kamezawa.hiroyu@jp.fujitsu.com: fix typo]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A big patch for changing memcg's LRU semantics.
Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).
- LRU of page_cgroup is not synchronous with global LRU.
- page and page_cgroup is one-to-one and statically allocated.
- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
- SwapCache is handled.
And, when we handle LRU list of page_cgroup, we do following.
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);
But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.
Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements per cgroup limit for usage of memory+swap. However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.
Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.
This has following benefits.
- A user can limit total resource usage of mem+swap.
Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.
And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..
Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.
Ability to set appropriate swap limit for each group is required.
Maybe someone wonder "why not swap but mem+swap ?"
- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.
In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.
Accounting target information is stored in swap_cgroup which is
per swap entry record.
Charge is done as following.
map
- charge page and memsw.
unmap
- uncharge page/memsw if not SwapCache.
swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.
swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.
swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.
There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)
TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.
[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For accounting swap, we need a record per swap entry, at least.
This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.
- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.
This patch just implements "how to record information". No actual method
for limit the usage of swap. These routine uses flat table to record and
lookup. "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation. (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)
Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Hugh Dickins <hugh@veritas.com>
Reported-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Config and control variable for mem+swap controller.
This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)
For accounting swap, it's obvious that we have to use additional memory to
remember "who uses swap". This adds more overhead. So, it's better to
offer "choice" to users. This patch adds 2 choices.
This patch adds 2 parameters to enable swap extension or not.
- CONFIG
- boot option
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>