mm->pinned_vm counts pages of mm's address space that were permanently
pinned in memory by increasing their reference counter. The counter was
introduced by commit bc3e53f682 ("mm: distinguish between mlocked and
pinned pages"), while before it locked_vm had been used for such pages.
Obviously, we should reset the counter on fork if !CLONE_VM, just like
we do with locked_vm, but currently we don't. Let's fix it.
This patch will fix the contents of /proc/pid/status:VmPin.
ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
RLIMIT_MEMLOCK. It's left from the times when pinned pages were accounted
under locked_vm, but today it looks wrong. It isn't clear how we should
deal with it.
We still have some drivers accounting pinned pages under mm->locked_vm -
this is what commit bc3e53f682 was fighting against. It's
infiniband/usnic and vfio.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.
We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:
- on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
are zeroed in dup_mmap();
- on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
calling mm_init();
- ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
called before mm_init() on both fork and exec;
- ->context is initialized by init_new_context(), which is called after
mm_init() on both fork and exec;
Let's consolidate all the initializations in mm_init() to make the code
look cleaner.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.
text data bss dec hex filename
6817 404 1984 9205 23f5 kernel/user_namespace.o-before
6913 308 1984 9205 23f5 kernel/user_namespace.o-after
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__sprint_symbol() should restore original address when kallsyms_lookup()
failed to find a symbol. It's reported when dumpstack shows an address in
a dynamically allocated trampoline for ftrace.
[ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
[ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
[ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
You can see a difference in the hex address - c4 and c3. Fix it.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone. Then attach the pin to original
vfsmount. Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out. Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.
mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new field to fs_pin - kill(pin). That's what umount and r/o remount
will be calling for all pins attached to vfsmount and superblock resp.
Called after bumping the refcount, so it won't go away under us. Dropping
the refcount is responsibility of the instance. All generic stuff moved to
fs/fs_pin.c; the next step will rip all the knowledge of kernel/acct.c from
fs/super.c and fs/namespace.c. After that - death to mnt_pin(); it was
intended to be usable as generic mechanism for code that wants to attach
objects to vfsmount, so that they would not make the sucker busy and
would get killed on umount. Never got it right; it remained acct.c-specific
all along. Now it's very close to being killable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pull generic parts into struct fs_pin. Eventually we want those
to replace mnt_pin()/mnt_unpin() mess; that stuff will move to
fs/*.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make acct->count atomic and acct freeing - rcu-delayed.
* instead of grabbing acct_lock around the places where we take a reference,
do that under rcu_read_lock() with atomic_long_inc_not_zero().
* have the new acct locked before making ns->bacct point to it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) file can't be NULL
b) file can't be changed under us
c) all writes are serialized by acct->lock; no need to mess with
spinlock there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Do not reuse bsd_acct_struct after closing the damn thing.
Structure lifetime is controlled by refcount now. We also
have a mutex in there, held over closing and writing (the
file is O_APPEND, so we are not losing any concurrency).
As the result, we do not need to bother with get_file()/fput()
on log write anymore. Moreover, do_acct_process() only needs
acct itself; file and pidns are picked from it.
Killed instances are distinguished by having NULL ->ns.
Refcount is protected by acct_lock; anybody taking the
mutex needs to grab a reference first.
The things will get a lot simpler in the next commits - this
is just the minimal chunk switching to the new lifetime rules.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>