Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.
Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand <romlem@android.com>
too many places open-code it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
mm/util.c
security/apparmor/include/apparmor.h
Change-Id: Ie8602e0199282dc462921cb7217158d1998853b0
commit 118b230225 upstream.
dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I70970c6125f377048664eb5bde08f3fae29aa348
Fix two bugs caused by merging anon vma_naming, a typo in
mempolicy.c and a bad merge in sys.c.
Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90
Signed-off-by: Colin Cross <ccross@android.com>
Swap subsystem does lazy swap slot free with expecting the page would be
swapped out again so we can avoid unnecessary write.
But the problem in in-memory swap(ex, zram) is that it consumes memory
space until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory
swap and big storage swap or in-memory swap alone.
This patch makes swap subsystem free swap slot as soon as swap-read is
completed and make the swapcache page dirty so the page should be
written out the swap device to reclaim it. It means we never lose it.
I tested this patch with kernel compile workload.
1. before
compile time : 9882.42
zram max wasted space by fragmentation: 13471881 byte
memory space consumed by zram: 174227456 byte
the number of slot free notify: 206684
2. after
compile time : 9653.90
zram max wasted space by fragmentation: 11805932 byte
memory space consumed by zram: 154001408 byte
the number of slot free notify: 426972
[akpm@linux-foundation.org: tweak comment text]
[artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()]
[akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.
This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period. In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.
It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.
Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
This patch adds a debugfs file called "shrinker" when read this calls
all the shrinkers in the system with nr_to_scan set to zero and prints
the result. These results are the number of objects the shrinkers have
available and can thus be used an indication of the total memory
that would be availble to the system if a shrink occurred.
Change-Id: Ied0ee7caff3d2fc1cb4bb839aaafee81b5b0b143
Signed-off-by: Rebecca Schultz Zavin <rebecca@android.com>
By default the kernel tries to keep half as much memory free at each
order as it does for one order below. This can be too agressive when
running without swap.
Change-Id: I5efc1a0b50f41ff3ac71e92d2efd175dedd54ead
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Pull SLAB fix from Pekka Enberg:
"A slab regression fix by Sasha Levin"
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slab: prevent warnings when allocating with __GFP_NOWARN
Sasha Levin noticed that the warning introduced by commit 6286ae9
("slab: Return NULL for oversized allocations) is being triggered:
WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
can: request_module (can-proto-4) failed.
mpoa: proc_mpc_write: could not parse ''
Modules linked in:
CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
Call Trace:
[<ffffffff83ff4041>] dump_stack+0x4e/0x82
[<ffffffff8111fe12>] warn_slowpath_common+0x82/0xb0
[<ffffffff8111fe55>] warn_slowpath_null+0x15/0x20
[<ffffffff81243dcf>] kmalloc_slab+0x2f/0xb0
[<ffffffff81278d54>] __kmalloc+0x24/0x4b0
[<ffffffff8196ffe3>] ? security_capable+0x13/0x20
[<ffffffff812a26b7>] ? pipe_fcntl+0x107/0x210
[<ffffffff812a26b7>] pipe_fcntl+0x107/0x210
[<ffffffff812b7ea0>] ? fget_raw_light+0x130/0x3f0
[<ffffffff812aa5fb>] SyS_fcntl+0x60b/0x6a0
[<ffffffff8403ca98>] tracesys+0xe1/0xe6
Andrew Morton writes:
__GFP_NOWARN is frequently used by kernel code to probe for "how big
an allocation can I get". That's a bit lame, but it's used on slow
paths and is pretty simple.
However, SLAB would still spew a warning when a big allocation happens
if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[ penberg@kernel.org: improve changelog ]
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.
The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.
The write side sets the position pointer first, then updates the
sequence count to "publish" the new position. Likewise, the read side
must read the sequence count first, then the position. If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:
writer: reader:
iter->position = position if iter->sequence == expected:
smp_wmb() smp_rmb()
iter->sequence = sequence position = iter->position
However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory. Fix this.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG. And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.
This fixes incorrect zeroing and allocation size for frontswap_map. The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing. But the wrongly calculated allocation size
may cause the problem.
For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size. For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [2.6.35+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The watermark check consists of two sub-checks. The first one is:
if (free_pages <= min + lowmem_reserve)
return false;
The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
This prevents the zone from being drained from pages available for
non-movable allocations.
The second check prevents the zone from getting too fragmented.
for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}
The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.
This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.
Laura said:
We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.
For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Tested-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.
This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.
However now we can get concurrency and stuff goes *bang*.
This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Reported-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A panic can be caused by simply cat'ing /proc/<pid>/smaps while an
application has a VM_PFNMAP range. It happened in-house when a
benchmarker was trying to decipher the memory layout of his program.
/proc/<pid>/smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.
Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures. And
this is not usually true for VM_PFNMAP areas. This can result in panics
on kernel page faults when attempting to address those page structures.
There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N. Horiguchi pointed out). So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.
The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.
VM_PFNMAP areas are used by:
- graphics memory manager gpu/drm/drm_gem.c
- global reference unit sgi-gru/grufile.c
- sgi special memory char/mspec.c
- and probably several out-of-tree modules
[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix printk format warnings in mm/memory_hotplug.c by using "%pa":
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'resource_size_t' [-Wformat]
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'resource_size_t' [-Wformat]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not use set_pmd_at to update pmd_t with pgtable_t pointer.
set_pmd_at is used to set pmd with huge pte entries and architectures
like ppc64, clear few flags from the pte when saving a new entry.
Without this change we observe bad pte errors like below on ppc64 with
THP enabled.
BUG: Bad page map in process ld mm=0xc000001ee39f4780 pte:7fc3f37848000001 pmd:c000001ec0000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page 'new' during MIGRATION can't be flushed with flush_cache_page().
Using flush_cache_page(vma, addr, pfn) is justified only if the page is
already placed in process page table, and that is done right after
flush_cache_page(). But without it the arch function has no knowledge
of process PTE and does nothing.
Besides that, flush_cache_page() flushes an application cache page, but
the kernel has a different page virtual address and dirtied it.
Replace it with flush_dcache_page(new) which is the proper usage.
The old page is flushed in try_to_unmap_one() before migration.
This bug takes place in Sead3 board with M14Kc MIPS CPU without cache
aliasing (but Harvard arch - separate I and D cache) in tight memory
environment (128MB) each 1-3days on SOAK test. It fails in cc1 during
kernel build (SIGILL, SIGBUS, SIGSEG) if CONFIG_COMPACTION is switched
ON.
Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Cc: Leonid Yegoshin <yegoshin@mips.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0c59b89c81 ("mm: memcg: push down PageSwapCache check into
uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
uncharge path after checking that page flag once, assuming that the
state is stable in all paths, but this is not the case and the condition
triggers in user environments. An uncharge after the last page table
reference to the page goes away can race with reclaim adding the page to
swap cache.
Swap cache pages are usually uncharged when they are freed after
swapout, from a path that also handles swap usage accounting and memcg
lifetime management. However, since the last page table reference is
gone and thus no references to the swap slot left, the swap slot will be
freed shortly when reclaim attempts to write the page to disk. The
whole swap accounting is not even necessary.
So while the race condition for which this VM_BUG_ON was added is real
and actually existed all along, there are no negative effects. Remove
the VM_BUG_ON again.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Lingzhu Xiang <lxiang@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 751efd8610 ("mmu_notifier_unregister NULL Pointer deref and
multiple ->release()") breaks the fix 3ad3d901bb ("mm: mmu_notifier:
fix freed page still mapped in secondary MMU").
Since hlist_for_each_entry_rcu() is changed now, we can not revert that
patch directly, so this patch reverts the commit and simply fix the bug
spotted by that patch
This bug spotted by commit 751efd8610 is:
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().
Assume two tasks, one calling mmu_notifier_unregister() as a result
of a filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).
A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() <--- NULL pointer deref.
This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.
The another issue spotted in the commit is "multiple ->release()
callouts", we needn't care it too much because it is really rare (e.g,
can not happen on kvm since mmu-notify is unregistered after
exit_mmap()) and the later call of multiple ->release should be fast
since all the pages have already been released by the first call.
Anyway, this issue should be fixed in a separate patch.
-stable suggestions: Any version that has commit 751efd8610 need to be
backported. I find the oldest version has this commit is 3.0-stable.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>