mirror of
https://github.com/armbian/linux-cix.git
synced 2026-01-06 12:30:45 -08:00
da89fb165e5e51a2ec1ff8a0ff6bc052d1068184
60 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
978ea78b65 |
rmap: remove __anon_vma_link() declaration
This declaration is not used anymore, remove it. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
72835c86ca |
mm: unify remaining mem_cont, mem, etc. variable names to memcg
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
948f017b09 |
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
2b575eb64f |
mm: convert anon_vma->lock to a mutex
Straightforward conversion of anon_vma->lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
25aeeb046e |
mm: revert page_lock_anon_vma() lock annotation
Its beyond ugly and gets in the way. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Namhyung Kim <namhyung@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
01d8b20dec |
mm: simplify anon_vma refcounts
This patch changes the anon_vma refcount to be 0 when the object is free. It does this by adding 1 ref to being in use in the anon_vma structure (iow. the anon_vma->head list is not empty). This allows a simpler release scheme without having to check both the refcount and the list as well as avoids taking a ref for each entry on the list. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
83813267c6 |
mm: move anon_vma ref out from under CONFIG_foo
We need the anon_vma refcount unconditionally to simplify the anon_vma lifetime rules. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9e60109f12 |
mm: rename drop_anon_vma() to put_anon_vma()
The normal code pattern used in the kernel is: get/put. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71e3aac072 |
thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
see removed:
1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap
2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing
3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)
4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...). Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario. So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).
The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas. This is what this patch tries to achieve in the
least intrusive possible way. We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).
The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails! This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...
Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail. This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM. Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*. The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.
The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon. It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode. collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later. collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).
The default I like is that transparent hugepages are used at page fault
time. This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".
The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view. In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...). And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.
If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet). But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.
Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.
NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores. I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks. One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only. If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time. It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).
This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation. hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits. hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.
Some performance result:
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988
============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (3UL*1024*1024*1024)
int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
return 0;
}
============
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
e9a81a821d |
rmap: wrap page_check_address() using __cond_lock()
The page_check_address() conditionally grabs *@ptlp in case of returning non-NULL. Rename and wrap it using __cond_lock() removes following warnings from sparse: mm/rmap.c:472:9: warning: context imbalance in 'page_mapped_in_vma' - unexpected unlock mm/rmap.c:524:9: warning: context imbalance in 'page_referenced_one' - unexpected unlock mm/rmap.c:706:9: warning: context imbalance in 'page_mkclean_one' - unexpected unlock mm/rmap.c:1066:9: warning: context imbalance in 'try_to_unmap_one' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ea4525b600 |
rmap: annotate lock context change on page_[un]lock_anon_vma()
The page_lock_anon_vma() conditionally grabs RCU and anon_vma lock but page_unlock_anon_vma() releases them unconditionally. This leads sparse to complain about context imbalance. Annotate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
182fea8f48 |
mm: remove alignment padding from anon_vma on (some) 64 bit builds
Reorder structure anon_vma to remove alignment padding on 64 builds when (CONFIG_KSM || CONFIG_MIGRATION). This will shrink the size of the anon_vma structure from 40 to 32 bytes & allow more objects per slab in its kmem_cache. Under slub the objects in the anon_vma kmem_cache will then be 40 bytes with 102 objects per slab. (On v2.6.36 without this patch,the size is 48 bytes and 85 objects/slab.) Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1021a64534 |
Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: hugetlb: add missing unlock in avoidcopy path in hugetlb_cow() hwpoison: rename CONFIG HWPOISON, hugetlb: support hwpoison injection for hugepage HWPOISON, hugetlb: detect hwpoison in hugetlb code HWPOISON, hugetlb: isolate corrupted hugepage HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage HWPOISON, hugetlb: enable error handling path for hugepage hugetlb, rmap: add reverse mapping for hugepage hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h Fix up trivial conflicts in mm/memory-failure.c |
||
|
|
0fe6e20b9c |
hugetlb, rmap: add reverse mapping for hugepage
This patch adds reverse mapping feature for hugepage by introducing
mapcount for shared/private-mapped hugepage and anon_vma for
private-mapped hugepage.
While hugepage is not currently swappable, reverse mapping can be useful
for memory error handler.
Without this patch, memory error handler cannot identify processes
using the bad hugepage nor unmap it from them. That is:
- for shared hugepage:
we can collect processes using a hugepage through pagecache,
but can not unmap the hugepage because of the lack of mapcount.
- for privately mapped hugepage:
we can neither collect processes nor unmap the hugepage.
This patch solves these problems.
This patch include the bug fix given by commit
|
||
|
|
ad8c2ee801 |
rmap: add exclusive page to private anon_vma on swapin
On swapin it is fairly common for a page to be owned exclusively by one process. In that case we want to add the page to the anon_vma of that process's VMA, instead of to the root anon_vma. This will reduce the amount of rmap searching that the swapout code needs to do. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
76545066c8 |
mm: extend KSM refcounts to the anon_vma root
KSM reference counts can cause an anon_vma to exist after the processe it belongs to have already exited. Because the anon_vma lock now lives in the root anon_vma, we need to ensure that the root anon_vma stays around until after all the "child" anon_vmas have been freed. The obvious way to do this is to have a "child" anon_vma take a reference to the root in anon_vma_fork. When the anon_vma is freed at munmap or process exit, we drop the refcount in anon_vma_unlink and possibly free the root anon_vma. The KSM anon_vma reference count function also needs to be modified to deal with the possibility of freeing 2 levels of anon_vma. The easiest way to do this is to break out the KSM magic and make it generic. When compiling without CONFIG_KSM, this code is compiled out. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
012f18004d |
mm: always lock the root (oldest) anon_vma
Always (and only) lock the root (oldest) anon_vma whenever we do something in an anon_vma. The recently introduced anon_vma scalability is due to the rmap code scanning only the VMAs that need to be scanned. Many common operations still took the anon_vma lock on the root anon_vma, so always taking that lock is not expected to introduce any scalability issues. However, always taking the same lock does mean we only need to take one lock, which means rmap_walk on pages from any anon_vma in the vma is excluded from occurring during an munmap, expand_stack or other operation that needs to exclude rmap_walk and similar functions. Also add the proper locking to vma_adjust. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5c341ee1df |
mm: track the root (oldest) anon_vma
Track the root (oldest) anon_vma in each anon_vma tree. Because we only take the lock on the root anon_vma, we cannot use the lock on higher-up anon_vmas to lock anything. This makes it impossible to do an indirect lookup of the root anon_vma, since the data structures could go away from under us. However, a direct pointer is safe because the root anon_vma is always the last one that gets freed on munmap or exit, by virtue of the same_vma list order and unlink_anon_vmas walking the list forward. [akpm@linux-foundation.org: fix typo] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
cba48b98f2 |
mm: change direct call of spin_lock(anon_vma->lock) to inline function
Subsitute a direct call of spin_lock(anon_vma->lock) with an inline function doing exactly the same. This makes it easier to do the substitution to the root anon_vma lock in a following patch. We will deal with the handful of special locks (nested, dec_and_lock, etc) separately. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb4a340e07 |
mm: rename anon_vma_lock to vma_lock_anon_vma
Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style used in page_lock_anon_vma and will come in really handy further down in this patch series. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7f60c214fd |
mm: migration: share the anon_vma ref counts between KSM and page migration
For clarity of review, KSM and page migration have separate refcounts on the anon_vma. While clear, this is a waste of memory. This patch gets KSM and page migration to share their toys in a spirit of harmony. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3f6c82728f |
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
6457474624 |
vmscan: detect mapped file pages used only once
The VM currently assumes that an inactive, mapped and referenced file page is in use and promotes it to the active list. However, every mapped file page starts out like this and thus a problem arises when workloads create a stream of such pages that are used only for a short time. By flooding the active list with those pages, the VM quickly gets into trouble finding eligible reclaim canditates. The result is long allocation latencies and eviction of the wrong pages. This patch reuses the PG_referenced page flag (used for unmapped file pages) to implement a usage detection that scales with the speed of LRU list cycling (i.e. memory pressure). If the scanner encounters those pages, the flag is set and the page cycled again on the inactive list. Only if it returns with another page table reference it is activated. Otherwise it is reclaimed as 'not recently used cache'. This effectively changes the minimum lifetime of a used-once mapped file page from a full memory cycle to an inactive list cycle, which allows it to occur in linear streams without affecting the stable working set of the system. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c44b674323 |
rmap: move exclusively owned pages to own anon_vma in do_wp_page()
When the parent process breaks the COW on a page, both the original which is mapped at child and the new page which is mapped parent end up in that same anon_vma. Generally this won't be a problem, but for some workloads it could preserve the O(N) rmap scanning complexity. A simple fix is to ensure that, when a page which is mapped child gets reused in do_wp_page, because we already are the exclusive owner, the page gets moved to our own exclusive child's anon_vma. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5beb493052 |
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |