Stub out vm_get_page_prot() if there's no MMU.
This was added by commit 804af2cf6e ("[AGPGART] remove private page
protection map") and is used in commit c07fbfd17e ("fbmem: VM_IO set,
but not propagated") in the fbmem video driver, but the function doesn't
exist on NOMMU, resulting in an undefined symbol at link time.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok. Also clean up and document inode_change_ok
to make this obvious.
As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error. This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere. Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.
Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.
Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.
A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.
[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example
kernel BUG at include/linux/swapops.h:105!
invalid opcode: 0000 [#1] PREEMPT SMP
...
Call Trace:
[<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[<ffffffff813099b5>] page_fault+0x25/0x30
[<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[<ffffffff8111329b>] search_binary_handler+0x173/0x313
[<ffffffff81114896>] do_execve+0x219/0x30a
[<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[<ffffffff8100320a>] stub_execve+0x6a/0xc0
RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
There is a race between shift_arg_pages and migration that triggers this
bug. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed
before migration completes, the migration PTE may not be found leading to
a BUG when the stack is faulted.
This patch causes pages within the temporary stack during exec to be
skipped by migration. It does this by marking the VMA covering the
temporary stack with an otherwise impossible combination of VMA flags.
These flags are cleared when the temporary stack is moved to its final
location.
[kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we look into pagemap using page-types with option -p, the value of
pfn for hugepages looks wrong (see below.) This is because pte was
evaluated only once for one vma although it should be updated for each
hugepage. This patch fixes it.
$ page-types -p 3277 -Nl -b huge
voffset offset len flags
7f21e8a00 11e400 1 ___U___________H_G________________
7f21e8a01 11e401 1ff ________________TG________________
^^^
7f21e8c00 11e400 1 ___U___________H_G________________
7f21e8c01 11e401 1ff ________________TG________________
^^^
One hugepage contains 1 head page and 511 tail pages in x86_64 and each
two lines represent each hugepage. Voffset and offset mean virtual
address and physical address in the page unit, respectively. The
different hugepages should not have the same offset value.
With this patch applied:
$ page-types -p 3386 -Nl -b huge
voffset offset len flags
7fec7a600 112c00 1 ___UD__________H_G________________
7fec7a601 112c01 1ff ________________TG________________
^^^
7fec7a800 113200 1 ___UD__________H_G________________
7fec7a801 113201 1ff ________________TG________________
^^^
OK
More info:
- This patch modifies walk_page_range()'s hugepage walker. But the
change only affects pagemap_read(), which is the only caller of hugepage
callback.
- Without this patch, hugetlb_entry() callback is called per vma, that
doesn't match the natural expectation from its name.
- With this patch, hugetlb_entry() is called per hugepte entry and the
callback can become much simpler.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support for the PMU's BTS features has been upstreamed in
v2.6.32, but we still have the old and disabled ptrace-BTS,
as Linus noticed it not so long ago.
It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
regard for other uses (perf) and doesn't provide the flexibility
needed for perf either.
Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
was never used and ptrace-block-step can be implemented using a
much simpler approach.
So axe all 3000 lines of it. That includes the *locked_memory*()
APIs in mm/mlock.c as well.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100325135413.938004390@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 34e55232e5 ("mm: avoid false sharing
of mm_counter") added sync_mm_rss() for syncing loosely accounted rss
counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU
enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it
well.
This patch changes
SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU
And for avoid unnecessary function calls, sync_mm_rss changed to be inlined
noop function in header file.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a VMA is in an inconsistent state during setup or teardown, the worst
that can happen is that the rmap code will not be able to find the page.
The mapping is in the process of being torn down (PTEs just got
invalidated by munmap), or set up (no PTEs have been instantiated yet).
It is also impossible for the rmap code to follow a pointer to an already
freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
teardown code needs to take before the VMA is removed from the anon_vma
chain.
Hence, we should not need the VM_LOCK_RMAP locking at all.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.
This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.
Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.
rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...
x86/mm is on 32-rc4 and missing the spinlock namespace changes which
are needed for further commits into this topic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add vmemmap_alloc_block_buf for mem map only.
It will fallback to the old way if it cannot get a block that big.
Before this patch, when a node have 128g ram installed, memmap are
split into two parts or more.
[ 0.000000] [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
[ 0.000000] [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
[ 0.000000] [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
[ 0.000000] [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
[ 0.000000] [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
[ 0.000000] [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
[ 0.000000] [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
[ 0.000000] [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
[ 0.000000] [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
[ 0.000000] [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
[ 0.000000] [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
[ 0.000000] [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
[ 0.000000] [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
[ 0.000000] [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
[ 0.000000] [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
[ 0.000000] [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
[ 0.000000] [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
[ 0.000000] [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
[ 0.000000] [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
[ 0.000000] [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
after patch will get
[ 0.000000] [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
[ 0.000000] [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
[ 0.000000] [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
[ 0.000000] [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
[ 0.000000] [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
[ 0.000000] [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
[ 0.000000] [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
[ 0.000000] [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
-v2: change buf to vmemmap_buf instead according to Ingo
also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
-v3: according to Andrew, use sizeof(name) instead of hard coded 15
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Finally we can use early_res to replace bootmem for x86_64 now.
Still can use CONFIG_NO_BOOTMEM to enable it or not.
-v2: fix 32bit compiling about MAX_DMA32_PFN
-v3: folded bug fix from LKML message below
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B747239.4070907@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>