Commit Graph

36 Commits

Author SHA1 Message Date
Al Viro 9206de95b1 Take arch_mmap_check() into get_unmapped_area()
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:44:58 -05:00
Al Viro 935874141d fix pgoff in "have to relocate" case of mremap()
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:23 -05:00
Al Viro 097eed1038 fix the arch checks in MREMAP_FIXED case
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:23 -05:00
Al Viro f106af4e90 fix checks for expand-in-place mremap
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:23 -05:00
Al Viro 1a0ef85f84 do_mremap() untangling, part 3
Take the check for being able to expand vma in place into a separate
helper.

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:22 -05:00
Al Viro ecc1a89937 do_mremap() untangling, part 2
Take the MREMAP_FIXED into a separate helper, simplify the living
hell out of conditions in both cases.

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:22 -05:00
Al Viro 54f5de7099 untangling do_mremap(), part 1
Take locating vma and checks on it to a separate helper (it will be
shared between MREMAP_FIXED/non-MREMAP_FIXED cases when we split
them in the next patch)

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-11 06:30:22 -05:00
npiggin@suse.de 25d9e2d152 truncate: new helpers
Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 08:41:47 -04:00
Hugh Dickins 7103ad323b ksm: mremap use err from ksm_madvise
mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
-ENOMEM (letting madvise convert that to -EAGAIN), and can also say
-ERESTARTSYS when signalled: so pass the error from ksm_madvise.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:33 -07:00
Hugh Dickins 1ff8299573 ksm: prevent mremap move poisoning
KSM's scan allows for user pages to be COWed or unmapped at any time,
without requiring any notification.  But its stable tree does assume that
when it finds a KSM page where it placed a KSM page, then it is the same
KSM page that it placed there.

mremap move could break that assumption: if an area containing a KSM page
was unmapped, then an area containing a different KSM page was moved with
mremap into the place of the original, before KSM's scan came around to
notice.  That could then poison a node of the stable tree, so that memcmps
would "lie" and upset the ordering of the tree.

Probably noone will ever need mremap move on a VM_MERGEABLE area; except
that prohibiting it would make trouble for schemes in which we try making
everything VM_MERGEABLE e.g.  for testing: an mremap which normally works
would then fail mysteriously.

There's no need to go to any trouble, such as re-sorting KSM's list of
rmap_items to match the new layout: simply unmerge the area to COW all its
KSM pages before moving, but leave VM_MERGEABLE on so that they're
remerged later.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:31 -07:00
Heiko Carstens 6a6160a7b5 [CVE-2009-0029] System call wrappers part 13
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:23 +01:00
Heiko Carstens 2ed7c03ec1 [CVE-2009-0029] Convert all system calls to return a long
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:14 +01:00
Alan Cox 046c68842b mm: update my address
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 17:44:42 -08:00
Rik van Riel ba470de431 mmap: handle mlocked pages during map, remap, unmap
Originally by Nick Piggin <npiggin@suse.de>

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Andrea Arcangeli cddb8a5c14 mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
 There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte".  In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present).  The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set.  Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space.  Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page.  Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0.  This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably.  And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with "mm".  This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end.  No secondary MMU page fault is allowed to map
   any spte or secondary tlb reference, while the VM is in the middle of
   range_begin/end as any page returned by get_user_pages in that critical
   section could later immediately be freed without any further
   ->invalidate_page notification (invalidate_range_begin/end works on
   ranges and ->invalidate_page isn't called immediately before freeing
   the page).  To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
   CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
   mmu notifiers, but this already allows to compile a KVM external module
   against a kernel with mmu notifiers enabled and from the next pull from
   kvm.git we'll start using them.  And GRU/XPMEM will also be able to
   continue the development by enabling KVM=m in their config, until they
   submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
   also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
   This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
   are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled.  Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 16:30:21 -07:00
Stephen Hemminger c80544dc0b sparse pointer use of zero as null
Get rid of sparse related warnings from places that use integer as NULL
pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:31 -07:00
Ollie Wild b6a2fea393 mm: variable length argument support
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space.  Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild <aaw@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Eric Paris ed03218951 security: Protection for exploiting null dereference using mmap
Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space.  The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.

This patch uses a new SELinux security class "memprotect."  Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)

Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2007-07-11 22:52:29 -04:00
Hugh Dickins 701dfbc1cb [PATCH] mm: mremap correct rmap accounting
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

[ "There is no block device driver one could use for XIP on mips
   platforms" - Carsten Otte ]

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 08:33:32 -08:00
Zachary Amsden 6606c3e0da [PATCH] paravirt: lazy mmu mode hooks.patch
Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables.  The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall.  We use this in VMI for
shadow page table synchronization, and it is a win.  It also can be used by
PPC and for direct page tables on Xen.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified.  This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of.  Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Ingo Molnar f20dc5f7c1 [PATCH] lockdep: annotate mm
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Randy.Dunlap c59ede7b78 [PATCH] move capable() to capability.h
- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
	(in include/, block/, ipc/, kernel/, a few drivers/,
	mm/, security/, & sound/;
	many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:13 -08:00
Linus Torvalds 4d7672b462 Make sure we copy pages inserted with "vm_insert_page()" on fork
The logic that decides that a fork() might be able to avoid copying a VM
area when it can be re-created by page faults didn't know about the new
vm_insert_page() case.

Also make some things a bit more anal wrt VM_PFNMAP.

Pointed out by Hugh Dickins <hugh@veritas.com>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-16 10:21:23 -08:00
Hugh Dickins 4c21e2f244 [PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins c74df32c72 [PATCH] mm: ptd_alloc take ptlock
Second step in pushing down the page_table_lock.  Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code.  But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate.  It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00