This patch fixes a crash in the hugepage code. unmap_hugepage_area() was
assuming that (due to prefault) PTEs must exist for all the area in
question. However, this may not be the case, if mmap() encounters an error
before the prefault and calls unmap_region() to clean up any partial
mapping.
Depending on the hugepage configuration, this crash can be triggered by an
unpriveleged user.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have found what seems to be a small bug in __vm_enough_memory() when
sysctl_overcommit_memory is set to OVERCOMMIT_NEVER.
When this bug occurs the systems fails to boot, with /sbin/init whining
about fork() returning ENOMEM.
We hunted down the problem to this:
The deferred update mecanism used in vm_acct_memory(), on a SMP system,
allows the vm_committed_space counter to have a negative value.
This should not be a problem since this counter is known to be inaccurate.
But in __vm_enough_memory() this counter is compared to the `allowed'
variable, which is an unsigned long. This comparison is broken since it
will consider the negative values of vm_committed_space to be huge positive
values, resulting in a memory allocation failure.
Signed-off-by: <Jean-Marc.Saffroy@ext.bull.net>
Signed-off-by: <Simon.Derr@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mremap's move_vma is applying __vm_stat_account to the old vma which may
have already been freed: move it to just before the do_munmap.
mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc/<pid>/status
VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by
this patch - worth including in 2.6.13, though not yet confirmed that it
fixes that specific report from Frank van Maarseveen.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and
has to be treated as such by get_user_pages().
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.
So revert __follow_page to check pte_write as before, and make
do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
done its full job: once get_user_pages receives this value, it no longer
requires pte_write in __follow_page.
But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case. To avoid changing
them all in a hurry, make an inline wrapper function (using the old
name) that masks off the new bit, and use the extended interface with
double underscores.
Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
[ Cleanups by Nick Piggin ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A kernel BUG() is triggered by a call to set_mempolicy() with a negative
first argument. This is because the mode is declared as an int, and the
validity check doesnt check < 0 values. Alternatively, mode could be
declared as unsigned int or unsigned long.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 has a large sparse gate area between VSYSCALL_START and
VSYSCALL_END, not all of it presently backed by pmds. Alexander Nyberg has
found that in some circumstances gdb may try to ptrace here, and hit
get_user_pages BUG_ON. It seems odd that gdb should be accessing here, but
it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT. Fixes
kernel bugzilla #4801.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There's no real guarantee that handle_mm_fault() will always be able to
break a COW situation - if an update from another thread ends up
modifying the page table some way, handle_mm_fault() may end up
requiring us to re-try the operation.
That's normally fine, but get_user_pages() ended up re-trying it as a
read, and thus a write access could in theory end up losing the dirty
bit or be done on a page that had not been properly COW'ed.
This makes get_user_pages() always retry write accesses as write
accesses by making "follow_page()" require that a writable follow has
the dirty bit set. That simplifies the code and solves the race: if the
COW break fails for some reason, we'll just loop around and try again.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are iterating over all nodes in nr_free_zone_pages(). Because the
fallback zonelists contain all nodes in the system, and we walk all the
zonelists, we're counting memory multiple times (once for each node). This
caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes
all the dirty ratio calculations, etc incorrect.
There's still a further bug to fix from e820 holes causing overestimation
as well, but this fix is separate, and good as is, and fixes one class of
problems. Problem found by Badari, and tested by Ram Pai - thanks!
Signed-off-by: Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by: Matt Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Originally __free_pages_bulk used the relative page number within a zone to
define its buddies. This meant that to maintain the "maximally aligned"
requirements (that an allocation of size N will be aligned at least to N
physically) zones had to also be aligned to 1<<MAX_ORDER pages. When
__free_pages_bulk was updated to use the relative page frame numbers of the
free'd pages to pair buddies this released the alignment constraint on the
'left' edge of the zone. This allows _either_ edge of the zone to contain
partial MAX_ORDER sized buddies. These simply never will have matching
buddies and thus will never make it to the 'top' of the pyramid.
The patch below removes a now redundant check ensuring that the mem_map was
aligned to MAX_ORDER.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The madvise() system call returns -EBADF for areas which does not map to
files, only for *behaviour* request MADV_WILLNEED.
According to man pages, madvise returns :
EBADF - the map exists, but the area maps something that isn't a file.
Fixes bug 2995.
Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>.
oprofile calls check_user_page_readable() from interrupt context, so we
deadlock over various VFS locks.
But check_user_page_readable() doesn't imply either a read or a write of the
page's contents. Change __follow_page() so that check_user_page_readable()
can tell __follow_page() that we're not accessing the page's contents, and use
that info to avoid the troublesome lock-takings.
Also, make follow_page() inline for the single callsite in memory.c to save a
bit of stack space.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch includes feedback from Andrew and Christoph. Thanks for
taking time to review.
Use of empty_zero_page was eliminated to fix compilation for architectures
that don't have it.
This patch removes setting pages up-to-date in ext2_get_xip_page and all
bug checks to verify that the page is indeed up to date. Setting the page
state on mapping to userland is bogus. None of the code patchs involved
with these pages in mm cares about the page state.
still on my ToDo list: identify a place outside second extended where
__inode_direct_access should reside
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mm/filemap_xip.c: In function `__xip_unmap':
mm/filemap_xip.c:194: request for member `pte' in something not a structure or union
Apparently pte_pfn() takes a pte_t, not a pointer to a pte_t. From looking
at asm/page.h, it seems to be the same on ia32 or ppc (iff
STRICT_MM_TYPECHECKS is enabled, which is disabled by default on ppc).
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We now print statistics when invoking the OOM killer, however this
information is not rate limited and you can get into situations where the
console is continually spammed.
For example, when a task is exiting the OOM killer will simply return
(waiting for that task to exit and clear up memory). If the VM continually
calls back into the OOM killer we get thousands of copies of show_mem() on
the console.
Use printk_ratelimit() to quieten it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove completly bogus comment from did_some_progress != 0 handling (that
same comment is a few lines below on did_some_progress = 0 case, where it
belongs).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch used to be in Andrew's tree before the NUMA slab allocator went
in. Either this patch or the NUMA slab allocator is needed in order for
kmalloc_node to work correctly.
pcibus_to_node may be used to generate the node information passed to
kmalloc_node. pcibus_to_node returns -1 if it was not able to determine
on which node a pcibus is located. For that case kmalloc_node must
work like kmalloc.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I spotted this issue while in memmap_init last week. I can't say the
change has any test coverage by me. start_pfn was formerly used in main
"for" loop. The fix is replace start_pfn with pfn.
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of ALIGN() to remove duplicate round-up code.
Signed-off-by: Nick Wilson <njw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>