This reverts commit 3811dbf671.
The masking was not at all useless, and it was sensible. We handle
GFP_ZERO in the caller, and passing it down to any page allocator logic
is buggy and wrong.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
and 2.6.25-rc5-mm1:
BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000
TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
GPR12: 0000000028000442 c000000000770080
NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
Call Trace:
[c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
[c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
[c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
[c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
[c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
[c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
[c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
[c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
[c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
[c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
[c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
[c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
[c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
[c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010
This was tracked down to a potential livelock in
return_unused_surplus_hugepages(). In the case where we have surplus
pages on some node, but no free pages on the same node, we may never
break out of the loop. To avoid this livelock, terminate the search if
we iterate a number of times equal to the number of online nodes without
freeing a page.
Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
the patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we show the surplus hugetlb pool state in /proc/meminfo, but
not in the per-node meminfo files, even though we track the information
on a per-node basis. Printing it there can help track down dynamic pool
bugs including the one in the follow-on patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 556a169dab ("slab: fix bootstrap on
memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
patch fixes list_add() corruption in mm/slab.c seen on the ES7000.
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO
is defined. This patch will be reversed when slab defrag is merged since slab
defrag requires count_partial() to determine the fragmentation status of
slab caches.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get stack footprint of pathname resolution back to relative sanity
[PATCH] double iput() on failure exit in hugetlb
[PATCH] double dput() on failure exit in tiny-shmem
[PATCH] fix up new filp allocators
[PATCH] check for null vfsmount in dentry_open()
[PATCH] reiserfs: eliminate private use of struct file in xattr
[PATCH] sanitize hppfs
hppfs pass vfsmount to dentry_open()
[PATCH] restore export of do_kern_mount()
Revert commit f1a9ee758d:
Author: Rik van Riel <riel@redhat.com>
Date: Thu Feb 7 00:14:08 2008 -0800
kswapd should only wait on IO if there is IO
The current kswapd (and try_to_free_pages) code has an oddity where the
code will wait on IO, even if there is no IO in flight. This problem is
notable especially when the system scans through many unfreeable pages,
causing unnecessary stalls in the VM.
Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
will sleep if a significant number of pages are encountered that should be
written out. This gives kswapd a chance to write out those pages, while
the direct reclaim task sleeps.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because of large latencies and interactivity problems reported by Carlos,
here: http://lkml.org/lkml/2008/3/22/211
Cc: Rik van Riel <riel@redhat.com>
Cc: "Carlos R. Mafra" <crmafra2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With numa enabled, some callers could have a range of memory on one node
but try to free that on other node. This can cause some pages to be
freed wrongly.
For example: when we try to allocate 128g boot ram early for
gart/swiotlb, and free that range later so gart/swiotlb can get some
range afterwards.
With this patch, we don't need to care which node holds the range, just
loop to call free_bootmem_node for all online nodes.
This patch makes free_bootmem_core() more robust by trimming the sidx
and eidx according the ram range that the node has.
And make the free_bootmem_core handle this out of range case. We could
use bdata_list to make sure the range can be freed for sure. So next
time, we don't need to loop online nodes and could use free_bootmem
directly.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc notation in mm/readahead.c.
Change ":" to ";" so that it doesn't get treated as a doc section heading.
Move the comment block ending "*/" to a line by itself so that the text on
that last line is not lost (dropped).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check t->pid == t->pid is not the blessed way to check whether a task is a
group leader.
This is not about the code beautifulness only, but about pid namespaces fixes
- both the tgid and the pid fields on the task_struct are (slowly :( )
becoming deprecated.
Besides, the thread_group_leader() macro makes only one dereference :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fallback path needs to enable interrupts like done for
the other page allocator calls. This was not necessary with
the alternate fast path since we handled irq enable/disable in
the slow path. The regular fastpath handles irq enable/disable
around calls to the slow path so we need to restore the proper
status before calling the page allocator from the slowpath.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.
The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: "Kevin Coffman" <kwc@citi.umich.edu>
Cc: "Alexey Dobriyan" <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free pages in the hugetlb pool are free and as such have a reference count of
zero. Regular allocations into the pool from the buddy are "freed" into the
pool which results in their page_count dropping to zero. However, surplus
pages can be directly utilized by the caller without first being freed to the
pool. Therefore, a call to put_page_testzero() is in order so that such a
page will be handed to the caller with a correct count.
This has not affected end users because the bad page count is reset before the
page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when
the page count is validated.
Thanks go to Mel for first spotting this issue and providing an initial fix.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address 3 known bugs in the current memory policy reference counting method.
I have a series of patches to rework the reference counting to reduce overhead
in the allocation path. However, that series will require testing in -mm once
I repost it.
1) alloc_page_vma() does not release the extra reference taken for
vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in
leaking mempolicy structures. This is probably occurring, but not being
noticed.
Fix: add the conditional release of the reference.
2) hugezonelist unconditionally releases a reference on the mempolicy when
mode == MPOL_INTERLEAVE. This can result in decrementing the reference
count for system default policy [should have no ill effect] or premature
freeing of task policy. If this occurred, the next allocation using task
mempolicy would use the freed structure and probably BUG out.
Fix: add the necessary check to the release.
3) The current reference counting method assumes that vma 'get_policy()'
methods automatically add an extra reference a non-NULL returned mempolicy.
This is true for shmem_get_policy() used by tmpfs mappings, including
regular page shm segments. However, SHM_HUGETLB shm's, backed by
hugetlbfs, just use the vma policy without the extra reference. This
results in freeing of the vma policy on the first allocation, with reuse of
the freed mempolicy structure on subsequent allocations.
Fix: Rather than add another condition to the conditional reference
release, which occur in the allocation path, just add a reference when
returning the vma policy in shm_get_policy() to match the assumptions.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: <eric.whitney@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NUMA slab allocator cpu migration bugfix
The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller. As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.
The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.
The error is very difficult to hit. When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:
kernel BUG at mm/slab.c:2417
cache_alloc_refill+0xe6
kmem_cache_alloc+0xd0
...
This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
SLUB should pack even small objects nicely into cachelines if that is what
has been asked for. Use the same algorithm as SLAB for this.
The effect of this patch for a system with a cacheline size of 64
bytes is that the 24 byte sized slab caches will now put exactly
2 objects into a cacheline instead of 3 with some overlap into
the next cacheline. This reduces the object density in a 4k slab
from 170 to 128 objects (same as SLAB).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>