Commit Graph

2180 Commits

Author SHA1 Message Date
Christoph Lameter
86c562a9d6 [PATCH] mm: optimize numa policy handling in slab allocator
Move the interrupt check from slab_node into ___cache_alloc and adds an
"unlikely()" to avoid pipeline stalls on some architectures.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Christoph Lameter
dc85da15d4 [PATCH] NUMA policies in the slab allocator V2
This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls
alloc_pages().  This means that memory policies may control the behavior of
alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
resulting in the spreading out of allocations during bootup over all
available nodes.  The slab allocator in 2.6.13 has only a single list of
slab pages.  As a result the per cpu slab cache and the spinlock controlled
page lists may contain slab entries from off node memory.  The slab
allocator in 2.6.13 makes no effort to discern the locality of an entry on
its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
slab entries by having lists of available slab pages for each node.  The
per cpu slab cache can only contain slab entries associated with the node
local to the processor.  This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore.  In 2.6.14 all node unspecific slab allocations are performed on
the boot processor.  This means that most of key data structures are
allocated on one node.  Most processors will have to refer to these
structures making the boot node a potential bottleneck.  This may reduce
performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer.  There is the need
of explicit application of NUMA memory policies by the slab allcator itself
since the NUMA slab allocator does no longer let the page_allocator control
locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy.  The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so that an
equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab
entries from the current node.  If the policy says that the local node is
not to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages.  The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the current
process.  It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
since no slab entries are in the lists anymore.  A page can typically be
used for multiple slab entries but lets say that the current process is
only using one.  The other entries are then added to the slab lists.  These
are now non local entries in the slab lists despite of the possible
availability of local pages that would provide faster access and increase
the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local slab
entry from kmalloc().  However, there are still these free slab entries
from the off node page obtained from the other process via MPOL_INTERLEAVE
in the cache.  The process will then get an off node slab entry although
other slab entries may be available that are local to that process.  This
means that the policy if one process may contaminate the locality of the
slab caches for other processes.

This patch in effect insures that a per process policy is followed for the
allocation of slab entries and that there cannot be a memory policy
influence from one process to another.  A process with default policy will
always get a local slab entry if one is available.  And the process using
memory policies will get its memory arranged as requested.  Off-node slab
allocation will require the use of spinlocks and will make the use of per
cpu caches not possible.  A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
  prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
  name.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Ingo Molnar
fc0abb1451 [PATCH] sem2mutex: mm/slab.c
Convert mm/swapfile.c's swapon_sem to swapon_mutex.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
Christoph Lameter
9eeff2395e [PATCH] Zone reclaim: Reclaim logic
Some bits for zone reclaim exists in 2.6.15 but they are not usable.  This
patch fixes them up, removes unused code and makes zone reclaim usable.

Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available.  Zone reclaim is of particular importance for NUMA
machines.  It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch.  By default
RECLAIM_DISTANCE is 20.  20 is the distance to another node in the same
component (enclosure or motherboard) on IA64.  The meaning of the NUMA
distance information seems to vary by arch.

If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).

This patch was discussed before. See

http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Christoph Lameter
f1fd1067ec [PATCH] Zone reclaim: resurrect may_swap
Zone reclaim has a huge impact on NUMA performance (f.e.  our maximum
throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination
of numa nodes destroys locality if one just does a large copy operation which
results in performance dropping for good until reboot).

This patch:

Resurrect may_swap in struct scan_control

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Christoph Lameter
fc30128963 [PATCH] Simplify migrate_page_add
Simplify migrate_page_add after feedback from Hugh.  This also allows us to
drop one parameter from migrate_page_add.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Nick Piggin
053837fce7 [PATCH] mm: migration page refcounting fix
Migration code currently does not take a reference to target page
properly, so between unlocking the pte and trying to take a new
reference to the page with isolate_lru_page, anything could happen to
it.

Fix this by holding the pte lock until we get a chance to elevate the
refcount.

Other small cleanups while we're here.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Andrew Morton
e236a166b2 [PATCH] mm: dirty_exceeded speedup
Ravikiran reports that this variable is bouncing all around nodes on NUMA
machines, causing measurable performance problems.  Fix that up by only
writing to it when it actually changed.

And put it in a new cacheline to prevent it sharing with other things (this
happened).

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Matt Tolentino
c09b42404d [PATCH] x86_64: add __meminit for memory hotplug
Add __meminit to the __init lineup to ensure functions default
to __init when memory hotplug is not enabled.  Replace __devinit
with __meminit on functions that were changed when the memory
hotplug code was introduced.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 23:18:35 -08:00
Paul Jackson
505970b96e [PATCH] cpuset oom lock fix
The problem, reported in:

  http://bugzilla.kernel.org/show_bug.cgi?id=5859

and by various other email messages and lkml posts is that the cpuset hook
in the oom (out of memory) code can try to take a cpuset semaphore while
holding the tasklist_lock (a spinlock).

One must not sleep while holding a spinlock.

The fix seems easy enough - move the cpuset semaphore region outside the
tasklist_lock region.

This required a few lines of mechanism to implement.  The oom code where
the locking needs to be changed does not have access to the cpuset locks,
which are internal to kernel/cpuset.c only.  So I provided a couple more
cpuset interface routines, available to the rest of the kernel, which
simple take and drop the lock needed here (cpusets callback_sem).

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:10 -08:00
Robin Holt
7339ff8302 [PATCH] Add tmpfs options for memory placement policies
Anything that writes into a tmpfs filesystem is liable to disproportionately
decrease the available memory on a particular node.  Since there's no telling
what sort of application (e.g.  dd/cp/cat) might be dropping large files
there, this lets the admin choose the appropriate default behavior for their
site's situation.

Introduce a tmpfs mount option which allows specifying a memory policy and
a second option to specify the nodelist for that policy.  With the default
policy, tmpfs will behave as it does today.  This patch adds support for
preferred, bind, and interleave policies.

The default policy will cause pages to be added to tmpfs files on the node
which is doing the writing.  Some jobs expect a single process to create
and manage the tmpfs files.  This results in a node which has a
significantly reduced number of free pages.

With this patch, the administrator can specify the policy and nodes for
that policy where they would prefer allocations.

This patch was originally written by Brent Casavant and Hugh Dickins.  I
added support for the bind and preferred policies and the mpol_nodelist
mount option.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:07 -08:00
Linus Torvalds
9f5974c873 Merge git://oss.sgi.com:8090/oss/git/xfs-2.6 2006-01-12 09:10:34 -08:00
Greg Ungerer
cbe8dd4af2 [PATCH] memmap_init_zone(): remove uneccesary page++
Remove unecessary page++ from memmap_init_zone loop.

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Catalin Marinas
2a7e2f7dcb [PATCH] do_truncate() call fix in tiny-shmem.c
Adapt tiny-shmem.c to the new do_truncate() prototype.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Christoph Lameter
f4598c8b36 [PATCH] migration: make sure there is no attempt to migrate reserved pages.
This ensures that reserved pages are not migrated.  Reserved pages
currently cause the WARN_ON to trigger in migrate_page_add()

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:48 -08:00
Randy.Dunlap
c59ede7b78 [PATCH] move capable() to capability.h
- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
	(in include/, block/, ipc/, kernel/, a few drivers/,
	mm/, security/, & sound/;
	many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:13 -08:00
Paul Jackson
4eac915d02 [PATCH] mm: gfp_atomic comments
Clarify in comments that GFP_ATOMIC means both "don't sleep" and "use
emergency pools", hence both ALLOC_HARDER and ALLOC_HIGH.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:09 -08:00
Hugh Dickins
7365f3d169 [PATCH] Restore KERN_EMERG to each line printed by bad_page
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:08 -08:00
Nathan Scott
ddae9c2ea7 Merge HEAD from oss.sgi.com:/oss/git/linux-2.6.git 2006-01-12 13:34:47 +11:00
David Woodhouse
a4fc7ab1d0 [PATCH] fix/simplify mutex debugging code
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 08:14:16 -08:00
Christoph Hellwig
78539fdfa4 [XFS] Export pagevec_lookup for use on the XFS page writeout path,
for dealing with delayed allocate and unwritten extents (as well).

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
2006-01-11 20:47:41 +11:00
Jesper Juhl
e97a31117c add missing printk loglevel in mm/swapfile.c
in mm/swapfile.c a printk() is missing a loglevel. I believe the proper
loglevel for this situation is KERN_ERR, so that's what the patch below
sets -if you agree, please apply.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-11 01:50:28 +01:00
Christoph Hellwig
870f481793 [PATCH] replace inode_update_time with file_update_time
To allow various options to work per-mount instead of per-sb we need a
struct vfsmount when updating ctime and mtime.  This preparation patch
replaces the inode_update_time routine with a file_update_atime routine so
we can easily get at the vfsmount.  (and the file makes more sense in this
context anyway).  Also get rid of the unused second argument - we always
want to update the ctime when calling this routine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:30 -08:00
Jes Sorensen
1b1dcc1b57 [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar <mingo@elte.hu>

(finished the conversion)

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09 15:59:24 -08:00
Ingo Molnar
de5097c2e7 [PATCH] mutex subsystem, more debugging code
more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:21 -08:00