Commit Graph

67 Commits

Author SHA1 Message Date
Christoph Lameter 8fce4d8e3b [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages.
The cache reaper currently tries to free all alien caches and all remote
per cpu pages in each pass of cache_reap.  For a machines with large number
of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
Interrupts are disabled while reclaiming creating unacceptable delays.

This patch changes that behavior by adding a per cpu reap_node variable.
Instead of attempting to free all caches, we free only one alien cache and
the per cpu pages from one remote node.  That reduces the time spend in
cache_reap.  However, doing so will lengthen the time it takes to
completely drain all remote per cpu pagesets and all alien caches.  The
time needed will grow with the number of nodes in the system.  All caches
are drained when they overflow their respective capacity.  So the drawback
here is only that a bit of memory may be wasted for awhile longer.

Details:

1. Rename drain_remote_pages to drain_node_pages to allow the specification
   of the node to drain of pcp pages.

2. Add additional functions init_reap_node, next_reap_node for NUMA
   that manage a per cpu reap_node counter.

3. Add a reap_alien function that reaps only from the current reap_node.

For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
the additional interrupt holdoff reductions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-09 19:47:38 -08:00
Jack Steiner 07ed76b2a0 [PATCH] slab: allocate larger cache_cache if order 0 fails
kmem_cache_init() incorrectly assumes that the cache_cache object will fit
in an order 0 allocation.  On very large systems, this is not true.  Change
the code to try larger order allocations if order 0 fails.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:15:04 -08:00
Linus Torvalds f78bb8ad48 slab: fix calculate_slab_order() for SLAB_RECLAIM_ACCOUNT
Instead of having a hard-to-read and confusing conditional in the
caller, just make the slab order calculation handle this special case,
since it's simple and obvious there.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 10:33:05 -08:00
Linus Torvalds 9888e6fa7b slab: clarify and fix calculate_slab_order()
If we triggered the 'offslab_limit' test, we would return with
cachep->gfporder incremented once too many times.

This clarifies the logic somewhat, and fixes that bug.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 17:44:43 -08:00
Linus Torvalds 264132bc62 Fix "check_slabp" printout size calculation
We want to use the "struct slab" size, not the size of the pointer to
same.  As it is, we'd not print out the last <n> entry pointers in the
slab (where <n> is ~10, depending on whether it's a 32-bit or 64-bit
kernel).

Gaah, that slab code was written by somebody who likes unreadable crud.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 12:10:07 -08:00
Ravikiran G Thirumalai f0188f4748 [PATCH] slab: Avoid deadlock at kmem_cache_create/kmem_cache_destroy
Prevents deadlock situation between
kmem_cache_create()/kmem_cache_destory(), and kmem_cache_create() /cpu
hotplug.  The locking order probably got moved over time.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10 08:13:12 -08:00
Linus Torvalds 7a21ef6fe9 mm/slab.c (non-NUMA): Fix compile warning and clean up code
The non-NUMA case would do an unmatched "free_alien_cache()" on an alien
pointer that had never been allocated.

It might not matter from a code generation standpoint (since in the
non-NUMA case, the code doesn't actually _do_ anything), but it not only
results in a compiler warning, it's really really ugly too.

Fix the compiler warning by just having a matching dummy allocation.
That also avoids an unnecessary #ifdef in the code.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:26:38 -08:00
Ravikiran G Thirumalai 4484ebf12b [PATCH] NUMA slab locking fixes: fix cpu down and up locking
This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator.  Sonny Rao <sonny@burdell.org> reported problems sometime back on
POWER5 boxes, when the last cpu on the nodes were being offlined.  We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down.  Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go
down, the array_caches (shared, alien) and the kmem_list3 of the node were
being freed (kfree) with the kmem_list3 lock held.  If the l3 or the
array_caches were to come from the same cache being cleared, we hit on
badness.

This patch cleans up the locking in cpu_up and cpu_down path.  We cannot
really free l3 on cpu down because, there is no node offlining yet and even
though a cpu is not yet up, node local memory can be allocated for it.  So l3s
are usually allocated at keme_cache_create and destroyed at
kmem_cache_destroy.  Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 4
dbench process running all the time.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai ca3b9b9173 [PATCH] NUMA slab locking fixes: irq disabling from cahep->spinlock to l3 lock
Earlier, we had to disable on chip interrupts while taking the
cachep->spinlock because, at cache_grow, on every addition of a slab to a slab
cache, we incremented colour_next which was protected by the cachep->spinlock,
and cache_grow could occur at interrupt context.  Since, now we protect the
per-node colour_next with the node's list_lock, we do not need to disable on
chip interrupts while taking the per-cache spinlock, but we just need to
disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Ravikiran G Thirumalai 2e1217cf96 [PATCH] NUMA slab locking fixes: move color_next to l3
colour_next is used as an index to add a colouring offset to a new slab in the
cache (colour_off * colour_next).  Now with the NUMA aware slab allocator, it
makes sense to colour slabs added on the same node sequentially with
colour_next.

This patch moves the colouring index "colour_next" per-node by placing it on
kmem_list3 rather than kmem_cache.

This also helps simplify locking for CPU up and down paths.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:53 -08:00
Randy Dunlap ee13d785ea [PATCH] slab: fix sparse warning
mm/slab.c:1522:13: error: incompatible types for operation (&)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Randy.Dunlap a70773ddb9 [PATCH] mm/slab: add kernel-doc for one function
Fix kernel-doc for calculate_slab_order().

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg 7fd6b14130 [PATCH] slab: fix kzalloc and kstrdup caller report for CONFIG_DEBUG_SLAB
Fix kzalloc() and kstrdup() caller report for CONFIG_DEBUG_SLAB.  We must
pass the caller to __cache_alloc() instead of directly doing
__builtin_return_address(0) there; otherwise kzalloc() and kstrdup() are
reported as the allocation site instead of the real one.

Thanks to Valdis Kletnieks for reporting the problem and Steven Rostedt for
the original idea.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg 343e0d7a93 [PATCH] slab: replace kmem_cache_t with struct kmem_cache
Replace uses of kmem_cache_t with proper struct kmem_cache in mm/slab.c.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg 9a2dba4b49 [PATCH] slab: rename ac_data to cpu_cache_get
Rename the ac_data() function to more descriptive cpu_cache_get().

Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg 6ed5eb2211 [PATCH] slab: extract virt_to_{cache|slab}
Introduce virt_to_cache() and virt_to_slab() functions to reduce duplicate
code and introduce a proper abstraction should we want to support other kind
of mapping for address to slab and cache (eg.  for vmalloc() or I/O memory).

Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:18 -08:00
Pekka Enberg 5295a74cc0 [PATCH] slab: reduce inlining
From: Manfred Spraul <manfred@colorfullife.com>

Reduce the amount of inline functions in slab to the functions that
are used in the hot path:

  - no inline for debug functions
  - no __always_inline, inline is already __always_inline
  - remove inline from a few numa support functions.

Before:

   text    data     bss     dec     hex filename
  13588     752      48   14388    3834 mm/slab.o (defconfig)
  16671    2492      48   19211    4b0b mm/slab.o (numa)

After:

   text    data     bss     dec     hex filename
  13366     752      48   14166    3756 mm/slab.o (defconfig)
  16230    2492      48   18770    4952 mm/slab.o (numa)

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Matthew Dobson 78d382d77c [PATCH] slab: extract slab_{put|get}_obj
Create two helper functions slab_get_obj() and slab_put_obj() to replace
duplicated code in mm/slab.c

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Matthew Dobson 12dd36faec [PATCH] slab: extract slab_destroy_objs()
Create a helper function, slab_destroy_objs() which called from
slab_destroy().  This makes slab_destroy() smaller and more readable, and
moves ifdefs outside the function body.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Steven Rostedt fbaccacff1 [PATCH] slab: cache_estimate cleanup
Clean up cache_estimate() in mm/slab.c and improves the algorithm from O(n) to
O(1).  We first calculate the maximum number of objects a slab can hold after
struct slab and kmem_bufctl_t for each object has been given enough space.
After that, to respect alignment rules, we decrease the number of objects if
necessary.  As required padding is at most align-1 and memory of obj_size is
at least align, it is always enough to decrease number of objects by one.

The optimization was originally made by Balbir Singh with more improvements
from Steven Rostedt.  Manfred Spraul provider further modifications: no loop
at all for the off-slab case and added comments to explain the background.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Steven Rostedt 5ec8a847bb [PATCH] slab: have index_of bug at compile time
I noticed the code for index_of is a creative way of finding the cache
index using the compiler to optimize to a single hard coded number.  But
I couldn't help noticing that it uses two methods to let you know that
someone used it wrong.  One is at compile time (the correct way), and
the other is at run time (not good).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Christoph Lameter 18f820f655 [PATCH] slab: minor cleanup to kmem_cache_alloc_node
Clean up kmem_cache_alloc_node a bit.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Manfred Spraul 3dafccf227 [PATCH] slab: distinguish between object and buffer size
An object cache has two different object lengths:

  - the amount of memory available for the user (object size)
  - the amount of memory allocated internally (buffer size)

This patch does some renames to make the code reflect that better.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:17 -08:00
Benjamin LaHaise 9884fd8df1 [PATCH] Use 32 bit division in slab_put_obj()
Improve the performance of slab_put_obj().  Without the cast, gcc considers
ptrdiff_t a 64 bit signed integer and ends up emitting code to use a full
signed 128 bit divide on EM64T, which is substantially slower than a 32 bit
unsigned divide.

I noticed this when looking at the profile of a case where the slab balance
is just on edge and thrashes back and forth freeing a block.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Christoph Lameter 86c562a9d6 [PATCH] mm: optimize numa policy handling in slab allocator
Move the interrupt check from slab_node into ___cache_alloc and adds an
"unlikely()" to avoid pipeline stalls on some architectures.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00