Commit Graph

2180 Commits

Author SHA1 Message Date
Christoph Lameter
1480a540c9 [PATCH] SwapMig: add_to_swap() avoid atomic allocations
Add gfp_mask to add_to_swap

add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
swapping.  During migration we may have use add_to_swap extensively which may
lead to out of memory errors.

This patch makes add_to_swap take a parameter that specifies the gfp mask.
The page migration code can then make add_to_swap use GFP_KERNEL.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Christoph Lameter
8419c31810 [PATCH] SwapMig: CONFIG_MIGRATION fixes
Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by
CONFIG_MIGRATION saving some codesize for single processor kernels.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Christoph Lameter
39743889aa [PATCH] Swap Migration V5: sys_migrate_pages interface
sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process.  A process may
have migrated to another node.  Memory was allocated optimally for the prior
context.  sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory.  Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed.  However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends.  The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset).  When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets.  The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Christoph Lameter
dc9aa5b9d6 [PATCH] Swap Migration V5: MPOL_MF_MOVE interface
Add page migration support via swap to the NUMA policy layer

This patch adds page migration support to the NUMA policy layer.  An
additional flag MPOL_MF_MOVE is introduced for mbind.  If MPOL_MF_MOVE is
specified then pages that do not conform to the memory policy will be evicted
from memory.  When they get pages back in new pages will be allocated
following the numa policy.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:41 -08:00
Christoph Lameter
7cbe34cf86 [PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support
Include page migration if the system is NUMA or having a memory model that
allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM).

And:
- Only include lru_add_drain_per_cpu if building for an SMP system.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:41 -08:00
Christoph Lameter
49d2e9cc45 [PATCH] Swap Migration V5: migrate_pages() function
This adds the basic page migration function with a minimal implementation that
only allows the eviction of pages to swap space.

Page eviction and migration may be useful to migrate pages, to suspend
programs or for remapping single pages (useful for faulty pages or pages with
soft ECC failures)

The process is as follows:

The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.

Then the actual migration or swapout can happen by calling migrate_pages().

migrate_pages does its best to migrate or swapout the pages and does multiple
passes over the list.  Some pages may only be swappable if they are not dirty.
 migrate_pages may start writing out dirty pages in the initial passes over
the pages.  However, migrate_pages may not be able to migrate or evict all
pages for a variety of reasons.

The remaining pages may be returned to the LRU lists using putback_lru_pages().

Changelog V4->V5:
- Use the lru caches to return pages to the LRU

Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
  require minimal changes. Rename swapout_pages() to migrate_pages().

Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()

Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Michael Kerrisk" <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:41 -08:00
Christoph Lameter
930d915252 [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap
Add PF_SWAPWRITE to control a processes permission to write to swap.

- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
  and pdflush

- Set PF_SWAPWRITE flag for kswapd and pdflush

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:41 -08:00
Christoph Lameter
21eac81f25 [PATCH] Swap Migration V5: LRU operations
This is the start of the `swap migration' patch series.

Swap migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running.  This means that the
virtual addresses that the process sees do not change.  However, the system
rearranges the physical location of those pages.

The main intent of page migration patches here is to reduce the latency of
memory access by moving pages near to the processor where the process
accessing that memory is running.

The patchset allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while
setting a new memory policy.

The pages of process can also be relocated from another process using the
sys_migrate_pages() function call.  Requires CAP_SYS_ADMIN.  The migrate_pages
function call takes two sets of nodes and moves pages of a process that are
located on the from nodes to the destination nodes.

Manual migration is very useful if for example the scheduler has relocated a
process to a processor on a distant node.  A batch scheduler or an
administrator can detect the situation and move the pages of the process
nearer to the new processor.

sys_migrate_pages() could be used on non-numa machines as well, to force all
of a particualr process's pages out to swap, if someone thinks that's useful.

Larger installations usually partition the system using cpusets into sections
of nodes.  Paul has equipped cpusets with the ability to move pages when a
task is moved to another cpuset.  This allows automatic control over locality
of a process.  If a task is moved to a new cpuset then also all its pages are
moved with it so that the performance of the process does not sink
dramatically (as is the case today).

Swap migration works by simply evicting the page.  The pages must be faulted
back in.  The pages are then typically reallocated by the system near the node
where the process is executing.

For swap migration the destination of the move is controlled by the allocation
policy.  Cpusets set the allocation policy before calling sys_migrate_pages()
in order to move the pages as intended.

No allocation policy changes are performed for sys_migrate_pages().  This
means that the pages may not faulted in to the specified nodes if no
allocation policy was set by other means.  The pages will just end up near the
node where the fault occurred.

There's another patch series in the pipeline which implements "direct
migration".

The direct migration patchset extends the migration functionality to avoid
going through swap.  The destination node of the relation is controllable
during the actual moving of pages.  The crutch of using the allocation policy
to relocate is not necessary and the pages are moved directly to the target.
Its also faster since swap is not used.

And sys_migrate_pages() can then move pages directly to the specified node.
Implement functions to isolate pages from the LRU and put them back later.

This patch:

An earlier implementation was provided by Hirokazu Takahashi
<taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the
memory hotplug project.

From: Magnus

This breaks out isolate_lru_page() and putpack_lru_page().  Needed for swap
migration.

Signed-off-by: Magnus Damm <magnus.damm@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:41 -08:00
Nick Piggin
48db57f8ff [PATCH] mm: free_pages opt
Try to streamline free_pages_bulk by ensuring callers don't pass in a
'count' that exceeds the list size.

Some cleanups:
Rename __free_pages_bulk to __free_one_page.
Put the page list manipulation from __free_pages_ok into free_one_page.
Make __free_pages_ok static.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Nick Piggin
23316bc86f [PATCH] mm: cleanup zone_pcp
Use zone_pcp everywhere even though NUMA code "knows" the internal details
of the zone.  Stop other people trying to copy, and it looks nicer.

Also, only print the pagesets of online cpus in zoneinfo.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Rohit Seth
8ad4b1fb82 [PATCH] Make high and batch sizes of per_cpu_pagelists configurable
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists.  This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added.  This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list.  The min value for this is 8.  It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result.  It
is set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Andrew Morton
9d0243bca3 [PATCH] drop-pagecache
Add /proc/sys/vm/drop_caches.  When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can.  THis
operation requires root permissions.

It won't drop dirty data, so the user should run `sync' first.

Caveats:

a) Holds inode_lock for exorbitant amounts of time.

b) Needs to be taught about NUMA nodes: propagate these all the way through
   so the discarding can be controlled on a per-node basis.

This is a debugging feature: useful for getting consistent results between
filesystem benchmarks.  We could possibly put it under a config option, but
it's less than 300 bytes.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Christoph Lameter
bec6b0c89b [PATCH] slab: remove nested #ifdef CONFIG_NUMA
For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
CONFIG_NUMA in the page allocator.  Remove innermost #ifdef CONFIG_NUMA

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Pekka Enberg
b28a02de8c [PATCH] slab: fix code formatting
The slab allocator code is inconsistent in coding style and messy.  For this
patch, I ran Lindent for mm/slab.c and fixed up goofs by hand.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:39 -08:00
Pekka Enberg
4d268eba11 [PATCH] slab: extract slab order calculation to separate function
This patch moves the ugly loop that determines the 'optimal' size (page order)
of cache slabs from kmem_cache_create() to a separate function and cleans it
up a bit.

Thanks to Matthew Wilcox for the help with this patch.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:39 -08:00
Pekka Enberg
85289f98dd [PATCH] slab: extract slabinfo header printing to separate function
This patch extracts slabinfo header printing to a separate function
print_slabinfo_header() to make s_start() more readable.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:39 -08:00
Pekka Enberg
f9f7500521 [PATCH] slab: remove unused align parameter from alloc_percpu
__alloc_percpu and alloc_percpu both take an 'align' argument which is
completely ignored.  snmp6_mib_init() in net/ipv6/af_inet6.c attempts to use
it, but it will be ignored.  Therefore, remove the 'align' argument and fixup
the lone caller.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:39 -08:00
Andrew Morton
84c2008af0 [PATCH] revert "mm: page_state fixes"
Hugh says:

page_alloc_cpu_notify() specifically contains code to

 		/* Add dead cpu's page_states to our own. */

which handles this more efficiently.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:38 -08:00
Arnd Bergmann
67207b9664 [PATCH] spufs: The SPU file system, base
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.

This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.

The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.

See Documentation/filesystems/spufs.txt on how to use
spufs.

Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-09 14:49:12 +11:00
Andrew Morton
22905f775d identify multipage ->writepages() calls
NFS needs to be able to distinguish between single-page ->writepage() calls and
 multipage ->writepages() calls.

 For the single-page writepage calls NFS can kick off the I/O within the
 context of ->writepage().

 For multipage ->writepages calls, nfs_writepage() will leave the I/O pending
 and nfs_writepages() will kick off the I/O when it all has been queued up
 within NFS.

 Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
 Signed-off-by: Andrew Morton <akpm@osdl.org>
 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-01-06 14:58:38 -05:00
Rafael J. Wysocki
3a291a20bd [PATCH] mm: add a new function (needed for swap suspend)
This adds the function get_swap_page_of_type() allowing us to specify an index
in swap_info[] and select a swap_info_struct structure to be used for
allocating a swap page.

This function (or another one of similar functionality) will be necessary for
implementing the image-writing part of swsusp in the user space.   It can also
be used for simplifying the current in-kernel implementation of the
image-writing part of swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:43 -08:00
Anton Blanchard
c898ec16e8 [PATCH] allow flatmem to be disabled when only sparsemem is implemented
On architectures that implement sparsemem but not discontigmem we want to
be able to hide the flatmem option in some cases.  On ppc64 for example,
when we select NUMA we must not select flatmem.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:37 -08:00
David Howells
b0e15190ea [PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU
The attached patch makes the SYSV IPC shared memory facilities use the new
ramfs facilities on a no-MMU kernel.

The following changes are made:

 (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to
     allow the IPC SHM facilities to commune with the tiny-shmem and shmem
     code.

 (2) ramfs files now need resizing using do_truncate() rather than by modifying
     the inode size directly (see shmem_file_setup()). This causes ramfs to
     attempt to bind a block of pages of sufficient size to the inode.

 (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:32 -08:00
Nick Piggin
a74609fafa [PATCH] mm: page_state opt
Optimise page_state manipulations by introducing interrupt unsafe accessors
to page_state fields.  Callers must provide their own locking (either
disable interrupts or not update from interrupt context).

Switch over the hot callsites that can easily be moved under interrupts off
sections.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:29 -08:00
Christoph Lameter
070f80326a [PATCH] build_zonelists_node(): rename args
Give j and r meaningful names.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:28 -08:00