Page migration currently does not check if the target of the move contains
nodes that that are invalid (if root attempts to migrate pages)
and may try to allocate from invalid nodes if these are specified
leading to oopses.
Return -EINVAL if an offline node is specified.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not BUG() if we cannot register a slab with sysfs. Just print an error.
The only consequence of not registering is that the slab cache is not
visible via /sys/slab. A BUG() may not be visible that early during boot
and we have had multiple issues here already.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In migration fallback path, write_page() or lock_page() will be called.
This causes sleep with holding rcu_read_lock().
For avoding that, just do rcu_lock if the page is Anon.(this is enough.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NUMA layer only supports NUMA policies for the highest zone. When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE. The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.
This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.
The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64. No abnormal results were seen in
kernbench, tbench, dbench or hackbench. It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.
akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on. That solution
would only use a single zonelist per node and filter on the fly. That may
help performance and also help to make memory policies work better."
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print a big fat warning and do what is necessary to continue if a node is
marked as up (meaning either node is online (upstream) or node has memory
(Andrew's tree)) but allocations from the node do not succeed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
83c54070ee).
By using the wrong bitmask, hugetlb_fault() failures are not being
recognized. This results in an infinite loop whenever follow_hugetlb_page
is involved in a failed fault.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Skip calling cache_free_alien() when the platform is not numa capable.
This will avoid cache misses that happen while accessing slabp (which is
per page memory reference) to get nodeid. Instead use a global variable to
skip the call, which is mostly likely to be present in the cache.
This gives a 0.8% performance boost with the database oltp workload on a
quad-core SMP platform and by any means the number is not small :)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new exec code inserts an accounted vma into an mm struct which is not
current->mm. The existing memory check code has a hard coded assumption
that this does not happen as does the security code.
As the correct mm is known we pass the mm to the security method and the
helper function. A new security test is added for the case where we need
to pass the mm and the existing one is modified to pass current->mm to
avoid the need to change large amounts of code.
(Thanks to Tobias for fixing rejects and testing)
Signed-off-by: Alan Cox <alan@redhat.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Cc: James Morris <jmorris@redhat.com>
Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lumpy reclaim works by selecting a lead page from the LRU list and then
selecting pages for reclaim from the order-aligned area of pages. In the
situation were all pages in that region are inactive and not referenced by any
process over time, it works well.
In the situation where there is even light load on the system, the pages may
not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
freed when the allocation attempt occurs because lumpy reclaim returned early.
This patch alters the behaviour of direct reclaim for large contiguous
blocks.
The first attempt to call shrink_page_list() is asynchronous but if it fails,
the pages are submitted a second time and the calling process waits for the IO
to complete. This may stall allocators waiting for contiguous memory but that
should be expected behaviour for high-order users. It is preferable behaviour
to potentially queueing unnecessary areas for IO. Note that kswapd will not
stall in this fashion.
[apw@shadowen.org: update to version 2]
[apw@shadowen.org: update to version 3]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As pointed out by Mel when reclaim is applied at higher orders a significant
amount of IO may be started. As this takes finite time to drain reclaim will
consider more areas than ultimatly needed to satisfy the request. This leads
to more reclaim than strictly required and reduced success rates.
I was able to confirm Mel's test results on systems locally. These show that
even under light load the success rates drop off far more than expected.
Testing with a modified version of his patch (which follows) I was able to
allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test
passes sequentially following system boot (the system has 29 hugepages in
ZONE_MOVABLE):
2.6.23-rc1 11 8 6 7 7
sync_lumpy 28 28 29 29 26
These show that although hugely better than the near 0% success normally
expected we can only allocate about a 1/4 of the zone. Using synchronous
reclaim for these allocations we get close to 100% as expected.
I have also run our standard high order tests and these show no regressions in
allocation success rates at rest, and some significant improvements under
load.
This patch:
We are transitioning pages from active to inactive in clear_active_flags,
those need counting as PGDEACTIVATE vm events.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c:
Initializing HighMem for node 0 (00038000:00100000)
Initializing HighMem for node 1 (00100000:001ffe00)
------------[ cut here ]------------
kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456!
[...]
This occurs because the section to node id mapping is not being
setup correctly during init under SPARSEMEM_STATIC, leading to an
attempt to free pages from all nodes into the zones on node 0.
When the zone_table[] was removed in the following commit, a new
section to node mapping table was introduced:
commit 89689ae7f9
[PATCH] Get rid of zone_table[]
That conversion inadvertantly only initialised the node mapping in
SPARSEMEM_EXTREME. Ensure we initialise the node mapping in
SPARSEMEM_STATIC.
[akpm@linux-foundation.org: make the stubs static inline]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n
sysace: HDIO_GETGEO has it's own method for ages
drivers/block/cpqarray.c: better error handling and kmalloc + memset conversion to k[cz]alloc
drivers/block/cciss.c: kmalloc + memset conversion to kzalloc
Clean up duplicate includes in drivers/block/
Fix remap handling by blktrace
[PATCH] remove mm/filemap.c:file_send_actor()
The dynamic dma kmalloc creation can run into trouble if a
GFP_ATOMIC allocation is the first one performed for a certain size
of dma kmalloc slab.
- Move the adding of the slab to sysfs into a workqueue
(sysfs does GFP_KERNEL allocations)
- Do not call kmem_cache_destroy() (uses slub_lock)
- Only acquire the slub_lock once and--if we cannot wait--do a trylock.
This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC)
for a range of sizes failing due to another process holding the slub_lock.
However, we only need to acquire the spinlock once in order to establish
each power of two DMA kmalloc cache. The possible conflict is with the
slub_lock taken during slab management actions (create / remove slab cache).
It is rather typical that a driver will first fill its buffers using
GFP_KERNEL allocations which will wait until the slub_lock can be acquired.
Drivers will also create its slab caches first outside of an atomic
context before starting to use atomic kmalloc from an interrupt context.
If there are any failures then they will occur early after boot or when
loading of multiple drivers concurrently. Drivers can already accomodate
failures of GFP_ATOMIC for other reasons. Retries will then create the slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
The MAX_PARTIAL checks were supposed to be an optimization. However, slab
shrinking is a manually triggered process either through running slabinfo
or by the kernel calling kmem_cache_shrink.
If one really wants to shrink a slab then all operations should be done
regardless of the size of the partial list. This also fixes an issue that
could surface if the number of partial slabs was initially above MAX_PARTIAL
in kmem_cache_shrink and later drops below MAX_PARTIAL through the
elimination of empty slabs on the partial list (rare). In that case a few
slabs may be left off the partial list (and only be put back when they
are empty).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
out_of_memory() may be called when an allocation is failing and the direct
reclaim is not making any progress. This does not take into account the
requested order of the allocation. If the request if for an order larger
than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation
because the kernel makes no guarantees about those allocations succeeding.
This false OOM situation can occur if a user is trying to grow the hugepage
pool in a script like;
#!/bin/bash
REQUIRED=$1
echo 1 > /proc/sys/vm/hugepages_treat_as_movable
echo $REQUIRED > /proc/sys/vm/nr_hugepages
ACTUAL=`cat /proc/sys/vm/nr_hugepages`
while [ $REQUIRED -ne $ACTUAL ]; do
echo Huge page pool at $ACTUAL growing to $REQUIRED
echo $REQUIRED > /proc/sys/vm/nr_hugepages
ACTUAL=`cat /proc/sys/vm/nr_hugepages`
sleep 1
done
This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM
easily on 2.6.23-rc1. This patch will fail an allocation for an order above
PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We ClearSlabDebug() before the last SlabDebug() check. Clear it later.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Introduce CONFIG_SUSPEND representing the ability to enter system sleep
states, such as the ACPI S3 state, and allow the user to choose SUSPEND
and HIBERNATION independently of each other.
Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
been chosen and the kernel is intended for SMP systems.
Also, introduce CONFIG_PM_SLEEP which is automatically selected if
CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
code needed for both suspend and hibernation.
The top-level power management headers and the ACPI code related to
suspend and hibernation are modified to use the new definitions (the
changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
the number of ifdefs).
There are many other files in which CONFIG_PM can be replaced with
CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
the future.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>