The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node. If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).
However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
then we will currently do some reclaim before returning NULL. The caller
may want memory from other nodes before reclaim should be triggered. (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).
There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.
So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
waking up kswapd.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c. Fix them up to match that concensus.
[PATCH] node local per-cpu-pages
[PATCH] ZVC: Scale thresholds depending on the size of the system
commit e7c8d5c995
commit df9ecaba3f
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long. This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.
In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:
1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
uses bit_set which in generic code takes an int.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
drain_node_pages() currently drains the complete pageset of all pages. If
there are a large number of pages in the queues then we may hold off
interrupts for too long.
Duplicate the method used in free_hot_cold_page. Only drain pcp->batch
pages at one time.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OOM can panic due to the processes stuck in __alloc_pages() doing infinite
rebalance loop while no memory can be reclaimed. OOM killer tries to kill
some processes, but unfortunetaly, rebalance label was moved by someone
below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
OOM-killed and it can simply fail the allocation :/
Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
memory allocation tricks triggered OOM panic.
Signed-off-by: Denis Lunev <den@sw.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an arch_alloc_page to match arch_free_page.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.
Remembers the zones in a zonelist that were short of free memory in the
last second. And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)
Recent changes:
This differs in a significant way from a similar patch that I
posted a week ago. Now, instead of having a nodemask_t of
recently full nodes, I have a bitmask of recently full zones.
This solves a problem that last weeks patch had, which on
systems with multiple zones per node (such as DMA zone) would
take seeing any of these zones full as meaning that all zones
on that node were full.
Also I changed names - from "zonelist faster" to "zonelist cache",
as that seemed to better convey what we're doing here - caching
some of the key zonelist state (for faster access.)
See below for some performance benchmark results. After all that
discussion with David on why I didn't need them, I went and got
some ;). I wanted to verify that I had not hurt the normal case
of memory allocation noticeably. At least for my one little
microbenchmark, I found (1) the normal case wasn't affected, and
(2) workloads that forced scanning across multiple nodes for
memory improved up to 10% fewer System CPU cycles and lower
elapsed clock time ('sys' and 'real'). Good. See details, below.
I didn't have the logic in get_page_from_freelist() for various
full nodes and zone reclaim failures correct. That should be
fixed up now - notice the new goto labels zonelist_scan,
this_zone_full, and try_next_zone, in get_page_from_freelist().
There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:
1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
have seen real customer loads where the cost to scan the zonelist
was a problem, due to many nodes being full of memory before
we got to a node we could use. Or at least, I think we have.
This was related to me by another engineer, based on experiences
from some time past. So this is not guaranteed. Most likely, though.
The following approach should help such real numa systems just as
much as it helps fake numa systems, or any combination thereof.
2) The effort to distinguish fake from real numa, using node_distance,
so that we could cache a fake numa node and optimize choosing
it over equivalent distance fake nodes, while continuing to
properly scan all real nodes in distance order, was going to
require a nasty blob of zonelist and node distance munging.
The following approach has no new dependency on node distances or
zone sorting.
See comment in the patch below for a description of what it actually does.
Technical details of note (or controversy):
- See the use of "zlc_active" and "did_zlc_setup" below, to delay
adding any work for this new mechanism until we've looked at the
first zone in zonelist. I figured the odds of the first zone
having the memory we needed were high enough that we should just
look there, first, then get fancy only if we need to keep looking.
- Some odd hackery was needed to add items to struct zonelist, while
not tripping up the custom zonelists built by the mm/mempolicy.c
code for MPOL_BIND. My usual wordy comments below explain this.
Search for "MPOL_BIND".
- Some per-node data in the struct zonelist is now modified frequently,
with no locking. Multiple CPU cores on a node could hit and mangle
this data. The theory is that this is just performance hint data,
and the memory allocator will work just fine despite any such mangling.
The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
(a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
all be self correcting after at most a one second delay.
- This still does a linear scan of the same lengths as before. All
I've optimized is making the scan faster, not algorithmically
shorter. It is now able to scan a compact array of 'unsigned
short' in the case of many full nodes, so one cache line should
cover quite a few nodes, rather than each node hitting another
one or two new and distinct cache lines.
- If both Andi and Nick don't find this too complicated, I will be
(pleasantly) flabbergasted.
- I removed the comment claiming we only use one cachline's worth of
zonelist. We seem, at least in the fake numa case, to have put the
lie to that claim.
- I pay no attention to the various watermarks and such in this performance
hint. A node could be marked full for one watermark, and then skipped
over when searching for a page using a different watermark. I think
that's actually quite ok, as it will tend to slightly increase the
spreading of memory over other nodes, away from a memory stressed node.
===============
Performance - some benchmark results and analysis:
This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.
Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.) System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.
Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added. The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.
number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
GBs N ------------ -------------- ---------------- systime
mem threads sys user real sys user real sys user real better
12 1 153 24 177 151 24 176 -2 0 -1 1%
12 19 99 22 8 99 22 8 0 0 0 0%
12 37 111 25 6 112 25 6 1 0 0 -0%
12 55 115 25 5 110 23 5 -5 -2 0 4%
38 1 502 74 576 497 73 570 -5 -1 -6 0%
38 19 426 78 48 373 76 39 -53 -2 -9 12%
38 37 544 83 36 547 82 36 3 -1 0 -0%
38 55 501 77 23 511 80 24 10 3 1 -1%
64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
64 19 1118 138 119 965 141 103 -153 3 -16 13%
64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
90 55 1948 210 104 1843 213 100 -105 3 -4 5%
Notes:
1) This test ran a memory hog program that started a specified number N of
threads, and had each thread allocate and touch 1/N'th of
the total memory to be used in the test run in a single loop,
writing a constant word to memory, one store every 4096 bytes.
Watching this test during some earlier trial runs, I would see
each of these threads sit down on one CPU and stay there, for
the remainder of the pass, a different CPU for each thread.
2) The 'real' column is not comparable to the 'sys' or 'user' columns.
The 'real' column is seconds wall clock time elapsed, from beginning
to end of that test pass. The 'sys' and 'user' columns are total
CPU seconds spent on that test pass. For a 19 thread test run,
for example, the sum of 'sys' and 'user' could be up to 19 times the
number of 'real' elapsed wall clock seconds.
3) Tests were run on a fresh, single-user boot, to minimize the amount
of memory already in use at the start of the test, and to minimize
the amount of background activity that might interfere.
4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
5) Notice that the 'real' time gets large for the single thread runs, even
though the measured 'sys' and 'user' times are modest. I'm not sure what
that means - probably something to do with it being slow for one thread to
be accessing memory along ways away. Perhaps the fake numa system, running
ostensibly the same workload, would not show this substantial degradation
of 'real' time for one thread on many nodes -- lets hope not.
6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
ran quite efficiently, as one might expect. Each pair of threads needed
to allocate and touch the memory on the node the two threads shared, a
pleasantly parallizable workload.
7) The intermediate thread count passes, when asking for alot of memory forcing
them to go to a few neighboring nodes, improved the most with this zonelist
caching patch.
Conclusions:
* This zonelist cache patch probably makes little difference one way or the
other for most workloads on real numa hardware, if those workloads avoid
heavy off node allocations.
* For memory intensive workloads requiring substantial off-node allocations
on real numa hardware, this patch improves both kernel and elapsed timings
up to ten per-cent.
* For fake numa systems, I'm optimistic, but will have to leave that up to
Rohit Seth to actually test (once I get him a 2.6.18 backport.)
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zone table is mostly not needed. If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.
In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field. In all of
the above cases there will be no need at all for the zone table.
The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags. In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.
For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table. The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- s/freeliest/freelist/ spelling fix
- Check for NULL *z zone seems useless - even if it could happen, so
what? Perhaps we should have a check later on if we are faced with an
allocation request that is not allowed to fail - shouldn't that be a
serious kernel error, passing an empty zonelist with a mandate to not
fail?
- Initializing 'z' to zonelist->zones can wait until after the first
get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
loop, so let's initialize 'z' there, in a 'for' loop. Seems clearer.
- Remove superfluous braces around a break
- Fix a couple errant spaces
- Adjust indentation on the cpuset_zone_allowed() check, to match the
lines just before it -- seems easier to read in this case.
- Add another set of braces to the zone_watermark_ok logic
From: Paul Jackson <pj@sgi.com>
Backout one item from a previous "memory page_alloc minor cleanups" patch.
Until and unless we are certain that no one can ever pass an empty zonelist
to __alloc_pages(), this check for an empty zonelist (or some BUG
equivalent) is essential. The code in get_page_from_freelist() blow ups if
passed an empty zonelist.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_min_pfn_for_node() and find_min_pfn_with_active_regions() both
depend on a sorted early_node_map[]. However, sort_node_map() is being
called after fin_min_pfn_with_active_regions() in
free_area_init_nodes().
In most cases, this is ok, but on at least one x86_64, the SRAT table
caused the E820 ranges to be registered out of order. This gave the
wrong values for the min PFN range resulting in some pages not being
initialised.
This patch sorts the early_node_map in find_min_pfn_for_node(). It has
been boot tested on x86, x86_64, ppc64 and ia64.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Un-needed add-store operation wastes a few bytes.
8 bytes wasted with -O2, on a ppc.
Signed-off-by: nkalmala <nkalmala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
absent_pages_in_range() made the assumption that users of the
arch-independent zone-sizing API would not care about holes beyound the end
of physical memory. This was not the case and was "fixed" in a patch
called "Account for holes that are outside the range of physical memory".
However, when given a range that started before a hole in "real" memory and
ended beyond the end of memory, it would get the result wrong. The bug is
in mainline but a patch is below.
It has been tested successfully on a number of machines and architectures.
Additional credit to Keith Mannthey for discovering the problem, helping
identify the correct fix and confirming it Worked For Him.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.
The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently. In balance_pgdat, we keep a separate
priority record per zone in a local array. In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.
Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low. They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.
From: Andrew Morton <akpm@osdl.org>
__zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list. Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.
Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.
This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
possible to remove that now, and to just start out at DEF_PRIORITY?
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Qooting Adrian:
- net/sunrpc/svc.c uses highest_possible_node_id()
- include/linux/nodemask.h says highest_possible_node_id() is
out-of-line #if MAX_NUMNODES > 1
- the out-of-line highest_possible_node_id() is in lib/cpumask.c
- lib/Makefile: lib-$(CONFIG_SMP) += cpumask.o
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y, CONFIG_SMP=n, CONFIG_SUNRPC=y
-> highest_possible_node_id() is used in net/sunrpc/svc.c
CONFIG_NODES_SHIFT defined and > 0
-> include/linux/numa.h: MAX_NUMNODES > 1
-> compile error
The bug is not present on architectures where ARCH_DISCONTIGMEM_ENABLE
depends on NUMA (but m32r isn't the only affected architecture).
So move the function into page_alloc.c
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.
The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.
This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.
Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the lock debug checks below the page reserved checks. Also, having
debug_check_no_locks_freed in kernel_map_pages is wrong.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
After the PG_reserved check was added, arch_free_page was being called in the
wrong place (it could be called for a page we don't actually want to free).
Fix that.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
memmap_zone_idx() is not used anymore. It was required by an earlier
version of
account-for-memmap-and-optionally-the-kernel-image-as-holes.patch but not
any more.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix kernel-doc and function declaration (missing "void") in
mm/page_alloc.c.
Add mm/page_alloc.c to kernel-api.tmpl in DocBook.
mm/page_alloc.c:2589:38: warning: non-ANSI function declaration of function 'remove_all_active_ranges'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Having min be a signed quantity means gcc can't turn high latency divides
into shifts. There happen to be two such divides for GFP_ATOMIC (ie.
networking, ie. important) allocations, one of which depends on the other.
Fixing this makes code smaller as a bonus.
Shame on somebody (probably me).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use NULL instead of 0 for pointer value, eliminate sparse warnings.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We do not need to allocate pagesets for unpopulated zones.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the node in order to optimize zone_to_nid.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NUMA_BUILD constant is always available and will be set to 1 on
NUMA_BUILDs. That way checks valid only under CONFIG_NUMA can easily be done
without #ifdef CONFIG_NUMA
F.e.
if (NUMA_BUILD && <numa_condition>) {
...
}
[akpm: not a thing we'd normally do, but CONFIG_NUMA is special: it is
causing ifdef explosion in core kernel, so let's see if this is a comfortable
way in whcih to control that]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>