This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.
The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_page_table_pages to a per zone counter
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Allows reclaim to access counter without looping over processor counts.
- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone. Therefore we had to scan in
intervals to figure out if any pages were unmapped.
With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone. So we can simply skip the
reclaim if there is an insufficient number of unmapped pages. We use
SWAP_CLUSTER_MAX as the boundary.
Drop all support for /proc/sys/vm/zone_reclaim_interval.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages. However, that is not
true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.
It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.
Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
calculation?
Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
user space tools that monitor these counters! NR_FILE_MAPPED works like
NR_FILE_DIRTY. It is only valid for pagecache pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently a single atomic variable is used to establish the size of the page
cache in the whole machine. The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.
Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.
Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.
We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).
We replace the use of nr_mapped in various kernel locations. This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Per zone counter infrastructure
The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.
So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.
Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.
Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.
The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.
Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.
The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.
Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.
Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.
We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When add_zone() is called against empty zone (not populated zone), we have to
initialize the zone which didn't initialize at boot time. But,
init_currently_empty_zone() may fail due to allocation of wait table. So,
this patch is to catch its error code.
Changes against wait_table is in the next patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is just to rename from wait_table_size() to wait_table_hash_nr_entries().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.infradead.org/hdrcleanup-2.6: (63 commits)
[S390] __FD_foo definitions.
Switch to __s32 types in joystick.h instead of C99 types for consistency.
Add <sys/types.h> to headers included for userspace in <linux/input.h>
Move inclusion of <linux/compat.h> out of user scope in asm-x86_64/mtrr.h
Remove struct fddi_statistics from user view in <linux/if_fddi.h>
Move user-visible parts of drivers/s390/crypto/z90crypt.h to include/asm-s390
Revert include/media changes: Mauro says those ioctls are only used in-kernel(!)
Include <linux/types.h> and use __uXX types in <linux/cramfs_fs.h>
Use __uXX types in <linux/i2o_dev.h>, include <linux/ioctl.h> too
Remove private struct dx_hash_info from public view in <linux/ext3_fs.h>
Include <linux/types.h> and use __uXX types in <linux/affs_hardblocks.h>
Use __uXX types in <linux/divert.h> for struct divert_blk et al.
Use __u32 for elf_addr_t in <asm-powerpc/elf.h>, not u32. It's user-visible.
Remove PPP_FCS from user view in <linux/ppp_defs.h>, remove __P mess entirely
Use __uXX types in user-visible structures in <linux/nbd.h>
Don't use 'u32' in user-visible struct ip_conntrack_old_tuple.
Use __uXX types for S390 DASD volume label definitions which are user-visible
S390 BIODASDREADCMB ioctl should use __u64 not u64 type.
Remove unneeded inclusion of <linux/time.h> from <linux/ufs_fs.h>
Fix private integer types used in V4L2 ioctls.
...
Manually resolve conflict in include/linux/mtd/physmap.h
From: Ralf Baechle <ralf@linux-mips.org>
<linux/mmzone.h> uses PAGE_SIZE, PAGE_SHIFT from <asm/page.h> without
including that header itself. For some sparsemem configurations this may
result in build errors like:
CC init/initramfs.o
In file included from include/linux/gfp.h:4,
from include/linux/slab.h:15,
from include/linux/percpu.h:4,
from include/linux/rcupdate.h:41,
from include/linux/dcache.h:10,
from include/linux/fs.h:226,
from init/initramfs.c:2:
include/linux/mmzone.h:498:22: warning: "PAGE_SHIFT" is not defined
In file included from include/linux/gfp.h:4,
from include/linux/slab.h:15,
from include/linux/percpu.h:4,
from include/linux/rcupdate.h:41,
from include/linux/dcache.h:10,
from include/linux/fs.h:226,
from init/initramfs.c:2:
include/linux/mmzone.h:526: error: `PAGE_SIZE' undeclared here (not in a function)
include/linux/mmzone.h: In function `__pfn_to_section':
include/linux/mmzone.h:573: error: `PAGE_SHIFT' undeclared (first use in this function)
include/linux/mmzone.h:573: error: (Each undeclared identifier is reported only once
include/linux/mmzone.h:573: error: for each function it appears in.)
include/linux/mmzone.h: In function `pfn_valid':
include/linux/mmzone.h:578: error: `PAGE_SHIFT' undeclared (first use in this function)
make[1]: *** [init/initramfs.o] Error 1
make: *** [init] Error 2
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Seems-reasonable-to: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER. An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned. Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints. page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone). Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.
SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured. ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently. This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.
Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Helper functions for for_each_online_pgdat/for_each_zone look too big to be
inlined. Speed of these helper macro itself is not very important. (inner
loops are tend to do more work than this)
This patch make helper function to be out-of-lined.
inline out-of-line
.text 005c0680 005bf6a0
005c0680 - 005bf6a0 = FE0 = 4Kbytes.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By using for_each_online_pgdat(), pgdat_list is not necessary now. This patch
removes it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch defines for_each_online_pgdat() as a replacement of
for_each_pgdat()
Now, online nodes are managed by node_online_map. But for_each_pgdat()
uses pgdat_link to iterate over all nodes(pgdat). This means management
structure for online pgdat is duplicated.
I think using node_online_map for for_each_pgdat() is simple and sane
rather ather than pgdat_link. New macro is named as
for_each_online_pgdat(). Following patch will fix callers of
for_each_pgdat().
The bootmem allocater uses for_each_pgdat() before pgdat initialization. I
don't think it's sane. Following patch will fix it.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes zone_mem_map.
pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat
instead of zone, which is only one user of zone_mem_map. By modifing it,
we can remove zone_mem_map.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
GFP_ZONETYPES calculate from GFP_ZONEMASK
GFP_ZONETYPES's value is directly related to the value of GFP_ZONEMASK. It
takes one of two forms depending whether the top bit of GFP_ZONEMASK is a
'loner'. Supply both forms, enabling the loner.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
GFP_ZONETYPES define using GFP_ZONEMASK and add commentry
Add commentry explaining the optimisation that we can apply to GFP_ZONETYPES
when the leftmost bit is a 'loaner', it can only be set in isolation.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
patch fixes them up, removes unused code and makes zone reclaim usable.
Zone reclaim allows the reclaiming of pages from a zone if the number of
free pages falls below the watermarks even if other zones still have enough
pages available. Zone reclaim is of particular importance for NUMA
machines. It can be more beneficial to reclaim a page than taking the
performance penalties that come with allocating a page on a remote zone.
Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch. By default
RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
component (enclosure or motherboard) on IA64. The meaning of the NUMA
distance information seems to vary by arch.
If zone reclaim is not successful then no further reclaim attempts will
occur for a certain time period (ZONE_RECLAIM_INTERVAL).
This patch was discussed before. See
http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>