commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.
Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.
[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...
Finally we can use early_res to replace bootmem for x86_64 now.
Still can use CONFIG_NO_BOOTMEM to enable it or not.
-v2: fix 32bit compiling about MAX_DMA32_PFN
-v3: folded bug fix from LKML message below
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B747239.4070907@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Some comments misspell "invocation"; this fixes them. No code
changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.
Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.
Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.
Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.
V7-V8:
- Explain chicken egg dilemmna with percpu allocator.
V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.
tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The following two patches remove searching in the page allocator fast-path
by maintaining multiple free-lists in the per-cpu structure. At the time
the search was introduced, increasing the per-cpu structures would waste a
lot of memory as per-cpu structures were statically allocated at
compile-time. This is no longer the case.
The patches are as follows. They are based on mmotm-2009-08-27.
Patch 1 adds multiple lists to struct per_cpu_pages, one per
migratetype that can be stored on the PCP lists.
Patch 2 notes that the pcpu drain path check empty lists multiple times. The
patch reduces the number of checks by maintaining a count of free
lists encountered. Lists containing pages will then free multiple
pages in batch
The patches were tested with kernbench, netperf udp/tcp, hackbench and
sysbench. The netperf tests were not bound to any CPU in particular and
were run such that the results should be 99% confidence that the reported
results are within 1% of the estimated mean. sysbench was run with a
postgres background and read-only tests. Similar to netperf, it was run
multiple times so that it's 99% confidence results are within 1%. The
patches were tested on x86, x86-64 and ppc64 as
x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.34% to 2.28% gain
netperf-tcp - 0.45% to 1.22% gain
hackbench - Small variances, very close to noise
sysbench - Very small gains
x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.83% to 10.42% gains
netperf-tcp - No conclusive until buffer >= PAGE_SIZE
4096 +15.83%
8192 + 0.34% (not significant)
16384 + 1%
hackbench - Small gains, very close to noise
sysbench - 0.79% to 1.6% gain
ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
kernbench - No significant difference, variance well within noise
netperf-udp - 2-3% gain for almost all buffer sizes tested
netperf-tcp - losses on small buffers, gains on larger buffers
possibly indicates some bad caching effect.
hackbench - No significant difference
sysbench - 2-4% gain
This patch:
Currently the per-cpu page allocator searches the PCP list for pages of
the correct migrate-type to reduce the possibility of pages being
inappropriate placed from a fragmentation perspective. This search is
potentially expensive in a fast-path and undesirable. Splitting the
per-cpu list into multiple lists increases the size of a per-cpu structure
and this was potentially a major problem at the time the search was
introduced. These problem has been mitigated as now only the necessary
number of structures is allocated for the running system.
This patch replaces a list search in the per-cpu allocator with one list
per migrate type. The potential snag with this approach is when bulk
freeing pages. We round-robin free pages based on migrate type which has
little bearing on the cache hotness of the page and potentially checks
empty lists repeatedly in the event the majority of PCP pages are of one
type.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.
For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.
It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned
So the end result may be
- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)
The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the system is running a heavy load of processes then concurrent reclaim
can isolate a large number of pages from the LRU. /proc/vmstat and the
output generated for an OOM do not show how many pages were isolated.
This has been observed during process fork bomb testing (mstctl11 in LTP).
This patch shows the information about isolated pages.
Reproduced via:
-----------------------
% ./hackbench 140 process 1000
=> OOM occur
active_anon:146 inactive_anon:0 isolated_anon:49245
active_file:79 inactive_file:18 isolated_file:113
unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
free:370 slab_reclaimable:309 slab_unreclaimable:5492
mapped:53 shmem:15 pagetables:28140 bounce:0
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.
We often use the following calculation to determine the amount of shmem
pages:
shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES
however the expression does not consider isolated and mlocked pages.
This patch adds explicit accounting for pages used by shmem and tmpfs.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine
which watermark to use.
This patch uses the flags as an array index into a watermark array that is
indexed with WMARK_* defines accessed via helpers. All call sites that
use zone->pages_* are updated to use the helpers for accessing the values
and the array offsets for setting.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pfn_valid() is meant to be able to tell if a given PFN has valid memmap
associated with it or not. In FLATMEM, it is expected that holes always
have valid memmap as long as there is valid PFNs either side of the hole.
In SPARSEMEM, it is assumed that a valid section has a memmap for the
entire section.
However, ARM and maybe other embedded architectures in the future free
memmap backing holes to save memory on the assumption the memmap is never
used. The page_zone linkages are then broken even though pfn_valid()
returns true. A walker of the full memmap must then do this additional
check to ensure the memmap they are looking at is sane by making sure the
zone and PFN linkages are still valid. This is expensive, but walkers of
the full memmap are extremely rare.
This was caught before for FLATMEM and hacked around but it hits again for
SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
are totally screwed. This looks like a hatchet job but the reality is that
any clean solution would end up consumning all the memory saved by punching
these unexpected holes in the memmap. For example, we tried marking the
memmap within the section invalid but the section size exceeds the size of
the hole in most cases so pfn_valid() starts returning false where valid
memmap exists. Shrinking the size of the section would increase memory
consumption offsetting the gains.
This patch identifies when an architecture is punching unexpected holes
in the memmap that the memory model cannot automatically detect and sets
ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
which is the model sub-architecture this has been reported on but may expand
later. When set, walkers of the full memmap must call memmap_valid_within()
for each PFN and passing in what it expects the page and zone to be for
that PFN. If it finds the linkages to be broken, it assumes the memmap is
invalid for that PFN.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
cpumask: remove cpumask allocation from idle_balance, fix
numa, cpumask: move numa_node_id default implementation to topology.h, fix
cpumask: remove cpumask allocation from idle_balance
x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
x86: microcode: cleanup
x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
numa, cpumask: move numa_node_id default implementation to topology.h
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
cpumask: remove x86 cpumask_t uses.
cpumask: use cpumask_var_t in uv_flush_tlb_others.
cpumask: remove cpumask_t assignment from vector_allocation_domain()
cpumask: make Xen use the new operators.
cpumask: clean up summit's send_IPI functions
cpumask: use new cpumask functions throughout x86
x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
x86: unify 32 and 64-bit node_to_cpumask_map
...
Impact: cleanup
In almost cases, for_each_zone() is used with populated_zone(). It's
because almost function doesn't need memoryless node information.
Therefore, for_each_populated_zone() can help to make code simplify.
This patch has no functional change.
[akpm@linux-foundation.org: small cleanup]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup, potential bugfix
Not sure what changed to expose this, but clearly that numa_node_id()
doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).
In file included from include/linux/topology.h:34,
from arch/x86/mm/numa.c:2:
/home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
In file included from include/linux/topology.h:32,
from arch/x86/mm/numa.c:2:
include/linux/mmzone.h:770:1: warning: this is the location of the previous definition
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <200903132343.37661.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.
By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.
I added printk message as
"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"
maybe enough informative for users.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>