mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
4fdfbe8506fa236f3c4cc915b24e70f84a6f768a
169 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
79134171df |
thp: transparent hugepage vmstat
Add hugepage stat information to /proc/vmstat and /proc/meminfo. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9950474883 |
mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363
Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%
proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130
Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%
proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
88f5acf88a |
mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit |
||
|
|
0e093d9976 |
writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
If congestion_wait() is called with no BDI congested, the caller will sleep for the full timeout and this may be an unnecessary sleep. This patch adds a wait_iff_congested() that checks congestion and only sleeps if a BDI is congested else, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep. This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ea941f0e2a |
writeback: add nr_dirtied and nr_written to /proc/vmstat
To help developers and applications gain visibility into writeback behaviour adding two entries to vm_stat_items and /proc/vmstat. This will allow us to track the "written" and "dirtied" counts. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
aa45484031 |
mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
25edde0332 |
vmscan: kill prev_priority completely
Since 2.6.28 zone->prev_priority is unused. Then it can be removed safely. It reduce stack usage slightly. Now I have to say that I'm sorry. 2 years ago, I thought prev_priority can be integrate again, it's useful. but four (or more) times trying haven't got good performance number. Thus I give up such approach. The rest of this changelog is notes on prev_priority and why it existed in the first place and why it might be not necessary any more. This information is based heavily on discussions between Andrew Morton, Rik van Riel and Kosaki Motohiro who is heavily quotes from. Historically prev_priority was important because it determined when the VM would start unmapping PTE pages. i.e. there are no balances of note within the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there is a potential risk of unnecessarily increasing minor faults as a large amount of read activity of use-once pages could push mapped pages to the end of the LRU and get unmapped. There is no proof this is still a problem but currently it is not considered to be. Active files are not deactivated if the active file list is smaller than the inactive list reducing the liklihood that file-mapped pages are being pushed off the LRU and referenced executable pages are kept on the active list to avoid them getting pushed out by read activity. Even if it is a problem, prev_priority prev_priority wouldn't works nowadays. First of all, current vmscan still a lot of UP centric code. it expose some weakness on some dozens CPUs machine. I think we need more and more improvement. The problem is, current vmscan mix up per-system-pressure, per-zone-pressure and per-task-pressure a bit. example, prev_priority try to boost priority to other concurrent priority. but if the another task have mempolicy restriction, it is unnecessary, but also makes wrong big latency and exceeding reclaim. per-task based priority + prev_priority adjustment make the emulation of per-system pressure. but it have two issue 1) too rough and brutal emulation 2) we need per-zone pressure, not per-system. Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer. but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the system have higher memory pressure than priority==0 (1/4096*10,000 > 2). prev_priority can't solve such multithreads workload issue. In other word, prev_priority concept assume the sysmtem don't have lots threads." Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b645bd1286 |
mmzone.h: remove dead prototype
get_zone_counts() was dropped from kernel tree, see: http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but its prototype remains. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7aac789885 |
numa: introduce numa_mem_id()- effective local memory node id
Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. [akpm@linux-foundation.org: fix build] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4eaf3f6439 |
mem-hotplug: fix potential race while building zonelist for new populated zone
Add global mutex zonelists_mutex to fix the possible race:
CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();
In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).
Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).
[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
1f522509c7 |
mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:
1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().
2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()
Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
[<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
[<ffffffff81128407>] __page_cache_alloc+0x67/0x70
[<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
[<ffffffff81132751>] ra_submit+0x21/0x30
[<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
[<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
[<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
[<ffffffff81266cfa>] nfs_file_read+0xca/0x130
[<ffffffff8117b20a>] do_sync_read+0xfa/0x140
[<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
[<ffffffff8117c151>] sys_read+0x51/0x80
[<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
RSP <ffff88000d1e78a8>
---[ end trace 4bda28328b9990db ]
[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
0faa56389c |
mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme.
Got this while compiling for ARM/SA1100: mm/sparse.c: In function '__section_nr': mm/sparse.c:135: warning: 'root' is used uninitialized in this function This patch follows Russell King's suggestion for a new calculation for NR_SECTION_ROOTS. Thanks also to Sergei Shtylyov for pointing out the existence of the macro DIV_ROUND_UP. Atsushi Nemoto observed: : This fix doesn't just silence the warning - it fixes a real problem. : : Without this fix, mem_section[] might have 0 size so mem_section[0] : will share other variable area. For example, I got: : : c030c700 b __warned.16478 : c030c700 B mem_section : c030c701 b __warned.16483 : : This might cause very strange behavior. Your patch actually fixes it. Signed-off-by: Marcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br> Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4f92e2586b |
mm: compaction: defer compaction using an exponential backoff when compaction fails
The fragmentation index may indicate that a failure is due to external fragmentation but after a compaction run completes, it is still possible for an allocation to fail. There are two obvious reasons as to why o Page migration cannot move all pages so fragmentation remains o A suitable page may exist but watermarks are not met In the event of compaction followed by an allocation failure, this patch defers further compaction in the zone (1 << compact_defer_shift) times. If the next compaction attempt also fails, compact_defer_shift is increased up to a maximum of 6. If compaction succeeds, the defer counters are reset again. The zone that is deferred is the first zone in the zonelist - i.e. the preferred zone. To defer compaction in the other zones, the information would need to be stored in the zonelist or implemented similar to the zonelist_cache. This would impact the fast-paths and is not justified at this time. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
318ae2edc3 |
Merge branch 'for-next' into for-linus
Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c |
||
|
|
93e4a89a8c |
mm: restore zone->all_unreclaimable to independence word
commit
|
||
|
|
a626b46e17 |
Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) early_res: Need to save the allocation name in drop_range_partial() sparsemem: Fix compilation on PowerPC early_res: Add free_early_partial() x86: Fix non-bootmem compilation on PowerPC core: Move early_res from arch/x86 to kernel/ x86: Add find_fw_memmap_area Move round_up/down to kernel.h x86: Make 32bit support NO_BOOTMEM early_res: Enhance check_and_double_early_res x86: Move back find_e820_area to e820.c x86: Add find_early_area_size x86: Separate early_res related code from e820.c x86: Move bios page reserve early to head32/64.c sparsemem: Put mem map for one node together. sparsemem: Put usemap for one node together x86: Make 64 bit use early_res instead of bootmem before slab x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA x86: Make early_node_mem get mem > 4 GB if possible x86: Dynamically increase early_res array size x86: Introduce max_early_res and early_res_count ... |
||
|
|
43cf38eb5c |
percpu: add __percpu sparse annotations to core kernel subsystems
Add __percpu sparse annotations to core subsystems. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-mm@kvack.org Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Biederman <ebiederm@xmission.com> |
||
|
|
08677214e3 |
x86: Make 64 bit use early_res instead of bootmem before slab
Finally we can use early_res to replace bootmem for x86_64 now. Still can use CONFIG_NO_BOOTMEM to enable it or not. -v2: fix 32bit compiling about MAX_DMA32_PFN -v3: folded bug fix from LKML message below Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B747239.4070907@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> |
||
|
|
2a61aa4016 |
Fix misspellings of "invocation" in comments.
Some comments misspell "invocation"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> |
||
|
|
99dcc3e5a9 |
this_cpu: Page allocator conversion
Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.
Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.
Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.
Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.
V7-V8:
- Explain chicken egg dilemmna with percpu allocator.
V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.
tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
01fc0ac198 |
kbuild: move bounds.h to include/generated
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michal Marek <mmarek@suse.cz> |
||
|
|
8d65af789f |
sysctl: remove "struct file *" argument of ->proc_handler
It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5f8dcc2121 |
page-allocator: split per-cpu list into one-list-per-migrate-type
The following two patches remove searching in the page allocator fast-path by maintaining multiple free-lists in the per-cpu structure. At the time the search was introduced, increasing the per-cpu structures would waste a lot of memory as per-cpu structures were statically allocated at compile-time. This is no longer the case. The patches are as follows. They are based on mmotm-2009-08-27. Patch 1 adds multiple lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. Patch 2 notes that the pcpu drain path check empty lists multiple times. The patch reduces the number of checks by maintaining a count of free lists encountered. Lists containing pages will then free multiple pages in batch The patches were tested with kernbench, netperf udp/tcp, hackbench and sysbench. The netperf tests were not bound to any CPU in particular and were run such that the results should be 99% confidence that the reported results are within 1% of the estimated mean. sysbench was run with a postgres background and read-only tests. Similar to netperf, it was run multiple times so that it's 99% confidence results are within 1%. The patches were tested on x86, x86-64 and ppc64 as x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.34% to 2.28% gain netperf-tcp - 0.45% to 1.22% gain hackbench - Small variances, very close to noise sysbench - Very small gains x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.83% to 10.42% gains netperf-tcp - No conclusive until buffer >= PAGE_SIZE 4096 +15.83% 8192 + 0.34% (not significant) 16384 + 1% hackbench - Small gains, very close to noise sysbench - 0.79% to 1.6% gain ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation) kernbench - No significant difference, variance well within noise netperf-udp - 2-3% gain for almost all buffer sizes tested netperf-tcp - losses on small buffers, gains on larger buffers possibly indicates some bad caching effect. hackbench - No significant difference sysbench - 2-4% gain This patch: Currently the per-cpu page allocator searches the PCP list for pages of the correct migrate-type to reduce the possibility of pages being inappropriate placed from a fragmentation perspective. This search is potentially expensive in a fast-path and undesirable. Splitting the per-cpu list into multiple lists increases the size of a per-cpu structure and this was potentially a major problem at the time the search was introduced. These problem has been mitigated as now only the necessary number of structures is allocated for the running system. This patch replaces a list search in the per-cpu allocator with one list per migrate type. The potential snag with this approach is when bulk freeing pages. We round-robin free pages based on migrate type which has little bearing on the cache hotness of the page and potentially checks empty lists repeatedly in the event the majority of PCP pages are of one type. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f862963174 |
mm: do batched scans for mem_cgroup
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.
For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.
It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned
So the end result may be
- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)
The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
a731286de6 |
mm: vmstat: add isolate pages
If the system is running a heavy load of processes then concurrent reclaim can isolate a large number of pages from the LRU. /proc/vmstat and the output generated for an OOM do not show how many pages were isolated. This has been observed during process fork bomb testing (mstctl11 in LTP). This patch shows the information about isolated pages. Reproduced via: ----------------------- % ./hackbench 140 process 1000 => OOM occur active_anon:146 inactive_anon:0 isolated_anon:49245 active_file:79 inactive_file:18 isolated_file:113 unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39 free:370 slab_reclaimable:309 slab_unreclaimable:5492 mapped:53 shmem:15 pagetables:28140 bounce:0 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |