Allow free of mlock()ed pages. This shouldn't happen, but during
developement, it occasionally did.
This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).
Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.
[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Report unevictable pages per zone and system wide.
Kosaki Motohiro added support for memory controller unevictable
statistics.
[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix to unevictable-lru-page-statistics.patch
Add unevictable lru infrastructure vm events to the statistics patch.
Rename the "NORECL_" and "noreclaim_" symbols and text strings to
"UNEVICTABLE_" and "unevictable_", respectively.
Currently, both the infrastructure and the mlocked pages event are
added by a single patch later in the series. This makes it difficult
to add or rework the incremental patches. The events actually "belong"
with the stats, so pull them up to here.
Also, restore the event counting to putback_lru_page(). This was removed
from previous patch in series where it was "misplaced". The actual events
weren't defined that early.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages. At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.
We can reduce the maximum number of pages that need to be scanned by not
taking the referenced state into account when deactivating an anonymous
page. After all, every anonymous page starts out referenced, so why
check?
If an anonymous page gets referenced again before it reaches the end of
the inactive list, we move it back to the active list.
To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).
Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.
[kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
[kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.
The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.
This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.
[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we are defining explicit variables for the inactive and active
list. An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.
We are saving a few bytes in terms of code size:
Before:
text data bss dec hex filename
4097753 573120 4092484 8763357 85b7dd vmlinux
After:
text data bss dec hex filename
4097729 573120 4092484 8763333 85b7c5 vmlinux
Having an easy way to add new lru lists may ease future work on the
reclaim code.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ordinarily, memory holes in flatmem still have a valid memmap and is safe
to use. However, an architecture (ARM) frees up the memmap backing memory
holes on the assumption it is never used. /proc/pagetypeinfo reads the
whole range of pages in a zone believing that the memmap is valid and that
pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
the page->zone linkages even though pfn_valid() returns true and the kernel
can oops shortly afterwards due to accessing a bogus struct zone *.
This patch lets architectures say when FLATMEM can have holes in the
memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
will confirm that the page linkages are still valid by checking page->zone
is still the expected zone. The lookup of page_zone is safe as there is a
limited range of memory that is accessed when calling page_zone. Even if
page_zone happens to return the correct zone, the impact is that the counters
in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
unlikely to be relevant on an embedded system.
Reported-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.
By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.
This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.
[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've found that it can take quite a bit of time (100's of usec) to get
through the zone loop in refresh_cpu_vm_stats().
Adding a cond_resched() to allow other threads to run in the non-preemptive
case.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocating huge pages directly from the buddy allocator is not guaranteed to
succeed. Success depends on several factors (such as the amount of physical
memory available and the level of fragmentation). With the addition of
dynamic hugetlb pool resizing, allocations can occur much more frequently.
For these reasons it is desirable to keep track of huge page allocation
successes and failures.
Add two new vmstat entries to track huge page allocations that succeed and
fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
being enabled.
[akpm@linux-foundation.org: reduced ifdeffery]
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign. It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.
This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it. This simplifies the reading of later patches in
this set.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.
The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Martin Bligh <mbligh@mbligh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Add comments explaining how the function can be called.
2. Collect global diffs in a local array and only spill
them once into the global counters when the zone scan
is finished. This means that we only touch each global
counter once instead of each time we fold cpu counters
into zone counters.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark start_cpu_timer() as __cpuinit instead of __devinit.
Fixes this section warning:
WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show')
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>