When online_pages() is called to add new memory to an empty zone, it
rebuilds all zone lists by calling build_all_zonelists(). But there's a
bug which prevents the new zone to be added to other nodes' zone lists.
online_pages() {
build_all_zonelists()
.....
node_set_state(zone_to_nid(zone), N_HIGH_MEMORY)
}
Here the node of the zone is put into N_HIGH_MEMORY state after calling
build_all_zonelists(), but build_all_zonelists() only adds zones from
nodes in N_HIGH_MEMORY state to the fallback zone lists.
build_all_zonelists()
->__build_all_zonelists()
->build_zonelists()
->find_next_best_node()
->for_each_node_state(n, N_HIGH_MEMORY)
So memory in the new zone will never be used by other nodes, and it may
cause strange behavor when system is under memory pressure. So put node
into N_HIGH_MEMORY state before calling build_all_zonelists().
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel. For example:
-Zone PFN ranges:
+Zone ranges:
- DMA32 0x00000010 -> 0x00100000
+ DMA32 [mem 0x00010000-0xffffffff]
- Normal 0x00100000 -> 0x01080000
+ Normal [mem 0x100000000-0x107fffffff]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This patch contains online_page_callback and apropriate functions for
registering/unregistering online page callbacks. It allows to do some
machine specific tasks during online page stage which is required to
implement memory hotplug in virtual machines. Currently this patch is
required by latest memory hotplug support for Xen balloon driver patch
which will be posted soon.
Additionally, originial online_page() function was splited into
following functions doing "atomic" operations:
- __online_page_set_limits() - set new limits for memory management code,
- __online_page_increment_counters() - increment totalram_pages and totalhigh_pages,
- __online_page_free() - free page to allocator.
It was done to:
- not duplicate existing code,
- ease hotplug code devolpment by usage of well defined interface,
- avoid stupid bugs which are unavoidable when the same code
(by design) is developed in many places.
[akpm@linux-foundation.org: use explicit indirect-call syntax]
Signed-off-by: Daniel Kiper <dkiper@net-space.pl>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 959ecc48fc ("mm/memory_hotplug.c: fix building of node hotplug
zonelist") does not protect the build_all_zonelists() call with
zonelists_mutex as needed. This can lead to races in constructing
zonelist ordering if a concurrent build is underway. Protecting this
with lock_memory_hotplug() is insufficient since zonelists can be
rebuild though sysfs as well.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat()
returns NULL if the new pgdat could not have been allocated and a pointer
to it otherwise.
mem_online_node() should fail if hotadd_new_pgdat() fails, not the
inverse. This fixes an issue when memoryless nodes are not onlined and
their sysfs interface is not registered when their first cpu is brought
up.
The bug was introduced by commit cf23422b9d ("cpu/mem hotplug: enable
CPUs online before local memory online") iow v2.6.35.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During memory hotplug we refresh zonelists when we online a page in a new
zone. It means that the node's zonelist is not initialized until pages
are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier
will point to NODE_DATA(nid) which has no zone fallback list. Moreover,
if we hot-add cpu-only nodes, alloc_pages() will do no fallback.
This patch makes a zonelist when a new pgdata is available.
Note: in production, at fujitsu, memory should be onlined before cpu
and our server didn't have any memory-less nodes and had no problems.
But recent changes in MEM_GOING_ONLINE+page_cgroup
will access not initialized zonelist of node.
Anyway, there are memory-less node and we need some care.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.
This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.
[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, memory_hotplug_(un)lock() is used for add/remove/offline pages
for avoiding races with hibernation. But this should be held in
online_pages(), too. It seems asymmetric.
There are cases where one has to avoid a race with memory hotplug
notifier and his own local code, and hotplug v.s. hotplug.
This will add a generic solution for avoiding races. In other view,
having lock here has no big impacts. online pages is tend to be
done by udev script at el against each memory section one by one.
Then, it's better to have lock here, too.
Cc: <stable@kernel.org> # 2.6.37
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>