- Add comments explaing how drain_pages() works.
- Eliminate useless functions
- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.
- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.
- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c
- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.
But ia64 registers it as IORESOURCE_MEM only.
In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.
This difference causes a failure of memory unplug of x86-64. This patch
fixes it.
This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI
device.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Luck, Tony" <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should unset migrate type "ISOLATE" when we successfully removed memory.
But current code has BUG and cannot works well.
This patch also includes bugfix? to change get_pageblock_flags to
get_pageblock_migratetype().
Thanks to Badari Pulavarty for finding this.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current memory notifier has some defects yet. (Fortunately, nothing uses
it.) This patch is to fix and rearrange for them.
- Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
- Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
- Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
- Make CANCEL events for rollingback when error occurs.
- Delete MEM_MAPPING_INVALID notification. It will be not used.
- Fix compile error of (un)register_memory_notifier().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, arch dependent code around CONFIG_MEMORY_HOTREMOVE is a mess.
This patch cleans up them. This is against 2.6.23-rc6-mm1.
- fix compile failure on ia64/ CONFIG_MEMORY_HOTPLUG && !CONFIG_MEMORY_HOTREMOVE case.
- For !CONFIG_MEMORY_HOTREMOVE, add generic no-op remove_memory(),
which returns -EINVAL.
- removed remove_pages() only used in powerpc.
- removed no-op remove_memory() in i386, sh, sparc64, x86_64.
- only powerpc returns -ENOSYS at memory hot remove(no-op). changes it
to return -EINVAL.
Note:
Currently, only ia64 supports CONFIG_MEMORY_HOTREMOVE. I welcome other
archs if there are requirements and testers.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Logic.
- set all pages in [start,end) as isolated migration-type.
by this, all free pages in the range will be not-for-use.
- Migrate all LRU pages in the range.
- Test all pages in the range's refcnt is zero or not.
Todo:
- allocate migration destination page from better area.
- confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
(I don't like this kind of page but..
- Find out pages which cannot be migrated.
- more running tests.
- Use reclaim for unplugging other memory type area.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A clean up patch for "scanning memory resource [start, end)" operation.
Now, find_next_system_ram() function is used in memory hotplug, but this
interface is not easy to use and codes are complicated.
This patch adds walk_memory_resouce(start,len,arg,func) function.
The function 'func' is called per valid memory resouce range in [start,pfn).
[pbadari@us.ibm.com: Error handling in walk_memory_resource()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is necessary to know if nodes have memory since we have recently begun to
add support for memoryless nodes. For that purpose we introduce a two new
node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.
A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
type of mmemory. If a node has memory then it has at least one zone defined
in its pgdat structure that is located in the pgdat itself.
A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
ZONE_HIGHMEM. This means it is possible to allocate memory that is not
subject to kmap.
N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
that we do the right thing when we encounter a memoryless node.
[akpm@linux-foundation.org: build fix]
[Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
[y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zone->present_pages is updated in online_pages(). But, __add_zone() can be
called twice or more before calling online_pages(). So,
init_currenty_empty_zone() can be called unnecessary times. It is cause of
memory leak of zone's wait_table.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime. It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.
It has been compile tested on ppc64, ia64, s390, i386 and x86_64.
Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zone table is mostly not needed. If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.
In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field. In all of
the above cases there will be no need at all for the zone table.
The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags. In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.
For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table. The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The api for hot-add memory already has a construct for finding nodes based on
an address, memory_add_physaddr_to_nid. This patch allows the fucntion to do
something besides return 0. It uses the nodes_add infomation to lookup to
node info for a hot add event.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit())
every time a CPU is hot-added/removed. But this value is not recalculated
when new pages are hot-added.
This patch fixes that problem by calling set_ratelimit() when new pages
are hot-added.
[akpm@osdl.org: cleanups]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is for collision check enhancement for memory hot add.
It's better to do resouce collision check before doing memory hot add,
which will touch memory management structures.
And add_section() should check section exists or not before calling
sparse_add_one_section(). (sparse_add_one_section() will do another
check anyway. but checking in memory_hotplug.c will be easy to understand.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: keith mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
find_next_system_ram() is used to find available memory resource at onlining
newly added memory. This patch fixes following problem.
find_next_system_ram() cannot catch this case.
Resource: (start)-------------(end)
Section : (start)-------------(end)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ioresouce handling code in memory hotplug allows not-aligned memory hot add.
But when memmap and other memory structures are initialized, parameters should
be aligned. (if not aligned, initialization of mem_map will do wrong, it
assumes parameters are aligned.) This patch fix it.
And this patch allows ioresource collision check to handle -EEXIST.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When new node becomes enable by hot-add, new sysfs file must be created for
new node. So, if new node is enabled by add_memory(), register_one_node() is
called to create it. In addition, I386's arch_register_node() and a part of
register_nodes() of powerpc are consolidated to register_one_node() as a
generic_code().
This is tested by Tiger4(IPF) with node hot-plug emulation.
Signed-off-by: Keiichiro Tokunaga <tokuanga.keiich@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch allows hot-add memory which is not aligned to section.
Now, hot-added memory has to be aligned to section size. Considering big
section sized archs, this is not useful.
When hot-added memory is registerd as iomem resoruce by iomem resource
patch, we can make use of that information to detect valid memory range.
Note: With this, not-aligned memory can be registerd. To allow hot-add
memory with holes, we have to do more work around add_memory().
(It doesn't allows add memory to already existing mem section.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Register hot-added memory to iomem_resource. With this, /proc/iomem can
show hot-added memory.
Note: kdump uses /proc/iomem to catch memory range when it is installed.
So, kdump should be re-installed after /proc/iomem change.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>