KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
is coming from page migration caller.
WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
Call Trace:
[<a000000100015220>] show_stack+0x80/0xa0
[<a000000100015270>] dump_stack+0x30/0x60
[<a000000100089ed0>] warn_on_slowpath+0x90/0xe0
[<a0000001001f8b10>] __set_page_dirty+0x330/0x360
[<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280
[<a00000010012fec0>] set_page_dirty+0xc0/0x260
[<a000000100195670>] migrate_page_copy+0x5d0/0x5e0
[<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0
[<a000000100195eb0>] migrate_pages+0x770/0xe00
What was happening is that migrate_page_copy wants to transfer the PG_dirty
bit from old page to new page, so what it would do is set_page_dirty(newpage).
However set_page_dirty() is used to set the entire page dirty, wheras in
this case, only part of the page was dirty, and it also was not uptodate.
Marking the whole page dirty with set_page_dirty would lead to corruption or
unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
buffers.
Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
however in the interests of keeping the change minimal...
Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
found a swap pte, before it had determined it to be a migration entry. That
left a surplus reference count on the page_cgroup, so it was still attached
when the page was later freed.
Move that mem_cgroup_charge down to where we're sure it's a migration entry.
We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
already inappropriate: change that to GFP_ATOMIC.
It's essential that remove_migration_pte removes all the migration entries,
other crashes follow if not. So proceed even when the charge fails: normally
it cannot, but after a mem_cgroup_force_empty it might - comment in the code.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While using memory control cgroup, page-migration under it works as following.
==
1. uncharge all refs at try to unmap.
2. charge regs again remove_migration_ptes()
==
This is simple but has following problems.
==
The page is uncharged and charged back again if *mapped*.
- This means that cgroup before migration can be different from one after
migration
- If page is not mapped but charged as page cache, charge is just ignored
(because not mapped, it will not be uncharged before migration)
This is memory leak.
==
This patch tries to keep memory cgroup at page migration by increasing
one refcnt during it. 3 functions are added.
mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
new page.
During migration
- old page is under PG_locked.
- new page is under PG_locked, too.
- both old page and new page is not on LRU.
These 3 facts guarantee that page_cgroup() migration has no race.
Tested and worked well in x86_64/fake-NUMA box.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin pointed out that swap cache and page cache addition routines
could be called from non GFP_KERNEL contexts. This patch makes the
charging routine aware of the gfp context. Charging might fail if the
cgroup is over it's limit, in which case a suitable error is returned.
This patch was tested on a Powerpc box. I am still looking at being able
to test the path, through which allocations happen in non GFP_KERNEL
contexts.
[kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the accounting hooks. The accounting is carried out for RSS and Page
Cache (unmapped) pages. There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time. Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache(). Swap cache is also accounted for.
Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier. A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.
Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup. Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.
[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: <Valdis.Kletnieks@vt.edu>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orphaned page might have fs-private metadata, the page is truncated. As
the page hasn't mapping, page migration refuse to migrate the page. It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail. I thought
we could free the metadata so such page can be freed in migration and make
migration more reliable.
[akpm@linux-foundation.org: go direct to try_to_free_buffers()]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.
It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.
This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
add modfied set_pte() for flushing if necessary.
This patch flush icache of a page when
new pte has exec bit.
&& new pte has present bit
&& new pte is user's page.
&& (old *ptep is not present
|| new pte's pfn is not same to old *ptep's ptn)
&& new pte's page has no Pg_arch_1 bit.
Pg_arch_1 is set when a page is cache consistent.
I think this condition checks are much easier to understand than considering
"Where sync_icache_dcache() should be inserted ?".
pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
clean-up. So, I added it again.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Online nodes now may have no memory. The checks and initialization must
therefore be changed to no longer use the online functions.
This will correctly initialize the interleave on bootup to only target nodes
with memory and will make sys_move_pages return an error when a page is to be
moved to a memoryless node. Similarly we will get an error if MPOL_BIND and
MPOL_INTERLEAVE is used on a memoryless node.
These are somewhat new semantics. So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or we'd do
something strange like with MPOL_INTERLEAVE). If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In migration fallback path, write_page() or lock_page() will be called.
This causes sleep with holding rcu_read_lock().
For avoding that, just do rcu_lock if the page is Anon.(this is enough.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
release_pages() in mm/swap.c changes page_count() to be 0 without removing
PageLRU flag...
This means isolate_lru_page() can see a page, PageLRU() &&
page_count(page)==0.. This is BUG. (get_page() will be called against
count=0 page.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
(mm here is a mm_struct which maps the migration target page.)
This semaphore helps avoiding some race conditions.
But, if we want to migrate a page by some kernel codes, we have to avoid
some races. This patch adds check code for following race condition.
1. A page which page->mapping==NULL can be target of migration. Then, we have
to check page->mapping before calling try_to_unmap().
2. anon_vma can be freed while page is unmapped, but page->mapping remains as
it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
being freed during migration.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is often known at allocation time whether a page may be migrated or not.
This patch adds a flag called __GFP_MOVABLE and a new mask called
GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
using the page migration mechanism or reclaimed by syncing with backing
storage and discarding.
An API function very similar to alloc_zeroed_user_highpage() is added for
__GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
flags used by alloc_zeroed_user_highpage() are not changed because it would
change the semantics of an existing API. After this patch is applied there
are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
be marked deprecated if this patch is merged.
Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
shmem.c to keep all flag modifications to inode->mapping in the
shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
Hugh Dickens.
Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
concept. Credit to Hugh Dickens for catching issues with shmem swap vector
and ramfs allocations.
[akpm@linux-foundation.org: build fix]
[hugh@veritas.com: __GFP_ZERO cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we do not check for vma flags if sys_move_pages is called to move
individual pages. If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.
Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.
Problem was spotted by Lee Schermerhorn
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.
Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.
"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.
When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).
[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_move_pages() uses vmalloc() to allocate an array of structures that is
fills with information passed from user mode and then passes to
do_stat_pages() (in the case the node list is NULL). do_stat_pages()
depends on a marker in the node field of the structure to decide how large
the array is and this marker is correctly inserted into the last element of
the array. However, vmalloc() doesn't zero the memory it allocates and if
the user passes NULL for the node list, then the node fields are not filled
in (except for the end marker). If the memory the vmalloc() returned
happend to have a word with the marker value in it in just the right place,
do_pages_stat will fail to fill the status field of part of the array and
we will return (random) kernel data to user mode.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stop fallback_migrate_page() from using page_has_buffers() since that might not
be available. Use PagePrivate() instead since that's more general.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In many places we will need to use the same combination of flags. Specify
a single GFP_THISNODE definition for ease of use in gfp.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>