As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
the (few) files that fail to build because they were relying on blkdev.h
pulling in extra includes for them.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
All the current page_mkwrite() implementations also set the page dirty. Which
results in the set_page_dirty_balance() call to _not_ call balance, because the
page is already found dirty.
This allows us to dirty a _lot_ of pages without ever hitting
balance_dirty_pages(). Not good (tm).
Force a balance call if ->page_mkwrite() was successful.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When pinning and unpinning pagetables, we must protect them against
being used by other CPUs, lest they see the pagetable in an
intermediate read-only-but-not-pinned state.
When using split pte locks, doing this properly would require taking
all the pte locks for the pagetable while pinning, but this may overflow
the PREEMPT_BITS part of the preempt counter if the process has mapped
more than about 512M of memory.
However, failing to take the pte locks causes write-protect faults when
the pageout code is trying to clear the Access bit on a pte which is part
of a freshy created and still being pinned process after fork.
This is a short-term fix until the problem is solved properly.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Keir Fraser <keir@xensource.com>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below
sys_remap_file_pages, while running Oracle database test on x86 in 6GB
RAM: kunmap thinks we're in_interrupt because the preempt count has
wrapped.
That's because __do_fault expected to unmap page_table, but one of its
two callers do_nonlinear_fault already unmapped it: let do_linear_fault
unmap it first too, and then there's no need to pass the page_table arg
down.
Why have we been so slow to notice this? Probably through forgetting
that the mapping_cap_account_dirty test means that sys_remap_file_pages
nowadays only goes the full nonlinear vma route on a few memory-backed
filesystems like ramfs, tmpfs and hugetlbfs.
[ It also depends on CONFIG_HIGHPTE, so it becomes even harder to
trigger in practice. Many who have need of large memory have probably
migrated to x86-64..
Problem introduced by commit d0217ac04c
("mm: fault feedback #1") -- Linus ]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: gurudas pai <gurudas.pai@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The virtual address space argument of clear_user_highpage is supposed to be
the virtual address where the page being cleared will eventually be mapped.
This allows architectures with virtually indexed caches a few clever
tricks. That sort of trick falls over in painful ways if the virtual
address argument is wrong.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch proposes fixes to the reference counting of memory policy in the
page allocation paths and in show_numa_map(). Extracted from my "Memory
Policy Cleanups and Enhancements" series as stand-alone.
Shared policy lookup [shmem] has always added a reference to the policy,
but this was never unrefed after page allocation or after formatting the
numa map data.
Default system policy should not require additional ref counting, nor
should the current task's task policy. However, show_numa_map() calls
get_vma_policy() to examine what may be [likely is] another task's policy.
The latter case needs protection against freeing of the policy.
This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another task's
mempolicy. Again, shared policy is already reference counted on lookup. A
matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
shared and vma policies, and in show_numa_map() for shared and another
task's mempolicy. We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL policy.
Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a shared or vma
'BIND policy. In this case, we should hold the reference until after the
huge page allocation in dequeue_hugepage(). The patch modifies
huge_zonelist() to return a pointer to the mempolicy if it needs to be
unref'd after allocation.
Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
w/o patch w/ refcount patch
Avg Std Devn Avg Std Devn
Real: 100.59 0.38 100.63 0.43
User: 1209.60 0.37 1209.91 0.31
System: 81.52 0.42 81.64 0.34
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was posted on Aug 28 and fixes an issue that could cause troubles
when slab caches >=128k are created.
http://marc.info/?l=linux-mm&m=118798149918424&w=2
Currently we simply add the debug flags unconditional when checking for a
matching slab. This creates issues for sysfs processing when slabs exist
that are exempt from debugging due to their huge size or because only a
subset of slabs was selected for debugging.
We need to only add the flags if kmem_cache_open() would also add them.
Create a function to calculate the flags that would be set
if the cache would be opened and use that function to determine
the flags before looking for a compatible slab.
[akpm@linux-foundation.org: fixlets]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page migration currently does not check if the target of the move contains
nodes that that are invalid (if root attempts to migrate pages)
and may try to allocate from invalid nodes if these are specified
leading to oopses.
Return -EINVAL if an offline node is specified.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not BUG() if we cannot register a slab with sysfs. Just print an error.
The only consequence of not registering is that the slab cache is not
visible via /sys/slab. A BUG() may not be visible that early during boot
and we have had multiple issues here already.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In migration fallback path, write_page() or lock_page() will be called.
This causes sleep with holding rcu_read_lock().
For avoding that, just do rcu_lock if the page is Anon.(this is enough.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NUMA layer only supports NUMA policies for the highest zone. When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE. The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.
This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.
The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64. No abnormal results were seen in
kernbench, tbench, dbench or hackbench. It passes regression tests from
the numactl package with and without kernelcore= once numactl tests are
patched to wait for vmstat counters to update.
akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
the mobility or get the other solution that Mel is working on. That solution
would only use a single zonelist per node and filter on the fly. That may
help performance and also help to make memory policies work better."
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print a big fat warning and do what is necessary to continue if a node is
marked as up (meaning either node is online (upstream) or node has memory
(Andrew's tree)) but allocations from the node do not succeed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
83c54070ee).
By using the wrong bitmask, hugetlb_fault() failures are not being
recognized. This results in an infinite loop whenever follow_hugetlb_page
is involved in a failed fault.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Skip calling cache_free_alien() when the platform is not numa capable.
This will avoid cache misses that happen while accessing slabp (which is
per page memory reference) to get nodeid. Instead use a global variable to
skip the call, which is mostly likely to be present in the cache.
This gives a 0.8% performance boost with the database oltp workload on a
quad-core SMP platform and by any means the number is not small :)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new exec code inserts an accounted vma into an mm struct which is not
current->mm. The existing memory check code has a hard coded assumption
that this does not happen as does the security code.
As the correct mm is known we pass the mm to the security method and the
helper function. A new security test is added for the case where we need
to pass the mm and the existing one is modified to pass current->mm to
avoid the need to change large amounts of code.
(Thanks to Tobias for fixing rejects and testing)
Signed-off-by: Alan Cox <alan@redhat.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Cc: James Morris <jmorris@redhat.com>
Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lumpy reclaim works by selecting a lead page from the LRU list and then
selecting pages for reclaim from the order-aligned area of pages. In the
situation were all pages in that region are inactive and not referenced by any
process over time, it works well.
In the situation where there is even light load on the system, the pages may
not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
freed when the allocation attempt occurs because lumpy reclaim returned early.
This patch alters the behaviour of direct reclaim for large contiguous
blocks.
The first attempt to call shrink_page_list() is asynchronous but if it fails,
the pages are submitted a second time and the calling process waits for the IO
to complete. This may stall allocators waiting for contiguous memory but that
should be expected behaviour for high-order users. It is preferable behaviour
to potentially queueing unnecessary areas for IO. Note that kswapd will not
stall in this fashion.
[apw@shadowen.org: update to version 2]
[apw@shadowen.org: update to version 3]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As pointed out by Mel when reclaim is applied at higher orders a significant
amount of IO may be started. As this takes finite time to drain reclaim will
consider more areas than ultimatly needed to satisfy the request. This leads
to more reclaim than strictly required and reduced success rates.
I was able to confirm Mel's test results on systems locally. These show that
even under light load the success rates drop off far more than expected.
Testing with a modified version of his patch (which follows) I was able to
allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test
passes sequentially following system boot (the system has 29 hugepages in
ZONE_MOVABLE):
2.6.23-rc1 11 8 6 7 7
sync_lumpy 28 28 29 29 26
These show that although hugely better than the near 0% success normally
expected we can only allocate about a 1/4 of the zone. Using synchronous
reclaim for these allocations we get close to 100% as expected.
I have also run our standard high order tests and these show no regressions in
allocation success rates at rest, and some significant improvements under
load.
This patch:
We are transitioning pages from active to inactive in clear_active_flags,
those need counting as PGDEACTIVATE vm events.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c:
Initializing HighMem for node 0 (00038000:00100000)
Initializing HighMem for node 1 (00100000:001ffe00)
------------[ cut here ]------------
kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456!
[...]
This occurs because the section to node id mapping is not being
setup correctly during init under SPARSEMEM_STATIC, leading to an
attempt to free pages from all nodes into the zones on node 0.
When the zone_table[] was removed in the following commit, a new
section to node mapping table was introduced:
commit 89689ae7f9
[PATCH] Get rid of zone_table[]
That conversion inadvertantly only initialised the node mapping in
SPARSEMEM_EXTREME. Ensure we initialise the node mapping in
SPARSEMEM_STATIC.
[akpm@linux-foundation.org: make the stubs static inline]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n
sysace: HDIO_GETGEO has it's own method for ages
drivers/block/cpqarray.c: better error handling and kmalloc + memset conversion to k[cz]alloc
drivers/block/cciss.c: kmalloc + memset conversion to kzalloc
Clean up duplicate includes in drivers/block/
Fix remap handling by blktrace
[PATCH] remove mm/filemap.c:file_send_actor()