Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.
Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit. Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a race condition that showed up in a threaded JIT environment.
The situation is that a process with a JIT code page forks, so the page is
marked read-only, then some threads are created in the child. One of the
threads attempts to add a new code block to the JIT page, so a
copy-on-write fault is taken, and the kernel allocates a new page, copies
the data, installs the new pte, and then calls lazy_mmu_prot_update() to
flush caches to make sure that the icache and dcache are in sync.
Unfortunately, the other thread runs right after the new pte is installed,
but before the caches have been flushed. It tries to execute some old JIT
code that was already in this page, but it sees some garbage in the i-cache
from the previous users of the new physical page.
Fix: we must make the caches consistent before installing the pte. This is
an ia64 only fix because lazy_mmu_prot_update() is a no-op on all other
architectures.
Signed-off-by: Anil Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch marks an unused export as EXPORT_UNUSED_SYMBOL.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.
We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.
In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.
The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.
The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).
Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.
References:
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_page_table_pages to a per zone counter
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new VMA operation to notify a filesystem or other driver about the
MMU generating a fault because userspace attempted to write to a page
mapped through a read-only PTE.
This facility permits the filesystem or driver to:
(*) Implement storage allocation/reservation on attempted write, and so to
deal with problems such as ENOSPC more gracefully (perhaps by generating
SIGBUS).
(*) Delay making the page writable until the contents have been written to a
backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
It permits the filesystem to have some guarantee about the state of the
cache.
(*) Account and limit number of dirty pages. This is one piece of the puzzle
needed to make shared writable mapping work safely in FUSE.
Needed by cachefs (Or is it cachefiles? Or fscache? <head spins>).
At least four other groups have stated an interest in it or a desire to use
the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
EXT3 really ought to use it to deal with the case of shared-writable mmap
encountering ENOSPC before we permit the page to be dirtied.
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
get_user_pages(.write=1, .force=1) can generate COW hits on read-only
shared mappings, this patch traps those as mkpage_write candidates and fails
to handle them the old way.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is better to redo the complete fault if do_swap_page() finds that the
page is not in PageSwapCache() because the page migration code may have
replaced the swap pte already with a pte pointing to valid memory.
do_swap_page() may interpret an invalid swap entry without this patch
because we do not reload the pte if we are looping back. The page
migration code may already have reused the swap entry referenced by our
local swp_entry.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).
And __setup() is used in obsolete_checksetup().
start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()
If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.
If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].
Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.
This patch fixes a wrong usage of it, however fixes obvious one only.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
Kconfig help: MTD_JEDECPROBE already supports Intel
Remove ugly debugging stuff
do_mounts.c: Minor ROOT_DEV comment cleanup
BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
BUG_ON() Conversion in mm/mempool.c
BUG_ON() Conversion in mm/memory.c
BUG_ON() Conversion in kernel/fork.c
BUG_ON() Conversion in ipc/sem.c
BUG_ON() Conversion in fs/ext2/
BUG_ON() Conversion in fs/hfs/
BUG_ON() Conversion in fs/dcache.c
BUG_ON() Conversion in fs/buffer.c
BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
BUG_ON() Conversion in md/dm-table.c
BUG_ON() Conversion in md/dm-path-selector.c
BUG_ON() Conversion in drivers/isdn
BUG_ON() Conversion in drivers/char
BUG_ON() Conversion in drivers/mtd/
Currently, get_user_pages() returns fully coherent pages to the kernel for
anything other than anonymous pages. This is a problem for things like
fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA
to anonymous pages passed in by users.
The fix is to add a new memory management API: flush_anon_page() which
is used in get_user_pages() to make anonymous pages coherent.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Hugh is rightly concerned that the CONFIG_DEBUG_VM coverage has gone too
far in vm_normal_page, considering that we expect production kernels to be
shipped with the option turned off, and that the code has been under some
large changes recently.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Turns out the hugepage logic in free_pgtables() was doubly broken. The
loop coalescing multiple normal page VMAs into one call to free_pgd_range()
had an off by one error, which could mean it would coalesce one hugepage
VMA into the same bundle (checking 'vma' not 'next' in the loop). I
transferred this bug into the new is_vm_hugetlb_page() based version.
Here's the fix.
This one didn't bite on powerpc previously for the same reason the
is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
is identical to free_pgd_range(). It didn't bite on ia64 because the
hugepage region is distant enough from any other region that the separated
PMD_SIZE distance test would always prevent coalescing the two together.
No libhugetlbfs testsuite regressions (ppc64, POWER5).
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
of the normal free_pgd_range() on hugepage VMAs. However, the test it uses
to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
range at the start of the vma. is_hugepage_only_range() will return true
if the given range has any intersection with a hugepage address region, and
in this case the given region need not be hugepage aligned. So, for
example, this test can return true if called on, say, a 4k VMA immediately
preceding a (nicely aligned) hugepage VMA.
At present we get away with this because the powerpc version of
hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the
only other arch with a non-trivial is_hugepage_only_range()) we get away
with it for a different reason; the hugepage area is not contiguous with
the rest of the user address space, and VMAs are not permitted in between,
so the test can't return a false positive there.
Nonetheless this should be fixed. We do that in the patch below by
replacing the is_hugepage_only_range() test with an explicit test of the
VMA using is_vm_hugetlb_page().
This in turn changes behaviour for platforms where is_hugepage_only_range()
returns false always (everything except powerpc and ia64). We address this
by ensuring that hugetlb_free_pgd_range() is defined to be identical to
free_pgd_range() (instead of a no-op) on everything except ia64. Even so,
it will prevent some otherwise possible coalescing of calls down to
free_pgd_range(). Since this only happens for hugepage VMAs, removing this
small optimization seems unlikely to cause any trouble.
This patch causes no regressions on the libhugetlbfs testsuite - ppc64
POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Put a few more checks under CONFIG_DEBUG_VM
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Lee Revell reported 28ms latency when process with lots of swapped memory
exits.
2.6.15 introduced a latency regression when unmapping: in accounting the
zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but a
swap entry counted nothing at all. We think of pages present as the slow
case, but Lee's trace shows that free_swap_and_cache's radix tree lookup
can make a lot of work - and we could have been doing it many thousands of
times without a latency break.
Move the zap_work update up to account swap entries like pages present.
This does account non-linear pte_file entries, and unmap_mapping_range
skipping over swap entries, by the same amount even though they're quick:
but neither of those cases deserves complicating the code (and they're
treated no worse than they were in 2.6.14).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
installation it's easiest if it's a boot time option.
Also I moved the variable to a more appropiate place and make
it independent from sysctl
And marked __read_mostly which it is.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Check for PageSwapCache after looking up and locking a swap page.
The page migration code may change a swap pte to point to a different page
under lock_page().
If that happens then the vm must retry the lookup operation in the swap space
to find the correct page number. There are a couple of locations in the VM
where a lock_page() is done on a swap page. In these locations we need to
check afterwards if the page was migrated. If the page was migrated then the
old page that was looked up before was freed and no longer has the
PageSwapCache bit set.
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This atomic operation is superfluous: the pte will be added with the
referenced bit set, and the page will be referenced through this mapping after
the page fault handler returns anyway.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Optimise rmap functions by minimising atomic operations when we know there
will be no concurrent modifications.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>