Commit Graph

126181 Commits

Author SHA1 Message Date
Nick Piggin ee53a891f4 mm: do_sync_mapping_range integrity fix
Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout.  Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data integrity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:00 -08:00
Andrew Morton 82fd1a9a8c mm: write_cache_pages more terminate quickly
Now that we have the early-termination logic in place, it makes sense to
bail out early in all other cases where done is set to 1.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:00 -08:00
Nick Piggin d5482cdf8a mm: write_cache_pages terminate quickly
Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page.  Pages cannot have their index change when
we have a reference on them (truncate, eg truncate_inode_pages_range
performs the same check without the page lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:59:00 -08:00
Nick Piggin 515f4a037f mm: write_cache_pages optimise page cleaning
In write_cache_pages, if we get stuck behind another process that is
cleaning pages, we will be forced to wait for them to finish, then perform
our own writeout (if it was redirtied during the long wait), then wait for
that.

If a page under writeout is still clean, we can skip waiting for it (if
we're part of a data integrity sync, we'll be waiting for all writeout
pages afterwards, so we'll still be waiting for the other guy's write
that's cleaned the page).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Nick Piggin 5a3d5c9813 mm: write_cache_pages cleanups
Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Nick Piggin 05fe478dd0 mm: write_cache_pages integrity fix
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.

The callers tend to set it to values that could break data interity
semantics easily in practice.  For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.

Fix this by ignoring nr_to_write if it is a data integrity sync.

This is a data integrity bug.

The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.

 "If a file has one dirty page at offset 1000000000000000 then someone
  does an fsync() and someone else gets in first and starts madly writing
  pages at offset 0, we want to write that page at 1000000000000000.
  Somehow."

What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for.  Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.

This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse.  We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.

There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Nick Piggin 00266770b8 mm: write_cache_pages writepage error fix
In write_cache_pages, if ret signals a real error, but we still have some
pages left in the pagevec, done would be set to 1, but the remaining pages
would continue to be processed and ret will be overwritten in the process.

It could easily be overwritten with success, and thus success will be
returned even if there is an error.  Thus the caller is told all writes
succeeded, wheras in reality some did not.

Fix this by bailing immediately if there is an error, and retaining the
first error code.

This is a data integrity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Nick Piggin bd19e012f6 mm: write_cache_pages early loop termination
We'd like to break out of the loop early in many situations, however the
existing code has been setting mapping->writeback_index past the final
page in the pagevec lookup for cyclic writeback.  This is a problem if we
don't process all pages up to the final page.

Currently the code mostly keeps writeback_index reasonable and hacked
around this by not breaking out of the loop or writing pages outside the
range in these cases.  Keep track of a real "done index" that enables us
to terminate the loop in a much more flexible manner.

Needed by the subsequent patch to preserve writepage errors, and then
further patches to break out of the loop early for other reasons.  However
there are no functional changes with this patch alone.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Nick Piggin 31a12666d8 mm: write_cache_pages cyclic fix
In write_cache_pages, scanned == 1 is supposed to mean that cyclic
writeback has circled through zero, thus we should not circle again.
However it gets set to 1 after the first successful pagevec lookup.  This
leads to cases where not enough data gets written.

Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10.  Then the 5 last pages will be found, and scanned will
be set to 1, after writing those out, we will not cycle back to get the
first 5.

Rework this logic, now we'll always cycle unless we started off from index
0.  When cycling, only write out as far as 1 page before the start page
from the first cycle (so we don't write parts of the file twice).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
Miquel van Smoorenburg 38c8e61809 do_mpage_readpage(): don't submit lots of small bios on boundary
While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c

As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.

Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set.  But we only need to push what I/O we have
accumulated at the last block of this range.

This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.

Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
David Rientjes 75aa199410 oom: print triggering task's cpuset and mems allowed
When cpusets are enabled, it's necessary to print the triggering task's
set of allowable nodes so the subsequently printed meminfo can be
interpreted correctly.

We also print the task's cpuset name for informational purposes.

[rientjes@google.com: task lock current before dereferencing cpuset]
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:59 -08:00
David Rientjes c7d4caeb1d oom: fix zone_scan_mutex name
zone_scan_mutex is actually a spinlock, so name it appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Nick Piggin 1c0fe6e3bd mm: invoke oom-killer from page fault
Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.

With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time.  Create a hook for pagefault oom path to call
into instead.

Only converted x86 and uml so far.

[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Brice Goglin 5bd1455c23 mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by default
pp->page is never used when not set to the right page, so there is no need
to set it to ZERO_PAGE(0) by default.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Brice Goglin 3140a22730 mm: rework do_pages_move() to work on page_sized chunks
Rework do_pages_move() to work by page-sized chunks of struct page_to_node
that are passed to do_move_page_to_node_array().  We now only have to
allocate a single page instead a possibly very large vmalloc area to store
all page_to_node entries.

As a result, new_page_node() will now have a very small lookup, hidding
much of the overall sys_move_pages() overhead.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Nathalie Furmento <Nathalie.Furmento@labri.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Hugh Dickins 390722baa7 mm: don't mark_page_accessed in shmem_fault
Following "mm: don't mark_page_accessed in fault path", which now
places a mark_page_accessed() in zap_pte_range(), we should remove
the mark_page_accessed() from shmem_fault().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Nick Piggin bf3f3bc5e7 mm: don't mark_page_accessed in fault path
Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
unmap-time if the pte is young has a number of problems.

mark_page_accessed is supposed to be roughly the equivalent of a young pte
for unmapped references. Unfortunately it doesn't come with any context:
after being called, reclaim doesn't know who or why the page was touched.

So calling mark_page_accessed not only adds extra lru or PG_referenced
manipulations for pages that are already going to have pte_young ptes anyway,
but it also adds these references which are difficult to work with from the
context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
wish to contribute to the page being referenced).

Then, simply doing SetPageReferenced when zapping a pte and finding it is
young, is not a really good solution either. SetPageReferenced does not
correctly promote the page to the active list for example. So after removing
mark_page_accessed from the fault path, several mmap()+touch+munmap() would
have a very different result from several read(2) calls for example, which
is not really desirable.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Mel Gorman 3340289ddf mm: report the MMU pagesize in /proc/pid/smaps
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA.  This matches the size used by the MMU in the
majority of cases.  However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor.  To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Mel Gorman 08fba69986 mm: report the pagesize backing a VMA in /proc/pid/smaps
It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 15:58:58 -08:00
Linus Torvalds 238c6d5483 Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm snapshot: extend exception store functions
  dm snapshot: split out exception store implementations
  dm snapshot: rename struct exception_store
  dm snapshot: separate out exception store interface
  dm mpath: move trigger_event to system workqueue
  dm: add name and uuid to sysfs
  dm table: rework reference counting
  dm: support barriers on simple devices
  dm request: extend target interface
  dm request: add caches
  dm ioctl: allow dm_copy_name_and_uuid to return only one field
  dm log: ensure log bitmap fits on log device
  dm log: move region_size validation
  dm log: avoid reinitialising io_req on every operation
  dm: consolidate target deregistration error handling
  dm raid1: fix error count
  dm log: fix dm_io_client leak on error paths
  dm snapshot: change yield to msleep
  dm table: drop reference at unbind
2009-01-05 19:20:59 -08:00
Jonathan Brassow a159c1ac5f dm snapshot: extend exception store functions
Supply dm_add_exception as a callback to the read_metadata function.
Add a status function ready for a later patch and name the functions
consistently.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:19 +00:00
Alasdair G Kergon 4db6bfe02b dm snapshot: split out exception store implementations
Move the existing snapshot exception store implementations out into
separate files.  Later patches will place these behind a new
interface in preparation for alternative implementations.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:17 +00:00
Jonathan Brassow 1ae25f9c93 dm snapshot: rename struct exception_store
Rename struct exception_store to dm_exception_store.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:16 +00:00
Jonathan Brassow aea53d92f7 dm snapshot: separate out exception store interface
Pull structures that bridge the gap between snapshot and
exception store out of dm-snap.h and put them in a new
.h file - dm-exception-store.h.  This file will define the
API for new exception stores.

Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
should be using it.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:15 +00:00
Alasdair G Kergon fe9cf30eb8 dm mpath: move trigger_event to system workqueue
The same workqueue is used both for sending uevents and processing queued I/O.
Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
for the queued I/O to be processed.  Use scheduled_work() for the asynchronous
uevents instead.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:13 +00:00