Commit Graph

129 Commits

Author SHA1 Message Date
Wu Fengguang
0d99519efe writeback: remove unused nonblocking and congestion checks
- no one is calling wb_writeback and write_cache_pages with
  wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
  congestion wait

So remove the congestion checks as suggested by Chris.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Alex Elder <aelder@sgi.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-03 13:54:25 +01:00
Wu Fengguang
d25105e891 writeback: account IO throttling wait as iowait
It makes sense to do IOWAIT when someone is blocked
due to IO throttle, as suggested by Kame and Peter.

There is an old comment for not doing IOWAIT on throttle,
however it has been mismatching the code for a long time.

If we stop accounting IOWAIT for 2.6.32, it could be an
undesirable behavior change. So restore the io_schedule.

CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-09 12:40:42 +02:00
Jens Axboe
a72bfd4dea writeback: pass in super_block to bdi_start_writeback()
Sometimes we only want to write pages from a specific super_block,
so allow that to be passed in.

This fixes a problem with commit 56a131dcf7
causing writeback on all super_blocks on a bdi, where we only really
want to sync a specific sb from writeback_inodes_sb().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-26 00:10:40 +02:00
Linus Torvalds
6d7f18f6ea Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
  writeback: writeback_inodes_sb() should use bdi_start_writeback()
  writeback: don't delay inodes redirtied by a fast dirtier
  writeback: make the super_block pinning more efficient
  writeback: don't resort for a single super_block in move_expired_inodes()
  writeback: move inodes from one super_block together
  writeback: get rid to incorrect references to pdflush in comments
  writeback: improve readability of the wb_writeback() continue/break logic
  writeback: cleanup writeback_single_inode()
  writeback: kupdate writeback shall not stop when more io is possible
  writeback: stop background writeback when below background threshold
  writeback: balance_dirty_pages() shall write more than dirtied pages
  fs: Fix busyloop in wb_writeback()
2009-09-25 09:27:30 -07:00
Jens Axboe
5b0830cb90 writeback: get rid to incorrect references to pdflush in comments
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:25 +02:00
Wu Fengguang
d3ddec7635 writeback: stop background writeback when below background threshold
Treat bdi_start_writeback(0) as a special request to do background write,
and stop such work when we are below the background dirty threshold.

Also simplify the (nr_pages <= 0) checks. Since we already pass in
nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
need to worry about it being decreased to zero.

Reported-by: Richard Kennedy <richard@rsk.demon.co.uk>
CC: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:24 +02:00
Wu Fengguang
3a2e9a5a2a writeback: balance_dirty_pages() shall write more than dirtied pages
Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Otherwise it is possible that
  loop {
    btrfs_file_write():     dirty 1024 pages
    balance_dirty_pages():  write up to 48 pages (= ratelimit_pages * 1.5)
  }
in which the writeback rate cannot keep up with dirty rate, and the
dirty pages go all the way beyond dirty_thresh.

The increased write_chunk may make the dirtier more bumpy.
So filesystems shall be take care not to dirty too much at
a time (eg. > 4MB) without checking the ratelimit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-25 18:08:24 +02:00
Linus Torvalds
db16826367 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
  HWPOISON: Enable error_remove_page on btrfs
  HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  HWPOISON: Add madvise() based injector for hardware poisoned pages v4
  HWPOISON: Enable error_remove_page for NFS
  HWPOISON: Enable .remove_error_page for migration aware file systems
  HWPOISON: The high level memory error handler in the VM v7
  HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
  HWPOISON: shmem: call set_page_dirty() with locked page
  HWPOISON: Define a new error_remove_page address space op for async truncation
  HWPOISON: Add invalidate_inode_page
  HWPOISON: Refactor truncate to allow direct truncating of page v2
  HWPOISON: check and isolate corrupted free pages v2
  HWPOISON: Handle hardware poisoned pages in try_to_unmap
  HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  HWPOISON: Add poison check to page fault handling
  HWPOISON: Add basic support for poisoned pages in fault handler v3
  HWPOISON: Add new SIGBUS error codes for hardware poison signals
  HWPOISON: Add support for poison swap entries v2
  HWPOISON: Export some rmap vma locking to outside world
  ...
2009-09-24 07:53:22 -07:00
Alexey Dobriyan
8d65af789f sysctl: remove "struct file *" argument of ->proc_handler
It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:21:04 -07:00
Wu Fengguang
adea02a1be mm: count only reclaimable lru pages
global_lru_pages() / zone_lru_pages() can be used in two ways:
- to estimate max reclaimable pages in determine_dirtyable_memory()
- to calculate the slab scan ratio

When swap is full or not present, the anon lru lists are not reclaimable
and also won't be scanned.  So the anon pages shall not be counted in both
usage scenarios.  Also rename to _reclaimable_pages: now they are counting
the possibly reclaimable lru pages.

It can greatly (and correctly) increase the slab scan rate under high
memory pressure (when most file pages have been reclaimed and swap is
full/absent), thus reduce false OOM kills.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: David Howells <dhowells@redhat.com>
Cc: "Li, Ming Chun" <macli@brc.ubc.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:30 -07:00
Jens Axboe
87c6a9b253 writeback: make balance_dirty_pages() gradually back more off
Currently it just sleeps for a very short time, just 1 jiffy. If
we keep looping in there, continually delay for a little longer
of up to 100msec in total. That was the old limit for congestion
wait.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-21 15:40:33 +02:00
Jens Axboe
3542a5c0de writeback: don't use schedule_timeout() without setting runstate
Just use schedule_timeout_interruptible(), saves a call to
set_current_state().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-21 15:40:33 +02:00
Jens Axboe
b6e51316da writeback: separate starting of sync vs opportunistic writeback
bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.

Push down the writeback_control allocation and only accept the
parameters that make sense for each function. This cleans up
the API considerably.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:18:52 +02:00
Jens Axboe
cfc4ba5365 writeback: use RCU to protect bdi_list
Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:18:51 +02:00
Jens Axboe
1fe06ad892 writeback: get rid of wbc->for_writepages
It's only set, it's never checked. Kill it.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-16 15:16:18 +02:00
Wu Fengguang
6746aff74d HWPOISON: shmem: call set_page_dirty() with locked page
The dirtying of page and set_page_dirty() can be moved into the page lock.

- In shmem_write_end(), the page was dirtied while the page lock was held,
  but it's being marked dirty just after dropping the page lock.
- In shmem_symlink(), both dirtying and marking can be moved into page lock.

It's valuable for the hwpoison code to know whether one bad page can be dropped
without losing data. It mainly judges by testing the PG_dirty bit after taking
the page lock. So it becomes important that the dirtying of page and the
marking of dirtiness are both done inside the page lock. Which is a common
practice, but sadly not a rule.

The noticeable exceptions are
- mapped pages
- pages with buffer_heads
The above pages could go dirty at any time. Fortunately the hwpoison will
unmap the page and release the buffer_heads beforehand anyway.

Many other types of pages (eg. metadata pages) can also be dirtied at will by
their owners, the hwpoison code cannot do meaningful things to them anyway.
Only the dirtiness of pagecache pages owned by regular files are interested.

v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:14 +02:00
Linus Torvalds
ada3fa1505 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
  powerpc64: convert to dynamic percpu allocator
  sparc64: use embedding percpu first chunk allocator
  percpu: kill lpage first chunk allocator
  x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
  percpu: update embedding first chunk allocator to handle sparse units
  percpu: use group information to allocate vmap areas sparsely
  vmalloc: implement pcpu_get_vm_areas()
  vmalloc: separate out insert_vmalloc_vm()
  percpu: add chunk->base_addr
  percpu: add pcpu_unit_offsets[]
  percpu: introduce pcpu_alloc_info and pcpu_group_info
  percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
  percpu: add @align to pcpu_fc_alloc_fn_t
  percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
  percpu: drop @static_size from first chunk allocators
  percpu: generalize first chunk allocator selection
  percpu: build first chunk allocators selectively
  percpu: rename 4k first chunk allocator to page
  percpu: improve boot messages
  percpu: fix pcpu_reclaim() locking
  ...

Fix trivial conflict as by Tejun Heo in kernel/sched.c
2009-09-15 09:39:44 -07:00
Jens Axboe
03ba3782e8 writeback: switch to per-bdi threads for flushing data
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Jens Axboe
66f3b8e2e1 writeback: move dirty inodes from super_block to backing_dev_info
This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 09:20:25 +02:00
Tejun Heo
384be2b18a Merge branch 'percpu-for-linus' into percpu-for-next
Conflicts:
	arch/sparc/kernel/smp_64.c
	arch/x86/kernel/cpu/perf_counter.c
	arch/x86/kernel/setup_percpu.c
	drivers/cpufreq/cpufreq_ondemand.c
	mm/percpu.c

Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids.  As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14 14:45:31 +09:00
Jens Axboe
8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Tejun Heo
c43768cbb7 Merge branch 'master' into for-next
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes.  As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.

Conflicts:
	arch/alpha/include/asm/percpu.h
	arch/mn10300/kernel/vmlinux.lds.S
	include/linux/percpu-defs.h
2009-07-04 07:13:18 +09:00
Richard Kennedy
d7831a0bdf mm: prevent balance_dirty_pages() from doing too much work
balance_dirty_pages can overreact and move all of the dirty pages to
writeback unnecessarily.

balance_dirty_pages makes its decision to throttle based on the number of
dirty plus writeback pages that are over the calculated limit,so it will
continue to move pages even when there are plenty of pages in writeback
and less than the threshold still dirty.

This allows it to overshoot its limits and move all the dirty pages to
writeback while waiting for the drives to catch up and empty the writeback
list.

A simple fio test easily demonstrates this problem.

fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10

This is the simplest fix I could find, but I'm not entirely sure that it
alone will be enough for all cases.  But it certainly is an improvement on
my desktop machine writing to 2 disks.

Do we need something more for machines with large arrays where
bdi_threshold * number_of_drives is greater than the dirty_ratio ?

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:01 -07:00
Tejun Heo
245b2e70ea percpu: clean up percpu variable definitions
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique.  Update percpu
variable definitions accordingly.

* as,cfq: rename ioc_count uniquely

* cpufreq: rename cpu_dbs_info uniquely

* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it

* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
  rename it

* ipv4,6: rename cookie_scratch uniquely

* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
  pmc_irq_entry and nmi_entry to pmc_nmi_entry

* perf_counter: rename disable_count to perf_disable_count

* ftrace: rename test_event_disable to ftrace_test_event_disable

* kmemleak: rename test_pointer to kmemleak_test_pointer

* mce: rename next_interval to mce_next_interval

[ Impact: percpu usage cleanups, no duplicate static percpu var names ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
2009-06-24 15:13:48 +09:00
H Hartley Sweeten
dcf975d585 mm/page-writeback.c: dirty limit type should be unsigned long
get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
with variable pbdi_dirty as one of the arguments.  This variable is an
unsigned long * but both functions expect it to be a long *.  This causes
the following sparse warnings:

  warning: incorrect type in argument 3 (different signedness)
     expected long *pbdi_dirty
     got unsigned long *pbdi_dirty
  warning: incorrect type in argument 2 (different signedness)
     expected long *pdirty
     got unsigned long *pbdi_dirty

Fix the warnings by changing the long * to unsigned long * in both
functions.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:31 -07:00