Commit Graph

199 Commits

Author SHA1 Message Date
Linus Torvalds
001a541ea9 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
  writeback: balanced_rate cannot exceed write bandwidth
  writeback: do strict bdi dirty_exceeded
  writeback: avoid tiny dirty poll intervals
  writeback: max, min and target dirty pause time
  writeback: dirty ratelimit - think time compensation
  btrfs: fix dirtied pages accounting on sub-page writes
  writeback: fix dirtied pages accounting on redirty
  writeback: fix dirtied pages accounting on sub-page writes
  writeback: charge leaked page dirties to active tasks
  writeback: Include all dirty inodes in background writeback
2012-01-10 16:59:59 -08:00
Linus Torvalds
eb59c505f8 Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
  PM / Hibernate: Implement compat_ioctl for /dev/snapshot
  PM / Freezer: fix return value of freezable_schedule_timeout_killable()
  PM / shmobile: Allow the A4R domain to be turned off at run time
  PM / input / touchscreen: Make st1232 use device PM QoS constraints
  PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
  PM / shmobile: Remove the stay_on flag from SH7372's PM domains
  PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
  PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
  PM: Drop generic_subsys_pm_ops
  PM / Sleep: Remove forward-only callbacks from AMBA bus type
  PM / Sleep: Remove forward-only callbacks from platform bus type
  PM: Run the driver callback directly if the subsystem one is not there
  PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
  PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
  PM / Sleep: Merge internal functions in generic_ops.c
  PM / Sleep: Simplify generic system suspend callbacks
  PM / Hibernate: Remove deprecated hibernation snapshot ioctls
  PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
  ARM: S3C64XX: Implement basic power domain support
  PM / shmobile: Use common always on power domain governor
  ...

Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
2012-01-08 13:10:57 -08:00
Wu Fengguang
bc31b86a59 writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
Fix compile error

 fs/fs-writeback.c:515:33: error: ‘PAGE_CACHE_SHIFT’ undeclared (first use in this function)

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2012-01-08 10:35:19 +08:00
Al Viro
ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
Rafael J. Wysocki
b7ba68c4a0 Merge branch 'pm-sleep' into pm-for-linus
* pm-sleep: (51 commits)
  PM: Drop generic_subsys_pm_ops
  PM / Sleep: Remove forward-only callbacks from AMBA bus type
  PM / Sleep: Remove forward-only callbacks from platform bus type
  PM: Run the driver callback directly if the subsystem one is not there
  PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
  PM / Sleep: Merge internal functions in generic_ops.c
  PM / Sleep: Simplify generic system suspend callbacks
  PM / Hibernate: Remove deprecated hibernation snapshot ioctls
  PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
  PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
  PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
  PM / Sleep: Make [un]lock_system_sleep() generic
  PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
  PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
  PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
  Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
  PM / Sleep: Unify diagnostic messages from device suspend/resume
  ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
  PM / Hibernate: Remove deprecated hibernation test modes
  PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
  ...

Conflicts:
	kernel/kmod.c
2011-12-25 23:42:20 +01:00
Rafael J. Wysocki
b00f4dc5ff Merge branch 'master' into pm-sleep
* master: (848 commits)
  SELinux: Fix RCU deref check warning in sel_netport_insert()
  binary_sysctl(): fix memory leak
  mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
  ipmi_watchdog: restore settings when BMC reset
  oom: fix integer overflow of points in oom_badness
  memcg: keep root group unchanged if creation fails
  nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
  nilfs2: unbreak compat ioctl
  cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  evm: prevent racing during tfm allocation
  evm: key must be set once during initialization
  mmc: vub300: fix type of firmware_rom_wait_states module parameter
  Revert "mmc: enable runtime PM by default"
  mmc: sdhci: remove "state" argument from sdhci_suspend_host
  x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
  IB/qib: Correct sense on freectxts increment and decrement
  RDMA/cma: Verify private data length
  cgroups: fix a css_set not found bug in cgroup_attach_proc
  oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
  Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
  ...

Conflicts:
	kernel/cgroup_freezer.c
2011-12-21 21:59:45 +01:00
Jan Kara
1bc36b6426 writeback: Include all dirty inodes in background writeback
Current livelock avoidance code makes background work to include only inodes
that were dirtied before background writeback has started. However background
writeback can be running for a long time and thus excluding newly dirtied
inodes can eventually exclude significant portion of dirty inodes making
background writeback inefficient. Since background writeback avoids livelocking
the flusher thread by yielding to any other work, there is no real reason why
background work should not include all dirty inodes so change the logic in
wb_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18 14:20:18 +08:00
Wu Fengguang
b3bba872dd writeback: show writeback reason with __print_symbolic
This makes the binary trace understandable by trace-cmd.

CC: Dave Chinner <david@fromorbit.com>
CC: Curt Wohlgemuth <curtw@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18 14:20:17 +08:00
Marcos Paulo de Souza
786228ab30 writeback: Fix issue on make htmldocs
Document the @reason parameter to make "make htmldocs" happy.

Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-11-29 15:50:28 +08:00
Tejun Heo
8a32c441c1 freezer: implement and use kthread_freezable_should_stop()
Writeback and thinkpad_acpi have been using thaw_process() to prevent
deadlock between the freezer and kthread_stop(); unfortunately, this
is inherently racy - nothing prevents freezing from happening between
thaw_process() and kthread_stop().

This patch implements kthread_freezable_should_stop() which enters
refrigerator if necessary but is guaranteed to return if
kthread_stop() is invoked.  Both thaw_process() users are converted to
use the new function.

Note that this deadlock condition exists for many of freezable
kthreads.  They need to be converted to use the new should_stop or
freezable workqueue.

Tested with synthetic test case.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
2011-11-21 12:32:23 -08:00
Curt Wohlgemuth
0e175a1835 writeback: Add a 'reason' to wb_writeback_work
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity.  A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:33:36 +08:00
Curt Wohlgemuth
ad4e38dd6a writeback: send work item to queue_io, move_expired_inodes
Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure.  There are other fields of a work item that are
useful in these routines and in tracepoints.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:33:27 +08:00
Wu Fengguang
b00949aa2d writeback: per-bdi background threshold
One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.

The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.

With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.

vmstat:nr_written values before/after patch:

  3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
------------------------  ------------------------  
               125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
               110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
               127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
               206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03 21:08:58 +08:00
Wu Fengguang
af6a311384 writeback: add bg_threshold parameter to __bdi_update_bandwidth()
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03 21:08:56 +08:00
Wu Fengguang
0e995816f4 don't busy retry the inode on failed grab_super_passive()
This fixes a soft lockup on conditions

a) the flusher is working on a work by __bdi_start_writeback(), while

b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
   grab sb->s_umount and enqueue a new work for the flusher to execute

The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.

Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().

bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html

Reported-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-31 22:52:08 +08:00
Linus Torvalds
f01ef569cd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
  mm: properly reflect task dirty limits in dirty_exceeded logic
  writeback: don't busy retry writeback on new/freeing inodes
  writeback: scale IO chunk size up to half device bandwidth
  writeback: trace global_dirty_state
  writeback: introduce max-pause and pass-good dirty limits
  writeback: introduce smoothed global dirty limit
  writeback: consolidate variable names in balance_dirty_pages()
  writeback: show bdi write bandwidth in debugfs
  writeback: bdi write bandwidth estimation
  writeback: account per-bdi accumulated written pages
  writeback: make writeback_control.nr_to_write straight
  writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
  writeback: trace event writeback_queue_io
  writeback: trace event writeback_single_inode
  writeback: remove .nonblocking and .encountered_congestion
  writeback: remove writeback_control.more_io
  writeback: skip balance_dirty_pages() for in-memory fs
  writeback: add bdi_dirty_limit() kernel-doc
  writeback: avoid extra sync work at enqueue time
  writeback: elevate queue_io() into wb_writeback()
  ...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c
2011-07-26 10:39:54 -07:00
Wu Fengguang
fcc5c22218 writeback: don't busy retry writeback on new/freeing inodes
Fix a system hang bug introduced by commit b7a2441f99 ("writeback:
remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.

It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE 
state, the flusher will get stuck busy retrying that inode, never
releasing wb->list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.

As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.                                                         

CC: Jan Kara <jack@suse.cz> 
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-24 10:46:51 +08:00
Dave Chinner
12ad3ab661 superblock: move pin_sb_for_writeback() to fs/super.c
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:38 -04:00
Wu Fengguang
1a12d8bd7b writeback: scale IO chunk size up to half device bandwidth
Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

	[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:03 -07:00
Wu Fengguang
c42843f2f0 writeback: introduce smoothed global dirty limit
The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:02 -07:00
Wu Fengguang
e98be2d599 writeback: bdi write bandwidth estimation
The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Wu Fengguang
d46db3d582 writeback: make writeback_control.nr_to_write straight
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Wu Fengguang
e84d0a4f8e writeback: trace event writeback_queue_io
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
251d6a471c writeback: trace event writeback_single_inode
It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
b7a2441f99 writeback: remove writeback_control.more_io
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00