Commit Graph

317 Commits

Author SHA1 Message Date
Randy Dunlap dbce03b9e3 fs/writeback.c: fix kernel-doc warnings
Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
after the function's opening brace.  Also #undef this macro at the end of
the function.

  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
  ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09 15:11:24 -08:00
Junichi Nomura aa750fd71c mm/filemap.c: make global sync not clear error status of individual inodes
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

   Application                    System admin
   ----------------------------------------------------------
   write data on page cache
                                  Run sync command
                                  writeback completes with error
                                  filemap_fdatawait() clears error
   fsync returns success
   (but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo b33e18f61b fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi->wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Patrick Marlier <patrick.marlier@gmail.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pranith kumar <bobby.prani@gmail.com>
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-28 13:17:30 +01:00
Tejun Heo b817525a4a writeback: bdi_writeback iteration must not skip dying ones
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c7 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:12 -06:00
Tejun Heo 6fdf860f15 writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
unfortunately, it was always waking up bdi->wb instead of the wb being
walked.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 001fe6f617 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:11 -06:00
Chris Mason 590dca3a71 fs-writeback: unplug before cond_resched in writeback_sb_inodes
Commit 505a666ee3 ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
which increases the merge rate when relatively contiguous small files
are written by the filesystem.  It helps both on flash and spindles.

For an fs_mark workload creating 4K files in parallel across 8 drives,
this commit improves performance ~9% more by unplugging before calling
cond_resched().  cond_resched() doesn't trigger an implicit unplug, so
explicitly getting the IO down to the device before scheduling reduces
latencies for anyone waiting on clean pages.

It also cuts down on how often we use kblockd to unplug, which means
less work bouncing from one workqueue to another.

Many more details about how we got here:

  https://lkml.org/lkml/2015/9/11/570

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-19 18:50:19 -07:00
Linus Torvalds 505a666ee3 writeback: plug writeback in wb_writeback() and writeback_inodes_wb()
We had to revert the pluggin in writeback_sb_inodes() because the
wb->list_lock is held, but we could easily plug at a higher level before
taking that lock, and unplug after releasing it.  This does that.

Chris will run performance numbers, just to verify that this approach is
comparable to the alternative (we could just drop and re-take the lock
around the blk_finish_plug() rather than these two commits.

I'd have preferred waiting for actual performance numbers before picking
one approach over the other, but I don't want to release rc1 with the
known "sleeping function called from invalid context" issue, so I'll
pick this cleanup version for now.  But if the numbers show that we
really want to plug just at the writeback_sb_inodes() level, and we
should just play ugly games with the spinlock, we'll switch to that.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-12 11:13:07 -07:00
Linus Torvalds 0ba13fd19d Revert "writeback: plug writeback at a high level"
This reverts commit d353d7587d.

Doing the block layer plug/unplug inside writeback_sb_inodes() is
broken, because that function is actually called with a spinlock held:
wb->list_lock, as pointed out by Chris Mason.

Chris suggested just dropping and re-taking the spinlock around the
blk_finish_plug() call (the plgging itself can happen under the
spinlock), and that would technically work, but is just disgusting.

We do something fairly similar - but not quite as disgusting because we
at least have a better reason for it - in writeback_single_inode(), so
it's not like the caller can depend on the lock being held over the
call, but in this case there just isn't any good reason for that
"release and re-take the lock" pattern.

[ In general, we should really strive to avoid the "release and retake"
  pattern for locks, because in the general case it can easily cause
  subtle bugs when the caller caches any state around the call that
  might be invalidated by dropping the lock even just temporarily. ]

But in this case, the plugging should be easy to just move up to the
callers before the spinlock is taken, which should even improve the
effectiveness of the plug.  So there is really no good reason to play
games with locking here.

I'll send off a test-patch so that Dave Chinner can verify that that
plug movement works.  In the meantime this just reverts the problematic
commit and adds a comment to the function so that we hopefully don't
make this mistake again.

Reported-by: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 13:26:39 -07:00
Linus Torvalds b0a1ea51bd Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block
Pull blk-cg updates from Jens Axboe:
 "A bit later in the cycle, but this has been in the block tree for a a
  while.  This is basically four patchsets from Tejun, that improve our
  buffered cgroup writeback.  It was dependent on the other cgroup
  changes, but they went in earlier in this cycle.

  Series 1 is set of 5 patches that has cgroup writeback updates:

   - bdi_writeback iteration fix which could lead to some wb's being
     skipped or repeated during e.g. sync under memory pressure.

   - Simplification of wb work wait mechanism.

   - Writeback tracepoints updated to report cgroup.

  Series 2 is is a set of updates for the CFQ cgroup writeback handling:

     cfq has always charged all async IOs to the root cgroup.  It didn't
     have much choice as writeback didn't know about cgroups and there
     was no way to tell who to blame for a given writeback IO.
     writeback finally grew support for cgroups and now tags each
     writeback IO with the appropriate cgroup to charge it against.

     This patchset updates cfq so that it follows the blkcg each bio is
     tagged with.  Async cfq_queues are now shared across cfq_group,
     which is per-cgroup, instead of per-request_queue cfq_data.  This
     makes all IOs follow the weight based IO resource distribution
     implemented by cfq.

     - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

     - Other misc review points addressed, acks added and rebased.

  Series 3 is the blkcg policy cleanup patches:

     This patchset contains assorted cleanups for blkcg_policy methods
     and blk[c]g_policy_data handling.

     - alloc/free added for blkg_policy_data.  exit dropped.

     - alloc/free added for blkcg_policy_data.

     - blk-throttle's async percpu allocation is replaced with direct
       allocation.

     - all methods now take blk[c]g_policy_data instead of blkcg_gq or
       blkcg.

  And finally, series 4 is a set of patches cleaning up the blkcg stats
  handling:

    blkcg's stats have always been somwhat of a mess.  This patchset
    tries to improve the situation a bit.

     - The following patches added to consolidate blkcg entry point and
       blkg creation.  This is in itself is an improvement and helps
       colllecting common stats on bio issue.

     - per-blkg stats now accounted on bio issue rather than request
       completion so that bio based and request based drivers can behave
       the same way.  The issue was spotted by Vivek.

     - cfq-iosched implements custom recursive stats and blk-throttle
       implements custom per-cpu stats.  This patchset make blkcg core
       support both by default.

     - cfq-iosched and blk-throttle keep track of the same stats
       multiple times.  Unify them"

* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
  blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
  blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
  blkcg: implement interface for the unified hierarchy
  blkcg: misc preparations for unified hierarchy interface
  blkcg: separate out tg_conf_updated() from tg_set_conf()
  blkcg: move body parsing from blkg_conf_prep() to its callers
  blkcg: mark existing cftypes as legacy
  blkcg: rename subsystem name from blkio to io
  blkcg: refine error codes returned during blkcg configuration
  blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
  blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
  blkcg: remove cfqg_stats->sectors
  blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
  blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
  blkcg: make blkcg_[rw]stat per-cpu
  blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
  blkcg: consolidate blkg creation in blkcg_bio_issue_check()
  blk-throttle: improve queue bypass handling
  blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
  blkcg: inline [__]blkg_lookup()
  ...
2015-09-10 18:56:14 -07:00
Linus Torvalds 7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Tejun Heo 006a0973ed writeback: sync_inodes_sb() must write out I_DIRTY_TIME inodes and always call wait_sb_inodes()
e79729123f ("writeback: don't issue wb_writeback_work if clean")
updated writeback path to avoid kicking writeback work items if there
are no inodes to be written out; unfortunately, the avoidance logic
was too aggressive and broke sync_inodes_sb().

* sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME
  inodes dont't contribute to bdi/wb_has_dirty_io() tests and were
  being skipped over.

* inodes are taken off wb->b_dirty/io/more_io lists after writeback
  starts on them.  sync_inodes_sb() skipping wait_sb_inodes() when
  bdi_has_dirty_io() breaks it by making it return while writebacks
  are in-flight.

This patch fixes the breakages by

* Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs().
  The callers are already testing the condition.

* Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that
  it always calls into bdi_split_work_to_wbs() and wait_sb_inodes().

* Making bdi_split_work_to_wbs() consider the b_dirty_time list for
  WB_SYNC_ALL writebacks.

Kudos to Eryu, Dave and Jan for tracking down the issue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e79729123f ("writeback: don't issue wb_writeback_work if clean")
Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.com
Reported-and-bisected-by: Eryu Guan <eguan@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.com>
Cc: Ted Ts'o <tytso@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-25 14:35:09 -06:00
Tejun Heo 5634cc2aa9 writeback: update writeback tracepoints to report cgroup
The following tracepoints are updated to report the cgroup used during
cgroup writeback.

* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]

Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it.  Tracepoints which take
bdi are updated to take bdi_writeback instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18 15:49:15 -07:00
Tejun Heo 60292bcc1b writeback: explain why @inode is allowed to be NULL for inode_congested()
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18 15:49:15 -07:00
Tejun Heo 8a1270cda7 writeback: remove wb_writeback_work->single_wait/done
wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this.  bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.

This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18 15:49:15 -07:00
Tejun Heo 1ed8d48c57 writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg
wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
earlier implementation, wb's were keyed by blkcg ID.
bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
allows iterations to start from an arbitrary ID which is used to
interrupt and resume iterations.

Unfortunately, while changing wb to be keyed by memcg ID instead of
blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
are keyed by blkcg ID.  This doesn't affect iterations which don't get
interrupted but bdi_split_work_to_wbs() makes use of iteration
resuming on allocation failures and thus may incorrectly skip or
repeat wb's.

Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
blkcg IDs and updating bdi_split_work_to_wbs() accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18 15:49:15 -07:00
Dave Chinner c7f5408493 inode: rename i_wb_list to i_io_list
There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 23:38:10 -04:00
Dave Chinner e97fedb9ef sync: serialise per-superblock sync operations
When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 18:39:47 -04:00
Dave Chinner 74278da9f7 inode: convert inode_sb_list_lock to per-sb
The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 18:39:46 -04:00
Dave Chinner d353d7587d writeback: plug writeback at a high level
Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2015-08-17 18:39:45 -04:00
Tejun Heo 5aa2a96b34 block: export bio_associate_*() and wbc_account_io()
bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported.  Export them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-23 13:36:44 -06:00
Tejun Heo dd73e4b7df writeback: do foreign inode detection iff cgroup writeback is enabled
Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
foreign inode detection and migration logic still ends up activating
cgroup writeback which is unexpected.  This patch ensures that the
foreign inode detection logic stays disabled when inode_cgwb_enabled()
is false by not associating writeback_control's with bdi_writeback's.

This also avoids unnecessary operations in wbc_init_bio(),
wbc_account_io() and wbc_detach_inode() for filesystems which don't
support cgroup writeback.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-17 12:47:37 -06:00
Tejun Heo e8a7abf5a5 writeback: disassociate inodes from dying bdi_writebacks
For the purpose of foreign inode detection, wb's (bdi_writeback's) are
identified by the associated memcg ID.  As we create a separate wb for
each memcg, this is enough to identify the active wb's; however, when
blkcg is enabled or disabled higher up in the hierarchy, the mapping
between memcg and blkcg changes which in turn creates a new wb to
service the new mapping.  The old wb is unlinked from index and
released after all references are drained.  The foreign inode
detection logic can't detect this condition because both the old and
new wb's point to the same memcg and thus never decides to move inodes
attached to the old wb to the new one.

This patch adds logic to initiate switching immediately in
wbc_attach_and_unlock_inode() if the associated wb is dying.  We can
make the usual foreign detection logic to distinguish the different
wb's mapped to the memcg but the dying wb is never gonna be in active
service again and there's no point in tracking the usage history and
reaching the switch verdict after enough data points are collected.
It's already known that the wb has to be switched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo d10c809552 writeback: implement foreign cgroup inode bdi_writeback switching
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode.  While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).

To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and transfers the ownership to it.  The previous patches
implemented the foreign condition detection mechanism and laid the
groundwork.  This patch implements the actual switching.

With the previously implemented [unlocked_]inode_to_wb_and_list_lock()
and wb stat transaction, grabbing wb->list_lock, inode->i_lock and
mapping->tree_lock gives us full exclusion against all wb operations
on the target inode.  inode_switch_wb_work_fn() grabs all the locks
and transfers the inode atomically along with its RECLAIMABLE and
WRITEBACK stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo aaa2cacf81 writeback: add lockdep annotation to inode_to_wb()
With the previous three patches, all operations which acquire wb from
inode are either under one of inode->i_lock, mapping->tree_lock or
wb->list_lock or protected by unlocked_inode_to_wb transaction.  This
will be depended upon by foreign inode wb switching.

This patch adds lockdep assertion to inode_to_wb() so that usages
outside the above list locks can be caught easily.  There are three
exceptions.

* locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
  wb may not be the inode's.  Ensuring that is the function's role
  after all.  Updated to deref inode->i_wb directly.

* inode_wb_stat_unlocked_begin() is usually protected by combination
  of !I_WB_SWITCH and rcu_read_lock().  Updated to deref inode->i_wb
  directly.

* inode_congested() wants to test whether inode->i_wb is set before
  starting the transaction.  Added inode_to_wb_is_valid() which tests
  inode->i_wb directly.

v5: might_lock() removed.  It annotates that the lock is grabbed w/
    irq enabled which isn't the case and triggering lockdep warning
    spuriously.

v4: might_lock() added to unlocked_inode_to_wb_begin().

v3: inode_congested() conversion added.

v2: locked_inode_to_wb_and_lock_list() was missing in the first
    version.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo 5cb8b8241e writeback: use unlocked_inode_to_wb transaction in inode_congested()
Similar to wb stat updates, inode_congested() accesses the associated
wb of an inode locklessly, which will break with foreign inode wb
switching.  This path updates inode_congested() to use unlocked inode
wb access transaction introduced by the previous patch.

Combined with the previous two patches, this makes all wb list and
access operations to be protected by either of inode->i_lock,
wb->list_lock, or mapping->tree_lock while wb switching is in
progress.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00