Commit Graph

1837 Commits

Author SHA1 Message Date
Ming Lei 25e823c8c3 block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices
Apply the introduced pm_runtime_set_memalloc_noio on block device so
that PM core will teach mm to not allocate memory with GFP_IOFS when
calling the runtime_resume and runtime_suspend callback for block
devices and its ancestors.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:16 -08:00
Darrick J. Wong ffecfd1a72 block: optionally snapshot page contents to provide stable pages during write
This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude.  If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return.  We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:20 -08:00
Darrick J. Wong 7d311cdab6 bdi: allow block devices to say that they require stable page writes
This patchset ("stable page writes, part 2") makes some key
modifications to the original 'stable page writes' patchset.  First, it
provides creators (devices and filesystems) of a backing_dev_info a flag
that declares whether or not it is necessary to ensure that page
contents cannot change during writeout.  It is no longer assumed that
this is true of all devices (which was never true anyway).  Second, the
flag is used to relaxed the wait_on_page_writeback calls so that wait
only occurs if the device needs it.  Third, it fixes up the remaining
disk-backed filesystems to use this improved conditional-wait logic to
provide stable page writes on those filesystems.

It is hoped that (for people not using checksumming devices, anyway)
this patchset will give back unnecessary performance decreases since the
original stable page write patchset went into 3.0.  Sorry about not
fixing it sooner.

Complaints were registered by several people about the long write
latencies introduced by the original stable page write patchset.
Generally speaking, the kernel ought to allocate as little extra memory
as possible to facilitate writeout, but for people who simply cannot
wait, a second page stability strategy is (re)introduced: snapshotting
page contents.  The waiting behavior is still the default strategy; to
enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
set.  This flag is used to bandaid^Henable stable page writeback on
ext3[1], and is not used anywhere else.

Given that there are already a few storage devices and network FSes that
have rolled their own page stability wait/page snapshot code, it would
be nice to move towards consolidating all of these.  It seems possible
that iscsi and raid5 may wish to use the new stable page write support
to enable zero-copy writeout.

Thank you to Jan Kara for helping fix a couple more filesystems.

Per Andrew Morton's request, here are the result of using dbench to measure
latencies on ext2:

3.8.0-rc3:
   Operation      Count    AvgLat    MaxLat
   ----------------------------------------
   WriteX        109347     0.028    59.817
   ReadX         347180     0.004     3.391
   Flush          15514    29.828   287.283

  Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
   WriteX        105556     0.029     4.273
   ReadX         335004     0.005     4.112
   Flush          14982    30.540   298.634

  Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, for ext2 the maximum write latency decreases from ~60ms
on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
increase, though I suspect that being able to dirty pages faster gives
the flusher more work to do.

On ext4, the average write latency decreases as well as all the maximum
latencies:

3.8.0-rc3:
   WriteX         85624     0.152    33.078
   ReadX         272090     0.010    61.210
   Flush          12129    36.219   168.260

  Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms

3.8.0-rc3 + patches:
   WriteX         86082     0.141    30.928
   ReadX         273358     0.010    36.124
   Flush          12214    34.800   165.689

  Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms

XFS seems to exhibit similar latency improvements as ext2:

3.8.0-rc3:
   WriteX        125739     0.028   104.343
   ReadX         399070     0.005     4.115
   Flush          17851    25.004   131.390

  Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms

3.8.0-rc3 + patches:
   WriteX        123529     0.028     6.299
   ReadX         392434     0.005     4.287
   Flush          17549    25.120   188.687

  Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms

...and btrfs, just to round things out, also shows some latency
decreases:

3.8.0-rc3:
   WriteX         67122     0.083    82.355
   ReadX         212719     0.005     2.828
   Flush           9547    47.561   147.418

  Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms

3.8.0-rc3 + patches:
   WriteX         64898     0.101    71.631
   ReadX         206673     0.005     7.123
   Flush           9190    47.963   219.034

  Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own wait code, or they don't block at all.  The blocking
behavior is back to what it was before 3.0 if you don't have a disk
requiring stable page writes.

This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
results as -rc3.

[1] The alternative fixes to ext3 include fixing the locking order and
page bit handling like we did for ext4 (but then why not just use
ext4?), or setting PG_writeback so early that ext3 becomes extremely
slow.  I tried that, but the number of write()s I could initiate dropped
by nearly an order of magnitude.  That was a bit much even for the
author of the stable page series! :)

This patch:

Creates a per-backing-device flag that tracks whether or not pages must
be held immutable during writeout.  Eventually it will be used to waive
wait_for_page_writeback() if nothing requires stable pages.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:19 -08:00
Linus Torvalds ece8e0b2f9 Merge branch 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull async changes from Tejun Heo:
 "These are followups for the earlier deadlock issue involving async
  ending up waiting for itself through block requesting module[1].  The
  following changes are made by these commits.

   - Instead of requesting default elevator on each request_queue init,
     block now requests it once early during boot.

   - Kmod triggers warning if invoked from an async worker.

   - Async synchronization implementation has been reimplemented.  It's
     a lot simpler now."

* 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  async: initialise list heads to fix crash
  async: replace list of active domains with global list of pending items
  async: keep pending tasks on async_domain and remove async_pending
  async: use ULLONG_MAX for infinity cookie value
  async: bring sanity to the use of words domain and running
  async, kmod: warn on synchronous request_module() from async workers
  block: don't request module during elevator init
  init, block: try to load default elevator module early during boot
2013-02-19 22:10:26 -08:00
Linus Torvalds d652e1eb8e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Main changes:

   - scheduler side full-dynticks (user-space execution is undisturbed
     and receives no timer IRQs) preparation changes that convert the
     cputime accounting code to be full-dynticks ready, from Frederic
     Weisbecker.

   - Initial sched.h split-up changes, by Clark Williams

   - select_idle_sibling() performance improvement by Mike Galbraith:

        " 1 tbench pair (worst case) in a 10 core + SMT package:

          pre   15.22 MB/sec 1 procs
          post 252.01 MB/sec 1 procs "

  - sched_rr_get_interval() ABI fix/change.  We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

  - misc RT scheduling cleanups, optimizations"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
  cputime: Remove irqsave from seqlock readers
  sched, powerpc: Fix sched.h split-up build failure
  cputime: Restore CPU_ACCOUNTING config defaults for PPC64
  sched/rt: Move rt specific bits into new header file
  sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
  sched: Move sched.h sysctl bits into separate header
  sched: Fix signedness bug in yield_to()
  sched: Fix select_idle_sibling() bouncing cow syndrome
  sched/rt: Further simplify pick_rt_task()
  sched/rt: Do not account zero delta_exec in update_curr_rt()
  cputime: Safely read cputime of full dynticks CPUs
  kvm: Prepare to add generic guest entry/exit callbacks
  cputime: Use accessors to read task cputime stats
  cputime: Allow dynamic switch between tick/virtual based cputime accounting
  cputime: Generic on-demand virtual cputime accounting
  cputime: Move default nsecs_to_cputime() to jiffies based cputime file
  cputime: Librarize per nsecs resolution cputime definitions
  cputime: Avoid multiplication overflow on utime scaling
  context_tracking: Export context state for generic vtime
  ...

Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-19 18:19:48 -08:00
Clark Williams cf4aebc292 sched: Move sched.h sysctl bits into separate header
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:50:54 +01:00
Tejun Heo 21c3c5d280 block: don't request module during elevator init
Block layer allows selecting an elevator which is built as a module to
be selected as system default via kernel param "elevator=".  This is
achieved by automatically invoking request_module() whenever a new
block device is initialized and the elevator is not available.

This led to an interesting deadlock problem involving async and module
init.  Block device probing running off an async job invokes
request_module().  While the module is being loaded, it performs
async_synchronize_full() which ends up waiting for the async job which
is already waiting for request_module() to finish, leading to
deadlock.

Invoking request_module() from deep in block device init path is
already nasty in itself.  It seems best to avoid these situations from
the beginning by moving on-demand module loading out of block init
path.

The previous patch made sure that the default elevator module is
loaded early during boot if available.  This patch removes on-demand
loading of the default elevator from elevator init path.  As the
module would have been loaded during boot, userland-visible behavior
difference should be minimal.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814

v2: The bool parameter was named @request_module which conflicted with
    request_module().  This built okay w/ CONFIG_MODULES because
    request_module() was defined as a macro.  W/o CONFIG_MODULES, it
    causes build breakage.  Rename the parameter to @try_loading.
    Reported by Fengguang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
2013-01-22 16:48:03 -08:00
Tejun Heo bb813f4c93 init, block: try to load default elevator module early during boot
This patch adds default module loading and uses it to load the default
block elevator.  During boot, it's called right after initramfs or
initrd is made available and right before control is passed to
userland.  This ensures that as long as the modules are available in
the usual places in initramfs, initrd or the root filesystem, the
default modules are loaded as soon as possible.

This will replace the on-demand elevator module loading from elevator
init path.

v2: Fixed build breakage when !CONFIG_BLOCK.  Reported by kbuild test
    robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Riesen <raa.lkml@gmail.com>
Cc: Fengguang We <fengguang.wu@intel.com>
2013-01-18 14:05:56 -08:00
Derek Basehore 12c2bdb232 block: prevent race/cleanup
Remove a race condition which causes a warning in disk_clear_events.  This
is a race between disk_clear_events() and disk_flush_events().
ev->clearing will be altered by disk_flush_events() even though we are
blocking event checking through disk_flush_events().  If this happens
after ev->clearing was cleared for disk_clear_events(), this can cause the
WARN_ON_ONCE() in that function to be triggered.

This change also has disk_clear_events() not go through a workqueue.
Since we have to wait for the work to complete, we should just call the
function directly.  Also, since this work cannot be put on a freezable
workqueue, it will have to contend with increased demand, so calling the
function directly avoids this.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-19 20:36:10 +01:00
Derek Basehore aea24a8bbc block: remove deadlock in disk_clear_events
In disk_clear_events, do not put work on system_nrt_freezable_wq.
Instead, put it on system_nrt_wq.

There is a race between probing a usb and suspending the device.  Since
probing a usb calls disk_clear_events, which puts work on a frozen
workqueue, probing cannot finish after the workqueue is frozen.  However,
suspending cannot finish until the usb probe is finished, so we get a
deadlock, causing the system to reboot.

The way to reproduce this bug is to wake up from suspend with a usb
storage device plugged in, or plugging in a usb storage device right
before suspend.  The window of time is on the order of time it takes to
probe the usb device.  As long as the workqueues are frozen before the
call to add_disk within sd_probe_async finishes, there will be a deadlock
(which calls blkdev_get, sd_open, check_disk_change, then
disk_clear_events).  This is not difficult to reproduce after figuring out
the timings.

[akpm@linux-foundation.org: fix up comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-19 20:36:05 +01:00
Linus Torvalds 848b81415c Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:
 "Incoming:

   - lots of misc stuff

   - backlight tree updates

   - lib/ updates

   - Oleg's percpu-rwsem changes

   - checkpatch

   - rtc

   - aoe

   - more checkpoint/restart support

  I still have a pile of MM stuff pending - Pekka should be merging
  later today after which that is good to go.  A number of other things
  are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
  scatterlist: don't BUG when we can trivially return a proper error.
  docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
  fs, fanotify: add @mflags field to fanotify output
  docs: add documentation about /proc/<pid>/fdinfo/<fd> output
  fs, notify: add procfs fdinfo helper
  fs, exportfs: add exportfs_encode_inode_fh() helper
  fs, exportfs: escape nil dereference if no s_export_op present
  fs, epoll: add procfs fdinfo helper
  fs, eventfd: add procfs fdinfo helper
  procfs: add ability to plug in auxiliary fdinfo providers
  tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
  breakpoint selftests: print failure status instead of cause make error
  kcmp selftests: print fail status instead of cause make error
  kcmp selftests: make run_tests fix
  mem-hotplug selftests: print failure status instead of cause make error
  cpu-hotplug selftests: print failure status instead of cause make error
  mqueue selftests: print failure status instead of cause make error
  vm selftests: print failure status instead of cause make error
  ubifs: use prandom_bytes
  mtd: nandsim: use prandom_bytes
  ...
2012-12-17 20:58:12 -08:00
Oleg Nesterov 22b361d1df percpu_rw_semaphore: introduce CONFIG_PERCPU_RWSEM
Currently only block_dev and uprobes use percpu_rw_semaphore,
add the config option selected by BLOCK || UPROBES.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:18 -08:00
Linus Torvalds 9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Linus Torvalds 60da5bf47d Merge branch 'for-3.8/core' of git://git.kernel.dk/linux-block
Pull block layer core updates from Jens Axboe:
 "Here are the core block IO bits for 3.8.  The branch contains:

   - The final version of the surprise device removal fixups from Bart.

   - Don't hide EFI partitions under advanced partition types.  It's
     fairly wide spread these days.  This is especially dangerous for
     systems that have both msdos and efi partition tables, where you
     want to keep them in sync.

   - Cleanup of using -1 instead of the proper NUMA_NO_NODE

   - Export control of bdi flusher thread CPU mask and default to using
     the home node (if known) from Jeff.

   - Export unplug tracepoint for MD.

   - Core improvements from Shaohua.  Reinstate the recursive merge, as
     the original bug has been fixed.  Add plugging for discard and also
     fix a problem handling non pow-of-2 discard limits.

  There's a trivial merge in block/blk-exec.c due to a fix that went
  into 3.7-rc at a later point than -rc4 where this is based."

* 'for-3.8/core' of git://git.kernel.dk/linux-block:
  block: export block_unplug tracepoint
  block: add plug for blkdev_issue_discard
  block: discard granularity might not be power of 2
  deadline: Allow 0ms deadline latency, increase the read speed
  partitions: enable EFI/GPT support by default
  bsg: Remove unused function bsg_goose_queue()
  block: Make blk_cleanup_queue() wait until request_fn finished
  block: Avoid scheduling delayed work on a dead queue
  block: Avoid that request_fn is invoked on a dead queue
  block: Let blk_drain_queue() caller obtain the queue lock
  block: Rename queue dead flag
  bdi: add a user-tunable cpu_list for the bdi flusher threads
  block: use NUMA_NO_NODE instead of -1
  block: recursive merge requests
  block CFQ: avoid moving request to different queue
2012-12-17 08:27:23 -08:00
NeilBrown cbae8d45d6 block: export block_unplug tracepoint
This allows stacked devices (like md/raid5) to provide blktrace
tracing, including unplug events.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:49:27 +01:00
Shaohua Li 0cfbcafcae block: add plug for blkdev_issue_discard
Last post of this patch appears lost, so I resend this.

Now discard merge works, add plug for blkdev_issue_discard. This will help
discard request merge especially for raid0 case. In raid0, a big discard
request is split to small requests, and if correct plug is added, such small
requests can be merged in low layer.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:46:04 +01:00
Shaohua Li 8dd2cb7e88 block: discard granularity might not be power of 2
In MD raid case, discard granularity might not be power of 2, for example, a
4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
such cases.

Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-14 20:46:04 +01:00
Linus Torvalds d206e09036 Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "A lot of activities on cgroup side.  The big changes are focused on
  making cgroup hierarchy handling saner.

   - cgroup_rmdir() had peculiar semantics - it allowed cgroup
     destruction to be vetoed by individual controllers and tried to
     drain refcnt synchronously.  The vetoing never worked properly and
     caused good deal of contortions in cgroup.  memcg was the last
     reamining user.  Michal Hocko removed the usage and cgroup_rmdir()
     path has been simplified significantly.  This was done in a
     separate branch so that the memcg people can base further memcg
     changes on top.

   - The above allowed cleaning up cgroup lifecycle management and
     implementation of generic cgroup iterators which are used to
     improve hierarchy support.

   - cgroup_freezer updated to allow migration in and out of a frozen
     cgroup and handle hierarchy.  If a cgroup is frozen, all descendant
     cgroups are frozen.

   - netcls_cgroup and netprio_cgroup updated to handle hierarchy
     properly.

   - Various fixes and cleanups.

   - Two merge commits.  One to pull in memcg and rmdir cleanups (needed
     to build iterators).  The other pulled in cgroup/for-3.7-fixes for
     device_cgroup fixes so that further device_cgroup patches can be
     stacked on top."

Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)

* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
  cgroup: update Documentation/cgroups/00-INDEX
  cgroup_rm_file: don't delete the uncreated files
  cgroup: remove subsystem files when remounting cgroup
  cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
  cgroup: warn about broken hierarchies only after css_online
  cgroup: list_del_init() on removed events
  cgroup: fix lockdep warning for event_control
  cgroup: move list add after list head initilization
  netprio_cgroup: allow nesting and inherit config on cgroup creation
  netprio_cgroup: implement netprio[_set]_prio() helpers
  netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
  netprio_cgroup: reimplement priomap expansion
  netprio_cgroup: shorten variable names in extend_netdev_table()
  netprio_cgroup: simplify write_priomap()
  netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
  cgroup: remove obsolete guarantee from cgroup_task_migrate.
  cgroup: add cgroup->id
  cgroup, cpuset: remove cgroup_subsys->post_clone()
  cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
  cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
  ...
2012-12-12 08:18:24 -08:00
xiaobing tu 75274551c8 deadline: Allow 0ms deadline latency, increase the read speed
Change a timer compare from after to after-equals, thus allowing
0 timeout and making deadline schedule FIFO.

Signed-off-by: xiaobing tu <xiaobing.tu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-09 19:19:23 +01:00
Diego Calleja 5f6f38dbb0 partitions: enable EFI/GPT support by default
The Kconfig currently enables MSDOS partitions by default because they
are assumed to be essential, but it's necessary to enable "advanced
partition selection" in order to get GPT support. IMO GPT partitions
are becoming common enought to deserve the same treatment MSDOS
partitions get.

(Side note: I got bit by a disk that had MSDOS and GPT partition
tables, but for some reason the MSDOS table was different from the
GPT one. I was stupid enought to disable "advanced partition
selection" in my .config, which disabled GPT partitioning and made
my btrfs pool unbootable because it couldn't find the partitions)

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:34:58 +01:00
Bart Van Assche 80729beb33 bsg: Remove unused function bsg_goose_queue()
The function bsg_goose_queue() does not have any in-tree callers,
so let's remove it.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:33:02 +01:00
Bart Van Assche 24faf6f604 block: Make blk_cleanup_queue() wait until request_fn finished
Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Chanho Min <chanho.min@lge.com>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:33:00 +01:00
Bart Van Assche 704605711e block: Avoid scheduling delayed work on a dead queue
Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:30 +01:00
Bart Van Assche c246e80d86 block: Avoid that request_fn is invoked on a dead queue
A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Chanho Min <chanho.min@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:32:01 +01:00
Bart Van Assche 807592a4fa block: Let blk_drain_queue() caller obtain the queue lock
Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-06 14:30:59 +01:00