Commit Graph

1768 Commits

Author SHA1 Message Date
Linus Torvalds
eff0d13f38 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:

 - Making the plugging support for drivers a bit more sane from Neil.
   This supersedes the plugging change from Shaohua as well.

 - The usual round of drbd updates.

 - Using a tail add instead of a head add in the request completion for
   ndb, making us find the most completed request more quickly.

 - A few floppy changes, getting rid of a duplicated flag and also
   running the floppy init async (since it takes forever in boot terms)
   from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
  floppy: remove duplicated flag FD_RAW_NEED_DISK
  blk: pass from_schedule to non-request unplug functions.
  block: stack unplug
  blk: centralize non-request unplug handling.
  md: remove plug_cnt feature of plugging.
  block/nbd: micro-optimization in nbd request completion
  drbd: announce FLUSH/FUA capability to upper layers
  drbd: fix max_bio_size to be unsigned
  drbd: flush drbd work queue before invalidate/invalidate remote
  drbd: fix potential access after free
  drbd: call local-io-error handler early
  drbd: do not reset rs_pending_cnt too early
  drbd: reset congestion information before reporting it in /proc/drbd
  drbd: report congestion if we are waiting for some userland callback
  drbd: differentiate between normal and forced detach
  drbd: cleanup, remove two unused global flags
  floppy: Run floppy initialization asynchronous
2012-08-01 09:06:47 -07:00
Linus Torvalds
8cf1a3fce0 Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block
Pull core block IO bits from Jens Axboe:
 "The most complicated part if this is the request allocation rework by
  Tejun, which has been queued up for a long time and has been in
  for-next ditto as well.

  There are a few commits from yesterday and today, mostly trivial and
  obvious fixes.  So I'm pretty confident that it is sound.  It's also
  smaller than usual."

* 'for-3.6/core' of git://git.kernel.dk/linux-block:
  block: remove dead func declaration
  block: add partition resize function to blkpg ioctl
  block: uninitialized ioc->nr_tasks triggers WARN_ON
  block: do not artificially constrain max_sectors for stacking drivers
  blkcg: implement per-blkg request allocation
  block: prepare for multiple request_lists
  block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
  blkcg: inline bio_blkcg() and friends
  block: allocate io_context upfront
  block: refactor get_request[_wait]()
  block: drop custom queue draining used by scsi_transport_{iscsi|fc}
  mempool: add @gfp_mask to mempool_create_node()
  blkcg: make root blkcg allocation use %GFP_KERNEL
  blkcg: __blkg_lookup_create() doesn't need radix preload
2012-08-01 09:02:41 -07:00
Yuanhan Liu
80799fbb7d block: remove dead func declaration
__generic_unplug_device() function is removed with commit
7eaceaccab, which forgot to
remove the declaration at meantime. Here remove it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:25:54 +02:00
Vivek Goyal
c83f6bf98d block: add partition resize function to blkpg ioctl
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.

This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.

Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().

v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:24:18 +02:00
Olof Johansson
4638a83e86 block: uninitialized ioc->nr_tasks triggers WARN_ON
Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[   10.886893] ------------[ cut here ]------------
[   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[   10.886905] Hardware name: Bochs
[   10.886906] Modules linked in:
[   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[   10.886908] Call Trace:
[   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
[   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
[   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
[   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
[   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[   10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01bee
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks.  This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task.  This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 12:17:27 +02:00
Mike Snitzer
fe86cdcef7 block: do not artificially constrain max_sectors for stacking drivers
blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.

It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).

Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01 10:44:28 +02:00
NeilBrown
74018dc306 blk: pass from_schedule to non-request unplug functions.
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
Shaohua Li
2a7d5559b3 block: stack unplug
MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
NeilBrown
9cbb175088 blk: centralize non-request unplug handling.
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:14 +02:00
Muthukumar Ratty
e81ca6fe85 [SCSI] block: Fix blk_execute_rq_nowait() dead queue handling
If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
callback function. That will result in blk_execute_rq() being stuck
in wait_for_completion(). Avoid this by initializing rq->end_io to the
done() callback before we check the queue state. Also, make sure the
queue lock is held around the invocation of the done() callback. Found
this through source code review.

Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-07-20 08:58:39 +01:00
Tejun Heo
a051661ca6 blkcg: implement per-blkg request allocation
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued.  When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup.  As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list.  Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
  which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
  handling a bit harier, this enables avoiding most overhead for root
  blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
  aware yet, so congestion state currently just follows the root
  blkcg.  As writeback isn't aware of blkcg yet, this works okay for
  async congestion but readahead may get the wrong signals.  It's
  better than blkcg completely collapsing with shared request_list but
  needs to be improved with future changes.

* After this change, each block cgroup gets a full request pool making
  resource consumption of each cgroup higher.  This makes allowing
  non-root users to create cgroups less desirable; however, note that
  allowing non-root users to directly manage cgroups is already
  severely broken regardless of this patch - each block cgroup
  consumes kernel memory and skews IO weight (IO weights are not
  hierarchical).

v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure.  Fix it
    by falling back to root_rl on blkg lookup failures.  This problem
    was spotted by Rakesh Iyer <rni@google.com>.

v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue".  blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 18:42:49 -04:00
Tejun Heo
5b788ce3e2 block: prepare for multiple request_lists
Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list.  blk_*queue_full() functions
  are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
  instead of @q.  Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
  blk_release_queue().

* Add request_list->q and make request alloc/free functions -
  blk_free_request(), [__]freed_request(), __get_request() - take @rl
  instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:52 +02:00
Tejun Heo
8a5ecdd428 block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:51 +02:00
Tejun Heo
b1208b56f3 blkcg: inline bio_blkcg() and friends
Make bio_blkcg() and friends inline.  They all are very simple and
used only in few places.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:50 +02:00
Tejun Heo
7f4b35d155 block: allocate io_context upfront
Block layer very lazy allocation of ioc.  It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer.  We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:50 +02:00
Tejun Heo
a06e05e6af block: refactor get_request[_wait]()
Currently, there are two request allocation functions - get_request()
and get_request_wait().  The former tries to allocate a request once
and the latter keeps retrying until it succeeds.  The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation.  However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
  wrapper function.

* get_request_wait() is renamed to get_request().  It now takes
  @gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:49 +02:00
Tejun Heo
86072d8112 block: drop custom queue draining used by scsi_transport_{iscsi|fc}
iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining.  fc_bsg_remove() open-codes mostly identical logic.

The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.

blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary.  Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@emulex.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:48 +02:00
Tejun Heo
a91a5ac685 mempool: add @gfp_mask to mempool_create_node()
mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:47 +02:00
Tejun Heo
159749937a blkcg: make root blkcg allocation use %GFP_KERNEL
Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation.  This could make
policy fail unnecessarily.

Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.

v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
   instead of ERR_PTR() value.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:46 +02:00
Tejun Heo
13589864be blkcg: __blkg_lookup_create() doesn't need radix preload
There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask.  Drop preloading from
__blkg_lookup_create().

While at it, drop sparse locking annotation which no longer applies.

v2: Vivek pointed out the odd preload usage.  Instead of updating,
    just drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-25 11:53:45 +02:00
Jan Kara
6d93592807 scsi: Silence unnecessary warnings about ioctl to partition
Sometimes, warnings about ioctls to partition happen often enough that they
form majority of the warnings in the kernel log and users complain. In some
cases warnings are about ioctls such as SG_IO so it's not good to get rid of
the warnings completely as they can ease debugging of userspace problems
when ioctl is refused.

Since I have seen warnings from lots of commands, including some proprietary
userspace applications, I don't think disallowing the ioctls for processes
with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: James Bottomley <JBottomley@parallels.com>
CC: linux-scsi@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 12:52:46 +02:00
Asias He
76aaa5101f block: Drop dead function blk_abort_queue()
This function was only used by btrfs code in btrfs_abort_devices()
(seems in a wrong way).

It was removed in commit d07eb91170,
So, Let's remove the dead code to avoid any confusion.

Changes in v2: update commit log, btrfs_abort_devices() was removed
already.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:46:23 +02:00
Asias He
5e5cfac0c6 block: Mitigate lock unbalance caused by lock switching
Commit 777eb1bf15 disconnects externally
supplied queue_lock before blk_drain_queue(). Switching the lock would
introduce lock unbalance because theads which have taken the external
lock might unlock the internal lock in the during the queue drain. This
patch mitigate this by disconnecting the lock after the queue draining
since queue draining makes a lot of request_queue users go away.

However, please note, this patch only makes the problem less likely to
happen. Anyone who still holds a ref might try to issue a new request on
a dead queue after the blk_cleanup_queue() finishes draining, the lock
unbalance might still happen in this case.

 =====================================
 [ BUG: bad unlock balance detected! ]
 3.4.0+ #288 Not tainted
 -------------------------------------
 fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
 [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
 but there are no more locks to release!

 other info that might help us debug this:
 1 lock held by fio/17706:
  #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
 get_request_wait+0x19a/0x250

 stack backtrace:
 Pid: 17706, comm: fio Not tainted 3.4.0+ #288
 Call Trace:
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
  [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
  [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
  [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
  [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
  [<ffffffff810e0079>] lock_release+0xd9/0x250
  [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
  [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
  [<ffffffff81328faa>] generic_make_request+0xca/0x100
  [<ffffffff81329056>] submit_bio+0x76/0xf0
  [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
  [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
  [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
  [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
  [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
  [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
  [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
  [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
  [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
  [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
  [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
  [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff811eae50>] sys_io_submit+0x10/0x20
  [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b

Changes since v2: Update commit log to explain how the code is still
                  broken even if we delay the lock switching after the drain.
Changes since v1: Update commit log as Tejun suggested.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:46:22 +02:00
Asias He
458f27a982 block: Avoid missed wakeup in request waitqueue
After hot-unplug a stressed disk, I found that rl->wait[] is not empty
while rl->count[] is empty and there are theads still sleeping on
get_request after the queue cleanup. With simple debug code, I found
there are exactly nr_sleep - nr_wakeup of theads in D state. So there
are missed wakeup.

  $ dmesg | grep nr_sleep
  [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
  $ vmstat 1
  1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0

To quote Tejun:

  Ah, okay, freed_request() wakes up single waiter with the assumption
  that after the wakeup there will at least be one successful allocation
  which in turn will continue the wakeup chain until the wait list is
  empty - ie. waiter wakeup is dependent on successful request
  allocation happening after each wakeup.  With queue marked dead, any
  woken up waiter fails the allocation path, so the wakeup chaining is
  lost and we're left with hung waiters. What we need is wake_up_all()
  after drain completion.

This patch fixes the missed wakeup by waking up all the theads which
are sleeping on wait queue after queue drain.

Changes in v2: Drop waitqueue_active() optimization

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>

Fixed a bug by me, where stacked devices would oops on calling
blk_drain_queue() since ->rq.wait[] do not get initialized unless
it's a full queue setup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-15 08:45:25 +02:00
Tejun Heo
27e1f9d1cc blkcg: drop local variable @q from blkg_destroy()
blkg_destroy() caches @blkg->q in local variable @q.  While there are
two places which needs @blkg->q, only lockdep_assert_held() used the
local variable leading to unused local variable warning if lockdep is
configured out.  Drop the local variable and just use @blkg->q
directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rakesh Iyer <rni@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-06 08:35:31 +02:00