The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.
Linux 3.4-rc5
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data. Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole. There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.
This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail. If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV. This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocation failure handling in get_request() is about to be updated.
To ease the update, collapse blk_alloc_request() into get_request().
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the previous change to guarantee bypass visiblity for RCU read
lock regions, entering bypass mode involves non-trivial overhead and
future changes are scheduled to make use of bypass mode during init
path. Combined it may end up adding noticeable delay during boot.
This patch makes request_queue start its life in bypass mode, which is
ended on queue init completion at the end of
blk_init_allocated_queue(), and updates blk_queue_bypass_start() such
that draining and RCU synchronization are performed only when the
queue actually enters bypass mode.
This avoids unnecessarily switching in and out of bypass mode during
init avoiding the overhead and any nasty surprises which may step from
leaving bypass mode on half-initialized queues.
The boot time overhead was pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg_lookup() doesn't check @q bypass state. This patch
updates blk_queue_bypass_start() to do synchronize_rcu() before
returning and updates blkg_lookup() to check blk_queue_bypass() and
return %NULL if bypassing. This ensures blkg_lookup() returns %NULL
if @q is bypassing.
This is to guarantee that nobody is accessing policy data while @q is
bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in
place on policy [de]activation.
v2: Added more comments explaining bypass guarantees as suggested by
Vivek.
v3: Added more comments explaining why there's no synchronize_rcu() in
blk_cleanup_queue() as suggested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do auto block plug flush to reduce latency, the threshold is 16
requests. This works well if the task is accessing one or two drives.
The problem is if the task is accessing a raid 0 device and the raid
disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
heavy lock contention.
This patch makes the threshold per-disk based. The latency should be
still ok accessing one or two drives. The setup with application
accessing a lot of drives in the meantime uaually is big machine,
avoiding lock contention is more important, because any contention
will actually increase latency.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should use the GFP flags that the caller specified instead of picking
our own. All the callers specify GFP_KERNEL so this doesn't make a
difference to how the kernel runs, it's just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current. Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.
For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup. This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.
Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs). As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.
This patch implements bio_associate_current() which associates the
specified bio with %current. The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio. bio
release puts the associated ioc and blkcg.
It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.
elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.
This doesn't update block cgroup policies yet. Further patches will
implement the support.
-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
rq_ioc() to fix build breakage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the following interface updates to prepare for future ioc related
changes.
* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.
* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().
* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
get_request() is structured a bit unusually in that failure path is
inlined in the usual flow with goto labels atop and inside it.
Relocate the error path to the end of the function.
This is to prepare for icq handling changes in get_request() and
doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkg is per cgroup-queue-policy combination. This is
unnatural and leads to various convolutions in partially used
duplicate fields in blkg, config / stat access, and general management
of blkgs.
This patch make blkg's per cgroup-queue and let them serve all
policies. blkgs are now created and destroyed by blkcg core proper.
This will allow further consolidation of common management logic into
blkcg core and API with better defined semantics and layering.
As a transitional step to untangle blkg management, elvswitch and
policy [de]registration, all blkgs except the root blkg are being shot
down during elvswitch and bypass. This patch adds blkg_root_update()
to update root blkg in place on policy change. This is hacky and racy
but should be good enough as interim step until we get locking
simplified and switch over to proper in-place update for all blkgs.
-v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
comment wasn't updated according to the function change. Fixed.
Both pointed out by Vivek.
-v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
all policies. This freed root pd during elvswitch before the
last queue finished exiting and led to oops. Directly invoke
update_root_blkg_pd() only on BLKIO_POLICY_PROP from
cfq_exit_queue(). This also is closer to what will be done with
proper in-place blkg update. Reported by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, specific policy implementations are responsible for
maintaining list and number of blkgs. This duplicates code
unnecessarily, and hinders factoring common code and providing blkcg
API with better defined semantics.
After this patch, request_queue hosts list heads and counters and blkg
has list nodes for both policies. This patch only relocates the
necessary fields and the next patch will actually move management code
into blkcg core.
Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
have 2 elements. This is to avoid include dependency and will be
removed by the next patch.
This patch doesn't introduce any behavior change.
-v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
as pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently block core calls directly into blk-throttle for init, drain
and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions. This is to give more control and
visiblity to blkcg core layer for proper layering. Further patches
will add logic common to blkcg policies to the functions.
While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For root blkg, blk_throtl_init() was using throtl_alloc_tg()
explicitly and cfq_init_queue() was manually initializing embedded
cfqd->root_group, adding unnecessarily different code paths to blkg
handling.
Make both use the usual blkio_group get functions - throtl_get_tg()
and cfq_get_cfqg() - for the root blkio_group too. Note that
blk_throtl_init() callsite is pushed downwards in
blk_alloc_queue_node() so that @q is sufficiently initialized for
throtl_get_tg().
This simplifies root blkg handling noticeably for cfq and will allow
further modularization of blkcg API.
-v2: Vivek pointed out that using cfq_get_cfqg() won't work if
CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out
initialization of base part of cfqg into cfq_init_cfqg_base() and
alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extend queue bypassing such that dying queue is always bypassing and
blk-throttle is drained on bypass. With blkcg policies updated to
test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
no bio or request is held by or going through blkcg policies on a
bypassing queue.
This will be used to implement blkg cleanup on elevator switches and
policy changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.
This will be further extended and used for blkio_group management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
q->elevator doesn't exist; however, bio based drivers don't have
elevator initialized but can still use blk-throttle. This patch moves
q->elevator test inside blk_drain_queue() such that only
elv_drain_elevator() is skipped if !q->elevator.
-v2: loop can have registered queue which has NULL request_fn. Make
sure we don't call into __blk_run_queue() in such cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Fold in bug fix from Vivek.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.
elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.
This is to prepare for plug merge update and doesn't introduce any
behavior change.
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails. In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.
floppy0: no floppy controllers found
------------[ cut here ]------------
WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
Hardware name: To Be Filled By O.E.M.
VFS: do_fd_request called on non-open device
Modules linked in:
Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
Call Trace:
[<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
[<ffffffff813d657f>] do_fd_request+0xbf/0xd0
[<ffffffff81322b95>] blk_drain_queue+0x65/0x80
[<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
[<ffffffff818a809d>] floppy_init+0xdeb/0xe28
[<ffffffff818a72b2>] ? daring+0x6b/0x6b
[<ffffffff810002af>] do_one_initcall+0x3f/0x170
[<ffffffff81884b34>] kernel_init+0x9d/0x11e
[<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
[<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
[<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
[<ffffffff815dbb10>] ? gs_change+0xb/0xb
Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.
* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().
* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.
There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.
This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.
However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.
* Comment added to explain how icq's are managed and synchronized.
This completes cleanup of io_context interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.
* Move cfqd->icq_list to request_queue->icq_list
* Make request explicitly point to icq instead of through elevator
private data. ->elevator_private[3] is replaced with sub struct elv
which contains icq pointer and priv[2]. cfq is updated accordingly.
* Meaningless clearing of ->elevator_private[0] removed from
elv_set_request(). At that point in code, the field was guaranteed
to be %NULL anyway.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>