cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.
* Dynamic addition and removal of cftypes to make config/stat file
handling modular for policies.
* cgroup removal update to not wait for css references to drain to fix
blkcg removal hang caused by cfq caching cfqgs.
Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
following conflicts in block/blk-cgroup.c.
* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
removal. Resolved by removing @subsys from all subsys methods.
* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
controllers" conflicts with ->pre_destroy() and ->attach() updates
and removal of modular config. Resolved by dropping forward
declarations of the methods and applying updates to the relocated
blkio_subsys.
* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
cftype interface" builds upon the previous item. Resolved by adding
->base_cftypes to the relocated blkio_subsys.
Signed-off-by: Tejun Heo <tj@kernel.org>
Make the following interface updates to prepare for future ioc related
changes.
* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.
* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().
* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently block core calls directly into blk-throttle for init, drain
and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions. This is to give more control and
visiblity to blkcg core layer for proper layering. Further patches
will add logic common to blkcg policies to the functions.
While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.
This will be further extended and used for blkio_group management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.
elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.
This is to prepare for plug merge update and doesn't introduce any
behavior change.
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins
Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.
* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().
* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.
There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.
This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.
However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.
* Comment added to explain how icq's are managed and synchronized.
This completes cleanup of io_context interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.
Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.
Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to. The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.
Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.
This doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path. This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.
Given the restriction, accessor + creator rolled into one doesn't work
too well. Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().
Future ioc updates will further consolidate ioc access and the create
interface will be unexported.
While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.
This patch does not introduce functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.
* Make sure the caller checks for the return value.
* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio(). The former is
always called from local task while the latter can be called from
different task. The synchornization between them are peculiar and
dubious.
* current_io_context() doesn't grab task_lock() and assumes that if it
saw %NULL ->io_context, it would stay that way until allocation and
assignment is complete. It has smp_wmb() between alloc/init and
assignment.
* set_task_ioprio() grabs task_lock() for assignment and does
smp_read_barrier_depends() between "ioc = task->io_context" and "if
(ioc)". Unfortunately, this doesn't achieve anything - the latter
is not a dependent load of the former. ie, if ioc itself were being
dereferenced "ioc->xxx", it would mean something (not sure what tho)
but as the code currently stands, the dependent read barrier is
noop.
As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.
Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.
ioc get/put doesn't have any reason to be complex. The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive. All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.
This patch updates ioc get/put so that it becomes more conventional.
* alloc_io_context() is replaced with get_task_io_context(). This is
the only interface which can acquire access to ioc of another task.
On return, the caller has an explicit reference to the object which
should be put using put_io_context() afterwards.
* The functionality of current_io_context() remains the same but when
creating a new ioc, it shares the code path with
get_task_io_context() and always goes through task_lock().
* get_io_context() now means incrementing ref on an ioc which the
caller already has access to (be that an explicit refcnt or implicit
%current one).
* PF_EXITING inhibits creation of new io_context and once
exit_io_context() is finished, it's guaranteed that both ioc
acquisition functions return %NULL.
* All users are updated. Most are trivial but
smp_read_barrier_depends() removal from cfq_get_io_context() needs a
bit of explanation. I suppose the original intention was to ensure
ioc->ioprio is visible when set_task_ioprio() allocates new
io_context and installs it; however, this wouldn't have worked
because set_task_ioprio() doesn't have wmb between init and install.
There are other problems with this which will be fixed in another
patch.
* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
specification.
-v2: Vivek spotted contamination from debug patch. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
and drops queue_lock in the latter case. Different locking context
depending on return value is error-prone and DEAD state is scheduled
to be protected by queue_lock anyway. Move DEAD check inside
queue_lock and return valid tg or NULL.
* blk_throtl_bio() indicates return status both with its return value
and in/out param **@bio. The former is used to indicate whether
queue is found to be dead during throtl processing. The latter
whether the bio is throttled.
There's no point in returning DEAD check result from
blk_throtl_bio(). The queue can die after blk_throtl_bio() is
finished but before make_request_fn() grabs queue lock.
Make it take *@bio instead and return boolean result indicating
whether the request is throttled or not.
This patch doesn't cause any visible functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reorganize queue draining related code in preparation of queue exit
changes.
* Factor out actual draining from elv_quiesce_start() to
blk_drain_queue().
* Make elv_quiesce_start/end() responsible for their own locking.
* Replace open-coded ELVSWITCH clearing in elevator_switch() with
elv_quiesce_end().
This patch doesn't cause any visible functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_throtl interface is block internal and there's no reason to have
them in linux/blkdev.h. Move them to block/blk.h.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit ae1b153962, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.
Also fixes up conflicts in the below files.
Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing. __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.
This was reported as causing an oops attributable to scsi.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.
In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0d:
block: make the flush insertion use the tail of the dispatch list
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
which causes about 20% regression running a sysbench fileio
workload.
Stable: 2.6.39 only
Cc: stable@kernel.org
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>