Original patch from Nikanth Karthikesan <knikanth@suse.de>
When a queue exits the queue lock is taken and cfq_exit_queue() would free all
the cic's associated with the queue.
But when a task exits, cfq_exit_io_context() gets cic one by one and then
locks the associated queue to call __cfq_exit_single_io_context. It looks like
between getting a cic from the ioc and locking the queue, the queue might have
exited on another cpu.
Fix this by rechecking the cfq_io_context queue key inside the queue lock
again, and not calling into __cfq_exit_single_io_context() if somebody
beat us to it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In loop_unplug() function is expected that mapping is set
and lo->lo_backing_file is not NULL.
Unfortunately loop_set_fd() set the request queue unplug function,
but loop_clr_fd() doesn't clear that.
Loop device allows open of non-configured loop in some situations.
If the unplug on request queue is called, loop module oopses because
of missing lo_backing_file.
Simple reproducer:
losetup /dev/loop0 /xxx
losetup -d /dev/loop0
dmsetup create x --table "0 1 linear /dev/loop0 0"
EIP is at loop_unplug+0x1d/0x3b
...
Call Trace:
blk_unplug+0x57/0x5e
dm_table_unplug_all+0x34/0x77 [dm_mod]
destroy_inode+0x27/0x38
generic_delete_inode+0xd5/0xd9
iput+0x4b/0x4e
dm_resume+0xca/0xfe [dm_mod]
dev_suspend+0x143/0x165 [dm_mod]
dm_ctl_ioctl+0x18e/0x1cf [dm_mod]
dev_suspend+0x0/0x165 [dm_mod]
dm_ctl_ioctl+0x0/0x1cf [dm_mod]
vfs_ioctl+0x22/0x69
do_vfs_ioctl+0x39d/0x3c7
trace_hardirqs_on+0xb/0xd
remove_vma+0x50/0x56
do_munmap+0x21c/0x237
sys_ioctl+0x2c/0x45
sysenter_do_call+0x12/0x31
Several reports here
http://www.kerneloops.org/search.php?search=loop_unplug
Fix it by simply clear unplug function together with
removing of backing file.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When there are still queued bios and reference count
drops to zero, loop device must flush all queued bios.
Otherwise it can lead to situation that caller
closes the device, but some bios are still running
and endio() function call later OOpses when uses
unallocated mempool.
This happens for example when running dm-crypt over loop,
here is typical oops backtrace:
Oops: 0000 [#1] PREEMPT SMP
EIP is at mempool_free+0x12/0x6b
...
crypt_dec_pending+0x50/0x54 [dm_crypt]
crypt_endio+0x9f/0xa7 [dm_crypt]
crypt_endio+0x0/0xa7 [dm_crypt]
bio_endio+0x2b/0x2e
loop_thread+0x37a/0x3b1
do_lo_send_aops+0x0/0x165
autoremove_wake_function+0x0/0x33
loop_thread+0x0/0x3b1
kthread+0x3b/0x61
kthread+0x0/0x61
kernel_thread_helper+0x7/0x10
(But crash is reproducible with different dm targets
running over loop device too.)
Patch fixes it by flushing the bios in release call,
reusing the flush mechanism for switching backing store.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
zero is invalid for max_phys_segments, max_hw_segments, and
max_segment_size. It's better to use use min_not_zero instead of
min. min() works though (because the commit 0e435ac makes sure that
these values are set to the default values, non zero, if a queue is
initialized properly).
With this patch, blk_queue_stack_limits does the almost same thing
that dm's combine_restrictions_low() does. I think that it's easy to
remove dm's combine_restrictions_low.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
disk_map_sector_rcu() returns a partition from a sector offset,
which we use for IO statistics on a per-partition basis. The
lookup itself is an O(N) list lookup, where N is the number of
partitions. This actually hurts performance quite a bit, even
on the lower end partitions. On higher numbered partitions,
it can get pretty bad.
Solve this by adding a one-hit cache for partition lookup.
This makes the lookup O(1) for the case where we do most IO to
one partition. Even for mixed partition workloads, amortized cost
is pretty close to O(1) since the natural IO batching makes the
one-hit cache last for lots of IOs.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This basically limits the hardware queue depth to 4*quantum at any
point in time, which is 16 with the default settings. As CFQ uses
other means to shrink the hardware queue when necessary in the first
place, there's really no need for this extra heuristic. Additionally,
it ends up hurting performance in some cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.
There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When we go and allocate a bio for IO, we actually do two allocations.
One for the bio itself, and one for the bi_io_vec that holds the
actual pages we are interested in.
This feature inlines a definable amount of io vecs inside the bio
itself, so we eliminate the bio_vec array allocation for IO's up
to a certain size. It defaults to 4 vecs, which is typically 16k
of IO.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.
This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only very rarely need the mempool backing, so it makes sense to
get rid of all but one of the mempool in a bio_set. So keep the
largest bio_vec count mempool so we can always honor the largest
allocation, and "upgrade" callers that fail.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We just want to hand the first bits of IO to the device as fast
as possible. Gains a few percent on the IOPS rate.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Empty barrier on write-through (or no cache) w/ ordered tag has no
command to execute and without any command to execute ordered tag is
never issued to the device and the ordering is never achieved. Force
draining for such cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.
With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().
Remove blk_empty_barrier() test which now has no user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Barrier completion had the following assumptions.
* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.
* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.
Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.
* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().
* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.
* While at it, convert 0/1 return from blk_do_ordered() to false/true.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In all barrier sequences, the barrier write itself was always assumed
to be issued and thus didn't have corresponding control flag. This
patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
start_ordered() such that any barrier action can be skipped.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Because barrier mode can be changed dynamically, whether barrier is
supported or not can be determined only when actually issuing the
barrier and there is no point in checking it earlier. Drop barrier
support check in generic_make_request() and __make_request(), and
update comment around the support check in blk_do_ordered().
* There is no reason to check discard support in both
generic_make_request() and __make_request(). Drop the check in
__make_request(). While at it, move error action block to the end
of the function and add unlikely() to q existence test.
* Barrier request, be it empty or not, is never passed to low level
driver and thus it's meaningless to try to copy back req->sector to
bio->bi_sector on error. In addition, the notion of failed sector
doesn't make any sense for empty barrier to begin with. Drop the
code block from __end_that_request_first().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Separate out ordering type (drain,) and action masks (preflush,
postflush, fua) from visible ordering mode selectors
(QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
while action masks are named QUEUE_ORDERED_DO_*.
This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
optional to improve empty barrier implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove 8 bytes of padding from struct bio which also removes 16 bytes from
struct bio_pair to make it 248 bytes. bio_pair then fits into one fewer
cache lines & into a smaller slab.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After many improvements on kblockd_flush_work, it is now identical to
cancel_work_sync, so a direct call to cancel_work_sync is suggested.
The only difference is that cancel_work_sync is a GPL symbol,
so no non-GPL modules anymore.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>