Effectively, make group_isolation=1 the default and remove the tunable.
The setting group_isolation=0 was because by default we idle on
sync-noidle tree and on fast devices, this can be very harmful for
throughput.
However, this problem can also be addressed by tuning slice_idle and
possibly group_idle on faster storage devices.
This change simplifies the CFQ code by removing the feature entirely.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Flush requests are never put on the IO scheduler. Convert request
structure's elevator_private* into an array and have the flush fields
share a union with it.
Reclaim the space lost in 'struct request' by moving 'completion_data'
back in the union with 'rb_node'.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Skip elevator initialization for flush requests by passing priv=0 to
blk_alloc_request() in get_request(). As such elv_set_request() is
never called for flush requests.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cdrom: support devices that have check_events but not media_changed
cfq-iosched: Don't wait if queue already has requests.
blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
cfq: rename a function to give it more appropriate name
cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y
loop: queue_lock NULL pointer derefence in blk_throtl_exit
drivers/block/Makefile: replace the use of <module>-objs with <module>-y
blktrace: Don't output messages if NOTIFY isn't set.
Commit 7667aa0630 added logic to wait for
the last queue of the group to become busy (have at least one request),
so that the group does not lose out for not being continuously
backlogged. The commit did not check for the condition that the last
queue already has some requests. As a result, if the queue already has
requests, wait_busy is set. Later on, cfq_select_queue() checks the
flag, and decides that since the queue has a request now and wait_busy
is set, the queue is expired. This results in early expiration of the
queue.
This patch fixes the problem by adding a check to see if queue already
has requests. If it does, wait_busy is not set. As a result, time slices
do not expire early.
The queues with more than one request are usually buffered writers.
Testing shows improvement in isolation between buffered writers.
Cc: stable@kernel.org
Signed-off-by: Justin TerAvest <teravest@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining. As such, sequencing is done
queue-wide one flush request after another. However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.
This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually. The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue. Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.
This allows arbitrary merging of different type of flushes. How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.
This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.
* As flush requests are never put on the IO scheduler, request fields
used for flush share space with rq->rb_node. rq->completion_data is
moved out of the union. This increases the request size by one
pointer.
As rq->elevator_private* are used only by the iosched too, it is
possible to reduce the request size further. However, to do that,
we need to modify request allocation path such that iosched data is
not allocated for flush requests.
* FLUSH/FUA processing happens on insertion now instead of dispatch.
- Comments updated as per Vivek and Mike.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bio's for flush are completed twice - once during the data phase and
one more time after the whole sequence is complete. The first
completion shouldn't notify completion to the issuer.
This was achieved by skipping all bio completion steps in
req_bio_endio() for the first completion; however, this has two
drawbacks.
* Error is not recorded in bio and must be tracked somewhere else.
* Partial completion is not supported.
Both don't cause problems for the current users; however, they make
further improvements difficult. Change req_bio_endio() such that it
only skips the actual notification part for the first completion. bio
completion is implemented with partial completions on mind anyway so
this is as simple as moving the REQ_FLUSH_SEQ conditional such that
only calling of bio_endio() is skipped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request. This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.
This patch doesn't cause any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.
This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel. A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).
Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Jeff Moyer was doing some testing on a RAM backed disk and
blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
somebody else reported that blkiocg_lookup_group() is eating 6% extra
cpu. Though looking at the code I can't think why the overhead of
this function is so high. One thing is that it is called with very high
frequency (once for every IO).
o For lot of folks blkio controller will be compiled in but they might
not have actually created cgroups. Hence optimize the case of root
cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
in root group (common case).
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Rename a function to give it more approprate name. We are calculating
cfq queue slice and function name gives the impression as if cfq group
slice length is being calculated.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If a queue is preempted before it gets slice assigned, the queue doesn't get
compensation, which looks unfair. For such queue, we compensate it for a whole
slice.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I got this:
fio-874 [007] 2157.724514: 8,32 m N cfq874 preempt
fio-874 [007] 2157.724519: 8,32 m N cfq830 slice expired t=1
fio-874 [007] 2157.724520: 8,32 m N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
fio-874 [007] 2157.724521: 8,32 m N cfq830 set_active wl_prio:0 wl_type:0
fio-874 [007] 2157.724522: 8,32 m N cfq830 Not idling. st->count:1
cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
have cfqg->saved_workload_slice mechanism, the preempt is a nop.
Looks currently our preempt is totally broken if the two queues are not from
the same workload type.
Below patch fixes it. This will might make async queue starvation, but it's
what our old code does before cgroup is added.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
cfq_group->ref is used with queue_lock hold, the only exception is
cfq_set_request, which looks like a bug to me, so ref doesn't need
to be an atomic and atomic operation is slower.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
cfq_queue->ref is used with queue_lock hold, so ref doesn't need to be an atomic
and atomic operation is slower.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We can't use krefs since it's apparently restricted to very basic
reference counting.
This reverts commit e4a683c8.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
kblockd is used for unplugging and may affect IO latency and
throughput and the max number of concurrent work items are bound by
the number of block devices. Make it HIGHPRI workqueue w/ default max
concurrency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
This patch fixes a spelling error in a source code comment and removes
superfluous braces in the function exit_io_context().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>