We often lookup the same queue many times in succession, so cache
the last looked up queue to avoid browsing the rbtree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfq hash is no more necessary. We always can get cfqq from io context.
cfq_get_io_context_noalloc() function is introduced, because we don't
want to allocate cic on merging and checking may_queue. In order to
identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is
eliminated we need to use other criterion: sync flag for queue is added.
In all places where we dig in rb_tree we're in current context, so no
additional locking is required.
Advantages of this patch: no additional memory for hash, no seeking in
hash, code is cleaner. But it is necessary now to seek cic in per-ioc
rbtree, but it is faster:
- most processes work only with few devices
- most systems have only few block devices
- it is a rb-tree
Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Changes by me:
- Merge into CFQ devel branch
- Get rid of cfq_get_io_context_noalloc()
- Fix various bugs with dereferencing cic->cfqq[] with offset other
than 0 or 1.
- Fix bug in cfqq setup, is_sync condition was reversed.
- Fix bug where only bio_sync() is used, we need to check for a READ too
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For tagged devices, allow overlap of requests if the idle window
isn't enabled on the current active queue.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's only used for preemption now that the IDLE and RT queues also
use the rbtree. If we pass an 'add_front' variable to
cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion
at the front of the tree.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently CFQ does a linked insert into the current list for RT
queues. We can just factor the class into the rb insertion,
and then we don't have to treat RT queues in a special way. It's
faster, too.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For cases where the rbtree is mainly used for sorting and min retrieval,
a nice speedup of the rbtree code is to maintain a cache of the leftmost
node in the tree.
Also spotted in the CFS CPU scheduler code.
Improved by Alan D. Brunelle <Alan.Brunelle@hp.com> by updating the
leftmost hint in cfq_rb_first() if it isn't set, instead of only
updating it on insert.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Drawing on some inspiration from the CFS CPU scheduler design, overhaul
the pending cfq_queue concept list management. Currently CFQ uses a
doubly linked list per priority level for sorting and service uses.
Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
to service them.
This unfortunately means that the ionice levels aren't as strong
anymore, will work on improving those later. We only scale the slice
time now, not the number of times we service. This means that latency
is better (for all priority levels), but that the distinction between
the highest and lower levels aren't as big.
The diffstat speaks for itself.
cfq-iosched.c | 363 +++++++++++++++++---------------------------------
1 file changed, 125 insertions(+), 238 deletions(-)
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- Move the queue_new flag clear to when the queue is selected
- Only select the non-first queue in cfq_get_best_queue(), if there's
a substantial difference between the best and first.
- Get rid of ->busy_rr
- Only select a close cooperator, if the current queue is known to take
a while to "think".
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- Implement logic for detecting cooperating processes, so we
choose the best available queue whenever possible.
- Improve residual slice time accounting.
- Remove dead code: we no longer see async requests coming in on
sync queues. That part was removed a long time ago. That means
that we can also remove the difference between cfq_cfqq_sync()
and cfq_cfqq_class_sync(), they are now indentical. And we can
kill the on_dispatch array, just make it a counter.
- Allow a process to go into the current list, if it hasn't been
serviced in this scheduler tick yet.
Possible future improvements including caching the cfqq lookup
in cfq_close_cooperator(), so we don't have to look it up twice.
cfq_get_best_queue() should just use that last decision instead
of doing it again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When testing the syslet async io approach, I discovered that CFQ
sometimes didn't perform as well as expected. cfq_should_preempt()
needs to better check for cooperating tasks, so fix that by allowing
preemption of an equal priority queue if the recently queued request
is as good a candidate for IO as the one we are currently waiting for.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's a really rare and obscure bug in CFQ, that causes a crash in
cfq_dispatch_insert() due to rq == NULL. One example of the resulting
oops is seen here:
http://lkml.org/lkml/2007/4/15/41
Neil correctly diagnosed the situation for how this can happen: if two
concurrent requests with the exact same sector number (due to direct IO
or aliasing between MD and the raw device access), the alias handling
will add the request to the sortlist, but next_rq remains NULL.
Read the more complete analysis at:
http://lkml.org/lkml/2007/4/25/57
This looks like it requires md to trigger, even though it should
potentially be possible to due with O_DIRECT (at least if you edit the
kernel and doctor some of the unplug calls).
The fix is to move the ->next_rq update to when we add a request to the
rbtree. Then we remove the possibility for a request to exist in the
rbtree code, but not have ->next_rq correctly updated.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a 10-15% performance regression for sequential writes on TCQ/NCQ
enabled drives in 2.6.21-rcX after the CFQ update went in. It has been
reported by Valerie Clement <valerie.clement@bull.net> and the Intel
testing folks. The regression is because of CFQ's now more aggressive
queue control, limiting the depth available to the device.
This patches fixes that regression by allowing a greater depth when only
one queue is busy. It has been tested to not impact sync-vs-async
workloads too much - we still do a lot better than 2.6.20.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (as857) modifies the SG_GET_RESERVED_SIZE and
SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
the device's request_queue's max_sectors value. This will permit
cdrecord to obtain a legal value for the maximum transfer length,
fixing Bugzilla #7026.
The patch also caps the initial reserved_size value. There's no
reason to have a reserved buffer larger than max_sectors, since it
would be impossible to use the extra space.
The corresponding ioctls in the block layer are modified similarly,
and the initial value for the reserved_size is set as large as
possible. This will effectively make it default to max_sectors.
Note that the actual value is meaningless anyway, since block devices
don't have a reserved buffer.
Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
uniform way for users to determine the actual max_sectors value for
any raw SCSI transport.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Douglas Gilbert <dougg@torque.net>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Revert all this. It can cause device-mapper to receive a different major from
earlier kernels and it turns out that the Amanda backup program (via GNU tar,
apparently) checks major numbers on files when performing incremental backups.
Which is a bit broken of Amanda (or tar), but this feature isn't important
enough to justify the churn.
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
machines in the bootlog:
io scheduler noop registered<6>Time: jiffies clocksource has been installed.
io scheduler deadline registered (default)
Looking at block/elevator.c, it appears that elv_register() uses two
consecutive printks in a non-atomic way, leading to the above glitch. The
attached trivial patch fixes this issue, by using a single printk.
Signed-off-by: Thibaut VARENE <varenet@parisc-linux.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>