Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.
The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):
ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]
Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.
Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:
8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240
8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1]
8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1]
8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1]
<==== 'I' means Inserted (into the IO scheduler) ===================================>
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]
<==== Only 15s later the CFQ scheduler dispatches the request ======================>
As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):
8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0
where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.
The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.
Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.
Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.
When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).
This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)
Care is taken during preempt to still allow rt requests to preempt us
regardless.
Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:
Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0
Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.
In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.
Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.
cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e
IO_ROOT=/mnt-cgroup/io
mkdir -p $IO_ROOT
if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi
cd $IO_ROOT
for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done
mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight
RUNTIME=10
(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&
(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&
(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&
sleep $((RUNTIME+1))
for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done
EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686
After applying the patch:
# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 9a7f38c42c (cfq-iosched: Convert from jiffies to nanoseconds)
could result in charging just 1 ns to a cgroup submitting IO instead of 1
jiffie we always charged before. It is arguable what is the right amount
to change but for now lets retain the old behavior of always charging at
least one jiffie.
Fixes: 9a7f38c42c
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 9a7f38c42c (cfq-iosched: Convert from jiffies to nanoseconds)
broke the condition for detecting starved sync IO in
cfq_completed_request() because rq->start_time remained in jiffies but
we compared it with nanosecond values. This manifested as a regression
in bonnie++ rewrite performance because we always ended up considering
sync IO starved and thus never increased async IO queue depth.
Since rq->start_time is used in a lot of places, converting it to ns
values would be non-trivial. So just revert the condition in CFQ to use
comparison with jiffies. This will lead to suboptimal results if
cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
relatively far from that with the storage used with CFQ (the default
value is 128 ms).
Fixes: 9a7f38c42c
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
slice_resid can be both positive and negative. Commit 9a7f38c42c
(cfq-iosched: Convert from jiffies to nanoseconds) converted it from
long to u64. Although this did not introduce any functional regression
(the operations just overflow and the result was fine), it is certainly
wrong and could cause issues in future. So convert the type to more
appropriate s64.
Fixes: 9a7f38c42c
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Expose interfaces to tune time slices of CFQ IO scheduler in
microseconds.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds
so that we can later make the intervals more fine-grained than jiffies.
One jiffie is several miliseconds and even for today's rotating disks
that is a noticeable amount of time and thus we leave disk unnecessarily
idle.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the is_sync helpers to use separate variables
for the operation and flags.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The bio and request operation and flags are going to be separate
definitions, so we cannot pass them in as a bitmap. This patch
converts the blkg_rwstat code and its caller, cfq, to pass in the
values separately.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't allow sync workload of one cgroup to preempt sync
workload of any other cgroup. This is because we want to achieve service
separation between cgroups. However in cases where cgroup preempting is
ancestor of the current cgroup, there is no need of separation and
idling introduces unnecessary overhead. This hurts for example the case
when workload is isolated within a cgroup but journalling threads are in
root cgroup. Simple way to demostrate the issue is using:
dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
difference more visible). When all processes are in the root cgroup,
reported throughput is 153.132 MB/sec. When dbench process gets its own
blkio cgroup, reported throughput drops to 26.1006 MB/sec.
Fix the problem by making check in cfq_should_preempt() more benevolent
and allow preemption by ancestor cgroup. This improves the throughput
reported by dbench4 to 48.9106 MB/sec.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The original idea with preemption of sync noidle queues (introduced in
commit 718eee0579 "cfq-iosched: fairness for sync no-idle queues") was
that we service all sync noidle queues together, we don't idle on any of
the queues individually and we idle only if there is no sync noidle
queue to be served. This intention also matches the original test:
if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
&& new_cfqq->service_tree == cfqq->service_tree)
return true;
However since at that time cfqq->service_tree was not set for idling
queues, this test was unreliable and was replaced in commit e4a229196a
"cfq-iosched: fix no-idle preemption logic" by:
if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
new_cfqq->service_tree->count == 1)
return true;
That was a reliable test but was actually doing something different -
now we preempt sync noidle queue only if the new queue is the only one
busy in the service tree.
These days cfq queue is kept in service tree even if it is idling and
thus the original check would be safe again. But since we actually check
that cfq queues are in the same cgroup, of the same priority class and
workload type (sync noidle), we know that new_cfqq is fine to preempt
cfqq. So just remove the service tree check.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Move check for preemption by rt class up. There is no functional change
but it makes arguing about conditions simpler since we can be sure both
cfq queues are from the same ioprio class.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no point in idling on a cfq group if the only cfq queue that is
there has too big thinktime.
Signed-off-by: Jan Kara <jack@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup_on_dfl() tests whether the cgroup's root is the default
hierarchy; however, an individual controller is only interested in
whether the controller is attached to the default hierarchy and never
tests a cgroup which doesn't belong to the hierarchy that the
controller is attached to.
This patch replaces cgroup_on_dfl() tests in controllers with faster
static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
the only user of cgroup_on_dfl() and the function is moved from the
header file to cgroup.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.
* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.
* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.
* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.
* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>