Commit Graph

662396 Commits

Author SHA1 Message Date
Shaohua Li 7394e31fa4 blk-throttle: make bandwidth change smooth
When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Scale up is linear. The limit scales up 1/2 .low limit every
throtl_slice after upgrade. The scale up will stop if the adjusted limit
hits .max limit. Scale down is exponential. We cut the scale value half
if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
fully downgrade the queue to LIMIT_LOW state.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li aec242468c blk-throttle: detect completed idle cgroup
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Please note this will be replaced with a more sophisticated algorithm
later, but this demonstrates the idea how we handle idle cgroups, so I
leave it here.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li d61fcfa4bb blk-throttle: choose a small throtl_slice for SSD
The throtl_slice is 100ms by default. This is a long time for SSD, a lot
of IO can run. To make cgroups have smoother throughput, we choose a
small value (20ms) for SSD.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 297e3d8547 blk-throttle: make throtl_slice tunable
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 06cceedcca blk-throttle: make sure expire time isn't too big
cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 3f0abd8066 blk-throttle: add downgrade logic
When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li c79892c557 blk-throttle: add upgrade logic for LIMIT_LOW state
When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
the cgroup has pending request. Since cgroup is throttled according to
the limit, pending request means the cgroup reaches the limit.

If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction
IO, the other direction IO could be severly harmed.

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops (which is a sum of childrens
bps/iops) cross low limit, we can upgrade queue state.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li b22c417c88 blk-throttle: configure bps/iops limit for cgroup in low limit
each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal value between low limit and
max limit configured by user. If the minimal value is 0, which means the
cgroup doesn't configure low limit, we will use max limit to throttle
the cgroup and the cgroup is ready to upgrade to LIMIT_MAX

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li cd5ab1b0fc blk-throttle: add .low interface
Add low limit for cgroup and corresponding cgroup interface. To be
consistent with memcg, we allow users configure .low limit higher than
.max limit. But the internal logic always assumes .low limit is lower
than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
userspace configuration. Old bps/iops fields in throtl_grp will be the
actual limit we use for throttling.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 327ffb9b37 blk-throttle: add configure option for new .low interface
As discussed in LSF, add configure option for the interface and mark it
as experimental, so people can try/test.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 9f626e372a blk-throttle: prepare support multiple limits
We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Shaohua Li 2ab5492de5 blk-throttle: use U64_MAX/UINT_MAX to replace -1
clean up the code to avoid using -1

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Eric Biggers 869ab90f0a block: constify struct blk_integrity_profile
blk_integrity_profile's are never modified, so mark them 'const' so that
they are placed in .rodata and benefit from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 20:34:39 -06:00
Eric Biggers 1be7d2073e block: remove outdated part of blkdev_issue_flush() comment
blkdev_issue_flush() is now always synchronous, and it no longer has a
flags argument.  So remove the part of the comment about the WAIT flag.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 15:41:30 -06:00
Eric Biggers e554911c23 block: correct documentation for blkdev_issue_discard() flags
BLKDEV_IFL_* flags no longer exist; blkdev_issue_discard() now actually
takes BLKDEV_DISCARD_* flags.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 15:41:28 -06:00
Geliang Tang 0a3ef9db64 mg_disk: use setup_timer
Use setup_timer() instead of init_timer() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 08:22:09 -06:00
Geliang Tang 68219bdf5f block: floppy: use setup_timer
Use setup_timer() instead of init_timer() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 08:22:07 -06:00
Dan Carpenter 7a88fa1919 block: make nr_iovecs unsigned in bio_alloc_bioset()
There isn't a bug here, but Smatch is not smart enough to know that
"nr_iovecs" can't be negative so it complains about underflows.
Really, it's slightly cleaner to make this parameter unsigned.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-23 08:16:11 -06:00
Christoph Hellwig a4d907b6a3 blk-mq: streamline blk_mq_make_request
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:17:03 -06:00
Christoph Hellwig 2299722c4b blk-mq: split the plug and sync cases in blk_mq_make_request
Now that we have a nice direct issue heper this helps simplifying
the code a bit, and also gets rid of the old_rq variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:17:01 -06:00
Christoph Hellwig 5eb6126e1c blk-mq: improve blk_mq_try_issue_directly
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:59 -06:00
Christoph Hellwig 254d259da0 blk-mq: merge mq and sq make_request instances
They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:58 -06:00
Christoph Hellwig 7642747d67 blk-mq: remove BLK_MQ_F_DEFER_ISSUE
This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:56 -06:00
Jan Kara d01b2dcb44 block: Fix oops scsi_disk_get()
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:11:37 -06:00
Jan Kara c70c176ff8 kobject: Export kobject_get_unless_zero()
Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:11:35 -06:00