blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:
pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad
In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.
A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.
Do struct blkcg_gq allocations outside the queue spinlock to allow
blocking during memory allocations.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
I inadvertently applied the v5 version of this patch, whereas
the agreed upon version was v5. Revert this one so we can apply
the right one.
This reverts commit 7fc6b87a9f.
blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:
pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad
In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.
A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.
Update blkg_create() function to temporarily drop the rcu and queue
locks when it is allowed by gfp mask.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
If blkg_create fails, new_blkg passed as an argument will
be freed by blkg_create, so there is no need to free it again.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
There's no potential harm in quiescing the queue, but it also doesn't
buy us anything. And we can't run the queue async for policy
deactivate, since we could be in the path of tearing the queue down.
If we schedule an async run of the queue at that time, we're racing
with queue teardown AFTER having we've already torn most of it down.
Reported-by: Omar Sandoval <osandov@fb.com>
Fixes: 4d199c6f1c ("blk-cgroup: ensure that we clear the stop bit on quiesced queues")
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we call blk_mq_quiesce_queue() on a queue, we must remember to
pair that with something that clears the stopped by on the
queues later on.
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.
We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.
We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.
Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marc MERLIN <marc@merlins.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
avoids that smatch reports the following error:
block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'
Fixes: 06b285bd11 ("blkcg: fix blkcg_policy_data allocation bug")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
get_disk(),get_gendisk() calls have non explicit side effect: they
increase the reference on the disk owner module.
The following is the correct sequence how to get a disk reference and
to put it:
disk = get_gendisk(...);
/* use disk */
owner = disk->fops->owner;
put_disk(disk);
module_put(owner);
fs/block_dev.c is aware of this required module_put() call, but f.e.
blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
a module reference. To see a leakage in action cgroups throttle config
can be used. In the following script I'm removing throttle for /dev/ram0
(actually this is NOP, because throttle was never set for this device):
# lsmod | grep brd
brd 5175 0
# i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
/sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
done
# lsmod | grep brd
brd 5175 100
Now brd module has 100 references.
The issue is fixed by calling module_put() just right away put_disk().
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Pull cgroup updates from Tejun Heo:
"The cgroup core saw several significant updates this cycle:
- percpu_rwsem for threadgroup locking is reinstated. This was
temporarily dropped due to down_write latency issues. Oleg's
rework of percpu_rwsem which is scheduled to be merged in this
merge window resolves the issue.
- On the v2 hierarchy, when controllers are enabled and disabled, all
operations are atomic and can fail and revert cleanly. This allows
->can_attach() failure which is necessary for cpu RT slices.
- Tasks now stay associated with the original cgroups after exit
until released. This allows tracking resources held by zombies
(e.g. pids) and makes it easy to find out where zombies came from
on the v2 hierarchy. The pids controller was broken before these
changes as zombies escaped the limits; unfortunately, updating this
behavior required too many invasive changes and I don't think it's
a good idea to backport them, so the pids controller on 4.3, the
first version which included the pids controller, will stay broken
at least until I'm sure about the cgroup core changes.
- Optimization of a couple common tests using static_key"
* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
cgroup: fix race condition around termination check in css_task_iter_next()
blkcg: don't create "io.stat" on the root cgroup
cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
cgroup: replace error handling in cgroup_init() with WARN_ON()s
cgroup: add cgroup_subsys->free() method and use it to fix pids controller
cgroup: keep zombies associated with their original cgroups
cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
cgroup: don't hold css_set_rwsem across css task iteration
cgroup: reorganize css_task_iter functions
cgroup: factor out css_set_move_task()
cgroup: keep css_set and task lists in chronological order
cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
cgroup: make css_sets pin the associated cgroups
cgroup: relocate cgroup_[try]get/put()
cgroup: move check_for_release() invocation
cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
cgroup: make cgroup->nr_populated count the number of populated css_sets
cgroup: remove an unused parameter from cgroup_task_migrate()
cgroup: fix too early usage of static_branch_disable()
cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
...
The stat files on the root cgroup shows stats for the whole system and
usually don't contain any information which isn't available through
the usual system monitoring mechanisms. Some controllers skip
collecting these duplicate stats to optimize cases where cgroup isn't
used and later try to emulate the result on demand.
This leads to complexities and subtle differences in the information
shown through different channels. This is entirely unnecessary and
cgroup v2 is dropping stat files which are duplicate from all
controllers. This patch removes "io.stat" from the root hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Pull block updates from Jens Axboe:
"This is a bit bigger than it should be, but I could (did) not want to
send it off last week due to both wanting extra testing, and expecting
a fix for the bounce regression as well. In any case, this contains:
- Fix for the blk-merge.c compilation warning on gcc 5.x from me.
- A set of back/front SG gap merge fixes, from me and from Sagi.
This ensures that we honor SG gapping for integrity payloads as
well.
- Two small fixes for null_blk from Matias, fixing a leak and a
capacity propagation issue.
- A blkcg fix from Tejun, fixing a NULL dereference.
- A fast clone optimization from Ming, fixing a performance
regression since the arbitrarily sized bio's were introduced.
- Also from Ming, a regression fix for bouncing IOs"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bounce_end_io
block: blk-merge: fast-clone bio when splitting rw bios
block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
block: Copy a user iovec if it includes gaps
block: Refuse adding appending a gapped integrity page to a bio
block: Refuse request/bio merges with gaps in the integrity payload
block: Check for gaps on front and back merges
null_blk: fix wrong capacity when bs is not 512 bytes
null_blk: fix memory leak on cleanup
block: fix bogus compiler warnings in blk-merge.c
While making the root blkg unconditional, ec13b1d6f0 ("blkcg: always
create the blkcg_gq for the root blkcg") removed the part which clears
q->root_blkg and ->root_rl.blkg during q exit. This leaves the two
pointers dangling after blkg_destroy_all(). blk-throttle exit path
performs blkg traversals and dereferences ->root_blkg and can lead to
the following oops.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
RIP: 0010:[<ffffffff81389746>] [<ffffffff81389746>] __blkg_lookup+0x26/0x70
...
Call Trace:
[<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
[<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
[<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
[<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
[<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
[<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
[<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
[<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
...
While the bug is a straigh-forward use-after-free bug, it is tricky to
reproduce because blkg release is RCU protected and the rest of exit
path usually finishes before RCU grace period.
This patch fixes the bug by updating blkg_destro_all() to clear
q->root_blkg and ->root_rl.blkg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Richard W.M. Jones" <rjones@redhat.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
Fixes: ec13b1d6f0 ("blkcg: always create the blkcg_gq for the root blkcg")
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.
* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.
* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.
* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.
* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blkg_conf_prep() expects input to be of the following form
MAJ:MIN NUM
and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects
MAJ:MIN BODY_STR
where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.
To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.
This doesn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>