Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.
It's reasonable to reserve 0 for special purposes. This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks. This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.
This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.
* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.
* There is single monitoring point at the root. There's no way to
delegate management of a subtree.
* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.
* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.
This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.
v2: Patch description updated as per Serge.
v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Lennart Poettering <lennart@poettering.net>
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
To implement the unified hierarchy behavior, we'll need to be able to
determine the associated cgroup on the default hierarchy from css_set.
Let's add css_set->dfl_cgrp so that it can be accessed conveniently
and efficiently.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css_task_iter iterates tasks associated with a css by
visiting each css_set associated with the owning cgroup and walking
tasks of each of them. This works fine for !unified hierarchies as
each cgroup has its own css for each associated subsystem on the
hierarchy; however, on the planned unified hierarchy, a cgroup may not
have csses associated and its tasks would be considered associated
with the matching css of the nearest ancestor which has the subsystem
enabled.
This means that on the default unified hierarchy, just walking all
tasks associated with a cgroup isn't enough to walk all tasks which
are associated with the specified css. If any of its children doesn't
have the matching css enabled, task iteration should also include all
tasks from the subtree. We already added cgroup->e_csets[] to list
all css_sets effectively associated with a given css and walk css_sets
on that list instead to achieve such iteration.
This patch updates css_task_iter iteration such that it walks css_sets
on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
requested on an non-dummy css. Thanks to the previous iteration
update, this change can be achieved with the addition of
css_task_iter->ss and minimal updates to css_advance_task_iter() and
css_task_iter_start().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
This patch reorganizes css_task_iter so that adding effective css
support is easier.
* s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency
* ->origin_css is used to determine whether the iteration reached the
last css_set. Replace it with explicit ->cset_head so that
css_advance_task_iter() doesn't have to know the termination
condition directly.
* css_task_iter_next() currently assumes that it's walking list of
cgrp_cset_link and reaches into the current cset through the current
link to determine the termination conditions for task walking. As
this won't always be true for effective css walking, add
->tasks_head and ->mg_tasks_head and use them to control task
walking so that css_task_iter_next() doesn't have to know how
css_sets are being walked.
This patch doesn't make any behavior changes. The iteration logic
stays unchanged after the patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
On the default unified hierarchy, a cgroup may be associated with
csses of its ancestors, which means that a css of a given cgroup may
be associated with css_sets of descendant cgroups. This means that we
can't walk all tasks associated with a css by iterating the css_sets
associated with the cgroup as there are css_sets which are pointing to
the css but linked on the descendants.
This patch adds per-subsystem list heads cgroup->e_csets[]. Any
css_set which is pointing to a css is linked to
css->cgroup->e_csets[$SUBSYS_ID] through
css_set->e_cset_node[$SUBSYS_ID]. The lists are protected by
css_set_rwsem and will allow us to walk all css_sets associated with a
given css so that we can find out all associated tasks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
944196278d ("cgroup: move ->subsys_mask from cgroupfs_root to
cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for
the unified hierarhcy; however, it turns out that carrying the
subsys_mask of the children in the parent, instead of itself, is a lot
more natural. This patch restores cgroup_root->subsys_mask and morphs
cgroup->subsys_mask into cgroup->child_subsys_mask.
* Uses of root->cgrp.subsys_mask are restored to root->subsys_mask.
* Remove automatic setting and clearing of cgrp->subsys_mask and
instead just inherit ->child_subsys_mask from the parent during
cgroup creation. Note that this doesn't affect any current
behaviors.
* Undo __kill_css() separation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
This cftype flag makes the file only appear on the default hierarchy.
This will later be used for cgroup.controllers file.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgrp_dfl_root will be used as the default unified hierarchy. This
patch makes cgrp_dfl_root mountable by making the following changes.
* cgroup_init_early() now initializes cgrp_dfl_root w/
CGRP_ROOT_SANE_BEHAVIOR. The default hierarchy is always sane.
* parse_cgroupfs_options() and cgroup_mount() are updated such that
cgrp_dfl_root is mounted if sane_behavior is specified w/o any
subsystems.
* rebind_subsystems() now populates the root directory of
cgrp_dfl_root. Note that the function still guarantees success of
rebinding subsystems to cgrp_dfl_root. If populating fails while
rebinding to cgrp_dfl_root, it whines but ignores the error.
* For backward compatibility, the default hierarchy shows up in
/proc/$PID/cgroup only after it's explicitly mounted so that
userland which doesn't make use of it doesn't see any change.
* "current_css_set_cg_links" file of debug cgroup now treats the
default hierarchy the same as other hierarchies. This is visible to
userland. Given that it's for debug controller, this should be
fine.
* While at it, implement cgroup_on_dfl() which tests whether a give
cgroup is on the default hierarchy or not.
The above changes make cgrp_dfl_root mostly equivalent to other
controllers but the actual unified hierarchy behaviors are not
implemented yet. Let's plug child cgroup creation in cgrp_dfl_root
from create_cgroup() for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The dummy root will be repurposed to serve as the default unified
hierarchy. Let's rename things in preparation.
* s/cgroup_dummy_root/cgrp_dfl_root/
* s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore
* s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity
This is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroupfs_root->subsys_mask represents the controllers attached to the
hierarchy. This patch moves the field to cgroup. Subsystem
initialization and rebinding updates the top cgroup's subsys_mask.
For !root cgroups, the subsys_mask bits are set from create_css() and
cleared from kill_css(), which effectively means that all cgroups will
have the same subsys_mask as the top cgroup.
While this doesn't make any difference now, this will help
implementation of the default unified hierarchy where !root cgroups
may have subsets of the top_cgroup's subsys_mask.
While at it, __kill_css() is split out of kill_css(). The former
doesn't care about the subsys_mask while the latter becomes noop if
the controller is already killed and clears the matching bit if not
before proceeding to killing the css. This will be used later by the
default unified hierarchy implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The dummy hierarchy is now a fully functional one and dummy_top has a
kernfs_node associated with it. Drop the NULL checks in
[pr_cont_]cont_{name|path}() which are no longer necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
For optimization, task_lock() is additionally used to protect
task->cgroups. The optimization is pretty dubious as either
css_set_rwsem is grabbed anyway or PF_EXITING already protects
task->cgroups. It adds only overhead and confusion at this point.
Let's drop task_[un]lock() and update comments accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too. Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set. Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks. Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
My kernel fails to boot, because blkcg calls cgroup_path() while
cgroupfs is not mounted.
Fix both cgroup_name() and cgroup_path().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The two functions don't have any users left. Remove them along with
cgroup_taskset->cur_cgrp.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
css. The intention of the interface is to make it easy to skip css's
(cgroup_subsys_states) which already match the migration target;
however, this is entirely unnecessary as migration taskset doesn't
include tasks which are already in the target cgroup. Drop @skip_css
from cgroup_taskset_for_each().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
cgroup_task_count() read-locks css_set_lock and walks all tasks to
count them and then returns the result. The only thing all the users
want is determining whether the cgroup is empty or not. This patch
implements cgroup_has_tasks() which tests whether cgroup->cset_links
is empty, replaces all cgroup_task_count() usages and unexports it.
Note that the test isn't synchronized. This is the same as before.
The test has always been racy.
This will help planned css_set locking update.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>