Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item. The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent. The separation is to allow using percpu refcnt for css.
Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name. As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine. A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.
As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both. This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes. INIT_WORK() is now performed right before queueing the work
item.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list. This will be used to make css use
percpu refcnt.
While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.
While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed. It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.
Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced. Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.
Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count. This will also allow removing @cgrp->count.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We will add another flag indicating that the cgroup is in the process
of being killed. REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also,
later percpu_ref usage will involve "kill"ing the refcnt.
s/CGRP_REMOVED/CGRP_DEAD/
s/cgroup_is_removed()/cgroup_is_dead()
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
* __css_get() isn't used by anyone. Fold it into css_get().
* Add proper function comments to all css reference functions.
This patch is purely cosmetic.
v2: Typo fix as per Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
There's no point in using kmalloc() instead of the clearing variant
for trivial stuff. We can live dangerously elsewhere. Use kzalloc()
instead and drop 0 inits.
While at it, do trivial code reorganization in cgroup_file_open().
This patch doesn't introduce any functional changes.
v2: I was caught in the very distant past where list_del() didn't
poison and the initial version converted list_del()s to
list_del_init()s too. Li and Kent took me out of the stasis
chamber.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides. The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.
>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list. From css_set side, it starts off ->cg_links and
runs through ->cg_link_list. This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets. This patch cleans it up a bit.
* s/cgroup->css_sets/cgroup->cset_links/
s/css_set->cg_links/css_set->cgrp_links/
s/cgroup_iter->cg_link/cgroup_iter->cset_link/
* s/cg_cgroup_link/cgrp_cset_link/
* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
* s/init_css_set_link/init_cgrp_cset_link/
s/free_cg_links/free_cgrp_cset_links/
s/allocate_cg_links/allocate_cgrp_cset_links/
* s/cgl[12]/link[12]/ in compare_css_sets()
* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
adustments.
* Comment and whiteline adjustments.
After the changes, we have
list_for_each_entry(link, &cont->cset_links, cset_link) {
struct css_set *cset = link->cset;
instead of
list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
struct css_set *cset = link->cg;
This patch is purely cosmetic.
v2: Fix broken sentences in the patch description.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.
It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough. Let's give it some sanity by
uniformly using @cset for all struct css_set variables.
* s/cg/cset/ for all css_set variables.
* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones
prefixed with "new".
* s/cg/cgrp/ for cgroup variables in compare_css_sets().
* s/css/cset/ for the cgroup variable in task_cgroup_from_root().
* Whiteline adjustments.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
* Rename it from files[] (really?) to cgroup_base_files[].
* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
used inconsistently. Just use "cgroup." directly.
* Collect insane files at the end. Note that only the insane ones are
missing "cgroup." prefix.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated. Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism. Let's make it clear that this
abomination is going away.
Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it. This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.
Mark "tasks" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
During a config change, propagate_exception() needs to traverse the
subtree to update config on the subtree. Because such config updates
need to allocate memory, it couldn't directly use
cgroup_for_each_descendant_pre() which required the whole iteration to
be contained in a single RCU read critical section. To work around
the limitation, propagate_exception() built a linked list of
descendant cgroups while read-locking RCU and then walked the list
afterwards, which is safe as the whole iteration is protected by
devcgroup_mutex. This works but is cumbersome.
With the recent updates, cgroup iterators now allow dropping RCU read
lock while iteration is in progress making this workaround no longer
necessary. This patch replaces dev_cgroup->propagate_pending list and
get_online_devcg() with direct cgroup_for_each_descendant_pre() walk.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.
The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped. Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle. Comments are updated
accordingly.
This makes the iterators easier to use and will simplify controllers.
Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state. Make it static.
While at it, make it return bool.
Signed-off-by: Tejun Heo <tj@kernel.org>
Merging to receive 7805d000db ("cgroup: fix a subtle bug in descendant
pre-order walk") so that further iterator updates can build upon it.
Signed-off-by: Tejun Heo <tj@kernel.org>
When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty. This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.
There's no reason to have the early exit path. Remove it along with
the later assumption that the subtree isn't empty. This simplifies
the code a bit and fixes the subtle bug.
While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
cgroup_lock() and cgroup_unlock() are now no longer exported, so fix
cgroup.h to not declare them if CONFIG_CGROUPS is not enabled.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kdbus folks want a sane way to determine the cgroup path that a given
task belongs to on a given hierarchy, which is a reasonble thing to
expect from cgroup core.
Implement task_cgroup_path_from_hierarchy().
v2: Dropped unnecessary NULL check on the return value of
task_cgroup_from_root() as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <greg@kroah.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <daniel@zonque.org>
We want to be able to lookup a hierarchy from its id and cyclic
allocation is a whole lot simpler with idr. Convert to idr and use
idr_alloc_cyclc().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that hierarchy_id alloc / free are protected by the cgroup
mutexes, there's no need for this separate lock. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We're planning to converting hierarchy_ida to an idr and use it to
look up hierarchy from its id. As we want the mapping to happen
atomically with cgroupfs_root registration, this patch refactors
hierarchy_id init / exit so that ida operations happen inside
cgroup_[root_]mutex.
* s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or
-errno like a normal function.
* Move hierarchy_id initialization from cgroup_root_from_opts() into
cgroup_mount() block where the root is confirmed to be used and
being registered while holding both mutexes.
* Split cgroup_drop_id() into cgroup_exit_root_id() and
cgroup_free_root(), so that ID release can happen before dropping
the mutexes in cgroup_kill_sb(). The latter expects hierarchy_id to
be exited before being invoked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.
But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.
This bug has been there since cgroup xattr was introduced.
Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>