linux

mirror of https://github.com/armbian/linux.git synced 2026-01-06 10:13:00 -08:00

Author	SHA1	Message	Date
Amit Pundir	703920c14a	cgroup: refactor allow_attach handler for 4.4 Refactor *allow_attach() handler to align it with the changes from mainline commit `1f7dd3e5a6` "cgroup: fix handling of multi-destination migration from subtree_control enabling". Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2016-02-16 13:53:46 -08:00
Dmitry Shmidt	69db8fca42	cgroup: fix cgroup_taskset_for_each call in allow_attach() for 4.1 Change-Id: I05013f6e76c30b0ece3671f9f2b4bbdc626cd35c Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>	2016-02-16 13:53:46 -08:00
Christian Poetzsch	f4adb71017	Fix generic cgroup subsystem permission checks In 53b5e2f generic cgroup subsystem permission checks have been added. When this is been done within procs_write an empty taskset is added to the tasks css set. When a task later on migrates to a new group we see a dmesg warning cause the mg_node isn't empty (cgroup.c:2086). Cause this happens all the time this spams dmesg. I am not really familiar with this code, but it looks to me like adding the taskset is just a temporary action in this context. Therefore this taskset should be removed after the actual check. This is what this fix does. This problem was seen and the fix tested on x86 using l-mr1 and master. Change-Id: I9894d39e8b5692ef65149002b07e65a84a33ffea Signed-off-by: Christian Poetzsch <christian.potzsch@imgtec.com>	2016-02-16 13:53:45 -08:00
Colin Cross	1811046286	cgroup: Add generic cgroup subsystem permission checks Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'allow_attach' handler, then we fall back to doing our checks the old way. Use the 'allow_attach' handler for the 'cpu' cgroup to allow non-root processes to add arbitrary processes to a 'cpu' cgroup if it has the CAP_SYS_NICE capability set. This version of the patch adds a 'allow_attach' handler instead of reusing the 'can_attach' handler. If the 'can_attach' handler is reused, a new cgroup that implements 'can_attach' but not the permission checks could end up with no permission checks at all. Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c Original-Author: San Mehat <san@google.com> Signed-off-by: Colin Cross <ccross@android.com>	2016-02-16 13:53:43 -08:00
Rom Lemarchand	6809864a2c	cgroup: refactor allow_attach function into common code move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach. This allows any process with CAP_SYS_NICE to move tasks across cgroups if they use this function as their allow_attach handler. Bug: 18260435 Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f Signed-off-by: Rom Lemarchand <romlem@android.com>	2016-02-16 13:53:42 -08:00
Tejun Heo	1f7dd3e5a6	cgroup: fix handling of multi-destination migration from subtree_control enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>	2015-12-03 10:18:21 -05:00
Tejun Heo	53254f900b	cgroup: make css_set pin its css's to avoid use-afer-free A css_set represents the relationship between a set of tasks and css's. css_set never pinned the associated css's. This was okay because tasks used to always disassociate immediately (in RCU sense) - either a task is moved to a different css_set or exits and never accesses css_set again. Unfortunately, `afcf6c8b75` ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") and patches leading up to it made a zombie hold onto its css_set and deref the associated css's on its release. Nothing pins the css's after exit and it might have already been freed leading to use-after-free. general protection fault: 0000 [#1] PREEMPT SMP task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000 RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 ... Call Trace: <IRQ> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0 [<ffffffff810f8893>] cgroup_free+0x53/0xe0 [<ffffffff8104ed62>] __put_task_struct+0x42/0x130 [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130 [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820 [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820 [<ffffffff81056e54>] __do_softirq+0xd4/0x460 [<ffffffff81057369>] irq_exit+0x89/0xa0 [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90 <EOI> ... Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02 RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 RSP <ffff88001fc03e20> ---[ end trace 89a4a4b916b90c49 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by making css_set pin the associate css's until its release. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de Fixes: `afcf6c8b75` ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")	2015-11-30 09:46:21 -05:00
Tejun Heo	34c06254ff	cgroup: fix cftype->file_offset handling `6f60eade24` ("cgroup: generalize obtaining the handles of and notifying cgroup files") introduced cftype->file_offset so that the handles for per-css file instances can be recorded. These handles then can be used, for example, to generate file modified notifications. Unfortunately, it made the wrong assumption that files are created once for a given css and removed on its destruction. Due to the dependencies among subsystems, a css may be hidden from userland and then later shown again. This is implemented by removing and re-creating the affected files, so the associated kernfs_node for a given cgroup file may change over time. This incorrect assumption led to the corruption of css->files lists. Reimplement cftype->file_offset handling so that cgroup_file->kn is protected by a lock and updated as files are created and destroyed. This also makes keeping them on per-cgroup list unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: James Sedgwick <jsedgwick@fb.com> Fixes: `6f60eade24` ("cgroup: generalize obtaining the handles of and notifying cgroup files") Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zefan Li <lizefan@huawei.com>	2015-11-16 10:58:26 -05:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Tejun Heo	d574567537	cgroup: fix race condition around termination check in css_task_iter_next() css_task_iter_next() checked @it->cur_task before grabbing css_set_lock and assumed that the result won't change afterwards; however, tasks could leave the cgroup being iterated terminating the iterator before css_task_lock is acquired. If this happens, css_task_iter_next() tries to calculate the current task from NULL cg_list pointer leading to the following oops. BUG: unable to handle kernel paging request at fffffffffffff7d0 IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 ... CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1 Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014 task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000 RIP: 0010:[<ffffffff810d5f22>] [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 RSP: 0018:ffff88083404fd28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0 RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278 RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0 R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400 R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58 FS: 00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0 Stack: 0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248 ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee 0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000 Call Trace: [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550 [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20 [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0 [<ffffffff811cab76>] seq_read+0x166/0x380 [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180 [<ffffffff811a7398>] __vfs_read+0x18/0x50 [<ffffffff811a745d>] vfs_read+0x8d/0x150 [<ffffffff811a756f>] SyS_read+0x4f/0xb0 [<ffffffff818d4772>] system_call_fastpath+0x12/0x17 Fix it by moving the termination condition check inside css_set_lock. @it->cur_task is now cleared after being put and @it->task_pos is tested for termination instead of @it->cset_pos as they indicate the same condition and @it->task_pos is what's being dereferenced. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Calvin Owens <calvinowens@fb.com> Fixes: `ed27b9f7a1` ("cgroup: don't hold css_set_rwsem across css task iteration") Acked-by: Zefan Li <lizefan@huawei.com>	2015-10-29 11:43:05 +09:00
Tejun Heo	e4b7037c86	cgroup: drop cgroup__DEVEL__legacy_files_on_dfl Now that interfaces for the major three controllers - cpu, memory, io - are shaping up, there's no reason to have an option to force legacy files to show up on the unified hierarchy for testing. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>	2015-10-15 17:00:43 -04:00
Tejun Heo	035f4f5105	cgroup: replace error handling in cgroup_init() with WARN_ON()s The init sequence shouldn't fail short of bugs and even when it does it's better to continue with the rest of initialization and we were silently ignoring /proc/cgroups creation failure. Drop the explicit error handling and wrap sysfs_create_mount_point(), register_filesystem() and proc_create() with WARN_ON()s. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>	2015-10-15 17:00:43 -04:00
Tejun Heo	afcf6c8b75	cgroup: add cgroup_subsys->free() method and use it to fix pids controller pids controller is completely broken in that it uncharges when a task exits allowing zombies to escape resource control. With the recent updates, cgroup core now maintains cgroup association till task free and pids controller can be fixed by uncharging on free instead of exit. This patch adds cgroup_subsys->free() method and update pids controller to use it instead of ->exit() for uncharging. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com>	2015-10-15 16:41:53 -04:00
Tejun Heo	2e91fa7f6d	cgroup: keep zombies associated with their original cgroups cgroup_exit() is called when a task exits and disassociates the exiting task from its cgroups and half-attach it to the root cgroup. This is unnecessary and undesirable. No controller actually needs an exiting task to be disassociated with non-root cgroups. Both cpu and perf_event controllers update the association to the root cgroup from their exit callbacks just to keep consistent with the cgroup core behavior. Also, this disassociation makes it difficult to track resources held by zombies or determine where the zombies came from. Currently, pids controller is completely broken as it uncharges on exit and zombies always escape the resource restriction. With cgroup association being reset on exit, fixing it is pretty painful. There's no reason to reset cgroup membership on exit. The zombie can be removed from its css_set so that it doesn't show up on "cgroup.procs" and thus can't be migrated or interfere with cgroup removal. It can still pin and point to the css_set so that its cgroup membership is maintained. This patch makes cgroup core keep zombies associated with their cgroups at the time of exit. * Previous patches decoupled populated_cnt tracking from css_set lifetime, so a dying task can be simply unlinked from its css_set while pinning and pointing to the css_set. This keeps css_set association from task side alive while hiding it from "cgroup.procs" and populated_cnt tracking. The css_set reference is dropped when the task_struct is freed. * ->exit() callback no longer needs the css arguments as the associated css never changes once PF_EXITING is set. Removed. * cpu and perf_events controllers no longer need ->exit() callbacks. There's no reason to explicitly switch away on exit. The final schedule out is enough. The callbacks are removed. * On traditional hierarchies, nothing changes. "/proc/PID/cgroup" still reports "/" for all zombies. On the default hierarchy, "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged to at the time of exit. If the cgroup gets removed before the task is reaped, " (deleted)" is appended. v2: Build brekage due to missing dummy cgroup_free() when !CONFIG_CGROUP fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>	2015-10-15 16:41:53 -04:00
Tejun Heo	f0d9a5f175	cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock css_set_rwsem is the inner lock protecting css_sets and is accessed from hot paths such as fork and exit. Internally, it has no reason to be a rwsem or even mutex. There are no internal blocking operations while holding it. This was rwsem because css task iteration used to expose it to external iterator users. As the previous patch updated css task iteration such that the locking is not leaked to its users, there's no reason to keep it a rwsem. This patch converts css_set_rwsem to a spinlock and rename it to css_set_lock. It uses bh-safe operations as a planned usage needs to access it from RCU callback context. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:53 -04:00
Tejun Heo	ed27b9f7a1	cgroup: don't hold css_set_rwsem across css task iteration css_sets are synchronized through css_set_rwsem but the locking scheme is kinda bizarre. The hot paths - fork and exit - have to write lock the rwsem making the rw part pointless; furthermore, many readers already hold cgroup_mutex. One of the readers is css task iteration. It read locks the rwsem over the entire duration of iteration. This leads to silly locking behavior. When cpuset tries to migrate processes of a cgroup to a different NUMA node, css_set_rwsem is held across the entire migration attempt which can take a long time locking out forking, exiting and other cgroup operations. This patch updates css task iteration so that it locks css_set_rwsem only while the iterator is being advanced. css task iteration involves two levels - css_set and task iteration. As css_sets in use are practically immutable, simply pinning the current one is enough for resuming iteration afterwards. Task iteration is tricky as tasks may leave their css_set while iteration is in progress. This is solved by keeping track of active iterators and advancing them if their next task leaves its css_set. v2: put_task_struct() in css_task_iter_next() moved outside css_set_rwsem. A later patch will add cgroup operations to task_struct free path which may grab the same lock and this avoids deadlock possibilities. css_set_move_task() updated to use list_for_each_entry_safe() when walking task_iters and advancing them. This is necessary as advancing an iter may remove it from the list. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:52 -04:00
Tejun Heo	ecb9d535df	cgroup: reorganize css_task_iter functions * Rename css_advance_task_iter() to css_task_iter_advance_css_set() and make it clear it->task_pos too at the end of the iteration. * Factor out css_task_iter_advance() from css_task_iter_next(). The new function whines if called on a terminated iterator. Except for the termination check, this is pure reorganization and doesn't introduce any behavior changes. This will help the planned locking update for css_task_iter. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:52 -04:00
Tejun Heo	f6d7d049c1	cgroup: factor out css_set_move_task() A task is associated and disassociated with its css_set in three places - during migration, after a new task is created and when a task exits. The first is handled by cgroup_task_migrate() and the latter two are open-coded. These are similar operations and spreading them over multiple places makes it harder to follow and update. This patch collects all task css_set [dis]association operations into css_set_move_task(). While css_set_move_task() may check whether populated state needs to be updated when not strictly necessary, the behavior is essentially equivalent before and after this patch. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:52 -04:00
Tejun Heo	389b9c1bc9	cgroup: keep css_set and task lists in chronological order css task iteration will be updated to not leak cgroup internal locking to iterator users. In preparation, update css_set and task lists to be in chronological order. For tasks, as migration path is already using list_splice_tail_init(), only cgroup_enable_task_cg_lists() and cgroup_post_fork() need updating. For css_sets, link_css_set() is the only place which needs to be updated. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:51 -04:00
Tejun Heo	91486f61f4	cgroup: make cgroup_destroy_locked() test cgroup_is_populated() cgroup_destroy_locked() currently tests whether any css_sets are associated to reject removal if the cgroup contains tasks. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup_destroy_locked() so that it tests cgroup_is_populated(), which counts the number of populated css_sets, instead of whether cgrp->cset_links is empty to determine whether the cgroup is populated or not. This ensures that rmdirs won't be incorrectly rejected for cgroups which only contain zombie tasks. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:51 -04:00
Tejun Heo	2ceb231b0a	cgroup: make css_sets pin the associated cgroups Currently, css_sets don't pin the associated cgroups. This is okay as a cgroup with css_sets associated are not allowed to be removed; however, to help resource tracking for zombie tasks, this is scheduled to change such that a cgroup can be removed even when it has css_sets associated as long as none of them are populated. To ensure that a cgroup doesn't go away while css_sets are still associated with it, make each associated css_set hold a reference on the cgroup if non-root. v2: Root cgroups are special and shouldn't be ref'd by css_sets. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:51 -04:00
Tejun Heo	052c3f3a0b	cgroup: relocate cgroup_[try]get/put() Relocate cgroup_get(), cgroup_tryget() and cgroup_put() upwards. This is pure code reorganization to prepare for future changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:50 -04:00
Tejun Heo	ad2ed2b35b	cgroup: move check_for_release() invocation To trigger release agent when the last task leaves the cgroup, check_for_release() is called from put_css_set_locked(); however, css_set being unlinked is being decoupled from task leaving the cgroup and the correct condition to test is cgroup->nr_populated dropping to zero which check_for_release() is already updated to test. This patch moves check_for_release() invocation from put_css_set_locked() to cgroup_update_populated(). Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:50 -04:00
Tejun Heo	27bd4dbb8d	cgroup: replace cgroup_has_tasks() with cgroup_is_populated() Currently, cgroup_has_tasks() tests whether the target cgroup has any css_set linked to it. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch replaces cgroup_has_tasks() with cgroup_is_populated() which tests cgroup->nr_populated instead which locally counts the number of populated css_sets. Unlike cgroup_has_tasks(), cgroup_is_populated() is recursive - if any of the descendants is populated, the cgroup is populated too. While this changes the meaning of the test, all the existing users are okay with the change. While at it, replace the open-coded ->populated_cnt test in cgroup_events_show() with cgroup_is_populated(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>	2015-10-15 16:41:50 -04:00
Tejun Heo	0de0942db2	cgroup: make cgroup->nr_populated count the number of populated css_sets Currently, cgroup->nr_populated counts whether the cgroup has any css_sets linked to it and the number of children which has non-zero ->nr_populated. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup->nr_populated so that for the cgroup itself it counts the number of css_sets which have tasks associated with them so that empty css_sets don't skew the populated test. Signed-off-by: Tejun Heo <tj@kernel.org>	2015-10-15 16:41:49 -04:00

1 2 3 4 5 ...

812 Commits