Pull cpuset changes from Tejun Heo:
"cpuset has always been rather odd about its configurations - a cgroup
right after creation didn't allow any task executions before
configuration, changing configuration in the parent modifies the
descendants irreversibly and so on. These behaviors are inherently
nasty and almost hostile against sharing the hierarchy with other
controllers making it very difficult to use in unified hierarchy.
Li is currently in the process of updating the behaviors for
__DEVEL__sane_behavior which is the bulk of changes in this pull
request. It isn't complete yet and the behaviors will change further
but all changes are gated behind sane_behavior. In the process, the
rather hairy work-item punting which was used to work around the
limitations of cgroup descendant iterator was simplified."
* 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: rename @cont to @cgrp
cpuset: fix to migrate mm correctly in a corner case
cpuset: allow to move tasks to empty cpusets
cpuset: allow to keep tasks in empty cpusets
cpuset: introduce effective_{cpumask|nodemask}_cpuset()
cpuset: record old_mems_allowed in struct cpuset
cpuset: remove async hotplug propagation work
cpuset: let hotplug propagation work wait for task attaching
cpuset: re-structure update_cpumask() a bit
cpuset: remove cpuset_test_cpumask()
cpuset: remove unnecessary variable in cpuset_attach()
cpuset: cleanup guarantee_online_{cpus|mems}()
cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
back and the comments/Documentation never got updated.
I figured it out when I was going through sched-domains.txt and so thought of
fixing it globally.
I haven't crossed check if the stuff that is referenced in sched/core.c by all
these files is still present and hasn't changed as that wasn't the motive behind
this patch.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cont is short for container. control group was named process container
at first, but then people found container already has a meaning in
linux kernel.
Clean up the leftover variable name @cont.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Before moving tasks out of empty cpusets, update_tasks_nodemask()
is called, which calls do_migrate_pages(xx, from, to). Then those
tasks are moved to an ancestor, and do_migrate_pages() is called
again.
The first time: from = node_to_be_offlined, to = empty.
The second time: from = empty, to = ancestor's nodemask.
so looks like no pages will be migrated.
Fix this by:
- Don't call update_tasks_nodemask() on empty cpusets.
- Pass cs->old_mems_allowed to do_migrate_pages().
v4: added comment in cpuset_hotplug_update_tasks() and rephased comment
in cpuset_attach().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.
Now with this patchset if cpuset is mounted with sane_behavior option,
it behaves differently:
- Tasks will be kept in empty cpusets when hotplug happens and take
masks of ancestors with non-empty cpus/mems, instead of being moved to
an ancestor.
- A task can be moved into an empty cpuset, and again it takes masks of
ancestors, so the user can drop a task into a newly created cgroup without
having to do anything for it.
As tasks can reside in empy cpusets, here're some rules:
- They can be moved to another cpuset, regardless it's empty or not.
- Though it takes masks from ancestors, it takes other configs from the
empty cpuset.
- If the ancestors' masks are changed, those tasks will also be updated
to take new masks.
v2: add documentation in include/linux/cgroup.h
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To achieve this:
- We call update_tasks_cpumask/nodemask() for empty cpusets when
hotplug happens, instead of moving tasks out of them.
- When a cpuset's masks are changed by writing cpuset.cpus/mems,
we also update tasks in child cpusets which are empty.
v3:
- do propagation work in one place for both hotplug and unplug
v2:
- drop rcu_read_lock before calling update_task_nodemask() and
update_task_cpumask(), instead of using workqueue.
- add documentation in include/linux/cgroup.h
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
effective_cpumask_cpuset() returns an ancestor cpuset which has
non-empty cpumask.
If a cpuset is empty and the tasks in it need to update their
cpus_allowed, they take on the ancestor cpuset's cpumask.
This currently won't change any behavior, but it will later allow us
to keep tasks in empty cpusets.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we update a cpuset's mems_allowed and thus update tasks'
mems_allowed, it's required to pass the old mems_allowed and new
mems_allowed to cpuset_migrate_mm().
Currently we save old mems_allowed in a temp local variable before
changing cpuset->mems_allowed. This patch changes it by saving
old mems_allowed in cpuset->old_mems_allowed.
This currently won't change any behavior, but it will later allow
us to keep tasks in empty cpusets.
v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed"
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As we can drop rcu read lock while iterating cgroup hierarchy,
we don't have to do propagation asynchronously via workqueue.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of triggering propagation work in cpuset_attach(), we make
hotplug propagation work wait until there's no task attaching in
progress.
IMO this is more robust. We won't see empty masks in cpuset_attach().
Also it's a preparation for removing propagation work. Without asynchronous
propagation we can't call move_tasks_in_empty_cpuset() in cpuset_attach(),
because otherwise we'll deadlock on cgroup_mutex.
tj: typo fixes.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Check if cpus_allowed is to be changed before calling validate_change().
This won't change any behavior, but later it will allow us to do this:
# mkdir /cpuset/child
# echo $$ > /cpuset/child/tasks /* empty cpuset */
# echo > /cpuset/child/cpuset.cpus /* do nothing, won't fail */
Without this patch, the last operation will fail.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- We never pass a NULL @cs to these functions.
- The top cpuset always has some online cpus/mems.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
Pull workqueue updates from Tejun Heo:
"A lot of activities on workqueue side this time. The changes achieve
the followings.
- WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.
- The ability to interface with multiple backend worker pools are
used to implement unbound workqueues with custom attributes.
Currently the supported attributes are the nice level and CPU
affinity. It may be expanded to include cgroup association in
future. The attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.
The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools
alone.
This allows converting custom worker pool implementations which
want worker attribute tuning to use workqueues. The writeback pool
is already converted in block tree and there are a couple others
are likely to follow including btrfs io workers.
- WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.
After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.
Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.
While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.
As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.
While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online
CPUs.
There are also a couple things which are being routed outside the
workqueue tree.
- block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.
- The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree."
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
workqueue: use kmem_cache_free() instead of kfree()
workqueue: avoid false negative WARN_ON() in destroy_workqueue()
workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
workqueue: implement NUMA affinity for unbound workqueues
workqueue: introduce put_pwq_unlocked()
workqueue: introduce numa_pwq_tbl_install()
workqueue: use NUMA-aware allocation for pool_workqueues
workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
workqueue: map an unbound workqueues to multiple per-node pool_workqueues
workqueue: move hot fields of workqueue_struct to the end
workqueue: make workqueue->name[] fixed len
workqueue: add workqueue->unbound_attrs
workqueue: determine NUMA node of workers accourding to the allowed cpumask
workqueue: drop 'H' from kworker names of unbound worker pools
workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
workqueue: fix memory leak in apply_workqueue_attrs()
workqueue: fix unbound workqueue attrs hashing / comparison
workqueue: fix race condition in unbound workqueue free path
workqueue: remove pwq_lock which is no longer used
...
Use the new interface, remove one ifdef. No code size changes.
We could/should have been using __meminit/__meminitdata here but there's
now no point in doing that because all this code is elided at compile time.
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported by Fengguang's kbuild test robot:
kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used
Introduced by commit e0e80a02e5
("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
which removed generate_sched_domains() from cpuset_hotplug_workfn().
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()).
This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources. This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.
Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported. Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>