Because accounting resources for the root cgroup sometimes incurs
measureable overhead for workloads which don't care about cgroup and
often ends up calculating a number which is available elsewhere in a
slightly different form, cgroup is not in the business of providing
system-wide statistics. The pids controller which was introduced
recently was exposing "pids.current" at the root. This patch disable
accounting for root cgroup and removes the file from the root
directory.
While this is a userland visible behavior change, pids has been
available only in one version and was badly broken there, so I don't
think this will be noticeable. If it turns out to be a problem, we
can reinstate it for v1 hierarchies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen. freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.
This patch merges the two steps. Propagation now takes place inside
the task loop. This simplifies the code and prepares it for the fix
of multi-destination migration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that we know that the forking task can't migrate amd the child is always
moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
And since we no longer need to pin this css, we can remove pid_fork().
Note: the patch uses task_css_check(true), perhaps it makes sense to add a
helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
account.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If the new child migrates to another cgroup before cgroup_post_fork() calls
subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
pids_uncharge(old_pids) + pids_charge(pids) sequence twice.
Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
unconditionally. percpu_down_read() is cheap and this allows other cleanups,
see the next changes.
Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A css_set represents the relationship between a set of tasks and
css's. css_set never pinned the associated css's. This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.
Unfortunately, afcf6c8b75 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release. Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.
general protection fault: 0000 [#1] PREEMPT SMP
task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
...
Call Trace:
<IRQ>
[<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
[<ffffffff810f8893>] cgroup_free+0x53/0xe0
[<ffffffff8104ed62>] __put_task_struct+0x42/0x130
[<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
[<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
[<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
[<ffffffff81056e54>] __do_softirq+0xd4/0x460
[<ffffffff81057369>] irq_exit+0x89/0xa0
[<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
[<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
<EOI>
...
Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
RSP <ffff88001fc03e20>
---[ end trace 89a4a4b916b90c49 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
Fix it by making css_set pin the associate css's until its release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b75 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
The following patch replaces all instances of time_t with time64_t i.e.
change the type used for representing time from 32-bit to 64-bit. All
32-bit kernels to date use a signed 32-bit time_t type, which can only
represent time until January 2038. Since embedded systems running 32-bit
Linux are going to survive beyond that date, we have to change all
current uses, in a backwards compatible way.
The patch also changes the function get_seconds() that returns a 32-bit
integer to ktime_get_seconds() that returns seconds as 64-bit integer.
The patch changes the type of ticks from time_t to u32. We keep ticks as
32-bits as the function uses 32-bit arithmetic which would prove less
expensive than 64-bit arithmetic and the function is expected to be
called atleast once every 32 seconds.
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that cgroup v2 is almost out of the door, replace the development
documentation unified-hierarchy.txt with Documentation/cgroup.txt
which is a superset of unified-hierarchy.txt and authoritatively
describes all userland-visible aspects of cgroup.
v2: Updated to include all information from blkio-controller.txt and
list filesystems which support cgroup writeback as suggested by
Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
In preparation for adding cgroup2 documentation, rename
Documentation/cgroups/ to Documentation/cgroup-legacy/.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild. Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.
mount -t cgroup2 none $MOUNT_POINT
The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").
v2: Assign a different magic number to cgroup2 fs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
6f60eade24 ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded. These handles
then can be used, for example, to generate file modified
notifications.
Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction. Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again. This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time. This incorrect assumption led
to the corruption of css->files lists.
Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: James Sedgwick <jsedgwick@fb.com>
Fixes: 6f60eade24 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Pull perf updates from Thomas Gleixner:
"Mostly updates to the perf tool plus two fixes to the kernel core code:
- Handle tracepoint filters correctly for inherited events (Peter
Zijlstra)
- Prevent a deadlock in perf_lock_task_context (Paul McKenney)
- Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
Melo)
- Print full source file paths when using 'perf annotate --print-line
--full-paths' (Michael Petlan)
- Fix 'perf probe -d' when just one out of uprobes and kprobes is
enabled (Wang Nan)
- Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)
- Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
the 'perf test' LLVM entries, when running it in-tree, to
.gitignore (Yunlong Song)
- libbpf error reporting improvements, using a strerror interface to
more precisely tell the user about problems with the provided
scriptlet, be it in C or as a ready made object file (Wang Nan)
- Do not be case sensitive when searching for matching 'perf test'
entries (Arnaldo Carvalho de Melo)
- Inform the user about objdump failures in 'perf annotate' (Andi
Kleen)
- Improve the LLVM 'perf test' entry, introduce a new ones for BPF
and kbuild tests to check the environment used by clang to compile
.c scriptlets (Wang Nan)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
tools include: Add compiler.h to list.h
perf probe: Verify parameters in two functions
perf session: Add missing newlines to some pr_err() calls
perf annotate: Support full source file paths for srcline fix
perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
perf: Fix inherited events vs. tracepoint filters
perf: Disable IRQs across RCU RS CS that acquires scheduler lock
perf test: Do not be case sensitive when searching for matching tests
perf test: Add 'perf test BPF'
perf test: Enhance the LLVM tests: add kbuild test
perf test: Enhance the LLVM test: update basic BPF test program
perf bpf: Improve BPF related error messages
perf tools: Make fetch_kernel_version() publicly available
bpf tools: Add new API bpf_object__get_kversion()
bpf tools: Improve libbpf error reporting
perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
perf annotate: Inform the user about objdump failures in --stdio
perf stat: Make stat options global
perf sched latency: Fix thread pid reuse issue
...
Pull scheduler fix from Thomas Gleixner:
"A single fix to prevent math underflow in the numa balancing code"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Fix math underflow in task_tick_numa()
Pull liblockdep fixes from Thomas Gleixner:
"Three small patches to synchronize liblockdep with the latest core
changes"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/liblockdep: explicitly declare lockdep API we call from liblockdep
tools/liblockdep: add userspace versions of WRITE_ONCE and RCU_INIT_POINTER
tools/liblockdep: remove task argument from debug_check_no_locks_held
Pull x86 fixes from Thomas Gleixner:
"A couple of fixes and updates related to x86:
- Fix the W+X check regression on XEN
- The real fix for the low identity map trainwreck
- Probe legacy PIC early instead of unconditionally allocating legacy
irqs
- Add cpu verification to long mode entry
- Adjust the cache topology to AMD Fam17H systems
- Let Merrifield use the TSC across S3"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Call verify_cpu() after having entered long mode too
x86/setup: Fix low identity map for >= 2GB kernel range
x86/mm: Skip the hypervisor range when walking PGD
x86/AMD: Fix last level cache topology for AMD Fam17h systems
x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield
Pull irq and timer fixes from Thomas Gleixner:
- An irq regression fix to restore the wakeup behaviour of chained
interrupts.
- A timer fix for a long standing race versus timers scheduled on a
target cpu which got exposed by recent changes in the workqueue
implementation.
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/PM: Restore system wake up from chained interrupts
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Use proper base migration in add_timer_on()
Pull MIPS updates from Ralf Baechle:
"These are the highlists of the main MIPS pull request for 4.4:
- Add latencytop support
- Support appended DTBs
- VDSO support and initially use it for gettimeofday.
- Drop the .MIPS.abiflags and ELF NOTE sections from vmlinux
- Support for the 5KE, an internal test core.
- Switch all MIPS platfroms to libata drivers.
- Improved support, cleanups for ralink and Lantiq platforms.
- Support for the new xilfpga platform.
- A number of DTB improvments for BMIPS.
- Improved support for CM and CPS.
- Minor JZ4740 and BCM47xx enhancements"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (120 commits)
MIPS: idle: add case for CPU_5KE
MIPS: Octeon: Support APPENDED_DTB
MIPS: vmlinux: create a section for appended DTB
MIPS: Clean up compat_siginfo_t
MIPS: Fix PAGE_MASK definition
MIPS: BMIPS: Enable GZIP ramdisk and timed printks
MIPS: Add xilfpga defconfig
MIPS: xilfpga: Add mipsfpga platform code
MIPS: xilfpga: Add xilfpga device tree files.
dt-bindings: MIPS: Document xilfpga bindings and boot style
MIPS: Make MIPS_CMDLINE_DTB default
MIPS: Make the kernel arguments from dtb available
MIPS: Use USE_OF as the guard for appended dtb
MIPS: BCM63XX: Use pr_* instead of printk
MIPS: Loongson: Cleanup CONFIG_LOONGSON_SUSPEND.
MIPS: lantiq: Disable xbar fpi burst mode
MIPS: lantiq: Force the crossbar to big endian
MIPS: lantiq: Initialize the USB core on boot
MIPS: lantiq: Return correct value for fpi clock on ar9
MIPS: ralink: Add missing clock on rt305x
...
Pull sound fixes from Takashi Iwai:
"Here are a collection of small fixes tha have been gathered for
4.4-rc1. The only significant changes are those in PCI drivers
Kconfig, to use "depends on" instead of "select" for CONFIG_ZONE_DMA.
A reverse select is often more user-friendly, but in this case, it
makes hard to manage with the conflict with ZONE_DEVICE, so changed in
such a way for now.
Others are all small fixes and quirks: an error check in soundcore
reigster_chrdev(), HD-audio HDMI/DP phantom jack fix, Intel Broxton DP
quirk, USB-audio DSD device quirk, some constifications, etc"
* tag 'sound-fix-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pci: depend on ZONE_DMA
ALSA: hda - Simplify phantom jack handling for HDMI/DP
ALSA: hda/hdmi - apply Skylake fix-ups to Broxton display codec
ALSA: ctxfi: constify rsc ops structures
ALSA: usb: Add native DSD support for Aune X1S
ALSA: oxfw: add an comment to Kconfig for TASCAM FireOne
sound: fix check for error condition of register_chrdev()
Pull ARC fixes from Vineet Gupta:
"Found a couple of brown paper bag bugs with the prev pull request
(including a SMP build breakage report from Guenter). Since these are
urgent I also decided to send over a bunch of other pending fixes
which could have otherwise waited an rc or two.
Summary:
- A bunch of brown paper bag bugs (MAINTAINERS list email, SMP build
failure)
- cpu_relax() now compiler barrier for UP as well
- handling of userspace Bus Errors for ARCompact builds"
* tag 'arc-4.4-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: Fix silly typo in MAINTAINERS file
ARC: cpu_relax() to be compiler barrier even for UP
ARC: use ASL assembler mnemonic
ARC: [arcompact] Handle bus error from userspace as Interrupt not exception
ARC: remove extraneous header include
ARCv2: lib: memcpy: use local symbols
ARCompact and ARCv2 only have ASL, while binutils used to support LSL as
a alias mnemonic.
Newer binutils (upstream) don't want to do that so replace it.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Bus errors from userspace on ARCompact based cores are handled by core
as a high priority L2 interrupt but current code treated it as interrupt
Handling an interrupt like exception is certainly not going to go unnoticed.
(and it worked so far as we never saw a Bus error from userspace until
IPPK guys tested a DDR controller with ECC error detection etc hence
needed to explicitly trigger/handle such errors)
- So move mem_service exception handler from common code into ARCv2 code.
- In ARCompact code, define mem_service as L2 interrupt handler which
just drops down to pure kernel mode and goes of to enqueue SIGBUS
Reported-by: Nelson Pereira <npereira@synopsys.com>
Tested-by: Ana Martins <amartins@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>