Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants. This patch adds the following
three macros.
* cgroup_for_each_child() - walk immediate children of a cgroup.
* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
in pre-order tree traversal.
* cgroup_for_each_descendant_post() - visit all descendants of a
cgroup in post-order tree traversal.
All three only require the user to hold RCU read lock during
traversal. Verifying that each iterated cgroup is online is the
responsibility of the user. When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way. See comments for details.
v2: s/config/state/ in commit message and comments per Michal. More
documentation on synchronization rules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Use RCU safe list operations for cgroup->children. This will be used
to implement cgroup children / descendant walking which can be used by
controllers.
Note that cgroup_create() now puts a new cgroup at the end of the
->children list instead of head. This isn't strictly necessary but is
done so that the iteration order is more conventional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.
This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.
This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().
When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance. It will be explained with the descendant iterators.
v2: Added a paragraph about its future use w/ descendant iterators per
Michal.
v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top. This pull created a trivial conflict between the
following two commits.
8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")
The former added a field to cgroup_subsys and the latter removed one
from it. They happen to be colocated causing the conflict. Keeping
what's added and removing what's removed resolves the conflict.
Signed-off-by: Tejun Heo <tj@kernel.org>
CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working. cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.
Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").
Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary. Remove it and all the mechanisms supporting it. Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal. All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.
This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).
Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism. css_tryget() simply can look at the
reference count and fail if it's deactivated - it's never getting
re-activated.
This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated. As deactivation and removal are a single
step now, they no longer need to be protected against css_tryget()
happening from irq context. Remove local_irq_disable/enable() from
cgroup_rmdir().
Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c. We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.
v2: Comment updated and explanation on local_irq_disable/enable()
added per Michal Hocko.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs. This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.
* Conditionals dependent on __DEPRECATED_clear_css_refs removed.
* cgroup_clear_css_refs() can no longer fail. All that needs to be
done are deactivating refcnts, setting CSS_REMOVED and putting the
base reference on each css. Remove cgroup_clear_css_refs() and the
failure path, and open-code the loops into cgroup_rmdir().
This patch keeps the two for_each_subsys() loops separate while open
coding them. They can be merged now but there are scheduled changes
which need them to be separate, so keep them separate to reduce the
amount of churn.
local_irq_save/restore() from cgroup_clear_css_refs() are replaced
with local_irq_disable/enable() for simplicity. This is safe as
cgroup_rmdir() is always called with IRQ enabled. Note that this IRQ
switching is necessary to ensure that css_tryget() isn't called from
IRQ context on the same CPU while lower context is between CSS
deactivation and setting CSS_REMOVED as css_tryget() would hang
forever in such cases waiting for CSS to be re-activated or
CSS_REMOVED set. This will go away soon.
v2: cgroup_call_pre_destroy() removal dropped per Michal. Commit
message updated to explain local_irq_disable/enable() conversion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
This hashtable implementation is using hlist buckets to provide a simple
hashtable to prevent it from getting reimplemented all over the kernel.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
[ Merging this now, so that subsystems can start applying Sasha's
patches that use this - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit b3356bf0db (KVM: emulator: optimize "rep ins" handling),
the pieces of io data can be collected and write them to the guest memory
or MMIO together
Unfortunately, kvm splits the mmio access into 8 bytes and store them to
vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it
will cause vcpu->mmio_fragments overflow
The bug can be exposed by isapc (-M isapc):
[23154.818733] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[ ......]
[23154.858083] Call Trace:
[23154.859874] [<ffffffffa04f0e17>] kvm_get_cr8+0x1d/0x28 [kvm]
[23154.861677] [<ffffffffa04fa6d4>] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm]
[23154.863604] [<ffffffffa04f5a1a>] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm]
Actually, we can use one mmio_fragment to store a large mmio access then
split it when we pass the mmio-exit-info to userspace. After that, we only
need two entries to store mmio info for the cross-mmio pages access
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull md fixes from NeilBrown:
"Some fixes for md in 3.7
- one recently introduced crash for dm-raid10 with discard
- one bug in new functionality that has been around for a few
releases.
- minor bug in md's 'faulty' personality
and UAPI disintegration for md."
* tag 'md-3.7-fixes' of git://neil.brown.name/md:
MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c
md/raid1: Fix assembling of arrays containing Replacements.
md faulty: use disk_stack_limits()
UAPI: (Scripted) Disintegrate include/linux/raid
Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
This is an optimization. The RCU-protected region is very small, so
there will be no latency problems if we disable preempt in this region.
So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
to preempt_disable / preempt_disable. It is smaller (and supposedly
faster) than preemptible rcu_read_lock / rcu_read_unlock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces new barrier pair light_mb() and heavy_mb() for
percpu rw semaphores.
This patch fixes a bug in percpu-rw-semaphores where a barrier was
missing in percpu_up_write.
This patch improves performance on the read path of
percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in
percpu_up_read. This patch changes it to a compiler barrier and removes
the "#if defined(X86) ..." condition.
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"This is what we usually expect at this stage of the game, lots of
little things, mostly in drivers. With the occasional 'oops didn't
mean to do that' kind of regressions in the core code."
1) Uninitialized data in __ip_vs_get_timeouts(), from Arnd Bergmann
2) Reject invalid ACK sequences in Fast Open sockets, from Jerry Chu.
3) Lost error code on return from _rtl_usb_receive(), from Christian
Lamparter.
4) Fix reset resume on USB rt2x00, from Stanislaw Gruszka.
5) Release resources on error in pch_gbe driver, from Veaceslav Falico.
6) Default hop limit not set correctly in ip6_template_metrics[], fix
from Li RongQing.
7) Gianfar PTP code requests wrong kind of resource during probe, fix
from Wei Yang.
8) Fix VHOST net driver on big-endian, from Michael S Tsirkin.
9) Mallenox driver bug fixes from Jack Morgenstein, Or Gerlitz, Moni
Shoua, Dotan Barak, and Uri Habusha.
10) usbnet leaks memory on TX path, fix from Hemant Kumar.
11) Use socket state test, rather than presence of FIN bit packet, to
determine FIONREAD/SIOCINQ value. Fix from Eric Dumazet.
12) Fix cxgb4 build failure, from Vipul Pandya.
13) Provide a SYN_DATA_ACKED state to complement SYN_FASTOPEN in socket
info dumps. From Yuchung Cheng.
14) Fix leak of security path in kfree_skb_partial(). Fix from Eric
Dumazet.
15) Handle RX FIFO overflows more resiliently in pch_gbe driver, from
Veaceslav Falico.
16) Fix MAINTAINERS file pattern for networking drivers, from Jean
Delvare.
17) Add iPhone5 IDs to IPHETH driver, from Jay Purohit.
18) VLAN device type change restriction is too strict, and should not
trigger for the automatically generated vlan0 device. Fix from Jiri
Pirko.
19) Make PMTU/redirect flushing work properly again in ipv4, from
Steffen Klassert.
20) Fix memory corruptions by using kfree_rcu() in netlink_release().
From Eric Dumazet.
21) More qmi_wwan device IDs, from Bjørn Mork.
22) Fix unintentional change of SNAT/DNAT hooks in generic NAT
infrastructure, from Elison Niven.
23) Fix 3.6.x regression in xt_TEE netfilter module, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
tilegx: fix some issues in the SW TSO support
qmi_wwan/cdc_ether: move Novatel 551 and E362 to qmi_wwan
net: usb: Fix memory leak on Tx data path
net/mlx4_core: Unmap UAR also in the case of error flow
net/mlx4_en: Don't use vlan tag value as an indication for vlan presence
net/mlx4_en: Fix double-release-range in tx-rings
bas_gigaset: fix pre_reset handling
vhost: fix mergeable bufs on BE hosts
gianfar_ptp: use iomem, not ioports resource tree in probe
ipv6: Set default hoplimit as zero.
NET_VENDOR_TI: make available for am33xx as well
pch_gbe: fix error handling in pch_gbe_up()
b43: Fix oops on unload when firmware not found
mwifiex: clean up scan state on error
mwifiex: return -EBUSY if specific scan request cannot be honored
brcmfmac: fix potential NULL dereference
Revert "ath9k_hw: Updated AR9003 tx gain table for 5GHz"
ath9k_htc: Add PID/VID for a Ubiquiti WiFiStation
rt2x00: usb: fix reset resume
rtlwifi: pass rx setup error code to caller
...
try_to_freeze_tasks() and cgroup_freezer rely on scheduler locks
to ensure that a task doing STOPPED/TRACED -> RUNNING transition
can't escape freezing. This mostly works, but ptrace_stop() does
not necessarily call schedule(), it can change task->state back to
RUNNING and check freezing() without any lock/barrier in between.
We could add the necessary barrier, but this patch changes
ptrace_stop() and do_signal_stop() to use freezable_schedule().
This fixes the race, freezer_count() and freezer_should_skip()
carefully avoid the race.
And this simplifies the code, try_to_freeze_tasks/update_if_frozen
no longer need to use task_is_stopped_or_traced() checks with the
non trivial assumptions. We can rely on the mechanism which was
specially designed to mark the sleeping task as "frozen enough".
v2: As Tejun pointed out, we can also change get_signal_to_deliver()
and move try_to_freeze() up before 'relock' label.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull staging driver fixes from Greg Kroah-Hartman:
"Here are some staging driver fixes for your 3.7-rc tree.
Nothing major here, a number of iio driver fixups that were causing
problems, some comedi driver bugfixes, and a bunch of tidspbridge
warning squashing and other regressions fixed from the 3.6 release.
All have been in the linux-next releases for a bit.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'staging-3.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (32 commits)
staging: tidspbridge: delete unused mmu functions
staging: tidspbridge: ioremap physical address of the stack segment in shm
staging: tidspbridge: ioremap dsp sync addr
staging: tidspbridge: change type to __iomem for per and core addresses
staging: tidspbridge: drop const from custom mmu implementation
staging: tidspbridge: request the right irq for mmu
staging: ipack: add missing include (implicit declaration of function 'kfree')
staging: ramster: depends on NET
staging: omapdrm: fix allocation size for page addresses array
staging: zram: Fix handling of incompressible pages
Staging: android: binder: Allow using highmem for binder buffers
Staging: android: binder: Fix memory leak on thread/process exit
staging: comedi: ni_labpc: fix possible NULL deref during detach
staging: comedi: das08: fix possible NULL deref during detach
staging: comedi: amplc_pc263: fix possible NULL deref during detach
staging: comedi: amplc_pc236: fix possible NULL deref during detach
staging: comedi: amplc_pc236: fix invalid register access during detach
staging: comedi: amplc_dio200: fix possible NULL deref during detach
staging: comedi: 8255_pci: fix possible NULL deref during detach
staging: comedi: ni_daq_700: fix dio subdevice regression
...
Pull driver core fixes from Greg Kroah-Hartman:
"Here are a number of firmware core fixes for 3.7, and some other minor
fixes. And some documentation updates thrown in for good measure.
All have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Documentation:Chinese translation of Documentation/arm64/memory.txt
Documentation:Chinese translation of Documentation/arm64/booting.txt
Documentation:Chinese translation of Documentation/IRQ.txt
firmware loader: document kernel direct loading
sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
dynamic_debug: Remove unnecessary __used
firmware loader: sync firmware cache by async_synchronize_full_domain
firmware loader: let direct loading back on 'firmware_buf'
firmware loader: fix one reqeust_firmware race
firmware loader: cancel uncache work before caching firmware
Pull char/misc driver fixes from Greg Kroah-Hartman:
"Here are some driver fixes for 3.7. They include extcon driver fixes,
a hyper-v bugfix, and two other minor driver fixes.
All of these have been in the linux-next releases for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'char-misc-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
sonypi: suspend/resume callbacks should be conditionally compiled on CONFIG_PM_SLEEP
Drivers: hv: Cleanup error handling in vmbus_open()
extcon : register for cable interest by cable name
extcon: trivial: kfree missed from remove path
extcon: driver model release call not needed
extcon: MAX77693: Add platform data for MUIC device to initialize registers
extcon: max77693: Use max77693_update_reg for rmw operations
extcon: Fix kerneldoc for extcon_set_cable_state and extcon_set_cable_state_
extcon: adc-jack: Add missing MODULE_LICENSE
extcon: adc-jack: Fix checking return value of request_any_context_irq
extcon: Fix return value in extcon_register_interest()
extcon: unregister compat link on cleanup
extcon: Unregister compat class at module unload to fix oops
extcon: optimising the check_mutually_exclusive function
extcon: standard cable names definition and declaration changed
extcon-max8997: remove usage of ret in max8997_muic_handle_charger_type_detach
extcon: Remove duplicate inclusion of extcon.h header file
Pull x86 fixes from Ingo Molnar:
"This fixes a couple of nasty page table initialization bugs which were
causing kdump regressions. A clean rearchitecturing of the code is in
the works - meanwhile these are reverts that restore the
best-known-working state of the kernel.
There's also EFI fixes and other small fixes."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, mm: Undo incorrect revert in arch/x86/mm/init.c
x86: efi: Turn off efi_enabled after setup on mixed fw/kernel
x86, mm: Find_early_table_space based on ranges that are actually being mapped
x86, mm: Use memblock memory loop instead of e820_RAM
x86, mm: Trim memory in memblock to be page aligned
x86/irq/ioapic: Check for valid irq_cfg pointer in smp_irq_move_cleanup_interrupt
x86/efi: Fix oops caused by incorrect set_memory_uc() usage
x86-64: Fix page table accounting
Revert "x86/mm: Fix the size calculation of mapping tables"
MAINTAINERS: Add EFI git repository location
Pull perf fixes from Ingo Molnar:
"Most of the kernel diffstat relates to a group of Intel P6 and KNC
(Xeon-Phi Knights Corner) PMU driver fixes, neither of which is in
heavy use, so we took the fixes.
The rest is diverse smallish fixes to the tooling and kernel side."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Remove unused variable in nhmex_rbox_alter_er()
perf/x86: Enable overflow on Intel KNC with a custom knc_pmu_handle_irq()
perf/x86: Remove cpuc->enable check on Intl KNC event enable/disable
perf/x86: Make Intel KNC use full 40-bit width of counters
perf/x86/uncore: Handle pci_read_config_dword() errors
perf/x86: Remove P6 cpuc->enabled check
perf/x86: Update/fix generic events on P6 PMU
perf/x86: Fix P6 FP_ASSIST event constraint
perf, cpu hotplug: Use cached value of smp_processor_id()
perf, cpu hotplug: Run CPU_STARTING notifiers with irqs disabled
x86/perf: Fix virtualization sanity check
perf test: Fix exclude_guest parse events tests
perf tools: do not flush maps on COMM for perf report
perf help: Fix --help for builtins
perf trace: Check if sample raw_data field is set
perf trace: Validate syscall id before growing syscall table
rb_erase_augmented() is a static function annotated with
__always_inline. This causes a compile failure when attempting to use
the rbtree implementation as a library (e.g. kvm tool):
rbtree_augmented.h:125:24: error: expected `=', `,', `;', `asm' or `__attribute__' before `void'
Include linux/compiler.h in rbtree_augmented.h so that the __always_inline
macro is resolved correctly.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull spi fixes from Mark Brown:
"A bunch of fixes here, mostly minor except for the pl022 which has
just been a bit of a shambles all round, the recent runtime PM changes
have as far as I can tell never worked so they're just getting thrown
out."
* tag 'spi-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc:
spi/pl022: Revert recent runtime PM changes
spi: tsc2005: delete soon-obsolete e-mail address
spi: spi-rspi: fix build error for the latest shdma driver