mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
64a43323e429d2b183fc02451b3201cce873807d
18328 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
40f6123737 |
Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children() |
||
|
|
a805cbf4c4 |
Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo: "Two workqueue fixes. Both are one liners. One fixes missing uevent for workqueue files on sysfs. The other one fixes missing zeroing of NUMA cpu masks which can lead to oopses among other things" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: zero cpumask of wq_numa_possible_cpumask on init workqueue: fix dev_set_uevent_suppress() imbalance |
||
|
|
5a6024f160 |
workqueue: zero cpumask of wq_numa_possible_cpumask on init
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
549f11c9f0 |
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
which I introduced in the last round of cleanups :("
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix memory leak when calling irq_free_hwirqs()
irqchip: spear_shirq: Fix interrupt offset
irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
irqchip: armada-370-xp: Mask all interrupts during initialization.
|
||
|
|
8844aad89e |
genirq: Fix memory leak when calling irq_free_hwirqs()
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.
Fixes:
|
||
|
|
ef34c6ce49 |
Merge tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
where running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and
clean ups that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when
ftrace was the only user, but now that it's used by perf and kprobes,
this is a bug where reading trace can disable kprobes and perf. A
very unexpected side effect and should be fixed"
* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove ftrace_stop/start() from reading the trace file
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
tracing/uprobes: Revert "Support mix of ftrace and perf"
|
||
|
|
d18bbc215f |
kernel/printk/printk.c: revert "printk: enable interrupts before calling console_trylock_for_printk()"
Revert commit |
||
|
|
76bb5ab8f6 |
cpuset: break kernfs active protection in cpuset_write_resmask()
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
migrating tasks off a cpuset after new resources are added to it.
As cpuset_hotplug_work calls into cgroup core via
cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
core locking from cpuset_write_resmak(). This used to be okay because
cgroup interface files were protected by a different mutex; however,
|
||
|
|
099ed15167 |
tracing: Remove ftrace_stop/start() from reading the trace file
Disabling reading and writing to the trace file should not be able to disable all function tracing callbacks. There's other users today (like kprobes and perf). Reading a trace file should not stop those from happening. Cc: stable@vger.kernel.org # 3.0+ Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> |
||
|
|
fb6bab6a5a |
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
The usage of uprobe_buffer_enable() added by |
||
|
|
f786106e80 |
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
I do not know why
|
||
|
|
06d0713904 |
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure
that nobody tries to play with the dead uprobe/consumer. This helps
to catch the bugs like the one fixed by the previous patch.
In the longer term we should fix this poorly designed interface.
uprobe_register() should return "struct uprobe *" which should be
passed to apply/unregister. Plus other semantic changes, see the
changelog in commit
|
||
|
|
4821254206 |
tracing/uprobes: Revert "Support mix of ftrace and perf"
This reverts commit
|
||
|
|
3a32bd72d7 |
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
tj: Updated comments. Renamed @sb to @pinned_sb.
Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
970317aa48 |
cgroup: fix mount failure in a corner case
# cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
391acf970d |
cpuset,mempolicy: fix sleeping function called from invalid context
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@vger.kernel.org> |
||
|
|
b8e46d22dc |
Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and fixes from Steven Rostedt: "This includes three patches from Oleg Nesterov. The first is a fix to a race condition that happens between enabling/disabling syscall tracepoints and new process creations (the check to go into the ptrace path for a process can be set when it shouldn't, or not set when it should). Not a major bug but one that should be fixed and even applied to stable. The other two patches are cleanup/fixes that are not that critical, but for an -rc1 release would be nice to have. They both deal with syscall tracepoints. It also includes a patch to introduce a new macro for the TRACE_EVENT() format called __field_struct(). Originally, __field() was used to record any variable into a trace event, but with the addition of setting the "is signed" attribute, the check causes anything but a primitive variable to fail to compile. That is, structs and unions can't be used as they once were. When the "is signed" check was introduce there were only primitive variables being recorded. But that will change soon and it was reported that __field() causes build failures. To solve the __field() issue, __field_struct() is introduced to allow trace_events to be able to record complex types too" * tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add __field_struct macro for TRACE_EVENT() tracing: syscall_regfunc() should not skip kernel threads tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread() tracing: Fix syscall_*regfunc() vs copy_process() race |
||
|
|
ed235875e2 |
kernel/watchdog.c: print traces for all cpus on lockup detection
A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than a predefined period to time, without giving other tasks a chance to run. Currently, upon detection of this condition by the per-cpu watchdog task, debug information (including a stack trace) is sent to the system log. On some occasions, we have observed that the "victim" rather than the actual "culprit" (i.e. the owner/holder of the contended resource) is reported to the user. Often this information has proven to be insufficient to assist debugging efforts. To avoid loss of useful debug information, for architectures which support NMI, this patch makes it possible to improve soft lockup reporting. This is accomplished by issuing an NMI to each cpu to obtain a stack trace. If NMI is not supported we just revert back to the old method. A sysctl and boot-time parameter is available to toggle this feature. [dzickus@redhat.com: add CONFIG_SMP in certain areas] [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations] [mq@suse.cz: fix warning] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7cd2b0a34a |
mm, pcp: allow restoring percpu_pagelist_fraction default
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
bde92cf455 |
kernel/watchdog.c: remove preemption restrictions when restarting lockup detector
Peter Wu noticed the following splat on his machine when updating /proc/sys/kernel/watchdog_thresh: BUG: sleeping function called from invalid context at mm/slub.c:965 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180 #1: (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110 #2: (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80 Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110 CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x4e/0x7a __might_sleep+0x11d/0x190 kmem_cache_alloc_trace+0x4e/0x1e0 perf_event_alloc+0x55/0x440 perf_event_create_kernel_counter+0x26/0xe0 watchdog_nmi_enable+0x75/0x140 update_timers_all_cpus+0x53/0xa0 proc_dowatchdog+0xe4/0x110 proc_sys_call_handler+0xb3/0xc0 proc_sys_write+0x14/0x20 vfs_write+0xad/0x180 SyS_write+0x49/0xb0 system_call_fastpath+0x16/0x1b NMI watchdog: disabled (cpu0): hardware events not enabled What happened is after updating the watchdog_thresh, the lockup detector is restarted to utilize the new value. Part of this process involved disabling preemption. Once preemption was disabled, perf tried to allocate a new event (as part of the restart). This caused the above BUG_ON as you can't sleep with preemption disabled. The preemption restriction seemed agressive as we are not doing anything on that particular cpu, but with all the online cpus (which are protected by the get_online_cpus lock). Remove the restriction and the BUG_ON goes away. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reported-by: Peter Wu <peter@lekensteyn.nl> Tested-by: Peter Wu <peter@lekensteyn.nl> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b3acc56bfe |
kexec: save PG_head_mask in VMCOREINFO
To allow filtering of huge pages, makedumpfile must be able to identify them in the dump. This can be done by checking the appropriate page flag, so communicate its value to makedumpfile through the VMCOREINFO interface. There's only one small catch. Depending on how many page flags are available on a given architecture, this bit can be called PG_head or PG_compound. I sent a similar patch back in 2012, but Eric Biederman did not like using an #ifdef. So, this time I'm adding a common symbol (PG_head_mask) instead. See https://lkml.org/lkml/2012/11/28/91 for the previous version. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Shaohua Li <shli@kernel.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8d056c48e4 |
CPU hotplug, smp: flush any pending IPI callbacks before CPU offline
There is a race between the CPU offline code (within stop-machine) and
the smp-call-function code, which can lead to getting IPIs on the
outgoing CPU, *after* it has gone offline.
Specifically, this can happen when using
smp_call_function_single_async() to send the IPI, since this API allows
sending asynchronous IPIs from IRQ disabled contexts. The exact race
condition is described below.
During CPU offline, in stop-machine, we don't enforce any rule in the
_DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
the other CPUs disable their local interrupts. Due to this, we can
encounter a situation in which an IPI is sent by one of the other CPUs
to the outgoing CPU (while it is *still* online), but the outgoing CPU
ends up noticing it only *after* it has gone offline.
CPU 1 CPU 2
(Online CPU) (CPU going offline)
Enter _PREPARE stage Enter _PREPARE stage
Enter _DISABLE_IRQ stage
=
Got a device interrupt, and | Didn't notice the IPI
the interrupt handler sent an | since interrupts were
IPI to CPU 2 using | disabled on this CPU.
smp_call_function_single_async() |
=
Enter _DISABLE_IRQ stage
Enter _RUN stage Enter _RUN stage
=
Busy loop with interrupts | Invoke take_cpu_down()
disabled. | and take CPU 2 offline
=
Enter _EXIT stage Enter _EXIT stage
Re-enable interrupts Re-enable interrupts
The pending IPI is noted
immediately, but alas,
the CPU is offline at
this point.
This of course, makes the smp-call-function IPI handler code running on
CPU 2 unhappy and it complains about "receiving an IPI on an offline
CPU".
One real example of the scenario on CPU 1 is the block layer's
complete-request call-path:
__blk_complete_request() [interrupt-handler]
raise_blk_irq()
smp_call_function_single_async()
However, if we look closely, the block layer does check that the target
CPU is online before firing the IPI. So in this case, it is actually
the unfortunate ordering/timing of events in the stop-machine phase that
leads to receiving IPIs after the target CPU has gone offline.
In reality, getting a late IPI on an offline CPU is not too bad by
itself (this can happen even due to hardware latencies in IPI
send-receive). It is a bug only if the target CPU really went offline
without executing all the callbacks queued on its list. (Note that a
CPU is free to execute its pending smp-call-function callbacks in a
batch, without waiting for the corresponding IPIs to arrive for each one
of those callbacks).
So, fixing this issue can be broken up into two parts:
1. Ensure that a CPU goes offline only after executing all the
callbacks queued on it.
2. Modify the warning condition in the smp-call-function IPI handler
code such that it warns only if an offline CPU got an IPI *and* that
CPU had gone offline with callbacks still pending in its queue.
Achieving part 1 is straight-forward - just flush (execute) all the
queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
including those callbacks for which the source CPU's IPIs might not have
been received on the outgoing CPU yet. Once we do this, an IPI that
arrives late on the CPU going offline (either due to the race mentioned
above, or due to hardware latencies) will be completely harmless, since
the outgoing CPU would have executed all the queued callbacks before
going offline.
Overall, this fix (parts 1 and 2 put together) additionally guarantees
that we will see a warning only when the *IPI-sender code* is buggy -
that is, if it queues the callback _after_ the target CPU has gone
offline.
[1]. The CPU_DYING part needs a little more explanation: by the time we
execute the CPU_DYING notifier callbacks, the CPU would have already
been marked offline. But we want to flush out the pending callbacks at
this stage, ignoring the fact that the CPU is offline. So restructure
the IPI handler code so that we can by-pass the "is-cpu-offline?" check
in this particular case. (Of course, the right solution here is to fix
CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
notifiers, but this requires a lot of audit to ensure that this change
doesn't break any existing code; hence lets go with the solution
proposed above until that is done).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sachin Kamat <sachin.kamat@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
bddbceb688 |
workqueue: fix dev_set_uevent_suppress() imbalance
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
401c58fcbb |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar: "This is larger than usual: the main reason are the ARM symbol lookup speedups that came in late and were hard to resist. There's also a kprobes fix and various tooling fixes, plus the minimal re-enablement of the mmap2 support interface" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/kprobes: Fix build errors and blacklist context_track_user perf tests: Add test for closing dso objects on EMFILE error perf tests: Add test for caching dso file descriptors perf tests: Allow reuse of test_file function perf tests: Spawn child for each test perf tools: Add dso__data_* interface descriptons perf tools: Allow to close dso fd in case of open failure perf tools: Add file size check and factor dso__data_read_offset perf tools: Cache dso data file descriptor perf tools: Add global count of opened dso objects perf tools: Add global list of opened dso objects perf tools: Add data_fd into dso object perf tools: Separate dso data related variables perf tools: Cache register accesses for unwind processing perf record: Fix to honor user freq/interval properly perf timechart: Reflow documentation perf probe: Improve error messages in --line option perf probe: Improve an error message of perf probe --vars mode perf probe: Show error code and description in verbose mode perf probe: Improve error message for unknown member of data structure ... |
||
|
|
7b08d618a2 |
Merge branch 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rtmutex fixes from Thomas Gleixner: "Another three patches to make the rtmutex code more robust. That's the last urgent fallout from the big futex/rtmutex investigation" * 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Plug slow unlock race rtmutex: Detect changes in the pi lock chain rtmutex: Handle deadlock detection smarter |