Pull perf fixes from Ingo Molnar:
"Various fixlets:
On the kernel side:
- fix a race
- fix a bug in the handling of the perf ring-buffer data page
On the tooling side:
- fix the handling of certain corrupted perf.data files
- fix a bug in 'perf probe'
- fix a bug in 'perf record + perf sched'
- fix a bug in 'make install'
- fix a bug in libaudit feature-detection on certain distros"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf session: Fix infinite loop on invalid perf.data file
perf tools: Fix installation of libexec components
perf probe: Fix to find line information for probe list
perf tools: Fix libaudit test
perf stat: Set child_pid after perf_evlist__prepare_workload()
perf tools: Add default handler for mmap2 events
perf/x86: Clean up cap_user_time* setting
perf: Fix perf_pmu_migrate_context
Pull ACPI and power management fixes from Rafael Wysocki:
- The resume part of user space driven hibernation (s2disk) is now
broken after the change that moved the creation of memory bitmaps to
after the freezing of tasks, because I forgot that the resume utility
loaded the image before freezing tasks and needed the bitmaps for
that. The fix adds special handling for that case.
- One of recent commits changed the export of acpi_bus_get_device() to
EXPORT_SYMBOL_GPL(), which was technically correct but broke existing
binary modules using that function including one in particularly
widespread use. Change it back to EXPORT_SYMBOL().
- The intel_pstate driver sometimes fails to disable turbo if its
no_turbo sysfs attribute is set. Fix from Srinivas Pandruvada.
- One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0
which still (incorrectly) treats non-NULL as non-error. Fix from
Philipp Zabel.
- The SPEAr cpufreq driver uses a wrong variable type in one place
preventing it from catching errors returned by one of the functions
called by it. Fix from Sachin Kamat.
* tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: Use EXPORT_SYMBOL() for acpi_bus_get_device()
intel_pstate: fix no_turbo
cpufreq: cpufreq-cpu0: NULL is a valid regulator, part 2
cpufreq: SPEAr: Fix incorrect variable type
PM / hibernate: Fix user space driven resume regression
While auditing the list_entry usage due to a trinity bug I found that
perf_pmu_migrate_context violates the rules for
perf_event::event_entry.
The problem is that perf_event::event_entry is a RCU list element, and
hence we must wait for a full RCU grace period before re-using the
element after deletion.
Therefore the usage in perf_pmu_migrate_context() which re-uses the
entry immediately is broken. For now introduce another list_head into
perf_event for this specific usage.
This doesn't actually fix the trinity report because that never goes
through this code.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The commit facd8b80c6
("irq: Sanitize invoke_softirq") converted irq exit
calls of do_softirq() to __do_softirq() on all architectures,
assuming it was only used there for its irq disablement
properties.
But as a side effect, the softirqs processed in the end
of the hardirq are always called on the inline current
stack that is used by irq_exit() instead of the softirq
stack provided by the archs that override do_softirq().
The result is mostly safe if the architecture runs irq_exit()
on a separate irq stack because then softirqs are processed
on that same stack that is near empty at this stage (assuming
hardirq aren't nesting).
Otherwise irq_exit() runs in the task stack and so does the softirq
too. The interrupted call stack can be randomly deep already and
the softirq can dig through it even further. To add insult to the
injury, this softirq can be interrupted by a new hardirq, maximizing
the chances for a stack overrun as reported in powerpc for example:
do_IRQ: stack overflow: 1920
CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
Call Trace:
[c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
[c0000000050a8810] .dump_stack+0x28/0x3c
[c0000000050a8880] .do_IRQ+0x2b8/0x2c0
[c0000000050a8930] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
LR = .cp_start_xmit+0x390/0x820 [8139cp]
[c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
[c0000000050a8e00] .sch_direct_xmit+0x110/0x260
[c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
[c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
[c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
[c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
[c0000000050a9130] .dev_queue_xmit+0x428/0x630
[c0000000050a91d0] .ip_finish_output+0x2a4/0x550
[c0000000050a9290] .ip_local_out+0x50/0x70
[c0000000050a9310] .ip_queue_xmit+0x148/0x420
[c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
[c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
[c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
[c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
[c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
[c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
[c0000000050a9840] .ip_rcv_finish+0x148/0x400
[c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
[c0000000050a99d0] .netif_receive_skb+0x44/0x110
[c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
[c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
[c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
[c0000000050a9c70] .nf_iterate+0x114/0x130
[c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
[c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
[c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
[c0000000050a9fa0] .netif_receive_skb+0x44/0x110
[c0000000050aa040] .napi_gro_receive+0xe8/0x120
[c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
[c0000000050aa1d0] .net_rx_action+0x1dc/0x310
[c0000000050aa2b0] .__do_softirq+0x158/0x330
[c0000000050aa3b0] .irq_exit+0xc8/0x110
[c0000000050aa430] .do_IRQ+0xdc/0x2c0
[c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .bad_range+0x1c/0x110
LR = .get_page_from_freelist+0x908/0xbb0
[c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
[c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
[c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
[c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
[c0000000050aac60] .handle_pte_fault+0x814/0xb70
[c0000000050aad50] .__get_user_pages+0x1a4/0x640
[c0000000050aae60] .get_user_pages_fast+0xec/0x160
[c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
[c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
[c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
[c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
[c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
[c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm]
[c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
[c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
[c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
[c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
[c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
[c0000000050abd80] .SyS_ioctl+0xd4/0xf0
[c0000000050abe30] syscall_exit+0x0/0x98
Since this is a regression, this patch proposes a minimalistic
and low-risk solution by blindly forcing the hardirq exit processing of
softirqs on the softirq stack. This way we should reduce significantly
the opportunities for task stack overflow dug by softirqs.
Longer term solutions may involve extending the hardirq stack coverage to
irq_exit(), etc...
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: #3.9.. <stable@vger.kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
"case 0" in free_pid() assumes that disable_pid_allocation() should
clear PIDNS_HASH_ADDING before the last pid goes away.
However this doesn't happen if the first fork() fails to create the
child reaper which should call disable_pid_allocation().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
dereference happens upon core dump because argv_split("") returns
argv[0] == NULL.
This bug was once fixed by commit 264b83c07a ("usermodehelper: check
subprocess_info->path != NULL") but was by error reintroduced by commit
7f57cfa4e2 ("usermodehelper: kill the sub_info->path[0] check").
This bug seems to exist since 2.6.19 (the version which core dump to
pipe was added). Depending on kernel version and config, some side
effect might happen immediately after this oops (e.g. kernel panic with
2.6.32-358.18.1.el6).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recent commit 8fd37a4 (PM / hibernate: Create memory bitmaps after
freezing user space) broke the resume part of the user space driven
hibernation (s2disk), because I forgot that the resume utility
loaded the image into memory without freezing user space (it still
freezes tasks after loading the image). This means that during user
space driven resume we need to create the memory bitmaps at the
"device open" time rather than at the "freeze tasks" time, so make
that happen (that's a special case anyway, so it needs to be treated
in a special way).
Reported-and-tested-by: Ronald <ronald645@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull scheduler, timer and x86 fixes from Ingo Molnar:
- A context tracking ARM build and functional fix
- A handful of ARM clocksource/clockevent driver fixes
- An AMD microcode patch level sysfs reporting fixlet
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm: Fix build error with context tracking calls
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
clocksource: of: Respect device tree node status
clocksource: exynos_mct: Set IRQ affinity when the CPU goes online
arm: clocksource: mvebu: Use the main timer as clock source from DT
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Fix patch level reporting for family 15h
Pull scheduler fixes from Ingo Molnar:
"Three small fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/balancing: Fix cfs_rq->task_h_load calculation
sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
Pull perf fixes from Ingo Molnar:
"Assorted standalone fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add model number for Avoton Silvermont
perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()
perf: Update ABI comment
tools lib lk: Uninclude linux/magic.h in debugfs.c
perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms()
perf probe: Fix finder to find lines of given function
perf session: Check for SIGINT in more loops
perf tools: Fix compile with libelf without get_phdrnum
perf tools: Fix buildid cache handling of kallsyms with kcore
perf annotate: Fix objdump line parsing offset validation
perf tools: Fill in new definitions for madvise()/mmap() flags
perf tools: Sharpen the libaudit dependencies test
Commit 1b3a5d02ee ("reboot: move arch/x86 reboot= handling to generic
kernel") did some cleanup for reboot= command line, but it made the
reboot_default inoperative.
The default value of variable reboot_default should be 1, and if command
line reboot= is not set, system will use the default reboot mode.
[akpm@linux-foundation.org: fix comment layout]
Signed-off-by: Li Fei <fei.li@intel.com>
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Robin Holt <robinmholt@linux.com>
Cc: <stable@vger.kernel.org> [3.11.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
watchdog_tresh controls how often nmi perf event counter checks per-cpu
hrtimer_interrupts counter and blows up if the counter hasn't changed
since the last check. The counter is updated by per-cpu
watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
period which guarantees that hrtimer is scheduled 2 times per the main
period. Both hrtimer and perf event are started together when the
watchdog is enabled.
So far so good. But...
But what happens when watchdog_thresh is updated from sysctl handler?
proc_dowatchdog will set a new sampling period and hrtimer callback
(watchdog_timer_fn) will use the new value in the next round. The
problem, however, is that nobody tells the perf event that the sampling
period has changed so it is ticking with the period configured when it
has been set up.
This might result in an ear ripping dissonance between perf and hrtimer
parts if the watchdog_thresh is increased. And even worse it might lead
to KABOOM if the watchdog is configured to panic on such a spurious
lockup.
This patch fixes the issue by updating both nmi perf even counter and
hrtimers if the threshold value has changed.
The nmi one is disabled and then reinitialized from scratch. This has
an unpleasant side effect that the allocation of the new event might
fail theoretically so the hard lockup detector would be disabled for
such cpus. On the other hand such a memory allocation failure is very
unlikely because the original event is deallocated right before.
It would be much nicer if we just changed perf event period but there
doesn't seem to be any API to do that right now. It is also unfortunate
that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
cannot use on_each_cpu() and do the same thing from the per-cpu context.
The update from the current CPU should be safe because
perf_event_disable removes the event atomically before it clears the
per-cpu watchdog_ev so it cannot change anything under running handler
feet.
The hrtimer is simply restarted (thanks to Don Zickus who has pointed
this out) if it is queued because we cannot rely it will fire&adopt to
the new sampling period before a new nmi event triggers (when the
treshold is decreased).
[akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_dowatchdog doesn't synchronize multiple callers which might lead to
confusion when two parallel callers might confuse watchdog_enable_all_cpus
resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
watchdog_thresh was set to 0 already).
This patch adds a local mutex which synchronizes callers to the sysctl
handler.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:
if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
(scaled_busy_load_per_task * imbn)) {
env->imbalance = busiest->load_per_task;
return;
}
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix it by substituting the subtraction with an equivalent addition in
the check.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows
env->imbalance = min(
max_pull * busiest->group_power,
(sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE;
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Solve the problems around the broken definition of perf_event_mmap_page::
cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
fixed by:
860f085b74 ("perf: Fix broken union in 'struct perf_event_mmap_page'")
The problem with the fix (merged in v3.12-rc1 and not yet released
officially), noticed by Vince Weaver is that the new behavior is
not detectable by new user-space, and that due to the reuse of the
field names it's easy to mis-compile a binary if old headers are used
on a new kernel or new headers are used on an old kernel.
To solve all that make this change explicit, detectable and self-contained,
by iterating the ABI the following way:
- Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
confuse old user-space binaries. RDPMC will be marked as unavailable
to old binaries but that's within the ABI, this is a capability bit.
- Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
libraries can reliably detect that bit 0 is deprecated and perma-zero
without having to check the kernel version.
- Use bits 2, 3, 4 for the newly defined, correct functionality:
cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
cap_user_time : 1, /* The time_* fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */
- Rename all the bitfield names in perf_event.h to be different from the
old names, to make sure it's not possible to mis-compile it
accidentally with old assumptions.
The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.
Also adjust tools/perf/ userspace for the new definitions, noticed by
Adrian Hunter.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Also-Fixed-by: Adrian Hunter <adrian.hunter@intel.com>
Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer fix from Ingo Molnar:
"An NTP related lockup fix"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix HRTICK related deadlock from ntp lock changes
Pull scheduler fixes from Ingo Molnar:
"Misc fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix comment for sched_info_depart
sched/Documentation: Update sched-design-CFS.txt documentation
sched/debug: Take PID namespace into account
sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones