Pull perf updates from Ingo Molnar:
"This update has mostly fixes, but also other bits:
- perf tooling fixes
- PMU driver fixes
- Intel Broadwell PMU driver HW-enablement for LBR callstacks
- a late coming 'perf kmem' tool update that enables it to also
analyze page allocation data. Note, this comes with MM tracepoint
changes that we believe to not break anything: because it changes
the formerly opaque 'struct page *' field that uniquely identifies
pages to 'pfn' which identifies pages uniquely too, but isn't as
opaque and can be used for other purposes as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()
perf/x86/intel: Add Broadwell support for the LBR callstack
perf/x86/intel/rapl: Fix energy counter measurements but supporing per domain energy units
perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events
perf/x86: Fix hw_perf_event::flags collision
perf probe: Fix segfault when probe with lazy_line to file
perf probe: Find compilation directory path for lazy matching
perf probe: Set retprobe flag when probe in address-based alternative mode
perf kmem: Analyze page allocator events also
tracing, mm: Record pfn instead of pointer to struct page
Pull perf changes from Ingo Molnar:
"Core kernel changes:
- One of the more interesting features in this cycle is the ability
to attach eBPF programs (user-defined, sandboxed bytecode executed
by the kernel) to kprobes.
This allows user-defined instrumentation on a live kernel image
that can never crash, hang or interfere with the kernel negatively.
(Right now it's limited to root-only, but in the future we might
allow unprivileged use as well.)
(Alexei Starovoitov)
- Another non-trivial feature is per event clockid support: this
allows, amongst other things, the selection of different clock
sources for event timestamps traced via perf.
This feature is sought by people who'd like to merge perf generated
events with external events that were measured with different
clocks:
- cluster wide profiling
- for system wide tracing with user-space events,
- JIT profiling events
etc. Matching perf tooling support is added as well, available via
the -k, --clockid <clockid> parameter to perf record et al.
(Peter Zijlstra)
Hardware enablement kernel changes:
- x86 Intel Processor Trace (PT) support: which is a hardware tracer
on steroids, available on Broadwell CPUs.
The hardware trace stream is directly output into the user-space
ring-buffer, using the 'AUX' data format extension that was added
to the perf core to support hardware constraints such as the
necessity to have the tracing buffer physically contiguous.
This patch-set was developed for two years and this is the result.
A simple way to make use of this is to use BTS tracing, the PT
driver emulates BTS output - available via the 'intel_bts' PMU.
More explicit PT specific tooling support is in the works as well -
will probably be ready by 4.2.
(Alexander Shishkin, Peter Zijlstra)
- x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
feature of Intel Xeon CPUs that allows the measurement and
allocation/partitioning of caches to individual workloads.
These kernel changes expose the measurement side as a new PMU
driver, which exposes various QoS related PMU events. (The
partitioning change is work in progress and is planned to be merged
as a cgroup extension.)
(Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
Waskiewicz Jr)
- x86 Intel Haswell LBR call stack support: this is a new Haswell
feature that allows the hardware recording of call chains, plus
tooling support. To activate this feature you have to enable it
via the new 'lbr' call-graph recording option:
perf record --call-graph lbr
perf report
or:
perf top --call-graph lbr
This hardware feature is a lot faster than stack walk or dwarf
based unwinding, but has some limitations:
- It reuses the current LBR facility, so LBR call stack and
branch record can not be enabled at the same time.
- It is only available for user-space callchains.
(Yan, Zheng)
- x86 Intel Broadwell CPU support and various event constraints and
event table fixes for earlier models.
(Andi Kleen)
- x86 Intel HT CPUs event scheduling workarounds. This is a complex
CPU bug affecting the SNB,IVB,HSW families that results in counter
value corruption. The mitigation code is automatically enabled and
is transparent.
(Maria Dimakopoulou, Stephane Eranian)
The perf tooling side had a ton of changes in this cycle as well, so
I'm only able to list the user visible changes here, in addition to
the tooling changes outlined above:
User visible changes affecting all tools:
- Improve support of compressed kernel modules (Jiri Olsa)
- Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
- Bash completion for subcommands (Yunlong Song)
- Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
- Support missing -f to override perf.data file ownership. (Yunlong Song)
- Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)
User visible changes in individual tools:
'perf data':
New tool for converting perf.data to other formats, initially
for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
Sebastian Siewior)
'perf diff':
Add --kallsyms option (David Ahern)
'perf list':
Allow listing events with 'tracepoint' prefix (Yunlong Song)
Sort the output of the command (Yunlong Song)
'perf kmem':
Respect -i option (Jiri Olsa)
Print big numbers using thousands' group (Namhyung Kim)
Allow -v option (Namhyung Kim)
Fix alignment of slab result table (Namhyung Kim)
'perf probe':
Support multiple probes on different binaries on the same command line (Masami Hiramatsu)
Support unnamed union/structure members data collection. (Masami Hiramatsu)
Check kprobes blacklist when adding new events. (Masami Hiramatsu)
'perf record':
Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)
Support recording running/enabled time (Andi Kleen)
'perf sched':
Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)
'perf report' and 'perf top':
Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)
Indicate which callchain entries are annotated in the
TUI hists browser (Arnaldo Carvalho de Melo)
Add pid/tid filtering to 'report' and 'script' commands (David Ahern)
Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
events (Arnaldo Carvalho de Melo)
'perf stat':
Report unsupported events properly (Suzuki K. Poulose)
Output running time and run/enabled ratio in CSV mode (Andi Kleen)
'perf trace':
Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)
Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)
Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)
Dump stack on segfaults (Arnaldo Carvalho de Melo)
No need to explicitely enable evsels for workload started from perf, let it
be enabled via perf_event_attr.enable_on_exec, removing some events that take
place in the 'perf trace' before a workload is really started by it.
(Arnaldo Carvalho de Melo)
Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)
There's also been a ton of infrastructure work done, such as the
split-out of perf's build system into tools/build/ and other changes -
see the shortlog and changelog for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
perf evlist: Fix type for references to data_head/tail
perf probe: Check the orphaned -x option
perf probe: Support multiple probes on different binaries
perf buildid-list: Fix segfault when show DSOs with hits
perf tools: Fix cross-endian analysis
perf tools: Fix error path to do closedir() when synthesizing threads
perf tools: Fix synthesizing fork_event.ppid for non-main thread
perf tools: Add 'I' event modifier for exclude_idle bit
perf report: Don't call map__kmap if map is NULL.
perf tests: Fix attr tests
perf probe: Fix ARM 32 building error
perf tools: Merge all perf_event_attr print functions
perf record: Add clockid parameter
perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
perf sched replay: Support using -f to override perf.data file ownership
perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
...
Pull trivial tree from Jiri Kosina:
"Usual trivial tree updates. Nothing outstanding -- mostly printk()
and comment fixes and unused identifier removals"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
goldfish: goldfish_tty_probe() is not using 'i' any more
powerpc: Fix comment in smu.h
qla2xxx: Fix printks in ql_log message
lib: correct link to the original source for div64_u64
si2168, tda10071, m88ds3103: Fix firmware wording
usb: storage: Fix printk in isd200_log_config()
qla2xxx: Fix printk in qla25xx_setup_mode
init/main: fix reset_device comment
ipwireless: missing assignment
goldfish: remove unreachable line of code
coredump: Fix do_coredump() comment
stacktrace.h: remove duplicate declaration task_struct
smpboot.h: Remove unused function prototype
treewide: Fix typo in printk messages
treewide: Fix typo in printk messages
mod_devicetable: fix comment for match_flags
The first argument passed to find_probe_point_lazy() should be CU die,
which will be passed to die_walk_lines() when lazy_line matches.
Currently, when we probe with lazy_line pattern to file without function
name, NULL pointer is passed and causes a segment fault.
Can be reproduced as following:
$ perf probe -k vmlinux --add='fs/super.c;s->s_count=1;'
[ 1958.984658] perf[1020]: segfault at 10 ip 00007fc6e10d8c71 sp
00007ffcbfaaf900 error 4 in libdw-0.161.so[7fc6e10ce000+34000]
Segmentation fault
After this patch:
$ perf probe -k vmlinux --add='fs/super.c;s->s_count=1;'
Added new event:
probe:_stext (on @fs/super.c)
You can now use it in all perf tools, such as:
perf record -e probe:_stext -aR sleep 1
Signed-off-by: He Kuang <hekuang@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1428925290-5623-3-git-send-email-hekuang@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When perf probe searched in a debuginfo file and failed, it tried with
an alternative, in function get_alternative_probe_event():
memcpy(tmp, &pev->point, sizeof(*tmp));
memset(&pev->point, 0, sizeof(pev->point));
In this case, it drops the retprobe flag and forgets to set it back in
find_alternative_probe_point(), so the problem occurs.
Can be reproduced as following:
$ perf probe -v -k vmlinux --add='sys_write%return'
...
Added new event:
Writing event: p:probe/sys_write _stext+1584952
probe:sys_write (on sys_write%return)
$ cat /sys/kernel/debug/tracing/kprobe_events
p:probe/sys_write _stext+1584952
After this patch:
$ perf probe -v -k vmlinux --add='sys_write%return'
Added new event:
Writing event: r:probe/sys_write SyS_write+0
probe:sys_write (on sys_write%return)
$ cat /sys/kernel/debug/tracing/kprobe_events
r:probe/sys_write SyS_write
Signed-off-by: He Kuang <hekuang@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1428925290-5623-1-git-send-email-hekuang@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To avoid probing in unintended binary, the orphaned -x option must be
checked and warned.
Without this patch, following command sets up the probe in the kernel.
-----
# perf probe -a strcpy -x ./perf
Added new event:
probe:strcpy (on strcpy)
You can now use it in all perf tools, such as:
perf record -e probe:strcpy -aR sleep 1
-----
But in this case, it seems that the user may want to probe in the perf
binary. With this patch, perf-probe correctly handles the orphaned -x.
-----
# perf probe -a strcpy -x ./perf
Error: -x/-m must follow the probe definitions.
...
-----
Reported-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150401102541.17137.75477.stgit@localhost.localdomain
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Support multiple probes on different binaries with just
one command.
In the result, this example sets up the probes on icmp_rcv in
kernel, on main and set_target in perf, and on pcspkr_event
in pcspker.ko driver.
-----
# perf probe -a icmp_rcv -x ./perf -a main -a set_target \
-m /lib/modules/4.0.0-rc5+/kernel/drivers/input/misc/pcspkr.ko \
-a pcspkr_event
Added new event:
probe:icmp_rcv (on icmp_rcv)
You can now use it in all perf tools, such as:
perf record -e probe:icmp_rcv -aR sleep 1
Added new event:
probe_perf:main (on main in /home/mhiramat/ksrc/linux-3/tools/perf/perf)
You can now use it in all perf tools, such as:
perf record -e probe_perf:main -aR sleep 1
Added new event:
probe_perf:set_target (on set_target in /home/mhiramat/ksrc/linux-3/tools/perf/perf)
You can now use it in all perf tools, such as:
perf record -e probe_perf:set_target -aR sleep 1
Added new event:
probe:pcspkr_event (on pcspkr_event in pcspkr)
You can now use it in all perf tools, such as:
perf record -e probe:pcspkr_event -aR sleep 1
-----
Reported-by: Arnaldo Carvalho de Melo <acme@infradead.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150401102539.17137.46454.stgit@localhost.localdomain
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
commit: f3b623b849 ("perf tools: Reference count struct thread")
appends every thread->node to dead_threads in machine__remove_thread()
and list_del_init() this node in thread__put().
perf_event__exit_del_thread() releases thread wihout using
machine__remove_thread(), and causes a NULL pointer crash when
list_del_init(&thread->node) is called. Fix this by using
machine_remove_thread() instead of using thread__put() directly.
This problem can be reproduced as following:
$ perf record ls
$ perf buildid-list --with-hits
[ 3874.195070] perf[1018]: segfault at 0 ip 00000000004b0b15 sp
00007ffc35b44780 error 6 in perf[400000+166000]
Segmentation fault
After this patch:
$ perf record ls
$ perf buildid-list --with-hits
bc23e7c3281e542650ba4324421d6acf78f4c23e /proc/kcore
643324cb0e969f30c56d660f167f84a150845511 [vdso]
0000000000000000000000000000000000000000 /bin/busybox
...
Signed-off-by: He Kuang <hekuang@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1428658500-6483-1-git-send-email-hekuang@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The soft maximum number of open files for a calling process is 1024,
which is defined as INR_OPEN_CUR in include/uapi/linux/fs.h, and the
hard maximum number of open files for a calling process is 4096, which
is defined as INR_OPEN_MAX in include/uapi/linux/fs.h.
Both INR_OPEN_CUR and INR_OPEN_MAX are used to limit the value of
RLIMIT_NOFILE in include/asm-generic/resource.h.
And the soft maximum number finally decides the limitation of the
maximum files which are allowed to be opened.
That is to say a process can use at most 1024 file descriptors for its
o pened files, or an EMFILE error will happen.
This error can be fixed by increasing the soft maximum number, under the
constraint that the soft maximum number can not exceed the hard maximum
number, or both soft and hard maximum number should be increased
simultaneously with privilege.
For perf sched replay, it uses sys_perf_event_open to create the file
descriptor for each of the tasks in order to handle information of perf
events.
That is to say each task needs a unique file descriptor. In x86_64,
there may be over 1024 or 4096 tasks correspoinding to the record in
perf.data, which causes that no enough file descriptors can be used.
As a result, EMFILE error happens and stops the replay process. To solve
this problem, we adaptively increase the soft and hard maximum number of
open files with a '-f' option.
Example:
Test environment: x86_64 with 160 cores
$ cat /proc/sys/kernel/pid_max
163840
$ cat /proc/sys/fs/file-max
6815744
$ ulimit -Sn
1024
$ ulimit -Hn
4096
Before this patch:
$ perf sched replay
...
task 1549 ( :163132: 163132), nr_events: 1
task 1550 ( :163540: 163540), nr_events: 1
task 1551 ( <unknown>: 0), nr_events: 10
Error: sys_perf_event_open() syscall returned with -1 (Too many open
files)
After this patch:
$ perf sched replay
...
task 1549 ( :163132: 163132), nr_events: 1
task 1550 ( :163540: 163540), nr_events: 1
task 1551 ( <unknown>: 0), nr_events: 10
Error: sys_perf_event_open() syscall returned with -1 (Too many open
files)
Have a try with -f option
$ perf sched replay -f
...
task 1549 ( :163132: 163132), nr_events: 1
task 1550 ( :163540: 163540), nr_events: 1
task 1551 ( <unknown>: 0), nr_events: 10
------------------------------------------------------------
#1 : 54.401, ravg: 54.40, cpu: 3285.21 / 3285.21
#2 : 199.548, ravg: 68.92, cpu: 4999.65 / 3456.66
#3 : 170.483, ravg: 79.07, cpu: 1349.94 / 3245.99
#4 : 192.034, ravg: 90.37, cpu: 1322.88 / 3053.67
#5 : 182.929, ravg: 99.62, cpu: 1406.51 / 2888.96
#6 : 152.974, ravg: 104.96, cpu: 1167.54 / 2716.82
#7 : 155.579, ravg: 110.02, cpu: 2992.53 / 2744.39
#8 : 130.557, ravg: 112.08, cpu: 1126.43 / 2582.59
#9 : 138.520, ravg: 114.72, cpu: 1253.22 / 2449.65
#10 : 134.328, ravg: 116.68, cpu: 1587.95 / 2363.48
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1427809596-29559-8-git-send-email-yunlong.song@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since there is sem_wait for each task in the wait_for_tasks(), e.g.
sem_wait(&task->work_done_sem).
The sem_wait can continue only when work_done_sem is greater than 0, or
it will be blocked.
For perf sched replay, one task may sem_post the work_done_sem of
another task, which causes the work_done_sem of that task processed in a
reasonable sequence, e.g. sem_post, sem_wait, sem_wait, sem_post...
This sequence simulates the sched process of the running tasks at the
time when perf sched record runs.
As a result, all the tasks are required and their threads must be
successfully created.
If any one (task A) of the tasks fails to create its thread, then
another task (task B), whose work_done_sem needs sem_post from that
failed task A, may likely block itself due to seg_wait.
And this is a dead halt, since task B's thread_func cannot continue at
all.
To solve this problem, perf sched replay should exit once any task fails
to create its thread.
Example:
Test environment: x86_64 with 160 cores
Before this patch:
$ perf sched replay
...
Error: sys_perf_event_open() syscall returned with -1 (Too many open
files)
------------------------------------------------------------ <- dead halt
After this patch:
$ perf sched replay
...
task 1551 ( <unknown>: 0), nr_events: 10
Error: sys_perf_event_open() syscall returned with -1 (Too many open
files)
$
As shown above, perf sched replay finishes the process after printing an
error message and does not block itself.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1427809596-29559-7-git-send-email-yunlong.song@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The pr_err in self_open_counters() prints error message to stderr.
Unlike stdout, stderr uses memory buffer on the stack of each calling
process.
The pr_err in self_open_counters() works in a thread called thread_func
created in function create_tasks, which concurrently creates
sched->nr_tasks threads.
If the error happens and pr_err prints the error message in each of
these threads, the stack size of the perf process (default is 8192
kbytes) will quickly run out and the segmentation fault will happen
then.
To solve this problem, pr_err with self_open_counters() should be moved
from newly created threads to the old main thread of the perf process.
Then the pr_err can work in a stable situation without the strange
segmentation fault problem.
Example:
Test environment: x86_64 with 160 cores
Before this patch:
$ perf sched replay
...
task 1549 ( :163132: 163132), nr_events: 1
task 1550 ( :163540: 163540), nr_events: 1
task 1551 ( <unknown>: 0), nr_events: 10
Segmentation fault
After this patch:
$ perf sched replay
...
task 1549 ( :163132: 163132), nr_events: 1
task 1550 ( :163540: 163540), nr_events: 1
task 1551 ( <unknown>: 0), nr_events: 10
...
As shown above, the result continues without any segmentation fault.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1427809596-29559-6-git-send-email-yunlong.song@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>