Peter Zijlstra
4411ec1d19
perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
...
> kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages'
Userspace controls @pgoff through the fault address. Sanitize the
array index before doing the array dereference.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com >
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Cc: <stable@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Vince Weaver <vincent.weaver@maine.edu >
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-05-05 08:37:27 +02:00
Linus Torvalds
f4ef6a438c
Merge tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
...
Pull tracing fixes from Steven Rostedt:
"Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
- Use special field for function pointer in trace event
- Fix igrab issues in uprobes
- Fixes to the new histogram triggers"
* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Do not warn on ENOMEM
tracing: Add field modifier parsing hist error for hist triggers
tracing: Add field parsing hist error for hist triggers
tracing: Restore proper field flag printing when displaying triggers
tracing: initcall: Ordered comparison of function pointers
tracing: Remove igrab() iput() call from uprobes.c
tracing: Fix bad use of igrab in trace_uprobe.c
2018-05-02 17:38:37 -10:00
Song Liu
61f94203c9
tracing: Remove igrab() iput() call from uprobes.c
...
Caller of uprobe_register is required to keep the inode and containing
mount point referenced.
There was misuse of igrab() in uprobes.c and trace_uprobe.c. This is
because igrab() will not prevent umount of the containing mount point.
To fix this, we added path to struct trace_uprobe, which keeps the inode
and containing mount reference.
For uprobes.c, it is not necessary to call igrab() in uprobe_register(),
as the caller is required to keep the inode reference. The igrab() is
removed and comments on this requirement is added to uprobe_register().
Link: http://lkml.kernel.org/r/CAELBmZB2XX=qEOLAdvGG4cPx4GEntcSnWQquJLUK1ongRj35cA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180423172135.4050588-2-songliubraving@fb.com
Cc: Ingo Molnar <mingo@redhat.com >
Cc: Howard McLauchlan <hmclauchlan@fb.com >
Cc: Josef Bacik <jbacik@fb.com >
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com >
Acked-by: Miklos Szeredi <mszeredi@redhat.com >
Signed-off-by: Song Liu <songliubraving@fb.com >
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org >
2018-04-26 14:50:56 -04:00
Jiri Olsa
bfb3d7b8b9
perf: Remove superfluous allocation error check
...
If the get_callchain_buffers fails to allocate the buffer it will
decrease the nr_callchain_events right away.
There's no point of checking the allocation error for
nr_callchain_events > 1. Removing that check.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Andi Kleen <andi@firstfloor.org >
Cc: H. Peter Anvin <hpa@zytor.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20180415092352.12403-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com >
2018-04-17 09:47:40 -03:00
Jiri Olsa
5af44ca53d
perf: Fix sample_max_stack maximum check
...
The syzbot hit KASAN bug in perf_callchain_store having the entry stored
behind the allocated bounds [1].
We miss the sample_max_stack check for the initial event that allocates
callchain buffers. This missing check allows to create an event with
sample_max_stack value bigger than the global sysctl maximum:
# sysctl -a | grep perf_event_max_stack
kernel.perf_event_max_stack = 127
# perf record -vv -C 1 -e cycles/max-stack=256/ kill
...
perf_event_attr:
size 112
...
sample_max_stack 256
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 = 4
Note the '-C 1', which forces perf record to create just single event.
Otherwise it opens event for every cpu, then the sample_max_stack check
fails on the second event and all's fine.
The fix is to run the sample_max_stack check also for the first event
with callchains.
[1] https://marc.info/?l=linux-kernel&m=152352732920874&w=2
Reported-by: syzbot+7c449856228b63ac951e@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Andi Kleen <andi@firstfloor.org >
Cc: H. Peter Anvin <hpa@zytor.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Fixes: 97c79a38cd ("perf core: Per event callchain limit")
Link: http://lkml.kernel.org/r/20180415092352.12403-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com >
2018-04-17 09:47:40 -03:00
Jiri Olsa
78b562fbfa
perf: Return proper values for user stack errors
...
Return immediately when we find issue in the user stack checks. The
error value could get overwritten by following check for
PERF_SAMPLE_REGS_INTR.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Andi Kleen <andi@firstfloor.org >
Cc: H. Peter Anvin <hpa@zytor.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: syzkaller-bugs@googlegroups.com
Cc: x86@kernel.org
Fixes: 60e2364e60 ("perf: Add ability to sample machine state on interrupt")
Link: http://lkml.kernel.org/r/20180415092352.12403-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com >
2018-04-17 09:47:39 -03:00
Alexey Budankov
101592b490
perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
...
Store preempting context switch out event into Perf trace as a part of
PERF_RECORD_SWITCH[_CPU_WIDE] record.
Percentage of preempting and non-preempting context switches help
understanding the nature of workloads (CPU or IO bound) that are running
on a machine;
The event is treated as preemption one when task->state value of the
thread being switched out is TASK_RUNNING. Event type encoding is
implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit;
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com >
Acked-by: Peter Zijlstra <peterz@infradead.org >
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Andi Kleen <ak@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com >
2018-04-17 09:47:39 -03:00
Song Liu
32e6e967fb
perf/core: Need CAP_SYS_ADMIN to create k/uprobe with perf_event_open()
...
Non-root user cannot create kprobe or uprobe through the text-based
interface (kprobe_events, uprobe_events),so they should not be able
to create probes via perf_event_open() either.
Reported-by: Vince Weaver <vincent.weaver@maine.edu >
Signed-off-by: Song Liu <songliubraving@fb.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Thomas Gleixner <tglx@linutronix.de >
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Link: http://lkml.kernel.org/r/C0B2EFB5-C403-4BDB-9046-C14B3EE66999@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-04-12 09:55:50 +02:00
Prashant Bhole
621b6d2ea2
perf/core: Fix use-after-free in uprobe_perf_close()
...
A use-after-free bug was caught by KASAN while running usdt related
code (BCC project. bcc/tests/python/test_usdt2.py):
==================================================================
BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870
CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G W 4.16.0-next-20180409 #215
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0xc7/0x15b
? show_regs_print_info+0x5/0x5
? printk+0x9c/0xc3
? kmsg_dump_rewind_nolock+0x6e/0x6e
? uprobe_perf_close+0x222/0x3b0
print_address_description+0x83/0x3a0
? uprobe_perf_close+0x222/0x3b0
kasan_report+0x1dd/0x460
? uprobe_perf_close+0x222/0x3b0
uprobe_perf_close+0x222/0x3b0
? probes_open+0x180/0x180
? free_filters_list+0x290/0x290
trace_uprobe_register+0x1bb/0x500
? perf_event_attach_bpf_prog+0x310/0x310
? probe_event_disable+0x4e0/0x4e0
perf_uprobe_destroy+0x63/0xd0
_free_event+0x2bc/0xbd0
? lockdep_rcu_suspicious+0x100/0x100
? ring_buffer_attach+0x550/0x550
? kvm_sched_clock_read+0x1a/0x30
? perf_event_release_kernel+0x3e4/0xc00
? __mutex_unlock_slowpath+0x12e/0x540
? wait_for_completion+0x430/0x430
? lock_downgrade+0x3c0/0x3c0
? lock_release+0x980/0x980
? do_raw_spin_trylock+0x118/0x150
? do_raw_spin_unlock+0x121/0x210
? do_raw_spin_trylock+0x150/0x150
perf_event_release_kernel+0x5d4/0xc00
? put_event+0x30/0x30
? fsnotify+0xd2d/0xea0
? sched_clock_cpu+0x18/0x1a0
? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
? pvclock_clocksource_read+0x152/0x2b0
? pvclock_read_flags+0x80/0x80
? kvm_sched_clock_read+0x1a/0x30
? sched_clock_cpu+0x18/0x1a0
? pvclock_clocksource_read+0x152/0x2b0
? locks_remove_file+0xec/0x470
? pvclock_read_flags+0x80/0x80
? fcntl_setlk+0x880/0x880
? ima_file_free+0x8d/0x390
? lockdep_rcu_suspicious+0x100/0x100
? ima_file_check+0x110/0x110
? fsnotify+0xea0/0xea0
? kvm_sched_clock_read+0x1a/0x30
? rcu_note_context_switch+0x600/0x600
perf_release+0x21/0x40
__fput+0x264/0x620
? fput+0xf0/0xf0
? do_raw_spin_unlock+0x121/0x210
? do_raw_spin_trylock+0x150/0x150
? SyS_fchdir+0x100/0x100
? fsnotify+0xea0/0xea0
task_work_run+0x14b/0x1e0
? task_work_cancel+0x1c0/0x1c0
? copy_fd_bitmaps+0x150/0x150
? vfs_read+0xe5/0x260
exit_to_usermode_loop+0x17b/0x1b0
? trace_event_raw_event_sys_exit+0x1a0/0x1a0
do_syscall_64+0x3f6/0x490
? syscall_return_slowpath+0x2c0/0x2c0
? lockdep_sys_exit+0x1f/0xaa
? syscall_return_slowpath+0x1a3/0x2c0
? lockdep_sys_exit+0x1f/0xaa
? prepare_exit_to_usermode+0x11c/0x1e0
? enter_from_user_mode+0x30/0x30
random: crng init done
? __put_user_4+0x1c/0x30
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f41d95f9340
RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290
Allocated by task 870:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_node+0x11a/0x430
copy_process.part.19+0x11a0/0x41c0
_do_fork+0x1be/0xa20
do_syscall_64+0x198/0x490
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Freed by task 0:
__kasan_slab_free+0x12e/0x180
kmem_cache_free+0x102/0x4d0
free_task+0xfe/0x160
__put_task_struct+0x189/0x290
delayed_put_task_struct+0x119/0x250
rcu_process_callbacks+0xa6c/0x1b60
__do_softirq+0x238/0x7ae
The buggy address belongs to the object at ffff880384f9b480
which belongs to the cache task_struct of size 12928
It occurs because task_struct is freed before perf_event which refers
to the task and task flags are checked while teardown of the event.
perf_event_alloc() assigns task_struct to hw.target of perf_event,
but there is no reference counting for it.
As a fix we get_task_struct() in perf_event_alloc() at above mentioned
assignment and put_task_struct() in _free_event().
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp >
Reviewed-by: Oleg Nesterov <oleg@redhat.com >
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Cc: <stable@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@kernel.org >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Thomas Gleixner <tglx@linutronix.de >
Fixes: 63b6da39bb ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-04-09 18:15:58 +02:00
Alexander Shishkin
6ed70cf342
perf/x86/pt, coresight: Clean up address filter structure
...
This is a cosmetic patch that deals with the address filter structure's
ambiguous fields 'filter' and 'range'. The former stands to mean that the
filter's *action* should be to filter the traces to its address range if
it's set or stop tracing if it's unset. This is confusing and hard on the
eyes, so this patch replaces it with 'action' enum. The 'range' field is
completely redundant (meaning that the filter is an address range as
opposed to a single address trigger), as we can use zero size to mean the
same thing.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org >
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Vince Weaver <vincent.weaver@maine.edu >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180329120648.11902-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-29 16:07:22 +02:00
Ingo Molnar
2d074918fb
Merge branch 'perf/urgent' into perf/core
...
Conflicts:
kernel/events/hw_breakpoint.c
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-29 16:03:48 +02:00
Linus Torvalds
f67b15037a
perf/hwbp: Simplify the perf-hwbp code, fix documentation
...
Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
modification of a breakpoint - simplify it and remove the pointless
local variables.
Also update the stale Docbook while at it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org >
Acked-by: Thomas Gleixner <tglx@linutronix.de >
Cc: <stable@vger.kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Andy Lutomirski <luto@kernel.org >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Vince Weaver <vincent.weaver@maine.edu >
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-28 17:41:50 +02:00
Ingo Molnar
7054e4e0b1
Merge branch 'perf/urgent' into perf/core, to pick up fixes
...
With the cherry-picked perf/urgent commit merged separately we can now
merge all the fixes without conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-24 09:21:47 +01:00
Song Liu
c917e0f259
perf/cgroup: Fix child event counting bug
...
When a perf_event is attached to parent cgroup, it should count events
for all children cgroups:
parent_group <---- perf_event
\
- child_group <---- process(es)
However, in our tests, we found this perf_event cannot report reliable
results. Here is an example case:
# create cgroups
mkdir -p /sys/fs/cgroup/p/c
# start perf for parent group
perf stat -e instructions -G "p"
# on another console, run test process in child cgroup:
stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs
# after the test process is done, stop perf in the first console shows
<not counted> instructions p
The instruction should not be "not counted" as the process runs in the
child cgroup.
We found this is because perf_event->cgrp and cpuctx->cgrp are not
identical, thus perf_event->cgrp are not updated properly.
This patch fixes this by updating perf_cgroup properly for ancestor
cgroup(s).
Reported-by: Ephraim Park <ephiepark@fb.com >
Signed-off-by: Song Liu <songliubraving@fb.com >
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Cc: <jolsa@redhat.com >
Cc: <kernel-team@fb.com >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Vince Weaver <vincent.weaver@maine.edu >
Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-20 08:58:47 +01:00
Ingo Molnar
134933e557
Merge tag 'v4.16-rc6' into perf/core, to pick up fixes
...
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-19 20:37:35 +01:00
Mark Rutland
24868367cd
perf/core: Clear sibling list of detached events
...
When perf_group_dettach() is called on a group leader, it updates each
sibling's group_leader field to point to that sibling, effectively
upgrading each siblnig to a group leader. After perf_group_detach has
completed, the caller may free the leader event.
We only remove siblings from the group leader's sibling_list when the
leader has a non-empty group_node. This was fine prior to commit:
8343aae661 ("perf/core: Remove perf_event::group_entry")
... as the sibling's sibling_list would be empty. However, now that we
use the sibling_list field as both the list head and the list entry,
this leaves each sibling with a non-empty sibling list, including the
stale leader event.
If perf_group_detach() is subsequently called on a sibling, it will
appear to be a group leader, and we'll walk the sibling_list,
potentially dereferencing these stale events. In 0day testing, this has
been observed to result in kernel panics.
Let's avoid this by always removing siblings from the sibling list when
we promote them to leaders.
Fixes: 8343aae661 ("perf/core: Remove perf_event::group_entry")
Signed-off-by: Mark Rutland <mark.rutland@arm.com >
Signed-off-by: Thomas Gleixner <tglx@linutronix.de >
Cc: vincent.weaver@maine.edu
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: torvalds@linux-foundation.org
Cc: Alexey Budankov <alexey.budankov@linux.intel.com >
Cc: valery.cherepennikov@intel.com
Cc: linux-tip-commits@vger.kernel.org
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: alexander.shishkin@linux.intel.com
Cc: davidcc@google.com
Cc: kan.liang@intel.com
Cc: Dmitry.Prohorov@intel.com
Cc: Jiri Olsa <jolsa@redhat.com >
Link: https://lkml.kernel.org/r/20180316131741.3svgr64yibc6vsid@lakrids.cambridge.arm.com
2018-03-16 20:44:32 +01:00
Peter Zijlstra
edb39592a5
perf: Fix sibling iteration
...
Mark noticed that the change to sibling_list changed some iteration
semantics; because previously we used group_list as list entry,
sibling events would always have an empty sibling_list.
But because we now use sibling_list for both list head and list entry,
siblings will report as having siblings.
Fix this with a custom for_each_sibling_event() iterator.
Fixes: 8343aae661 ("perf/core: Remove perf_event::group_entry")
Reported-by: Mark Rutland <mark.rutland@arm.com >
Suggested-by: Mark Rutland <mark.rutland@arm.com >
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Signed-off-by: Thomas Gleixner <tglx@linutronix.de >
Cc: vincent.weaver@maine.edu
Cc: alexander.shishkin@linux.intel.com
Cc: torvalds@linux-foundation.org
Cc: alexey.budankov@linux.intel.com
Cc: valery.cherepennikov@intel.com
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: linux-tip-commits@vger.kernel.org
Cc: davidcc@google.com
Cc: kan.liang@intel.com
Cc: Dmitry.Prohorov@intel.com
Cc: jolsa@redhat.com
Link: https://lkml.kernel.org/r/20180315170129.GX4043@hirez.programming.kicks-ass.net
2018-03-16 20:44:12 +01:00
Milind Chabbi
32ff77e8cc
perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTES
...
Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
is created, there is no flexibility to change the breakpoint type
(bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
only option is to close the perf event and configure a new breakpoint
event. This inflexibility has a significant performance overhead. For
example, sampling-based, lightweight performance profilers (and also
concurrency bug detection tools), monitor different addresses for a short
duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
another address or change the kind of breakpoint (bp_type) from "write" to
a "read" or vice-versa or change the length (bp_len) of the address being
monitored. The cost of these modifications is prohibitive since it involves
unmapping the circular buffer associated with the perf event, closing the
perf event, opening another perf event and mmaping another circular buffer.
Solution: The new ioctl flag for perf events,
PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
to a struct perf_event_attr as an argument to update an old breakpoint
event with new address, type, and size. This facility allows retaining a
previous mmaped perf events ring buffer and avoids having to close and
reopen another perf event.
This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
implementations can extend this feature. The patch replicates some of its
functionality of modify_user_hw_breakpoint() in
kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
directly since perf_event_ctx_lock() is already held in _perf_ioctl().
Evidence: Experiments show that the baseline (not able to modify an already
created breakpoint) costs an order of magnitude (~10x) more than the
suggested optimization (having the ability to dynamically modifying a
configured breakpoint via ioctl). When the breakpoints typically do not
trap, the speedup due to the suggested optimization is ~10x; even when the
breakpoints always trap, the speedup is ~4x due to the suggested
optimization.
Testing: tests posted at
https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
performance significance of this patch. Tests also check the functional
correctness of the patch.
Signed-off-by: Milind Chabbi <chabbi.milind@gmail.com >
[ Using modify_user_hw_breakpoint_check function. ]
[ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 15:24:02 +01:00
Jiri Olsa
5f970521d3
perf/core: Move perf_event_attr::sample_max_stack into perf_copy_attr()
...
Move the sample_max_stack check and setup into perf_copy_attr(),
so we have all perf_event_attr initial setup in one place
and can easily compare attrs in the new ioctl introduced
in following change.
Suggested-by: Peter Zijlstra <peterz@infradead.org >
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-7-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:08 +01:00
Jiri Olsa
705feaf321
hw_breakpoint: Add perf_event_attr fields check in __modify_user_hw_breakpoint()
...
And rename it to modify_user_hw_breakpoint_check().
We are about to use modify_user_hw_breakpoint_check() for user space
breakpoints modification, we must be very strict to check only the
fields we can change have changed. As Peter explained:
"Suppose someone does:
attr = malloc(sizeof(*attr)); // uninitialized memory
attr->type = BP;
attr->bp_addr = new_addr;
attr->bp_type = bp_type;
attr->bp_len = bp_len;
ioctl(fd, PERF_IOC_MOD_ATTR, &attr);
And feeds absolute shite for the rest of the fields.
Then we later want to extend IOC_MOD_ATTR to allow changing
attr::sample_type but we can't, because that would break the
above application."
I'm making this check optional because we already export
modify_user_hw_breakpoint() and with this check we could
break existing users.
Suggested-by: Peter Zijlstra <peterz@infradead.org >
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-6-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:08 +01:00
Jiri Olsa
18ff57b220
hw_breakpoint: Factor out __modify_user_hw_breakpoint() function
...
Moving out all the functionality without the events
disabling/enabling calls, because we want to call another
disabling/enabling functions in following change.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-5-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:08 +01:00
Jiri Olsa
ea6a9d530c
hw_breakpoint: Add modify_bp_slot() function
...
Add the modify_bp_slot() function to keep slot numbers
correct when changing the breakpoint type.
Using existing __release_bp_slot()/__reserve_bp_slot()
call sequence to update the slot counts.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-4-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:07 +01:00
Jiri Olsa
1ad9ff7dea
hw_breakpoint: Pass bp_type argument to __reserve_bp_slot|__release_bp_slot()
...
Passing bp_type argument to __reserve_bp_slot() and __release_bp_slot()
functions, so we can pass another bp_type than the one defined in
bp->attr.bp_type. This will be handy in following change that fixes
breakpoint slot counts during its modification.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:07 +01:00
Jiri Olsa
cbd9d9f114
hw_breakpoint: Pass bp_type directly as find_slot_idx() argument
...
Pass bp_type directly as a find_slot_idx() argument,
so we don't need to have whole event to get the
breakpoint slot type. It will be used in following
changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: David Ahern <dsahern@gmail.com >
Cc: Frederic Weisbecker <fweisbec@gmail.com >
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com >
Cc: Jin Yao <yao.jin@linux.intel.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Kan Liang <kan.liang@intel.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Michael Ellerman <mpe@ellerman.id.au >
Cc: Milind Chabbi <chabbi.milind@gmail.com >
Cc: Namhyung Kim <namhyung@kernel.org >
Cc: Oleg Nesterov <onestero@redhat.com >
Cc: Peter Zijlstra <peterz@infradead.org >
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Will Deacon <will.deacon@arm.com >
Link: http://lkml.kernel.org/r/20180312134548.31532-2-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-13 06:56:07 +01:00
leilei.lin
33801b9474
perf/core: Fix installing cgroup events on CPU
...
There's two problems when installing cgroup events on CPUs: firstly
list_update_cgroup_event() only tries to set cpuctx->cgrp for the
first event, if that mismatches on @cgrp we'll not try again for later
additions.
Secondly, when we install a cgroup event into an active context, only
issue an event reprogram when the event matches the current cgroup
context. This avoids a pointless event reprogramming.
Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com >
[ Improved the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org >
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com >
Cc: Arnaldo Carvalho de Melo <acme@redhat.com >
Cc: Jiri Olsa <jolsa@redhat.com >
Cc: Linus Torvalds <torvalds@linux-foundation.org >
Cc: Stephane Eranian <eranian@google.com >
Cc: Thomas Gleixner <tglx@linutronix.de >
Cc: Vince Weaver <vincent.weaver@maine.edu >
Cc: brendan.d.gregg@gmail.com
Cc: eranian@gmail.com
Cc: linux-kernel@vger.kernel.org
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org >
2018-03-12 15:28:51 +01:00