Pull perf fixes from Ingo Molnar:
"Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
driver fixes, plus miscellanous tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Reject non sampling events with precise_ip
perf/x86/intel: Account interrupts for PEBS errors
perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
perf/core: Fix sys_perf_event_open() vs. hotplug
perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
perf/x86: Set pmu->module in Intel PMU modules
perf probe: Fix to probe on gcc generated symbols for offline kernel
perf probe: Fix --funcs to show correct symbols for offline module
perf symbols: Robustify reading of build-id from sysfs
perf tools: Install tools/lib/traceevent plugins with install-bin
tools lib traceevent: Fix prev/next_prio for deadline tasks
perf record: Fix --switch-output documentation and comment
perf record: Make __record_options static
tools lib subcmd: Add OPT_STRING_OPTARG_SET option
perf probe: Fix to get correct modname from elf header
samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
samples/bpf sock_example: Avoid getting ethhdr from two includes
perf sched timehist: Show total scheduling time
The filter added by sock_setfilter is intended to only permit
packets matching the pattern set up by create_payload(), but
we only check the ip_len, and a single test-character in
the IP packet to ensure this condition.
Harden the filter by adding additional constraints so that we only
permit UDP/IPv4 packets that meet the ip_len and test-character
requirements. Include the bpf_asm src as a comment, in case this
needs to be enhanced in the future
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When structs are used to store temporary state in cb[] buffer that is
used with programs and among tail calls, then the generated code will
not always access the buffer in bpf_w chunks. We can ease programming
of it and let this act more natural by allowing for aligned b/h/w/dw
sized access for cb[] ctx member. Various test cases are attached as
well for the selftest suite. Potentially, this can also be reused for
other program types to pass data around.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge fixes from Andrew Morton:
"27 fixes.
There are three patches that aren't actually fixes. They're simple
function renamings which are nice-to-have in mainline as ongoing net
development depends on them."
* akpm: (27 commits)
timerfd: export defines to userspace
mm/hugetlb.c: fix reservation race when freeing surplus pages
mm/slab.c: fix SLAB freelist randomization duplicate entries
zram: support BDI_CAP_STABLE_WRITES
zram: revalidate disk under init_lock
mm: support anonymous stable page
mm: add documentation for page fragment APIs
mm: rename __page_frag functions to __page_frag_cache, drop order from drain
mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
mm: don't dereference struct page fields of invalid pages
mailmap: add codeaurora.org names for nameless email commits
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
mm: pmd dirty emulation in page fault handler
ipc/sem.c: fix incorrect sem_lock pairing
lib/Kconfig.debug: fix frv build failure
mm: get rid of __GFP_OTHER_NODE
mm: fix remote numa hits statistics
mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
ocfs2: fix crash caused by stale lvb with fsdlm plugin
...
Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.
Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.
One common situation when this is useful:
int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */
if (some_condition)
len = 42;
else
len = 84;
some_helper(..., buf, len & (BUFSIZE - 1));
The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.
However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.
Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.
Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 484611357c ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.
The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.
Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:
struct trace_data {
char pathname[PATHLEN];
};
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
struct trace_data data;
bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
/* consume data.pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:
struct bpf_map_def SEC("maps") scratch_map = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(struct trace_data),
.max_entries = 1,
};
SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
int id = 0;
struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
if (!p)
return;
bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
/* consume p->pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
And wouldn't risk exhausting the stack.
Code changes are loosely modeled after commit 6841de8b0d ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.
Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Packets from any/all interfaces may be queued up on the PF_PACKET socket
before it is bound to the loopback interface by psock_tpacket, and
when these are passed up by the kernel, they could interfere
with the Rx tests.
Avoid interference from spurious packet by blocking Rx until the
socket filter has been set up, and the packet has been bound to the
desired (lo) interface. The effective sequence is
socket(PF_PACKET, SOCK_RAW, 0);
set up ring
Invoke SO_ATTACH_FILTER
bind to sll_protocol set to ETH_P_ALL, sll_ifindex for lo
After this sequence, the only packets that will be passed up are
those received on loopback that pass the attached filter.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf/urgent fixes and one improvement from Arnaldo Carvalho de Melo:
Fixes:
- Fix prev/next_prio formatting for deadline tasks in libtraceevent (Daniel Bristot de Oliveira)
- Robustify reading of build-ids from /sys/kernel/note (Arnaldo Carvalho de Melo)
- Fix building some sample/bpf in Alpine Linux 3.4 (Arnaldo Carvalho de Melo)
- Fix 'make install-bin' to install libtraceevent plugins (Arnaldo Carvalho de Melo)
- Fix 'perf record --switch-output' documentation and comment (Jiri Olsa)
- Fix 'perf probe' for cross arch probing (Masami Hiramatsu)
Improvement:
- Show total scheduling time in 'perf sched timehist' (Namhyumg Kim)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fix perf-probe to show probe definition on gcc generated symbols for
offline kernel (including cross-arch kernel image).
gcc sometimes optimizes functions and generate new symbols with suffixes
such as ".constprop.N" or ".isra.N" etc. Since those symbol names are
not recorded in DWARF, we have to find correct generated symbols from
offline ELF binary to probe on it (kallsyms doesn't correct it). For
online kernel or uprobes we don't need it because those are rebased on
_text, or a section relative address.
E.g. Without this:
$ perf probe -k build-arm/vmlinux -F __slab_alloc*
__slab_alloc.constprop.9
$ perf probe -k build-arm/vmlinux -D __slab_alloc
p:probe/__slab_alloc __slab_alloc+0
If you put above definition on target machine, it should fail
because there is no __slab_alloc in kallsyms.
With this fix, perf probe shows correct probe definition on
__slab_alloc.constprop.9:
$ perf probe -k build-arm/vmlinux -D __slab_alloc
p:probe/__slab_alloc __slab_alloc.constprop.9+0
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350060434.19001.11864836288580083501.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fix --funcs (-F) option to show correct symbols for offline module.
Since previous perf-probe uses machine__findnew_module_map() for offline
module, even if user passes a module file (with full path) which is for
other architecture, perf-probe always tries to load symbol map for
current kernel module.
This fix uses dso__new_map() to load the map from given binary as same
as a map for user applications.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350053478.19001.15435255244512631545.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently, the sched:sched_switch tracepoint reports deadline tasks with
priority -1. But when reading the trace via perf script I've got the
following output:
# ./d & # (d is a deadline task, see [1])
# perf record -e sched:sched_switch -a sleep 1
# perf script
...
swapper 0 [000] 2146.962441: sched:sched_switch: swapper/0:0 [120] R ==> d:2593 [4294967295]
d 2593 [000] 2146.972472: sched:sched_switch: d:2593 [4294967295] R ==> g:2590 [4294967295]
The task d reports the wrong priority [4294967295]. This happens because
the "int prio" is stored in an unsigned long long val. Although it is
set as a %lld, as int is shorter than unsigned long long,
trace_seq_printf prints it as a positive number.
The fix is just to cast the val as an int, and print it as a %d,
as in the sched:sched_switch tracepoint's "format".
The output with the fix is:
# ./d &
# perf record -e sched:sched_switch -a sleep 1
# perf script
...
swapper 0 [000] 4306.374037: sched:sched_switch: swapper/0:0 [120] R ==> d:10941 [-1]
d 10941 [000] 4306.383823: sched:sched_switch: d:10941 [-1] R ==> swapper/0:0 [120]
[1] d.c
---
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/types.h>
#include <linux/sched.h>
struct sched_attr {
__u32 size, sched_policy;
__u64 sched_flags;
__s32 sched_nice;
__u32 sched_priority;
__u64 sched_runtime, sched_deadline, sched_period;
};
int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags)
{
return syscall(__NR_sched_setattr, pid, attr, flags);
}
int main(void)
{
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_DEADLINE, /* This creates a 10ms/30ms reservation */
.sched_runtime = 10 * 1000 * 1000,
.sched_period = attr.sched_deadline = 30 * 1000 * 1000,
};
if (sched_setattr(0, &attr, 0) < 0) {
perror("sched_setattr");
return -1;
}
for(;;);
}
---
Committer notes:
Got the program from the provided URL, http://bristot.me/lkml/d.c,
trimmed it and included in the cset log above, so that we have
everything needed to test it in one place.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/866ef75bcebf670ae91c6a96daa63597ba981f0d.1483443552.git.bristot@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>