Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
We also don't need __rcu annotations on cgroup_bpf.inactive since
it's not read/updated concurrently.
v4:
* drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
should be more clear to understand which mutex/refcount protects
each particular place
v3:
* amend cgroup_rcu_dereference to include percpu_ref_is_dying;
cgroup_bpf is now reference counted and we don't hold cgroup_mutex
anymore in cgroup_bpf_release
v2:
* replace xchg with rcu_swap_protected
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Drop __rcu annotations and rcu read sections from bpf_prog_array
helper functions. They are not needed since all existing callers
call those helpers from the rcu update side while holding a mutex.
This guarantees that use-after-free could not happen.
In the next patches I'll fix the callers with missing
rcu_dereference_protected to make sparse/lockdep happy, the proper
way to use these helpers is:
struct bpf_prog_array __rcu *progs = ...;
struct bpf_prog_array *p;
mutex_lock(&mtx);
p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
bpf_prog_array_length(p);
bpf_prog_array_copy_to_user(p, ...);
bpf_prog_array_delete_safe(p, ...);
bpf_prog_array_copy_info(p, ...);
bpf_prog_array_copy(p, ...);
bpf_prog_array_free(p);
mutex_unlock(&mtx);
No functional changes! rcu_dereference_protected with lockdep_is_held
should catch any cases where we update prog array without a mutex
(I've looked at existing call sites and I think we hold a mutex
everywhere).
Motivation is to fix sparse warnings:
kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
kernel/bpf/core.c:1803:9: expected struct callback_head *head
kernel/bpf/core.c:1803:9: got struct callback_head [noderef] <asn:4> *
kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
kernel/bpf/core.c:1877:44: expected struct bpf_prog_array_item *item
kernel/bpf/core.c:1877:44: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1901:26: expected struct bpf_prog_array_item *existing
kernel/bpf/core.c:1901:26: got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1935:26: expected struct bpf_prog_array_item *[assigned] existing
kernel/bpf/core.c:1935:26: got struct bpf_prog_array_item [noderef] <asn:4> *
v2:
* remove comment about potential race; that can't happen
because all callers are in rcu-update section
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull perf/urgent fixes:
BPF:
Jiri Olsa:
- Fixup determination of end of kernel map, to avoid having BPF programs,
that are after the kernel headers and just before module texts mixed up in
the kernel map.
tools UAPI header copies:
Arnaldo Carvalho de Melo:
- Update copy of files related to new fspick, fsmount, fsconfig, fsopen,
move_mount and open_tree syscalls.
- Sync cpufeatures.h, sched.h, fs.h, drm.h, i915_drm.h and kvm.h headers.
Namespaces:
Namhyung Kim:
- Add missing byte swap ops for namespace events when processing records from
perf.data files that could have been recorded in a arch with a different
endianness.
- Fix access to the thread namespaces list by using the namespaces_lock.
perf data:
Shawn Landden:
- Fix 'strncat may truncate' build failure with recent gcc.
s/390
Thomas Richter:
- Fix s390 missing module symbol and warning for non-root users in 'perf record'.
arm64:
Vitaly Chikunov:
- Fix mksyscalltbl when system kernel headers are ahead of the kernel.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.
This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
On systems with ACPI platform firmware the last stage of hibernation
is analogous to system suspend to S3 (suspend-to-RAM), so it should
be handled analogously. In particular, pm_suspend_via_firmware()
should return 'true' in that stage to let the callers of it know that
control will be passed to the platform firmware going forward, so
pm_set_suspend_via_firmware() needs to be called then in analogy with
acpi_suspend_begin().
However, the platform hibernation ->begin() callback is invoked
during the "freeze" transition (before creating a snapshot image of
system memory) as well as during the "hibernate" transition which is
the last stage of it and pm_set_suspend_via_firmware() should be
invoked by that callback in the latter stage only.
In order to implement that redefine the hibernation ->begin()
callback to take a pm_message_t argument to indicate which stage
of hibernation is taking place and rework acpi_hibernation_begin()
and acpi_hibernation_begin_old() to take it into account as needed.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull tracing warning fix from Steven Rostedt:
"Make the GCC 9 warning for sub struct memset go away.
GCC 9 now warns about calling memset() on partial structures when it
goes across multiple fields. This adds a helper for the place in
tracing that does this type of clearing of a structure"
* tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Silence GCC 9 array bounds warning
Starting with GCC 9, -Warray-bounds detects cases when memset is called
starting on a member of a struct but the size to be cleared ends up
writing over further members.
Such a call happens in the trace code to clear, at once, all members
after and including `seq` on struct trace_iterator:
In function 'memset',
inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
./include/linux/string.h:344:9: warning: '__builtin_memset' offset
[8505, 8560] from the object at 'iter' is out of the bounds of
referenced subobject 'seq' with type 'struct trace_seq' at offset
4368 [-Warray-bounds]
344 | return __builtin_memset(p, c, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to avoid GCC complaining about it, we compute the address
ourselves by adding the offsetof distance instead of referring
directly to the member.
Since there are two places doing this clear (trace.c and trace_kdb.c),
take the chance to move the workaround into a single place in
the internal header.
Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.com
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[ Removed unnecessary parenthesis around "iter" ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull tracing fixes from Steven Rostedt:
"Tom Zanussi sent me some small fixes and cleanups to the histogram
code and I forgot to incorporate them.
I also added a small clean up patch that was sent to me a while ago
and I just noticed it"
* tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kernel/trace/trace.h: Remove duplicate header of trace_seq.h
tracing: Add a check_val() check before updating cond_snapshot() track_val
tracing: Check keys for variable references in expressions too
tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
This patch randomizes high 32-bit of a definition when BPF_F_TEST_RND_HI32
is set.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
x86_64 and AArch64 perhaps are two arches that running bpf testsuite
frequently, however the zero extension insertion pass is not enabled for
them because of their hardware support.
It is critical to guarantee the pass correction as it is supposed to be
enabled at default for a couple of other arches, for example PowerPC,
SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
to test this pass on for example x86_64.
The test methodology employed by this set is "poisoning" useless bits. High
32-bit of a definition is randomized if it is identified as not used by any
later insn. Such randomization is only enabled under testing mode which is
gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After previous patches, verifier will mark a insn if it really needs zero
extension on dst_reg.
It is then for back-ends to decide how to use such information to eliminate
unnecessary zero extension code-gen during JIT compilation.
One approach is verifier insert explicit zero extension for those insns
that need zero extension in a generic way, JIT back-ends then do not
generate zero extension for sub-register write at default.
However, only those back-ends which do not have hardware zero extension
want this optimization. Back-ends like x86_64 and AArch64 have hardware
zero extension support that the insertion should be disabled.
This patch introduces new target hook "bpf_jit_needs_zext" which returns
false at default, meaning verifier zero extension insertion is disabled at
default. A back-end could override this hook to return true if it doesn't
have hardware support and want verifier insert zero extension explicitly.
Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.
NOTE: arches could have diversified features, it is possible for one arch
to have hardware zero extension support for some sub-register write insns
but not for all. For example, PowerPC, SPARC have zero extended loads, but
not for alu32. So when verifier zero extension insertion enabled, these JIT
back-ends need to peephole insns to remove those zero extension inserted
for insn that actually has hardware zero extension support. The peephole
could be as simple as looking the next insn, if it is a special zero
extension insn then it is safe to eliminate it if the current insn has
hardware zero extension support.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Patched insns do not go through generic verification, therefore doesn't has
zero extension information collected during insn walking.
We don't bother analyze them at the moment, for any sub-register def comes
from them, just conservatively mark it as needing zero extension.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.
However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.
u64_value = (u64) u32_value
... other uses of u64_value
This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.
One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.
This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.
Algo:
- Split read flags into READ32 and READ64.
- Record index of insn that does sub-register write. Keep the index inside
reg state and update it during verifier insn walking.
- A full register read on a sub-register marks its definition insn as
needing zero extension on dst register.
A new sub-register write overrides the old one.
- When propagating read64 during path pruning, also mark any insn defining
a sub-register that is read in the pruned path as full-register.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pule more SPDX updates from Greg KH:
"Here is another set of reviewed patches that adds SPDX tags to
different kernel files, based on a set of rules that are being used to
parse the comments to try to determine that the license of the file is
"GPL-2.0-or-later".
Only the "obvious" versions of these matches are included here, a
number of "non-obvious" variants of text have been found but those
have been postponed for later review and analysis.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers"
* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
...
This patch tries to solve the following specific use case.
Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.
bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.
Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.
Such a mechanism can be implemented in the following way:
. a perf ring buffer is created between bpf program
and tracing app.
. once a particular event happens, bpf program writes
to the ring buffer and the tracing app gets notified.
. the tracing app sends a signal SIGALARM to the hhvm.
But this method could have large delays and causing profiling
results skewed.
This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The kernel test robot has reported that the use of __this_cpu_add()
causes bug messages like:
BUG: using __this_cpu_add() in preemptible [00000000] code: ...
Given the imprecise nature of the count and the possibility of resetting
the count and doing the measurement again, this is not really a big
problem to use the unprotected __this_cpu_*() functions.
To make the preemption checking code happy, the this_cpu_*() functions
will be used if CONFIG_DEBUG_PREEMPT is defined.
The imprecise nature of the locking counts are also documented with
the suggestion that we should run the measurement a few times with the
counts reset in between to get a better picture of what is going on
under the hood.
Fixes: a8654596f0 ("locking/rwsem: Enable lock event counting")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public licence as published by
the free software foundation either version 2 of the licence or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 114 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170857.552531963@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:
...
local_dec_and_test(&rb->nest)
... <-- an IRQ/NMI can happen here
rb->user_page->data_head = head;
...
In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.
This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.
[ Split up by peterz. ]
Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Fixes: ef60777c9a ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>