Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
used by all host JIT targets to get jited image and it's length. While for
offload, targets are likely to have different offload mechanisms that these
info are kept in device private data fields.
Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
length and contents info for offload targets.
One way is to introduce new callback to parse device private data then fill
those fields in bpf_prog_info. This might be a little heavy, the other way
is to add generic fields which will be initialized by all offload targets.
This patch follow the second approach to introduce two new fields in
struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
correct jited_prog_len and jited_prog_insns in bpf_prog_info.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF map offload follow similar path to program offload. At creation
time users may specify ifindex of the device on which they want to
create the map. Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation. Map will have an empty set of operations
associated with it (save for alloc and free callbacks). The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures. Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.
Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held. Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns). Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.
All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
.map_alloc callbacks contain a number of checks validating user-
-provided map attributes against constraints of a particular map
type. For offloaded maps we will need to check map attributes
without actually allocating any memory on the host. Add a new
callback for validating attributes before any memory is allocated.
This callback can be selectively implemented by map types for
sharing code with offloads, or simply to separate the logical
steps of validation and allocation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF alignment tests got a conflict because the registers
are output as Rn_w instead of just Rn in net-next, and
in net a fixup for a testcase prohibits logical operations
on pointers before using them.
Also, we should attempt to patch BPF call args if JIT always on is
enabled. Instead, if we fail to JIT the subprogs we should pass
an error back up and fail immediately.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-01-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Prevent out-of-bounds speculation in BPF maps by masking the
index after bounds checks in order to fix spectre v1, and
add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
removing the BPF interpreter from the kernel in favor of
JIT-only mode to make spectre v2 harder, from Alexei.
2) Remove false sharing of map refcount with max_entries which
was used in spectre v1, from Daniel.
3) Add a missing NULL psock check in sockmap in order to fix
a race, from John.
4) Fix test_align BPF selftest case since a recent change in
verifier rejects the bit-wise arithmetic on pointers
earlier but test_align update was missing, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to commit b2157399cc ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.
pahole dump after change:
struct bpf_map {
const struct bpf_map_ops * ops; /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
u32 pages; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
bool unpriv_array; /* 56 1 */
/* XXX 7 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct user_struct * user; /* 64 8 */
atomic_t refcnt; /* 72 4 */
atomic_t usercnt; /* 76 4 */
struct work_struct work; /* 80 32 */
char name[16]; /* 112 16 */
/* --- cacheline 2 boundary (128 bytes) --- */
/* size: 128, cachelines: 2, members: 17 */
/* sum members: 121, holes: 1, sum holes: 7 */
};
Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.
Quoting from Google's Project Zero blog [1]:
Additionally, at least on the Intel machine on which this was
tested, bouncing modified cache lines between cores is slow,
apparently because the MESI protocol is used for cache coherence
[8]. Changing the reference counter of an eBPF array on one
physical CPU core causes the cache line containing the reference
counter to be bounced over to that CPU core, making reads of the
reference counter on all other CPU cores slow until the changed
reference counter has been written back to memory. Because the
length and the reference counter of an eBPF array are stored in
the same cache line, this also means that changing the reference
counter on one physical CPU core causes reads of the eBPF array's
length to be slow on other physical CPU cores (intentional false
sharing).
While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399cc, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.
Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.
[1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.
To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.
Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.
If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.
Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.
That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.
v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.
Cc: stable@vger.kernel.org # v4.14
Fixes: 98589a0998 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The sockmap infrastructure is only aware of TCP sockets at the
moment. In the future we plan to add UDP. In both cases CONFIG_NET
should be built-in.
So lets only build sockmap if CONFIG_INET is enabled.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Report to the user ifindex and namespace information of offloaded
programs. If device has disappeared return -ENODEV. Specify the
namespace using dev/inode combination.
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.
Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish. This design decision was made when rtnl lock was providing
all the locking. Use the read/write lock instead and remove the
workqueue.
Verifier will now call into the offload code, so dev_ops are moved
to offload structure. Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We currently use aux->offload to indicate that program is bound
to a specific device. This forces us to keep the offload structure
around even after the device is gone. Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Typical JIT does several passes over bpf instructions to
compute total size and relative offsets of jumps and calls.
With multitple bpf functions calling each other all relative calls
will have invalid offsets intially therefore we need to additional
last pass over the program to emit calls with correct offsets.
For example in case of three bpf functions:
main:
call foo
call bpf_map_lookup
exit
foo:
call bar
exit
bar:
exit
We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
x64 JIT typically does 4-5 passes to converge.
After these initial passes the image for these 3 functions
will be good except call targets, since start addresses of
foo() and bar() are unknown when we were JITing main()
(note that call bpf_map_lookup will be resolved properly
during initial passes).
Once start addresses of 3 functions are known we patch
call_insn->imm to point to right functions and call
bpf_int_jit_compile() again which needs only one pass.
Additional safety checks are done to make sure this
last pass doesn't produce image that is larger or smaller
than previous pass.
When constant blinding is on it's applied to all functions
at the first pass, since doing it once again at the last
pass can change size of the JITed code.
Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
All other JITs that support normal BPF_CALL will behave the same way
since bpf-to-bpf call is equivalent to bpf-to-kernel call from
JITs point of view.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
though bpf_call is still the same call instruction and
calling convention 'bpf to bpf' and 'bpf to helper' is the same
the interpreter has to oparate on 'struct bpf_insn *'.
To distinguish these two cases add a kernel internal opcode and
mark call insns with it.
This opcode is seen by interpreter only. JITs will never see it.
Also add tiny bit of debug code to aid interpreter debugging.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit f371b304f1 ("bpf/tracing: allow user space to
query prog array on the same tp") introduced a perf
ioctl command to query prog array attached to the
same perf tracepoint. The commit introduced a
compilation error under certain config conditions, e.g.,
(1). CONFIG_BPF_SYSCALL is not defined, or
(2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
nor CONFIG_KPROBE_EVENTS is defined.
Error message:
kernel/events/core.o: In function `perf_ioctl':
core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
This patch fixed this error by guarding the real definition under
CONFIG_BPF_EVENTS and provided static inline dummy function
if CONFIG_BPF_EVENTS was not defined.
It renamed the function from bpf_event_query_prog_array to
perf_event_query_prog_array and moved the definition from linux/bpf.h
to linux/trace_events.h so the definition is in proximity to
other prog_array related functions.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Using BPF we can override kprob'ed functions and return arbitrary
values. Obviously this can be a bit unsafe, so make this feature opt-in
for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit e87c6bc385 ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32 ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With the current ARG_PTR_TO_MEM/ARG_PTR_TO_UNINIT_MEM semantics, an helper
argument can be NULL when the next argument type is ARG_CONST_SIZE_OR_ZERO
and the verifier can prove the value of this next argument is 0. However,
most helpers are just interested in handling <!NULL, 0>, so forcing them to
deal with <NULL, 0> makes the implementation of those helpers more
complicated for no apparent benefits, requiring them to explicitly handle
those corner cases with checks that bpf programs could start relying upon,
preventing the possibility of removing them later.
Solve this by making ARG_PTR_TO_MEM/ARG_PTR_TO_UNINIT_MEM never accept NULL
even when ARG_CONST_SIZE_OR_ZERO is set, and introduce a new argument type
ARG_PTR_TO_MEM_OR_NULL to explicitly deal with the NULL case.
Currently, the only helper that needs this is bpf_csum_diff_proto(), so
change arg1 and arg3 to this new type as well.
Also add a new battery of tests that explicitly test the
!ARG_PTR_TO_MEM_OR_NULL combination: all the current ones testing the
various <NULL, 0> variations are focused on bpf_csum_diff, so cover also
other helpers.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reverts commit bd601b6ada ("bpf: report offload info to user
space"). The ifindex by itself is not sufficient, we should provide
information on which network namespace this ifindex belongs to.
After considering some options we concluded that it's best to just
remove this API for now, and rework it in -next.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>