mirror of
https://github.com/ukui/kernel.git
synced 2026-03-09 10:07:04 -07:00
e795688eeedfab9755727bf42e8b64c31a833ea0
84 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3f50f132d8 |
bpf: Verifier, do explicit ALU32 bounds tracking
It is not possible for the current verifier to track ALU32 and JMP ops
correctly. This can result in the verifier aborting with errors even though
the program should be verifiable. BPF codes that hit this can work around
it by changin int variables to 64-bit types, marking variables volatile,
etc. But this is all very ugly so it would be better to avoid these tricks.
But, the main reason to address this now is do_refine_retval_range() was
assuming return values could not be negative. Once we fixed this code that
was previously working will no longer work. See do_refine_retval_range()
patch for details. And we don't want to suddenly cause programs that used
to work to fail.
The simplest example code snippet that illustrates the problem is likely
this,
53: w8 = w0 // r8 <- [0, S32_MAX],
// w8 <- [-S32_MIN, X]
54: w8 <s 0 // r8 <- [0, U32_MAX]
// w8 <- [0, X]
The expected 64-bit and 32-bit bounds after each line are shown on the
right. The current issue is without the w* bounds we are forced to use
the worst case bound of [0, U32_MAX]. To resolve this type of case,
jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
from branch_taken logic creating new bounds we can track 32-bit bounds
explicitly.
The next case we observed is ALU ops after the jmp32,
53: w8 = w0 // r8 <- [0, S32_MAX],
// w8 <- [-S32_MIN, X]
54: w8 <s 0 // r8 <- [0, U32_MAX]
// w8 <- [0, X]
55: w8 += 1 // r8 <- [0, U32_MAX+1]
// w8 <- [0, X+1]
In order to keep the bounds accurate at this point we also need to track
ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
ops, mov, add, sub, etc.
Finally there is a question of how and when to merge bounds. The cases
enumerate here,
1. MOV ALU32 - zext 32-bit -> 64-bit
2. MOV ALU64 - copy 64-bit -> 32-bit
3. op ALU32 - zext 32-bit -> 64-bit
4. op ALU64 - n/a
5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off)
6. jmp ALU64 - 32-bit: (>> (<< var64_off))
Details for each case,
For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
bounds correctly. See zext_32_to_64.
For "MOV ALU64" copy all bounds including 32-bit into new register. If
the src register had 32-bit bounds the dst register will as well.
For "op ALU32" zero extend 32-bit into 64-bit the same as move,
see zext_32_to_64.
For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
is done here. Except we have a special case. When RSH or ARSH is
done we can't simply ignore shifting bits from 64-bit reg into the
32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
This will be correct in the sense that they will represent a valid
state of the register. However we could lose some accuracy if an
ARSH is following a jmp32 operation. We can handle this special
case in a follow up series.
For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
special case if 64-bit bounds has zero'd upper 32bits at which point
we can simply copy 32-bit bounds into 64-bit register. This catches
a common compiler trick where upper 32-bits are zeroed and then
32-bit ops are used followed by a 64-bit compare or 64-bit op on
a pointer. See __reg_combine_64_into_32().
For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
counterpart. For example s32_min_value = (s32)reg->smin_value. For
tnum use only the lower 32bits via, (>>(<<var_off)). See
__reg_combine_64_into_32().
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
|
||
|
|
51c39bb1d5 |
bpf: Introduce function-by-function verification
New llvm and old llvm with libbpf help produce BTF that distinguish global and
static functions. Unlike arguments of static function the arguments of global
functions cannot be removed or optimized away by llvm. The compiler has to use
exactly the arguments specified in a function prototype. The argument type
information allows the verifier validate each global function independently.
For now only supported argument types are pointer to context and scalars. In
the future pointers to structures, sizes, pointer to packet data can be
supported as well. Consider the following example:
static int f1(int ...)
{
...
}
int f3(int b);
int f2(int a)
{
f1(a) + f3(a);
}
int f3(int b)
{
...
}
int main(...)
{
f1(...) + f2(...) + f3(...);
}
The verifier will start its safety checks from the first global function f2().
It will recursively descend into f1() because it's static. Then it will check
that arguments match for the f3() invocation inside f2(). It will not descend
into f3(). It will finish f2() that has to be successfully verified for all
possible values of 'a'. Then it will proceed with f3(). That function also has
to be safe for all possible values of 'b'. Then it will start subprog 0 (which
is main() function). It will recursively descend into f1() and will skip full
check of f2() and f3(), since they are global. The order of processing global
functions doesn't affect safety, since all global functions must be proven safe
based on their arguments only.
Such function by function verification can drastically improve speed of the
verification and reduce complexity.
Note that the stack limit of 512 still applies to the call chain regardless whether
functions were static or global. The nested level of 8 also still applies. The
same recursion prevention checks are in place as well.
The type information and static/global kind is preserved after the verification
hence in the above example global function f2() and f3() can be replaced later
by equivalent functions with the same types that are loaded and verified later
without affecting safety of this main() program. Such replacement (re-linking)
of global functions is a subject of future patches.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
|
||
|
|
d2e4c1e6c2 |
bpf: Constant map key tracking for prog array pokes
Add tracking of constant keys into tail call maps. The signature of bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3 is a index key. The direct call approach for tail calls can be enabled if the verifier asserted that for all branches leading to the tail call helper invocation, the map pointer and index key were both constant and the same. Tracking of map pointers we already do from prior work via |
||
|
|
8c1b6e69dc |
bpf: Compare BTF types of functions arguments with actual types
Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org |
||
|
|
9e15db6613 |
bpf: Implement accurate raw_tp context access via BTF
libbpf analyzes bpf C program, searches in-kernel BTF for given type name and stores it into expected_attach_type. The kernel verifier expects this btf_id to point to something like: typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc); which represents signature of raw_tracepoint "kfree_skb". Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb' and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint. In first case it passes btf_id of 'struct sk_buff *' back to the verifier core and 'void *' in second case. Then the verifier tracks PTR_TO_BTF_ID as any other pointer type. Like PTR_TO_SOCKET points to 'struct bpf_sock', PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on. PTR_TO_BTF_ID points to in-kernel structs. If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF then PTR_TO_BTF_ID#1234 points to one of in kernel skbs. When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32) the btf_struct_access() checks which field of 'struct sk_buff' is at offset 32. Checks that size of access matches type definition of the field and continues to track the dereferenced type. If that field was a pointer to 'struct net_device' the r2's type will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device' in vmlinux's BTF. Such verifier analysis prevents "cheating" in BPF C program. The program cannot cast arbitrary pointer to 'struct sk_buff *' and access it. C compiler would allow type cast, of course, but the verifier will notice type mismatch based on BPF assembly and in-kernel BTF. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org |
||
|
|
8580ac9404 |
bpf: Process in-kernel BTF
If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux' for further use by the verifier. In-kernel BTF is trusted just like kallsyms and other build artifacts embedded into vmlinux. Yet run this BTF image through BTF verifier to make sure that it is valid and it wasn't mangled during the build. If in-kernel BTF is incorrect it means either gcc or pahole or kernel are buggy. In such case disallow loading BPF programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org |
||
|
|
10d274e880 |
bpf: introduce verifier internal test flag
Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain and state pruning. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
dca73a65a6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-06-19 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin. 2) BTF based map definition, from Andrii. 3) support bpf_map_lookup_elem for xskmap, from Jonathan. 4) bounded loops and scalar precision logic in the verifier, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
b5dc0163d8 |
bpf: precise scalar_value tracking
Introduce precision tracking logic that
helps cilium programs the most:
old clang old clang new clang new clang
with all patches with all patches
bpf_lb-DLB_L3.o 1838 2283 1923 1863
bpf_lb-DLB_L4.o 3218 2657 3077 2468
bpf_lb-DUNKNOWN.o 1064 545 1062 544
bpf_lxc-DDROP_ALL.o 26935 23045 166729 22629
bpf_lxc-DUNKNOWN.o 34439 35240 174607 28805
bpf_netdev.o 9721 8753 8407 6801
bpf_overlay.o 6184 7901 5420 4754
bpf_lxc_jit.o 39389 50925 39389 50925
Consider code:
654: (85) call bpf_get_hash_recalc#34
655: (bf) r7 = r0
656: (15) if r8 == 0x0 goto pc+29
657: (bf) r2 = r10
658: (07) r2 += -48
659: (18) r1 = 0xffff8881e41e1b00
661: (85) call bpf_map_lookup_elem#1
662: (15) if r0 == 0x0 goto pc+23
663: (69) r1 = *(u16 *)(r0 +0)
664: (15) if r1 == 0x0 goto pc+21
665: (bf) r8 = r7
666: (57) r8 &= 65535
667: (bf) r2 = r8
668: (3f) r2 /= r1
669: (2f) r2 *= r1
670: (bf) r1 = r8
671: (1f) r1 -= r2
672: (57) r1 &= 255
673: (25) if r1 > 0x1e goto pc+12
R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
674: (67) r1 <<= 1
675: (0f) r0 += r1
At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
R1 has to be precise for later operations on R0 to be validated properly.
The verifier will backtrack the above code in the following way:
last_idx 675 first_idx 664
regs=2 stack=0 before 675: (0f) r0 += r1 // started backtracking R1 regs=2 is a bitmask
regs=2 stack=0 before 674: (67) r1 <<= 1
regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
regs=2 stack=0 before 672: (57) r1 &= 255
regs=2 stack=0 before 671: (1f) r1 -= r2 // now both R1 and R2 has to be precise -> regs=6 mask
regs=6 stack=0 before 670: (bf) r1 = r8 // after this insn R8 and R2 has to be precise
regs=104 stack=0 before 669: (2f) r2 *= r1 // after this one R8, R2, and R1
regs=106 stack=0 before 668: (3f) r2 /= r1
regs=106 stack=0 before 667: (bf) r2 = r8
regs=102 stack=0 before 666: (57) r8 &= 65535
regs=102 stack=0 before 665: (bf) r8 = r7
regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
// this is the end of verifier state. The following regs will be marked precised:
R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
parent didn't have regs=82 stack=0 marks // so backtracking continues into parent state
last_idx 663 first_idx 655
regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0) // R1 was assigned no need to track it further
regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23 // keep tracking R7
regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1 // keep tracking R7
regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
regs=80 stack=0 before 658: (07) r2 += -48
regs=80 stack=0 before 657: (bf) r2 = r10
regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
regs=80 stack=0 before 655: (bf) r7 = r0 // here the assignment into R7
// mark R0 to be precise:
R0_rw=invP(id=0)
parent didn't have regs=1 stack=0 marks // regs=1 -> tracking R0
last_idx 654 first_idx 644
regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
// nothing further to backtrack
Two scalar registers not marked precise are equivalent from state pruning point of view.
More details in the patch comments.
It doesn't support bpf2bpf calls yet and enabled for root only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
||
|
|
2589726d12 |
bpf: introduce bounded loops
Allow the verifier to validate the loops by simulating their execution. Exisiting programs have used '#pragma unroll' to unroll the loops by the compiler. Instead let the verifier simulate all iterations of the loop. In order to do that introduce parentage chain of bpf_verifier_state and 'branches' counter for the number of branches left to explore. See more detailed algorithm description in bpf_verifier.h This algorithm borrows the key idea from Edward Cree approach: https://patchwork.ozlabs.org/patch/877222/ Additional state pruning heuristics make such brute force loop walk practical even for large loops. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
a6cdeeb16b |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
25763b3c86 |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 107 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171438.615055994@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
5327ed3d44 |
bpf: verifier: mark verified-insn with sub-register zext flag
eBPF ISA specification requires high 32-bit cleared when low 32-bit sub-register is written. This applies to destination register of ALU32 etc. JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and AArch64 ISA has the same semantics, so the corresponding JIT back-end doesn't need to do extra work. However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches (PowerPC, SPARC etc) need to do explicit zero extension to meet this requirement, otherwise code like the following will fail. u64_value = (u64) u32_value ... other uses of u64_value This is because compiler could exploit the semantic described above and save those zero extensions for extending u32_value to u64_value, these JIT back-ends are expected to guarantee this through inserting extra zero extensions which however could be a significant increase on the code size. Some benchmarks show there could be ~40% sub-register writes out of total insns, meaning at least ~40% extra code-gen. One observation is these extra zero extensions are not always necessary. Take above code snippet for example, it is possible u32_value will never be casted into a u64, the value of high 32-bit of u32_value then could be ignored and extra zero extension could be eliminated. This patch implements this idea, insns defining sub-registers will be marked when the high 32-bit of the defined sub-register matters. For those unmarked insns, it is safe to eliminate high 32-bit clearnace for them. Algo: - Split read flags into READ32 and READ64. - Record index of insn that does sub-register write. Keep the index inside reg state and update it during verifier insn walking. - A full register read on a sub-register marks its definition insn as needing zero extension on dst register. A new sub-register write overrides the old one. - When propagating read64 during path pruning, also mark any insn defining a sub-register that is read in the pruned path as full-register. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
dc2a4ebc0b |
bpf: convert explored_states to hash table
All prune points inside a callee bpf function most likely will have different callsites. For example, if function foo() is called from two callsites the half of explored states in all prune points in foo() will be useless for subsequent walking of one of those callsites. Fortunately explored_states pruning heuristics keeps the number of states per prune point small, but walking these states is still a waste of cpu time when the callsite of the current state is different from the callsite of the explored state. To improve pruning logic convert explored_states into hash table and use simple insn_idx ^ callsite hash to select hash bucket. This optimization has no effect on programs without bpf2bpf calls and drastically improves programs with calls. In the later case it reduces total memory consumption in 1M scale tests by almost 3 times (peak_states drops from 5752 to 2016). Care should be taken when comparing the states for equivalency. Since the same hash bucket can now contain states with different indices the insn_idx has to be part of verifier_state and compared. Different hash table sizes and different hash functions were explored, but the results were not significantly better vs this patch. They can be improved in the future. Hit/miss heuristic is not counting index miscompare as a miss. Otherwise verifier stats become unstable when experimenting with different hash functions. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
a8f500af0c |
bpf: split explored_states
split explored_states into prune_point boolean mark and link list of explored states. This removes STATE_LIST_MARK hack and allows marks to be separate from states. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
7df737e991 |
bpf: remove global variables
Move three global variables protected by bpf_verifier_lock into 'struct bpf_verifier_env' to allow parallel verification. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
d8eca5bbb2 |
bpf: implement lookup-free direct value access for maps
This generic extension to BPF maps allows for directly loading an address residing inside a BPF map value as a single BPF ldimm64 instruction! The idea is similar to what BPF_PSEUDO_MAP_FD does today, which is a special src_reg flag for ldimm64 instruction that indicates that inside the first part of the double insns's imm field is a file descriptor which the verifier then replaces as a full 64bit address of the map into both imm parts. For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following: the first part of the double insns's imm field is again a file descriptor corresponding to the map, and the second part of the imm field is an offset into the value. The verifier will then replace both imm parts with an address that points into the BPF map value at the given value offset for maps that support this operation. Currently supported is array map with single entry. It is possible to support more than just single map element by reusing both 16bit off fields of the insns as a map index, so full array map lookup could be expressed that way. It hasn't been implemented here due to lack of concrete use case, but could easily be done so in future in a compatible way, since both off fields right now have to be 0 and would correctly denote a map index 0. The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of map pointer versus load of map's value at offset 0, and changing BPF_PSEUDO_MAP_FD's encoding into off by one to differ between regular map pointer and map value pointer would add unnecessary complexity and increases barrier for debugability thus less suitable. Using the second part of the imm field as an offset into the value does /not/ come with limitations since maximum possible value size is in u32 universe anyway. This optimization allows for efficiently retrieving an address to a map value memory area without having to issue a helper call which needs to prepare registers according to calling convention, etc, without needing the extra NULL test, and without having to add the offset in an additional instruction to the value base pointer. The verifier then treats the destination register as PTR_TO_MAP_VALUE with constant reg->off from the user passed offset from the second imm field, and guarantees that this is within bounds of the map value. Any subsequent operations are normally treated as typical map value handling without anything extra needed from verification side. The two map operations for direct value access have been added to array map for now. In future other types could be supported as well depending on the use case. The main use case for this commit is to allow for BPF loader support for global variables that reside in .data/.rodata/.bss sections such that we can directly load the address of them with minimal additional infrastructure required. Loader support has been added in subsequent commits for libbpf library. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
9f4686c41b |
bpf: improve verification speed by droping states
Branch instructions, branch targets and calls in a bpf program are
the places where the verifier remembers states that led to successful
verification of the program.
These states are used to prune brute force program analysis.
For unprivileged programs there is a limit of 64 states per such
'branching' instructions (maximum length is tracked by max_states_per_insn
counter introduced in the previous patch).
Simply reducing this threshold to 32 or lower increases insn_processed
metric to the point that small valid programs get rejected.
For root programs there is no limit and cilium programs can have
max_states_per_insn to be 100 or higher.
Walking 100+ states multiplied by number of 'branching' insns during
verification consumes significant amount of cpu time.
Turned out simple LRU-like mechanism can be used to remove states
that unlikely will be helpful in future search pruning.
This patch introduces hit_cnt and miss_cnt counters:
hit_cnt - this many times this state successfully pruned the search
miss_cnt - this many times this state was not equivalent to other states
(and that other states were added to state list)
The heuristic introduced in this patch is:
if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
/* drop this state from future considerations */
Higher numbers increase max_states_per_insn (allow more states to be
considered for pruning) and slow verification speed, but do not meaningfully
reduce insn_processed metric.
Lower numbers drop too many states and insn_processed increases too much.
Many different formulas were considered.
This one is simple and works well enough in practice.
(the analysis was done on selftests/progs/* and on cilium programs)
The end result is this heuristic improves verification speed by 10 times.
Large synthetic programs that used to take a second more now take
1/10 of a second.
In cases where max_states_per_insn used to be 100 or more, now it's ~10.
There is a slight increase in insn_processed for cilium progs:
before after
bpf_lb-DLB_L3.o 1831 1838
bpf_lb-DLB_L4.o 3029 3218
bpf_lb-DUNKNOWN.o 1064 1064
bpf_lxc-DDROP_ALL.o 26309 26935
bpf_lxc-DUNKNOWN.o 33517 34439
bpf_netdev.o 9713 9721
bpf_overlay.o 6184 6184
bpf_lcx_jit.o 37335 39389
And 2-3 times improvement in the verification speed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
||
|
|
06ee7115b0 |
bpf: add verifier stats and log_level bit 2
In order to understand the verifier bottlenecks add various stats and extend log_level: log_level 1 and 2 are kept as-is: bit 0 - level=1 - print every insn and verifier state at branch points bit 1 - level=2 - print every insn and verifier state at every insn bit 2 - level=4 - print verifier error and stats at the end of verification When verifier rejects the program the libbpf is trying to load the program twice. Once with log_level=0 (no messages, only error code is reported to user space) and second time with log_level=1 to tell the user why the verifier rejected it. With introduction of bit 2 - level=4 the libbpf can choose to always use that level and load programs once, since the verification speed is not affected and in case of error the verbose message will be available. Note that the verifier stats are not part of uapi just like all other verbose messages. They're expected to change in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
1b98658968 |
bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
can still be accessed after bpf_sk_release(sk).
Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
This patch addresses them together.
A simple reproducer looks like this:
sk = bpf_sk_lookup_tcp();
/* if (!sk) ... */
tp = bpf_tcp_sock(sk);
/* if (!tp) ... */
bpf_sk_release(sk);
snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
The problem is the verifier did not scrub the register's states of
the tcp_sock ptr (tp) after bpf_sk_release(sk).
[ Note that when calling bpf_tcp_sock(sk), the sk is not always
refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
fine for this case. ]
Currently, the verifier does not track if a helper's return ptr (in REG_0)
is "carry"-ing one of its argument's refcount status. To carry this info,
the reg1->id needs to be stored in reg0.
One approach was tried, like "reg0->id = reg1->id", when calling
"bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id"
for the same reg. However, overlapping the NULL marking and ref
tracking purpose in one "id" does not work well:
ref_sk = bpf_sk_lookup_tcp();
fullsock = bpf_sk_fullsock(ref_sk);
tp = bpf_tcp_sock(ref_sk);
if (!fullsock) {
bpf_sk_release(ref_sk);
return 0;
}
/* fullsock_reg->id is marked for NOT-NULL.
* Same for tp_reg->id because they have the same id.
*/
/* oops. verifier did not complain about the missing !tp check */
snd_cwnd = tp->snd_cwnd;
Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
scrub all reg states which has a ref_obj_id match. It is done with the
changes in release_reg_references() in this patch.
While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
bpf_sk_fullsock() to avoid these helpers from returning
another ptr. It will make bpf_sk_release(tp) possible:
sk = bpf_sk_lookup_tcp();
/* if (!sk) ... */
tp = bpf_tcp_sock(sk);
/* if (!tp) ... */
bpf_sk_release(tp);
A separate helper "bpf_get_listener_sock()" will be added in a later
patch to do sk_to_full_sk().
Misc change notes:
- To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET
is removed from bpf.h since no helper is using it.
- arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
take ARG_PTR_TO_SOCK_COMMON.
- check_refcount_ok() ensures is_acquire_function() cannot take
arg_type_may_be_refcounted() as its argument.
- The check_func_arg() can only allow one refcount-ed arg. It is
guaranteed by check_refcount_ok() which ensures at most one arg can be
refcounted. Hence, it is a verifier internal error if >1 refcount arg
found in check_func_arg().
- In release_reference(), release_reference_state() is called
first to ensure a match on "reg->ref_obj_id" can be found before
scrubbing the reg states with release_reg_references().
- reg_is_refcounted() is no longer needed.
1. In mark_ptr_or_null_regs(), its usage is replaced by
"ref_obj_id && ref_obj_id == id" because,
when is_null == true, release_reference_state() should only be
called on the ref_obj_id obtained by a acquire helper (i.e.
is_acquire_function() == true). Otherwise, the following
would happen:
sk = bpf_sk_lookup_tcp();
/* if (!sk) { ... } */
fullsock = bpf_sk_fullsock(sk);
if (!fullsock) {
/*
* release_reference_state(fullsock_reg->ref_obj_id)
* where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
*
* Hence, the following bpf_sk_release(sk) will fail
* because the ref state has already been released in the
* earlier release_reference_state(fullsock_reg->ref_obj_id).
*/
bpf_sk_release(sk);
}
2. In release_reg_references(), the current reg_is_refcounted() call
is unnecessary because the id check is enough.
- The type_is_refcounted() and type_is_refcounted_or_null()
are no longer needed also because reg_is_refcounted() is removed.
Fixes:
|
||
|
|
d83525ca62 |
bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.
Example:
struct hash_elem {
int cnt;
struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
bpf_spin_lock(&val->lock);
val->cnt++;
bpf_spin_unlock(&val->lock);
}
Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
It drastically simplifies implementation yet allows bpf program to use
any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
insufficient preemption checks
Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
Other solutions to avoid nested bpf_spin_lock are possible.
Like making sure that all networking progs run with softirq disabled.
spin_lock_irqsave is the simplest and doesn't add overhead to the
programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
extra flag during map_create, but specifying it via BTF is cleaner.
It provides introspection for map key/value and reduces user mistakes.
Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
to request kernel to grab bpf_spin_lock before rewriting the value.
That will serialize access to map elements.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
||
|
|
08ca90afba |
bpf: notify offload JITs about optimizations
Let offload JITs know when instructions are replaced and optimized out, so they can update their state appropriately. The optimizations are best effort, if JIT returns an error from any callback verifier will stop notifying it as state may now be out of sync, but the verifier continues making progress. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
9e4c24e7ee |
bpf: verifier: record original instruction index
The communication between the verifier and advanced JITs is based on instruction indexes. We have to keep them stable throughout the optimizations otherwise referring to a particular instruction gets messy quickly. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
d3bd7413e0 |
bpf: fix sanitation of alu op with pointer / scalar type from different paths
While |
||
|
|
979d63d50c |
bpf: prevent out of bounds speculation on pointer arithmetic
Jann reported that the original commit back in |