In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
without checking the return value. bpf_prog_add() can fail since 92117d8443
("bpf: fix refcnt overflow"), so we really must check it. Take the reference
right when we assign it to the rq from priv->xdp_prog, and just drop the
reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.
Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc-6.2.1 gives the following warning:
kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’:
kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized]
The "next" is currently initialized in the while() loop which must have >=1
iterations.
This patch initializes next to get rid of the compiler warning.
Fixes: 3a08c2fd76 ("bpf: LRU List")
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute. The map attribute will be added in the later
patch.
While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).
Each percpu LRU is isolated from each other. The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.
Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
1 cpus: 2934082 updates
4 cpus: 7391434 updates
8 cpus: 6500576 updates
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
1 cpus: 2896553 updates
4 cpus: 9766395 updates
8 cpus: 17460553 updates
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.
* General Thoughts:
1. Target use case. Read is more often than update.
(i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
If bpf_prog does a bpf_lookup_elem() first and then an in-place
update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
3.1 No lock in read case
3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
property. A completely isolated percpu LRU list has the best
performance but the memory utilization is not ideal considering
the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
list with batched operations before accessing the global LRU list.
As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
'struct bpf_lru_locallist'. This local list is not used to sort
the LRU property. Instead, the local list is to batch enough
operations before acquiring the lock of the global LRU list. More
details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
map-attribute for scalability reason and for use cases that need to
prepare for the worst (and pathological) case like DoS attack.
The percpu LRU list is completely isolated from each other and the
LRU nodes (including free nodes) cannot be moved across the list. The
following description is for the global LRU list but mostly applicable
to the percpu LRU list also.
* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
global LRU list) at the beginning. The pre-allocation reasoning
is similar to the existing BPF_MAP_TYPE_HASH. However,
opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
the LRU map.
* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
the nodes. We can think of it as the working set or more frequently
accessed nodes. The access frequency is approximated by a ref-bit.
The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
active set of nodes. They are the candidates to be removed
from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
The tail of the inactive list is the older nodes and
should be released first if the bpf_htab needs free element.
* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
the global list.
2. The active list is only rotated when the inactive list is running
low. This idea is similar to the current page cache.
Inactive running low is currently defined as
"# of inactive < # of active".
3. The active list rotation always starts from the tail. It moves
node without ref-bit set to the head of the inactive list.
It moves node with ref-bit set back to the head of the active
list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
It walks the inactive list and moves the nodes back to the head of
active list if its ref-bit is set. The ref-bit is cleared after moving
to the active list.
If the node does not have ref-bit set, it just leave it as it is
because it is already in the inactive list.
* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
one node from the tail of inactive or active list. Forcefully is
in the sense that it ignores the ref-bit.
* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
batch enough operations before acquiring the lock of the
global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
global LRU list. The global LRU list can either satisfy it
by its global free-list or by shrinking the global inactive
list. Since we have acquired the global LRU list lock,
it will try to get at most LOCAL_FREE_TARGET elements
to the local free list.
5. When a new element is added to the bpf_htab, it will
first sit at the pending-list (of the local list) first.
The pending-list will be flushed to the global LRU list
when it needs to acquire free nodes from the global list
next time.
* Lock Consideration:
The LRU list has a lock (lru_lock). Each bucket of htab has a
lock (buck_lock). If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.
In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().
We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()
Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':
kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]
There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.
They were introduced in commit 484611357c ("bpf: allow access into map
value arrays") but seem to have never been used.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.
Fixes: a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While commit bb35a6ef7d ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The verifier currently prints raw function ids when printing CALL
instructions or when complaining:
5: (85) call 23
unknown func 23
print a meaningful function name instead:
5: (85) call bpf_redirect#23
unknown func bpf_redirect#23
Moves the function documentation to a single comment and renames all
helpers names in the list to conform to the bpf_ prefix notation so
they can be greped in the kernel source.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A BPF program is required to check the return register of a
map_elem_lookup() call before accessing memory. The verifier keeps
track of this by converting the type of the result register from
PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
jump ensures safety. This check is currently exclusively performed
for the result register 0.
In the event the compiler reorders instructions, BPF_MOV64_REG
instructions may be moved before the conditional jump which causes
them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
verifier objects when the register is accessed:
0: (b7) r1 = 10
1: (7b) *(u64 *)(r10 -8) = r1
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x59c00000
6: (85) call 1
7: (bf) r4 = r0
8: (15) if r0 == 0x0 goto pc+1
R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
9: (7a) *(u64 *)(r4 +0) = 0
R4 invalid mem access 'map_value_or_null'
This commit extends the verifier to keep track of all identical
PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
assigning them an ID and then marking them all when the conditional
jump is observed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
Suppose you have a map array value that is something like this
struct foo {
unsigned iter;
int array[SOME_CONSTANT];
};
You can easily insert this into an array, but you cannot modify the contents of
foo->array[] after the fact. This is because we have no way to verify we won't
go off the end of the array at verification time. This patch provides a start
for this work. We accomplish this by keeping track of a minimum and maximum
value a register could be while we're checking the code. Then at the time we
try to do an access into a MAP_VALUE we verify that the maximum offset into that
region is a valid access into that memory region. So in practice, code such as
this
unsigned index = 0;
if (foo->iter >= SOME_CONSTANT)
foo->iter = index;
else
index = foo->iter++;
foo->array[index] = bar;
would be allowed, as we can verify that index will always be between 0 and
SOME_CONSTANT-1. If you wish to use signed values you'll have to have an extra
check to make sure the index isn't less than 0, or do something like index %=
SOME_CONSTANT.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
put_cpu_var takes the percpu data, not the data returned from
get_cpu_var.
This doesn't change the behavior.
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This prevent future potential pointer leaks when an unprivileged eBPF
program will read a pointer value from its context. Even if
is_valid_access() returns a pointer type, the eBPF verifier replace it
with UNKNOWN_VALUE. The register value that contains a kernel address is
then allowed to leak. Moreover, this fix allows unprivileged eBPF
programs to use functions with (legitimate) pointer arguments.
Not an issue currently since reg_type is only set for PTR_TO_PACKET or
PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
privileged. For now, the only unprivileged eBPF program allowed is for
socket filtering and all the types from its context are UNKNOWN_VALUE.
However, this fix is important for future unprivileged eBPF programs
which could use pointers in their context.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
instructions as loading CONST_IMM with the value stored
in imm. The verifier will continue not recognizing those
due to concerns about search space/program complexity
increase.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.
Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked. For now only add most basic callback for
per-instruction pre-interpretation checks is added. More
advanced users may also like to have per-instruction post
callback and state comparison callback.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move verifier's internal structures to a header file and
prefix their names with bpf_ to avoid potential namespace
conflicts. Those structures will soon be used by external
analyzers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>