Merge branch 'bpf-ring-buffer'

Andrii Nakryiko says:

====================
Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]).
It presents an alternative to perf buffer, following its semantics closely,
but allowing sharing same instance of ring buffer across multiple CPUs
efficiently.

Most patches have extensive commentary explaining various aspects, so I'll
keep cover letter short. Overall structure of the patch set:
- patch #1 adds BPF ring buffer implementation to kernel and necessary
  verifier support;
- patch #2 adds libbpf consumer implementation for BPF ringbuf;
- patch #3 adds selftest, both for single BPF ring buf use case, as well as
  using it with array/hash of maps;
- patch #4 adds extensive benchmarks and provide some analysis in commit
  message, it builds upon selftests/bpf's bench runner.
- patch #5 adds most of patch #1 commit message as a doc under
  Documentation/bpf/ringbuf.rst.

Litmus tests, validating consumer/producer protocols and memory orderings,
were moved out as discussed in [1] and are going to be posted against -rcu
tree and put under Documentation/litmus-tests/bpf-rb.

  [0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w
  [1] https://lkml.org/lkml/2020/5/22/1011

v3->v4:
- fix ringbuf freeing (vunmap, __free_page); verified with a trivial loop
  creating and closing ringbuf map endlessly (Daniel);

v2->v3:
- dropped unnecessary smp_wmb() (Paul);
- verifier reference type enhancement patch was dropped (Alexei);
- better verifier message for various memory access checks (Alexei);
- clarified a bit roundup_len() bit shifting (Alexei);
- converted doc to .rst (Alexei);
- fixed warning on 32-bit arches regarding tautological ring area size check.

v1->v2:
- commit()/discard()/output() accept flags (NO_WAKEUP/FORCE_WAKEUP) (Stanislav);
- bpf_ringbuf_query() added, returning available data size, ringbuf size,
  consumer/producer positions, needed to implement smarter notification policy
  (Stanislav);
- added ringbuf UAPI constants to include/uapi/linux/bpf.h (Jonathan);
- fixed sample size check, added proper ringbuf size check (Jonathan, Alexei);
- wake_up_all() is done through irq_work (Alexei);
- consistent use of smp_load_acquire/smp_store_release, no
  READ_ONCE/WRITE_ONCE (Alexei);
- added Documentation/bpf/ringbuf.txt (Stanislav);
- updated litmus test with smp_load_acquire/smp_store_release changes;
- added ring_buffer__consume() API to libbpf for busy-polling;
- ring_buffer__poll() on success returns number of records consumed;
- fixed EPOLL notifications, don't assume available data, done similarly to
  perfbuf's implementation;
- both ringbuf and perfbuf now have --rb-sampled mode, instead of
  pb-raw/pb-custom mode, updated benchmark results;
- extended ringbuf selftests to validate epoll logic/manual notification
  logic, as well as bpf_ringbuf_query().
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit is contained in:
Daniel Borkmann
2020-05-29 17:11:09 +02:00
committed by Alexei Starovoitov
35 changed files with 2630 additions and 72 deletions
+209
View File
@@ -0,0 +1,209 @@
===============
BPF ring buffer
===============
This document describes BPF ring buffer design, API, and implementation details.
.. contents::
:local:
:depth: 2
Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
- more efficient memory utilization by sharing ring buffer across CPUs;
- preserving ordering of events that happen sequentially in time, even across
multiple CPUs (e.g., fork/exec/exit events for a task).
These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer. Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.
Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
ultimately rejected.
One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
enforce "same CPU only" rule. This would be more familiar interface compatible
with existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key.
``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
Additionally, given the performance of BPF ringbuf, many use cases would just
opt into a simple single ring buffer shared among all CPUs, for which current
approach would be an overkill.
Another approach could introduce a new concept, alongside BPF map, to represent
generic "container" object, which doesn't necessarily have key/value interface
with lookup/update/delete operations. This approach would add a lot of extra
infrastructure that has to be built for observability and verifier support. It
would also add another concept that BPF developers would have to familiarize
themselves with, new syntax in libbpf, etc. But then would really provide no
additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF``
doesn't support lookup/update/delete operations, but so doesn't few other map
types (e.g., queue and stack; array doesn't support delete, etc).
The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using a single
ring buffer for all CPUs, it's as simple and straightforward, as would be with
a dedicated "container" object. On the other hand, by being a map, it can be
combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
a wide variety of topologies, from one ring buffer for each CPU (e.g., as
a replacement for perf buffer use cases), to a complicated application
hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
with hashed task's tgid being a look up key to preserve order, but reduce
contention).
Key and value sizes are enforced to be zero. ``max_entries`` is used to specify
the size of ring buffer and has to be a power of 2 value.
There are a bunch of similarities between perf buffer
(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
- variable-length records;
- if there is no more space left in ring buffer, reservation fails, no
blocking;
- memory-mappable data area for user-space applications for ease of
consumption and high performance;
- epoll notifications for new incoming data;
- but still the ability to do busy polling for new data to achieve the
lowest latency, if necessary.
BPF ringbuf provides two sets of APIs to BPF programs:
- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
buffer, similarly to ``bpf_perf_event_output()``;
- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
APIs split the whole process into two steps. First, a fixed amount of space
is reserved. If successful, a pointer to a data inside ring buffer data
area is returned, which BPF programs can use similarly to a data inside
array/hash maps. Once ready, this piece of memory is either committed or
discarded. Discard is similar to commit, but makes consumer ignore the
record.
``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
because record has to be prepared in some other place first. But it allows to
submit records of the length that's not known to verifier beforehand. It also
closely matches ``bpf_perf_event_output()``, so will simplify migration
significantly.
``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access memory
outside its reserved record space. bpf_ringbuf_output(), while slightly slower
due to extra memory copy, covers some use cases that are not suitable for
``bpf_ringbuf_reserve()``.
The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary
``malloc()``/``free()`` within single BPF program invocation.
Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.
``bpf_ringbuf_query()`` helper allows to query various properties of ring
buffer. Currently 4 are supported:
- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
of consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.
One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
helpers, it allows BPF program a high degree of control and, e.g., more
efficient batched notifications. Default self-balancing strategy, though,
should be adequate for most applications and will work reliable and efficiently
already.
Design and Implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
a lock, in which case reservation will fail even if ring buffer is not full.
The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
- consumer counter shows up to which logical position consumer consumed the
data;
- producer counter denotes amount of data reserved by all producers.
Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains the
length of reserved record, as well as two extra bits: busy bit to denote that
record is still being worked on, and discard bit, which might be set at commit
time if record is discarded. In the latter case, consumer is supposed to skip
the record and move on to the next one. Record header also encodes record's
relative offset from the beginning of ring buffer data area (in pages). This
allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
pointer to the record itself, without requiring also the pointer to ring buffer
itself. Ring buffer memory location will be restored from record metadata
header. This significantly simplifies verifier, as well as improving API
usability.
Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.
Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus_tests/bpf-rb/_.
One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in ``bpf_ringbuf_area_alloc()``.
Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
``bpf_ringbuf_commit()`` implementation will send a notification of new record
being available after commit only if consumer has already caught up right up to
the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c_) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
data availability, but require extra caution and diligence in using this API.
+13
View File
@@ -90,6 +90,8 @@ struct bpf_map_ops {
int (*map_direct_value_meta)(const struct bpf_map *map,
u64 imm, u32 *off);
int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
__poll_t (*map_poll)(struct bpf_map *map, struct file *filp,
struct poll_table_struct *pts);
};
struct bpf_map_memory {
@@ -244,6 +246,9 @@ enum bpf_arg_type {
ARG_PTR_TO_LONG, /* pointer to long */
ARG_PTR_TO_SOCKET, /* pointer to bpf_sock (fullsock) */
ARG_PTR_TO_BTF_ID, /* pointer to in-kernel struct */
ARG_PTR_TO_ALLOC_MEM, /* pointer to dynamically allocated memory */
ARG_PTR_TO_ALLOC_MEM_OR_NULL, /* pointer to dynamically allocated memory or NULL */
ARG_CONST_ALLOC_SIZE_OR_ZERO, /* number of allocated bytes requested */
};
/* type of values returned from helper functions */
@@ -255,6 +260,7 @@ enum bpf_return_type {
RET_PTR_TO_SOCKET_OR_NULL, /* returns a pointer to a socket or NULL */
RET_PTR_TO_TCP_SOCK_OR_NULL, /* returns a pointer to a tcp_sock or NULL */
RET_PTR_TO_SOCK_COMMON_OR_NULL, /* returns a pointer to a sock_common or NULL */
RET_PTR_TO_ALLOC_MEM_OR_NULL, /* returns a pointer to dynamically allocated memory or NULL */
};
/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -322,6 +328,8 @@ enum bpf_reg_type {
PTR_TO_XDP_SOCK, /* reg points to struct xdp_sock */
PTR_TO_BTF_ID, /* reg points to kernel struct */
PTR_TO_BTF_ID_OR_NULL, /* reg points to kernel struct or NULL */
PTR_TO_MEM, /* reg points to valid memory region */
PTR_TO_MEM_OR_NULL, /* reg points to valid memory region or NULL */
};
/* The information passed from prog-specific *_is_valid_access
@@ -1611,6 +1619,11 @@ extern const struct bpf_func_proto bpf_tcp_sock_proto;
extern const struct bpf_func_proto bpf_jiffies64_proto;
extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto;
extern const struct bpf_func_proto bpf_event_output_data_proto;
extern const struct bpf_func_proto bpf_ringbuf_output_proto;
extern const struct bpf_func_proto bpf_ringbuf_reserve_proto;
extern const struct bpf_func_proto bpf_ringbuf_submit_proto;
extern const struct bpf_func_proto bpf_ringbuf_discard_proto;
extern const struct bpf_func_proto bpf_ringbuf_query_proto;
const struct bpf_func_proto *bpf_tracing_func_proto(
enum bpf_func_id func_id, const struct bpf_prog *prog);
+1
View File
@@ -118,6 +118,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
#if defined(CONFIG_BPF_JIT)
BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
+4
View File
@@ -54,6 +54,8 @@ struct bpf_reg_state {
u32 btf_id; /* for PTR_TO_BTF_ID */
u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */
/* Max size from any of the above. */
unsigned long raw;
};
@@ -63,6 +65,8 @@ struct bpf_reg_state {
* offset, so they can share range knowledge.
* For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
* came from, when one is tested for != NULL.
* For PTR_TO_MEM_OR_NULL this is used to identify memory allocation
* for the purpose of tracking that it's freed.
* For PTR_TO_SOCKET this is used to share which pointers retain the
* same reference to the socket, to determine proper reference freeing.
*/
+83 -1
View File
@@ -147,6 +147,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_SK_STORAGE,
BPF_MAP_TYPE_DEVMAP_HASH,
BPF_MAP_TYPE_STRUCT_OPS,
BPF_MAP_TYPE_RINGBUF,
};
/* Note that tracing related programs such as
@@ -3157,6 +3158,59 @@ union bpf_attr {
* **bpf_sk_cgroup_id**\ ().
* Return
* The id is returned or 0 in case the id could not be retrieved.
*
* void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
* Description
* Copy *size* bytes from *data* into a ring buffer *ringbuf*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* 0, on success;
* < 0, on error.
*
* void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
* Description
* Reserve *size* bytes of payload in a ring buffer *ringbuf*.
* Return
* Valid pointer with *size* bytes of memory available; NULL,
* otherwise.
*
* void bpf_ringbuf_submit(void *data, u64 flags)
* Description
* Submit reserved ring buffer sample, pointed to by *data*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* Nothing. Always succeeds.
*
* void bpf_ringbuf_discard(void *data, u64 flags)
* Description
* Discard reserved ring buffer sample, pointed to by *data*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* Nothing. Always succeeds.
*
* u64 bpf_ringbuf_query(void *ringbuf, u64 flags)
* Description
* Query various characteristics of provided ring buffer. What
* exactly is queries is determined by *flags*:
* - BPF_RB_AVAIL_DATA - amount of data not yet consumed;
* - BPF_RB_RING_SIZE - the size of ring buffer;
* - BPF_RB_CONS_POS - consumer position (can wrap around);
* - BPF_RB_PROD_POS - producer(s) position (can wrap around);
* Data returned is just a momentary snapshots of actual values
* and could be inaccurate, so this facility should be used to
* power heuristics and for reporting, not to make 100% correct
* calculation.
* Return
* Requested value, or 0, if flags are not recognized.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3288,7 +3342,12 @@ union bpf_attr {
FN(seq_printf), \
FN(seq_write), \
FN(sk_cgroup_id), \
FN(sk_ancestor_cgroup_id),
FN(sk_ancestor_cgroup_id), \
FN(ringbuf_output), \
FN(ringbuf_reserve), \
FN(ringbuf_submit), \
FN(ringbuf_discard), \
FN(ringbuf_query),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3398,6 +3457,29 @@ enum {
BPF_F_GET_BRANCH_RECORDS_SIZE = (1ULL << 0),
};
/* BPF_FUNC_bpf_ringbuf_commit, BPF_FUNC_bpf_ringbuf_discard, and
* BPF_FUNC_bpf_ringbuf_output flags.
*/
enum {
BPF_RB_NO_WAKEUP = (1ULL << 0),
BPF_RB_FORCE_WAKEUP = (1ULL << 1),
};
/* BPF_FUNC_bpf_ringbuf_query flags */
enum {
BPF_RB_AVAIL_DATA = 0,
BPF_RB_RING_SIZE = 1,
BPF_RB_CONS_POS = 2,
BPF_RB_PROD_POS = 3,
};
/* BPF ring buffer constants */
enum {
BPF_RINGBUF_BUSY_BIT = (1U << 31),
BPF_RINGBUF_DISCARD_BIT = (1U << 30),
BPF_RINGBUF_HDR_SZ = 8,
};
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
+1 -1
View File
@@ -4,7 +4,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init)
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
obj-$(CONFIG_BPF_JIT) += trampoline.o
obj-$(CONFIG_BPF_SYSCALL) += btf.o
+10
View File
@@ -635,6 +635,16 @@ bpf_base_func_proto(enum bpf_func_id func_id)
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_ktime_get_boot_ns:
return &bpf_ktime_get_boot_ns_proto;
case BPF_FUNC_ringbuf_output:
return &bpf_ringbuf_output_proto;
case BPF_FUNC_ringbuf_reserve:
return &bpf_ringbuf_reserve_proto;
case BPF_FUNC_ringbuf_submit:
return &bpf_ringbuf_submit_proto;
case BPF_FUNC_ringbuf_discard:
return &bpf_ringbuf_discard_proto;
case BPF_FUNC_ringbuf_query:
return &bpf_ringbuf_query_proto;
default:
break;
}
File diff suppressed because it is too large Load Diff
+12
View File
@@ -26,6 +26,7 @@
#include <linux/audit.h>
#include <uapi/linux/btf.h>
#include <linux/bpf_lsm.h>
#include <linux/poll.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -662,6 +663,16 @@ out:
return err;
}
static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct *pts)
{
struct bpf_map *map = filp->private_data;
if (map->ops->map_poll)
return map->ops->map_poll(map, filp, pts);
return EPOLLERR;
}
const struct file_operations bpf_map_fops = {
#ifdef CONFIG_PROC_FS
.show_fdinfo = bpf_map_show_fdinfo,
@@ -670,6 +681,7 @@ const struct file_operations bpf_map_fops = {
.read = bpf_dummy_read,
.write = bpf_dummy_write,
.mmap = bpf_map_mmap,
.poll = bpf_map_poll,
};
int bpf_map_new_fd(struct bpf_map *map, int flags)
+146 -49
View File
@@ -233,6 +233,7 @@ struct bpf_call_arg_meta {
bool pkt_access;
int regno;
int access_size;
int mem_size;
u64 msize_max_value;
int ref_obj_id;
int func_id;
@@ -408,7 +409,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
type == PTR_TO_SOCKET_OR_NULL ||
type == PTR_TO_SOCK_COMMON_OR_NULL ||
type == PTR_TO_TCP_SOCK_OR_NULL ||
type == PTR_TO_BTF_ID_OR_NULL;
type == PTR_TO_BTF_ID_OR_NULL ||
type == PTR_TO_MEM_OR_NULL;
}
static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
@@ -422,7 +424,9 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
return type == PTR_TO_SOCKET ||
type == PTR_TO_SOCKET_OR_NULL ||
type == PTR_TO_TCP_SOCK ||
type == PTR_TO_TCP_SOCK_OR_NULL;
type == PTR_TO_TCP_SOCK_OR_NULL ||
type == PTR_TO_MEM ||
type == PTR_TO_MEM_OR_NULL;
}
static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
@@ -436,7 +440,9 @@ static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
*/
static bool is_release_function(enum bpf_func_id func_id)
{
return func_id == BPF_FUNC_sk_release;
return func_id == BPF_FUNC_sk_release ||
func_id == BPF_FUNC_ringbuf_submit ||
func_id == BPF_FUNC_ringbuf_discard;
}
static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -444,7 +450,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
return func_id == BPF_FUNC_sk_lookup_tcp ||
func_id == BPF_FUNC_sk_lookup_udp ||
func_id == BPF_FUNC_skc_lookup_tcp ||
func_id == BPF_FUNC_map_lookup_elem;
func_id == BPF_FUNC_map_lookup_elem ||
func_id == BPF_FUNC_ringbuf_reserve;
}
static bool is_acquire_function(enum bpf_func_id func_id,
@@ -454,7 +461,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
if (func_id == BPF_FUNC_sk_lookup_tcp ||
func_id == BPF_FUNC_sk_lookup_udp ||
func_id == BPF_FUNC_skc_lookup_tcp)
func_id == BPF_FUNC_skc_lookup_tcp ||
func_id == BPF_FUNC_ringbuf_reserve)
return true;
if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -494,6 +502,8 @@ static const char * const reg_type_str[] = {
[PTR_TO_XDP_SOCK] = "xdp_sock",
[PTR_TO_BTF_ID] = "ptr_",
[PTR_TO_BTF_ID_OR_NULL] = "ptr_or_null_",
[PTR_TO_MEM] = "mem",
[PTR_TO_MEM_OR_NULL] = "mem_or_null",
};
static char slot_type_char[] = {
@@ -2468,32 +2478,49 @@ static int check_map_access_type(struct bpf_verifier_env *env, u32 regno,
return 0;
}
/* check read/write into map element returned by bpf_map_lookup_elem() */
static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
int size, bool zero_size_allowed)
/* check read/write into memory region (e.g., map value, ringbuf sample, etc) */
static int __check_mem_access(struct bpf_verifier_env *env, int regno,
int off, int size, u32 mem_size,
bool zero_size_allowed)
{
struct bpf_reg_state *regs = cur_regs(env);
struct bpf_map *map = regs[regno].map_ptr;
bool size_ok = size > 0 || (size == 0 && zero_size_allowed);
struct bpf_reg_state *reg;
if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
off + size > map->value_size) {
if (off >= 0 && size_ok && (u64)off + size <= mem_size)
return 0;
reg = &cur_regs(env)[regno];
switch (reg->type) {
case PTR_TO_MAP_VALUE:
verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n",
map->value_size, off, size);
return -EACCES;
mem_size, off, size);
break;
case PTR_TO_PACKET:
case PTR_TO_PACKET_META:
case PTR_TO_PACKET_END:
verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n",
off, size, regno, reg->id, off, mem_size);
break;
case PTR_TO_MEM:
default:
verbose(env, "invalid access to memory, mem_size=%u off=%d size=%d\n",
mem_size, off, size);
}
return 0;
return -EACCES;
}
/* check read/write into a map element with possible variable offset */
static int check_map_access(struct bpf_verifier_env *env, u32 regno,
int off, int size, bool zero_size_allowed)
/* check read/write into a memory region with possible variable offset */
static int check_mem_region_access(struct bpf_verifier_env *env, u32 regno,
int off, int size, u32 mem_size,
bool zero_size_allowed)
{
struct bpf_verifier_state *vstate = env->cur_state;
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg = &state->regs[regno];
int err;
/* We may have adjusted the register to this map value, so we
/* We may have adjusted the register pointing to memory region, so we
* need to try adding each of min_value and max_value to off
* to make sure our theoretical access will be safe.
*/
@@ -2514,10 +2541,10 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
regno);
return -EACCES;
}
err = __check_map_access(env, regno, reg->smin_value + off, size,
zero_size_allowed);
err = __check_mem_access(env, regno, reg->smin_value + off, size,
mem_size, zero_size_allowed);
if (err) {
verbose(env, "R%d min value is outside of the array range\n",
verbose(env, "R%d min value is outside of the allowed memory range\n",
regno);
return err;
}
@@ -2527,18 +2554,38 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
* If reg->umax_value + off could overflow, treat that as unbounded too.
*/
if (reg->umax_value >= BPF_MAX_VAR_OFF) {
verbose(env, "R%d unbounded memory access, make sure to bounds check any array access into a map\n",
verbose(env, "R%d unbounded memory access, make sure to bounds check any such access\n",
regno);
return -EACCES;
}
err = __check_map_access(env, regno, reg->umax_value + off, size,
zero_size_allowed);
if (err)
verbose(env, "R%d max value is outside of the array range\n",
err = __check_mem_access(env, regno, reg->umax_value + off, size,
mem_size, zero_size_allowed);
if (err) {
verbose(env, "R%d max value is outside of the allowed memory range\n",
regno);
return err;
}
if (map_value_has_spin_lock(reg->map_ptr)) {
u32 lock = reg->map_ptr->spin_lock_off;
return 0;
}
/* check read/write into a map element with possible variable offset */
static int check_map_access(struct bpf_verifier_env *env, u32 regno,
int off, int size, bool zero_size_allowed)
{
struct bpf_verifier_state *vstate = env->cur_state;
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg = &state->regs[regno];
struct bpf_map *map = reg->map_ptr;
int err;
err = check_mem_region_access(env, regno, off, size, map->value_size,
zero_size_allowed);
if (err)
return err;
if (map_value_has_spin_lock(map)) {
u32 lock = map->spin_lock_off;
/* if any part of struct bpf_spin_lock can be touched by
* load/store reject this program.
@@ -2596,21 +2643,6 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
}
}
static int __check_packet_access(struct bpf_verifier_env *env, u32 regno,
int off, int size, bool zero_size_allowed)
{
struct bpf_reg_state *regs = cur_regs(env);
struct bpf_reg_state *reg = &regs[regno];
if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
(u64)off + size > reg->range) {
verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n",
off, size, regno, reg->id, reg->off, reg->range);
return -EACCES;
}
return 0;
}
static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
int size, bool zero_size_allowed)
{
@@ -2631,16 +2663,17 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
regno);
return -EACCES;
}
err = __check_packet_access(env, regno, off, size, zero_size_allowed);
err = __check_mem_access(env, regno, off, size, reg->range,
zero_size_allowed);
if (err) {
verbose(env, "R%d offset is outside of the packet\n", regno);
return err;
}
/* __check_packet_access has made sure "off + size - 1" is within u16.
/* __check_mem_access has made sure "off + size - 1" is within u16.
* reg->umax_value can't be bigger than MAX_PACKET_OFF which is 0xffff,
* otherwise find_good_pkt_pointers would have refused to set range info
* that __check_packet_access would have rejected this pkt access.
* that __check_mem_access would have rejected this pkt access.
* Therefore, "off + reg->umax_value + size - 1" won't overflow u32.
*/
env->prog->aux->max_pkt_offset =
@@ -3220,6 +3253,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
mark_reg_unknown(env, regs, value_regno);
}
}
} else if (reg->type == PTR_TO_MEM) {
if (t == BPF_WRITE && value_regno >= 0 &&
is_pointer_value(env, value_regno)) {
verbose(env, "R%d leaks addr into mem\n", value_regno);
return -EACCES;
}
err = check_mem_region_access(env, regno, off, size,
reg->mem_size, false);
if (!err && t == BPF_READ && value_regno >= 0)
mark_reg_unknown(env, regs, value_regno);
} else if (reg->type == PTR_TO_CTX) {
enum bpf_reg_type reg_type = SCALAR_VALUE;
u32 btf_id = 0;
@@ -3557,6 +3600,10 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
return -EACCES;
return check_map_access(env, regno, reg->off, access_size,
zero_size_allowed);
case PTR_TO_MEM:
return check_mem_region_access(env, regno, reg->off,
access_size, reg->mem_size,
zero_size_allowed);
default: /* scalar_value|ptr_to_stack or invalid ptr */
return check_stack_boundary(env, regno, access_size,
zero_size_allowed, meta);
@@ -3661,6 +3708,17 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type)
type == ARG_CONST_SIZE_OR_ZERO;
}
static bool arg_type_is_alloc_mem_ptr(enum bpf_arg_type type)
{
return type == ARG_PTR_TO_ALLOC_MEM ||
type == ARG_PTR_TO_ALLOC_MEM_OR_NULL;
}
static bool arg_type_is_alloc_size(enum bpf_arg_type type)
{
return type == ARG_CONST_ALLOC_SIZE_OR_ZERO;
}
static bool arg_type_is_int_ptr(enum bpf_arg_type type)
{
return type == ARG_PTR_TO_INT ||
@@ -3720,7 +3778,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
type != expected_type)
goto err_type;
} else if (arg_type == ARG_CONST_SIZE ||
arg_type == ARG_CONST_SIZE_OR_ZERO) {
arg_type == ARG_CONST_SIZE_OR_ZERO ||
arg_type == ARG_CONST_ALLOC_SIZE_OR_ZERO) {
expected_type = SCALAR_VALUE;
if (type != expected_type)
goto err_type;
@@ -3791,13 +3850,29 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
* happens during stack boundary checking.
*/
if (register_is_null(reg) &&
arg_type == ARG_PTR_TO_MEM_OR_NULL)
(arg_type == ARG_PTR_TO_MEM_OR_NULL ||
arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL))
/* final test in check_stack_boundary() */;
else if (!type_is_pkt_pointer(type) &&
type != PTR_TO_MAP_VALUE &&
type != PTR_TO_MEM &&
type != expected_type)
goto err_type;
meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
} else if (arg_type_is_alloc_mem_ptr(arg_type)) {
expected_type = PTR_TO_MEM;
if (register_is_null(reg) &&
arg_type == ARG_PTR_TO_ALLOC_MEM_OR_NULL)
/* final test in check_stack_boundary() */;
else if (type != expected_type)
goto err_type;
if (meta->ref_obj_id) {
verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
regno, reg->ref_obj_id,
meta->ref_obj_id);
return -EFAULT;
}
meta->ref_obj_id = reg->ref_obj_id;
} else if (arg_type_is_int_ptr(arg_type)) {
expected_type = PTR_TO_STACK;
if (!type_is_pkt_pointer(type) &&
@@ -3893,6 +3968,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
zero_size_allowed, meta);
if (!err)
err = mark_chain_precision(env, regno);
} else if (arg_type_is_alloc_size(arg_type)) {
if (!tnum_is_const(reg->var_off)) {
verbose(env, "R%d unbounded size, use 'var &= const' or 'if (var < const)'\n",
regno);
return -EACCES;
}
meta->mem_size = reg->var_off.value;
} else if (arg_type_is_int_ptr(arg_type)) {
int size = int_ptr_type_to_size(arg_type);
@@ -3929,6 +4011,14 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
func_id != BPF_FUNC_xdp_output)
goto error;
break;
case BPF_MAP_TYPE_RINGBUF:
if (func_id != BPF_FUNC_ringbuf_output &&
func_id != BPF_FUNC_ringbuf_reserve &&
func_id != BPF_FUNC_ringbuf_submit &&
func_id != BPF_FUNC_ringbuf_discard &&
func_id != BPF_FUNC_ringbuf_query)
goto error;
break;
case BPF_MAP_TYPE_STACK_TRACE:
if (func_id != BPF_FUNC_get_stackid)
goto error;
@@ -4655,6 +4745,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
regs[BPF_REG_0].id = ++env->id_gen;
} else if (fn->ret_type == RET_PTR_TO_ALLOC_MEM_OR_NULL) {
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].type = PTR_TO_MEM_OR_NULL;
regs[BPF_REG_0].id = ++env->id_gen;
regs[BPF_REG_0].mem_size = meta.mem_size;
} else {
verbose(env, "unknown return type %d of func %s#%d\n",
fn->ret_type, func_id_name(func_id), func_id);
@@ -6611,6 +6706,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
reg->type = PTR_TO_TCP_SOCK;
} else if (reg->type == PTR_TO_BTF_ID_OR_NULL) {
reg->type = PTR_TO_BTF_ID;
} else if (reg->type == PTR_TO_MEM_OR_NULL) {
reg->type = PTR_TO_MEM;
}
if (is_null) {
/* We don't need id and ref_obj_id from this point
+10
View File
@@ -1088,6 +1088,16 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_perf_event_read_value_proto;
case BPF_FUNC_get_ns_current_pid_tgid:
return &bpf_get_ns_current_pid_tgid_proto;
case BPF_FUNC_ringbuf_output:
return &bpf_ringbuf_output_proto;
case BPF_FUNC_ringbuf_reserve:
return &bpf_ringbuf_reserve_proto;
case BPF_FUNC_ringbuf_submit:
return &bpf_ringbuf_submit_proto;
case BPF_FUNC_ringbuf_discard:
return &bpf_ringbuf_discard_proto;
case BPF_FUNC_ringbuf_query:
return &bpf_ringbuf_query_proto;
default:
return NULL;
}
+83 -1
View File
@@ -147,6 +147,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_SK_STORAGE,
BPF_MAP_TYPE_DEVMAP_HASH,
BPF_MAP_TYPE_STRUCT_OPS,
BPF_MAP_TYPE_RINGBUF,
};
/* Note that tracing related programs such as
@@ -3157,6 +3158,59 @@ union bpf_attr {
* **bpf_sk_cgroup_id**\ ().
* Return
* The id is returned or 0 in case the id could not be retrieved.
*
* void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags)
* Description
* Copy *size* bytes from *data* into a ring buffer *ringbuf*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* 0, on success;
* < 0, on error.
*
* void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags)
* Description
* Reserve *size* bytes of payload in a ring buffer *ringbuf*.
* Return
* Valid pointer with *size* bytes of memory available; NULL,
* otherwise.
*
* void bpf_ringbuf_submit(void *data, u64 flags)
* Description
* Submit reserved ring buffer sample, pointed to by *data*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* Nothing. Always succeeds.
*
* void bpf_ringbuf_discard(void *data, u64 flags)
* Description
* Discard reserved ring buffer sample, pointed to by *data*.
* If BPF_RB_NO_WAKEUP is specified in *flags*, no notification of
* new data availability is sent.
* IF BPF_RB_FORCE_WAKEUP is specified in *flags*, notification of
* new data availability is sent unconditionally.
* Return
* Nothing. Always succeeds.
*
* u64 bpf_ringbuf_query(void *ringbuf, u64 flags)
* Description
* Query various characteristics of provided ring buffer. What
* exactly is queries is determined by *flags*:
* - BPF_RB_AVAIL_DATA - amount of data not yet consumed;
* - BPF_RB_RING_SIZE - the size of ring buffer;
* - BPF_RB_CONS_POS - consumer position (can wrap around);
* - BPF_RB_PROD_POS - producer(s) position (can wrap around);
* Data returned is just a momentary snapshots of actual values
* and could be inaccurate, so this facility should be used to
* power heuristics and for reporting, not to make 100% correct
* calculation.
* Return
* Requested value, or 0, if flags are not recognized.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3288,7 +3342,12 @@ union bpf_attr {
FN(seq_printf), \
FN(seq_write), \
FN(sk_cgroup_id), \
FN(sk_ancestor_cgroup_id),
FN(sk_ancestor_cgroup_id), \
FN(ringbuf_output), \
FN(ringbuf_reserve), \
FN(ringbuf_submit), \
FN(ringbuf_discard), \
FN(ringbuf_query),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3398,6 +3457,29 @@ enum {
BPF_F_GET_BRANCH_RECORDS_SIZE = (1ULL << 0),
};
/* BPF_FUNC_bpf_ringbuf_commit, BPF_FUNC_bpf_ringbuf_discard, and
* BPF_FUNC_bpf_ringbuf_output flags.
*/
enum {
BPF_RB_NO_WAKEUP = (1ULL << 0),
BPF_RB_FORCE_WAKEUP = (1ULL << 1),
};
/* BPF_FUNC_bpf_ringbuf_query flags */
enum {
BPF_RB_AVAIL_DATA = 0,
BPF_RB_RING_SIZE = 1,
BPF_RB_CONS_POS = 2,
BPF_RB_PROD_POS = 3,
};
/* BPF ring buffer constants */
enum {
BPF_RINGBUF_BUSY_BIT = (1U << 31),
BPF_RINGBUF_DISCARD_BIT = (1U << 30),
BPF_RINGBUF_HDR_SZ = 8,
};
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
+1 -1
View File
@@ -1,3 +1,3 @@
libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \
netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \
btf_dump.o
btf_dump.o ringbuf.o
+21
View File
@@ -478,6 +478,27 @@ LIBBPF_API int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags);
LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
size_t info_size, __u32 flags);
/* Ring buffer APIs */
struct ring_buffer;
typedef int (*ring_buffer_sample_fn)(void *ctx, void *data, size_t size);
struct ring_buffer_opts {
size_t sz; /* size of this struct, for forward/backward compatiblity */
};
#define ring_buffer_opts__last_field sz
LIBBPF_API struct ring_buffer *
ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx,
const struct ring_buffer_opts *opts);
LIBBPF_API void ring_buffer__free(struct ring_buffer *rb);
LIBBPF_API int ring_buffer__add(struct ring_buffer *rb, int map_fd,
ring_buffer_sample_fn sample_cb, void *ctx);
LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms);
LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb);
/* Perf buffer APIs */
struct perf_buffer;
typedef void (*perf_buffer_sample_fn)(void *ctx, int cpu,
+5
View File
@@ -263,4 +263,9 @@ LIBBPF_0.0.9 {
bpf_link_get_next_id;
bpf_program__attach_iter;
perf_buffer__consume;
ring_buffer__add;
ring_buffer__consume;
ring_buffer__free;
ring_buffer__new;
ring_buffer__poll;
} LIBBPF_0.0.8;
+5
View File
@@ -238,6 +238,11 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex)
if (btf_fd < 0)
return false;
break;
case BPF_MAP_TYPE_RINGBUF:
key_size = 0;
value_size = 0;
max_entries = 4096;
break;
case BPF_MAP_TYPE_UNSPEC:
case BPF_MAP_TYPE_HASH:
case BPF_MAP_TYPE_ARRAY:
+285
View File
@@ -0,0 +1,285 @@
// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
/*
* Ring buffer operations.
*
* Copyright (C) 2020 Facebook, Inc.
*/
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <linux/err.h>
#include <linux/bpf.h>
#include <asm/barrier.h>
#include <sys/mman.h>
#include <sys/epoll.h>
#include <tools/libc_compat.h>
#include "libbpf.h"
#include "libbpf_internal.h"
#include "bpf.h"
/* make sure libbpf doesn't use kernel-only integer typedefs */
#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64
struct ring {
ring_buffer_sample_fn sample_cb;
void *ctx;
void *data;
unsigned long *consumer_pos;
unsigned long *producer_pos;
unsigned long mask;
int map_fd;
};
struct ring_buffer {
struct epoll_event *events;
struct ring *rings;
size_t page_size;
int epoll_fd;
int ring_cnt;
};
static void ringbuf_unmap_ring(struct ring_buffer *rb, struct ring *r)
{
if (r->consumer_pos) {
munmap(r->consumer_pos, rb->page_size);
r->consumer_pos = NULL;
}
if (r->producer_pos) {
munmap(r->producer_pos, rb->page_size + 2 * (r->mask + 1));
r->producer_pos = NULL;
}
}
/* Add extra RINGBUF maps to this ring buffer manager */
int ring_buffer__add(struct ring_buffer *rb, int map_fd,
ring_buffer_sample_fn sample_cb, void *ctx)
{
struct bpf_map_info info;
__u32 len = sizeof(info);
struct epoll_event *e;
struct ring *r;
void *tmp;
int err;
memset(&info, 0, sizeof(info));
err = bpf_obj_get_info_by_fd(map_fd, &info, &len);
if (err) {
err = -errno;
pr_warn("ringbuf: failed to get map info for fd=%d: %d\n",
map_fd, err);
return err;
}
if (info.type != BPF_MAP_TYPE_RINGBUF) {
pr_warn("ringbuf: map fd=%d is not BPF_MAP_TYPE_RINGBUF\n",
map_fd);
return -EINVAL;
}
tmp = reallocarray(rb->rings, rb->ring_cnt + 1, sizeof(*rb->rings));
if (!tmp)
return -ENOMEM;
rb->rings = tmp;
tmp = reallocarray(rb->events, rb->ring_cnt + 1, sizeof(*rb->events));
if (!tmp)
return -ENOMEM;
rb->events = tmp;
r = &rb->rings[rb->ring_cnt];
memset(r, 0, sizeof(*r));
r->map_fd = map_fd;
r->sample_cb = sample_cb;
r->ctx = ctx;
r->mask = info.max_entries - 1;
/* Map writable consumer page */
tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
map_fd, 0);
if (tmp == MAP_FAILED) {
err = -errno;
pr_warn("ringbuf: failed to mmap consumer page for map fd=%d: %d\n",
map_fd, err);
return err;
}
r->consumer_pos = tmp;
/* Map read-only producer page and data pages. We map twice as big
* data size to allow simple reading of samples that wrap around the
* end of a ring buffer. See kernel implementation for details.
* */
tmp = mmap(NULL, rb->page_size + 2 * info.max_entries, PROT_READ,
MAP_SHARED, map_fd, rb->page_size);
if (tmp == MAP_FAILED) {
err = -errno;
ringbuf_unmap_ring(rb, r);
pr_warn("ringbuf: failed to mmap data pages for map fd=%d: %d\n",
map_fd, err);
return err;
}
r->producer_pos = tmp;
r->data = tmp + rb->page_size;
e = &rb->events[rb->ring_cnt];
memset(e, 0, sizeof(*e));
e->events = EPOLLIN;
e->data.fd = rb->ring_cnt;
if (epoll_ctl(rb->epoll_fd, EPOLL_CTL_ADD, map_fd, e) < 0) {
err = -errno;
ringbuf_unmap_ring(rb, r);
pr_warn("ringbuf: failed to epoll add map fd=%d: %d\n",
map_fd, err);
return err;
}
rb->ring_cnt++;
return 0;
}
void ring_buffer__free(struct ring_buffer *rb)
{
int i;
if (!rb)
return;
for (i = 0; i < rb->ring_cnt; ++i)
ringbuf_unmap_ring(rb, &rb->rings[i]);
if (rb->epoll_fd >= 0)
close(rb->epoll_fd);
free(rb->events);
free(rb->rings);
free(rb);
}
struct ring_buffer *
ring_buffer__new(int map_fd, ring_buffer_sample_fn sample_cb, void *ctx,
const struct ring_buffer_opts *opts)
{
struct ring_buffer *rb;
int err;
if (!OPTS_VALID(opts, ring_buffer_opts))
return NULL;
rb = calloc(1, sizeof(*rb));
if (!rb)
return NULL;
rb->page_size = getpagesize();
rb->epoll_fd = epoll_create1(EPOLL_CLOEXEC);
if (rb->epoll_fd < 0) {
err = -errno;
pr_warn("ringbuf: failed to create epoll instance: %d\n", err);
goto err_out;
}
err = ring_buffer__add(rb, map_fd, sample_cb, ctx);
if (err)
goto err_out;
return rb;
err_out:
ring_buffer__free(rb);
return NULL;
}
static inline int roundup_len(__u32 len)
{
/* clear out top 2 bits (discard and busy, if set) */
len <<= 2;
len >>= 2;
/* add length prefix */
len += BPF_RINGBUF_HDR_SZ;
/* round up to 8 byte alignment */
return (len + 7) / 8 * 8;
}
static int ringbuf_process_ring(struct ring* r)
{
int *len_ptr, len, err, cnt = 0;
unsigned long cons_pos, prod_pos;
bool got_new_data;
void *sample;
cons_pos = smp_load_acquire(r->consumer_pos);
do {
got_new_data = false;
prod_pos = smp_load_acquire(r->producer_pos);
while (cons_pos < prod_pos) {
len_ptr = r->data + (cons_pos & r->mask);
len = smp_load_acquire(len_ptr);
/* sample not committed yet, bail out for now */
if (len & BPF_RINGBUF_BUSY_BIT)
goto done;
got_new_data = true;
cons_pos += roundup_len(len);
if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) {
sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ;
err = r->sample_cb(r->ctx, sample, len);
if (err) {
/* update consumer pos and bail out */
smp_store_release(r->consumer_pos,
cons_pos);
return err;
}
cnt++;
}
smp_store_release(r->consumer_pos, cons_pos);
}
} while (got_new_data);
done:
return cnt;
}
/* Consume available ring buffer(s) data without event polling.
* Returns number of records consumed across all registered ring buffers, or
* negative number if any of the callbacks return error.
*/
int ring_buffer__consume(struct ring_buffer *rb)
{
int i, err, res = 0;
for (i = 0; i < rb->ring_cnt; i++) {
struct ring *ring = &rb->rings[i];
err = ringbuf_process_ring(ring);
if (err < 0)
return err;
res += err;
}
return res;
}
/* Poll for available data and consume records, if any are available.
* Returns number of records consumed, or negative number, if any of the
* registered callbacks returned error.
*/
int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms)
{
int i, cnt, err, res = 0;
cnt = epoll_wait(rb->epoll_fd, rb->events, rb->ring_cnt, timeout_ms);
for (i = 0; i < cnt; i++) {
__u32 ring_id = rb->events[i].data.fd;
struct ring *ring = &rb->rings[ring_id];
err = ringbuf_process_ring(ring);
if (err < 0)
return err;
res += cnt;
}
return cnt < 0 ? -errno : res;
}
+4 -1
View File
@@ -413,12 +413,15 @@ $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
$(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
$(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
$(OUTPUT)/perfbuf_bench.skel.h
$(OUTPUT)/bench.o: bench.h testing_helpers.h
$(OUTPUT)/bench: LDLIBS += -lm
$(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \
$(OUTPUT)/bench_count.o \
$(OUTPUT)/bench_rename.o \
$(OUTPUT)/bench_trigger.o
$(OUTPUT)/bench_trigger.o \
$(OUTPUT)/bench_ringbufs.o
$(call msg,BINARY,,$@)
$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
+16
View File
@@ -130,6 +130,13 @@ static const struct argp_option opts[] = {
{},
};
extern struct argp bench_ringbufs_argp;
static const struct argp_child bench_parsers[] = {
{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
{},
};
static error_t parse_arg(int key, char *arg, struct argp_state *state)
{
static int pos_args;
@@ -208,6 +215,7 @@ static void parse_cmdline_args(int argc, char **argv)
.options = opts,
.parser = parse_arg,
.doc = argp_program_doc,
.children = bench_parsers,
};
if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
exit(1);
@@ -310,6 +318,10 @@ extern const struct bench bench_trig_rawtp;
extern const struct bench bench_trig_kprobe;
extern const struct bench bench_trig_fentry;
extern const struct bench bench_trig_fmodret;
extern const struct bench bench_rb_libbpf;
extern const struct bench bench_rb_custom;
extern const struct bench bench_pb_libbpf;
extern const struct bench bench_pb_custom;
static const struct bench *benchs[] = {
&bench_count_global,
@@ -327,6 +339,10 @@ static const struct bench *benchs[] = {
&bench_trig_kprobe,
&bench_trig_fentry,
&bench_trig_fmodret,
&bench_rb_libbpf,
&bench_rb_custom,
&bench_pb_libbpf,
&bench_pb_custom,
};
static void setup_benchmark()
File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More