Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
This commit is contained in:
Linus Torvalds
2022-12-13 15:47:48 -08:00
2013 changed files with 166136 additions and 34555 deletions

View File

@@ -298,3 +298,48 @@ A: NO.
The BTF_ID macro does not cause a function to become part of the ABI
any more than does the EXPORT_SYMBOL_GPL macro.
Q: What is the compatibility story for special BPF types in map values?
-----------------------------------------------------------------------
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
values (when using BTF support for BPF maps). This allows to use helpers for
such objects on these fields inside map values. Users are also allowed to embed
pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
kernel preserve backwards compatibility for these features?
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
NO, but see below.
For struct types that have been added already, like bpf_spin_lock and bpf_timer,
the kernel will preserve backwards compatibility, as they are part of UAPI.
For kptrs, they are also part of UAPI, but only with respect to the kptr
mechanism. The types that you can use with a __kptr and __kptr_ref tagged
pointer in your struct are NOT part of the UAPI contract. The supported types can
and will change across kernel releases. However, operations like accessing kptr
fields and bpf_kptr_xchg() helper will continue to be supported across kernel
releases for the supported types.
For any other supported struct type, unless explicitly stated in this document
and added to bpf.h UAPI header, such types can and will arbitrarily change their
size, type, and alignment, or any other user visible API or ABI detail across
kernel releases. The users must adapt their BPF programs to the new changes and
update them to make sure their programs continue to work correctly.
NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
order to introduce more special fields in the future. Hence, user programs must
avoid defining types with 'bpf\_' prefix to not be broken in future releases.
In other words, no backwards compatibility is guaranteed if one using a type
in BTF with 'bpf\_' prefix.
Q: What is the compatibility story for special BPF types in allocated objects?
------------------------------------------------------------------------------
Q: Same as above, but for allocated objects (i.e. objects allocated using
bpf_obj_new for user defined types). Will the kernel preserve backwards
compatibility for these features?
A: NO.
Unlike map value types, there are no stability guarantees for this case. The
whole API to work with allocated objects and any support for special fields
inside them is unstable (since it is exposed through kfuncs).

View File

@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
Submitting patches
==================
Q: How do I run BPF CI on my changes before sending them out for review?
------------------------------------------------------------------------
A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
While GitHub also provides a CLI that can be used to accomplish the same
results, here we focus on the UI based workflow.
The following steps lay out how to start a CI run for your patches:
- Create a fork of the aforementioned repository in your own account (one time
action)
- Clone the fork locally, check out a new branch tracking either the bpf-next
or bpf branch, and apply your to-be-tested patches on top of it
- Push the local branch to your fork and create a pull request against
kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively
Shortly after the pull request has been created, the CI workflow will run. Note
that capacity is shared with patches submitted upstream being checked and so
depending on utilization the run can take a while to finish.
Note furthermore that both base branches (bpf-next_base and bpf_base) will be
updated as patches are pushed to the respective upstream branches they track. As
such, your patch set will automatically (be attempted to) be rebased as well.
This behavior can result in a CI run being aborted and restarted with the new
base line.
Q: To which mailing list do I need to submit my BPF patches?
------------------------------------------------------------
A: Please submit your BPF patches to the bpf kernel mailing list:

View File

@@ -0,0 +1,485 @@
=============
BPF Iterators
=============
----------
Motivation
----------
There are a few existing ways to dump kernel data into user space. The most
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
sockets in the system. However, their output format tends to be fixed, and if
users want more information about these sockets, they have to patch the kernel,
which often takes time to publish upstream and release. The same is true for popular
tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
additional information needs a kernel patch.
To solve this problem, the `drgn
<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
dig out the kernel data with no kernel change. However, the main drawback for
drgn is performance, as it cannot do pointer tracing inside the kernel. In
addition, drgn cannot validate a pointer value and may read invalid data if the
pointer becomes invalid inside the kernel.
The BPF iterator solves the above problem by providing flexibility on what data
(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
data object.
----------------------
How BPF Iterators Work
----------------------
A BPF iterator is a type of BPF program that allows users to iterate over
specific types of kernel objects. Unlike traditional BPF tracing programs that
allow users to define callbacks that are invoked at particular points of
execution in the kernel, BPF iterators allow users to define callbacks that
should be executed for every entry in a variety of kernel data structures.
For example, users can define a BPF iterator that iterates over every task on
the system and dumps the total amount of CPU runtime currently used by each of
them. Another BPF task iterator may instead dump the cgroup information for each
task. Such flexibility is the core value of BPF iterators.
A BPF program is always loaded into the kernel at the behest of a user space
process. A user space process loads a BPF program by opening and initializing
the program skeleton as required and then invoking a syscall to have the BPF
program verified and loaded by the kernel.
In traditional tracing programs, a program is activated by having user space
obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
activated, the program callback will be invoked whenever the tracepoint is
triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
program is obtained using ``bpf_link_create()``, and the program callback is
invoked by issuing system calls from user space.
Next, let us see how you can use the iterators to iterate on kernel objects and
read data.
------------------------
How to Use BPF iterators
------------------------
BPF selftests are a great resource to illustrate how to use the iterators. In
this section, well walk through a BPF selftest which shows how to load and use
a BPF iterator program. To begin, well look at `bpf_iter.c
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
which illustrates how to load and trigger BPF iterators on the user space side.
Later, well look at a BPF program that runs in kernel space.
Loading a BPF iterator in the kernel from user space typically involves the
following steps:
* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
has verified and loaded the program, it returns a file descriptor (fd) to user
space.
* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
specified with the BPF program file descriptor received from the kernel.
* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
available.
* Close the iterator fd using ``close(bpf_iter_fd)``.
* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
The following are a few examples of selftest BPF iterator programs:
* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
iterator.
::
struct bpf_iter__task_file {
union {
struct bpf_iter_meta *meta;
};
union {
struct task_struct *task;
};
u32 fd;
union {
struct file *file;
};
};
In the above code, the field 'meta' contains the metadata, which is the same for
all BPF iterator programs. The rest of the fields are specific to different
iterators. For example, for task_file iterators, the kernel layer provides the
'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
counted
<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
so they won't go away when the BPF program runs.
Here is a snippet from the ``bpf_iter_task_file.c`` file:
::
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
struct seq_file *seq = ctx->meta->seq;
struct task_struct *task = ctx->task;
struct file *file = ctx->file;
__u32 fd = ctx->fd;
if (task == NULL || file == NULL)
return 0;
if (ctx->meta->seq_num == 0) {
count = 0;
BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
}
if (tgid == task->tgid && task->tgid != task->pid)
count++;
if (last_tgid != task->tgid) {
last_tgid = task->tgid;
unique_tgid_count++;
}
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
In the above example, the section name ``SEC(iter/task_file)``, indicates that
the program is a BPF iterator program to iterate all files from all tasks. The
context of the program is ``bpf_iter__task_file`` struct.
The user space program invokes the BPF iterator program running in the kernel
by issuing a ``read()`` syscall. Once invoked, the BPF
program can export data to user space using a variety of BPF helper functions.
You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
``bpf_seq_write()`` function based on whether you need formatted output or just
binary data, respectively. For binary-encoded data, the user space applications
can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
you can use ``cat <path>`` to print the results similar to ``cat
/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
use ``rm -f <path>`` to remove the pinned iterator.
For example, you can use the following command to create a BPF iterator from the
``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
path:
::
$ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
And then print out the results using the following command:
::
$ cat /sys/fs/bpf/my_route
-------------------------------------------------------
Implement Kernel Support for BPF Iterator Program Types
-------------------------------------------------------
To implement a BPF iterator in the kernel, the developer must make a one-time
change to the following key data structure defined in the `bpf.h
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
file.
::
struct bpf_iter_reg {
const char *target;
bpf_iter_attach_target_t attach_target;
bpf_iter_detach_target_t detach_target;
bpf_iter_show_fdinfo_t show_fdinfo;
bpf_iter_fill_link_info_t fill_link_info;
bpf_iter_get_func_proto_t get_func_proto;
u32 ctx_arg_info_size;
u32 feature;
struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
const struct bpf_iter_seq_info *seq_info;
};
After filling the data structure fields, call ``bpf_iter_reg_target()`` to
register the iterator to the main BPF iterator subsystem.
The following is the breakdown for each field in struct ``bpf_iter_reg``.
.. list-table::
:widths: 25 50
:header-rows: 1
* - Fields
- Description
* - target
- Specifies the name of the BPF iterator. For example: ``bpf_map``,
``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
* - attach_target and detach_target
- Allows for target specific ``link_create`` action since some targets
may need special processing. Called during the user space link_create stage.
* - show_fdinfo and fill_link_info
- Called to fill target specific information when user tries to get link
info associated with the iterator.
* - get_func_proto
- Permits a BPF iterator to access BPF helpers specific to the iterator.
* - ctx_arg_info_size and ctx_arg_info
- Specifies the verifier states for BPF program arguments associated with
the bpf iterator.
* - feature
- Specifies certain action requests in the kernel BPF iterator
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
that the kernel function cond_resched() is called to avoid other kernel
subsystem (e.g., rcu) misbehaving.
* - seq_info
- Specifies certain action requests in the kernel BPF iterator
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
that the kernel function cond_resched() is called to avoid other kernel
subsystem (e.g., rcu) misbehaving.
`Click here
<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
to see an implementation of the ``task_vma`` BPF iterator in the kernel.
---------------------------------
Parameterizing BPF Task Iterators
---------------------------------
By default, BPF iterators walk through all the objects of the specified types
(processes, cgroups, maps, etc.) across the entire system to read relevant
kernel data. But often, there are cases where we only care about a much smaller
subset of iterable kernel objects, such as only iterating tasks within a
specific process. Therefore, BPF iterator programs support filtering out objects
from iteration by allowing user space to configure the iterator program when it
is attached.
--------------------------
BPF Task Iterator Program
--------------------------
The following code is a BPF iterator program to print files and task information
through the ``seq_file`` of the iterator. It is a standard BPF iterator program
that visits every file of an iterator. We will use this BPF program in our
example later.
::
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
char _license[] SEC("license") = "GPL";
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
struct seq_file *seq = ctx->meta->seq;
struct task_struct *task = ctx->task;
struct file *file = ctx->file;
__u32 fd = ctx->fd;
if (task == NULL || file == NULL)
return 0;
if (ctx->meta->seq_num == 0) {
BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
}
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
----------------------------------------
Creating a File Iterator with Parameters
----------------------------------------
Now, let us look at how to create an iterator that includes only files of a
process.
First, fill the ``bpf_iter_attach_opts`` struct as shown below:
::
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
union bpf_iter_link_info linfo;
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
that only includes opened files for the process with the specified ``pid``. In
this example, we will only be iterating files for our process. If
``linfo.task.pid`` is zero, the iterator will visit every opened file of every
process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
that visits opened files of a specific thread, not a process. In this example,
``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
separate file descriptor table. In most circumstances, all process threads share
a single file descriptor table.
Now, in the userspace program, pass the pointer of struct to the
``bpf_program__attach_iter()``.
::
link = bpf_program__attach_iter(prog, &opts); iter_fd =
bpf_iter_create(bpf_link__fd(link));
If both *tid* and *pid* are zero, an iterator created from this struct
``bpf_iter_attach_opts`` will include every opened file of every task in the
system (in the namespace, actually.) It is the same as passing a NULL as the
second argument to ``bpf_program__attach_iter()``.
The whole program looks like the following code:
::
#include <stdio.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include "bpf_iter_task_ex.skel.h"
static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
{
struct bpf_link *link;
char buf[16] = {};
int iter_fd = -1, len;
int ret = 0;
link = bpf_program__attach_iter(prog, opts);
if (!link) {
fprintf(stderr, "bpf_program__attach_iter() fails\n");
return -1;
}
iter_fd = bpf_iter_create(bpf_link__fd(link));
if (iter_fd < 0) {
fprintf(stderr, "bpf_iter_create() fails\n");
ret = -1;
goto free_link;
}
/* not check contents, but ensure read() ends without error */
while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
buf[len] = 0;
printf("%s", buf);
}
printf("\n");
free_link:
if (iter_fd >= 0)
close(iter_fd);
bpf_link__destroy(link);
return 0;
}
static void test_task_file(void)
{
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
struct bpf_iter_task_ex *skel;
union bpf_iter_link_info linfo;
skel = bpf_iter_task_ex__open_and_load();
if (skel == NULL)
return;
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
printf("PID %d\n", getpid());
do_read_opts(skel->progs.dump_task_file, &opts);
bpf_iter_task_ex__destroy(skel);
}
int main(int argc, const char * const * argv)
{
test_task_file();
return 0;
}
The following lines are the output of the program.
::
PID 1859
tgid pid fd file
1859 1859 0 ffffffff82270aa0
1859 1859 1 ffffffff82270aa0
1859 1859 2 ffffffff82270aa0
1859 1859 3 ffffffff82272980
1859 1859 4 ffffffff8225e120
1859 1859 5 ffffffff82255120
1859 1859 6 ffffffff82254f00
1859 1859 7 ffffffff82254d80
1859 1859 8 ffffffff8225abe0
------------------
Without Parameters
------------------
Let us look at how a BPF iterator without parameters skips files of other
processes in the system. In this case, the BPF program has to check the pid or
the tid of tasks, or it will receive every opened file in the system (in the
current *pid* namespace, actually). So, we usually add a global variable in the
BPF program to pass a *pid* to the BPF program.
The BPF program would look like the following block.
::
......
int target_pid = 0;
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
......
if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
return 0;
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
The user space program would look like the following block:
::
......
static void test_task_file(void)
{
......
skel = bpf_iter_task_ex__open_and_load();
if (skel == NULL)
return;
skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
......
}
``target_pid`` is a global variable in the BPF program. The user space program
should initialize the variable with a process ID to skip opened files of other
processes in the BPF program. When you parametrize a BPF iterator, the iterator
calls the BPF program fewer times which can save significant resources.
---------------------------
Parametrizing VMA Iterators
---------------------------
By default, a BPF VMA iterator includes every VMA in every process. However,
you can still specify a process or a thread to include only its VMAs. Unlike
files, a thread can not have a separate address space (since Linux 2.6.0-test6).
Here, using *tid* makes no difference from using *pid*.
----------------------------
Parametrizing Task Iterators
----------------------------
A BPF task iterator with *pid* includes all tasks (threads) of a process. The
BPF program receives these tasks one after another. You can specify a BPF task
iterator with *tid* parameter to include only the tasks that match the given
*tid*.

View File

@@ -1062,4 +1062,9 @@ format.::
7. Testing
==========
Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
provides an extensive set of BTF-related tests.
.. Links
.. _tools/testing/selftests/bpf/prog_tests/btf.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c

View File

@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
maps
bpf_prog_run
classic_vs_extended.rst
bpf_iterators
bpf_licensing
test_debug
clang-notes
linux-notes
other
redirect
.. only:: subproject and html

View File

@@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
``BPF_XOR | BPF_K | BPF_ALU`` means::
src_reg = (u32) src_reg ^ (u32) imm32
dst_reg = (u32) dst_reg ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
src_reg = src_reg ^ imm32
dst_reg = dst_reg ^ imm32
Byte swap instructions

View File

@@ -72,6 +72,30 @@ argument as its size. By default, without __sz annotation, the size of the type
of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
pointer.
2.2.2 __k Annotation
--------------------
This annotation is only understood for scalar arguments, where it indicates that
the verifier must check the scalar argument to be a known constant, which does
not indicate a size parameter, and the value of the constant is relevant to the
safety of the program.
An example is given below::
void *bpf_obj_new(u32 local_type_id__k, ...)
{
...
}
Here, bpf_obj_new uses local_type_id argument to find out the size of that type
ID in program's BTF and return a sized pointer to it. Each type ID will have a
distinct size, hence it is crucial to treat each such call as distinct when
values don't match during verifier state pruning checks.
Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
size parameter, and the value of the constant matters for program safety, __k
suffix should be used.
.. _BPF_kfunc_nodef:
2.3 Using an existing kernel function
@@ -137,22 +161,20 @@ KF_ACQUIRE and KF_RET_NULL flags.
--------------------------
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
indicates that the all pointer arguments will always have a guaranteed lifetime,
and pointers to kernel objects are always passed to helpers in their unmodified
form (as obtained from acquire kfuncs).
indicates that the all pointer arguments are valid, and that all pointers to
BTF objects have been passed in their unmodified form (that is, at a zero
offset, and without having been obtained from walking another pointer).
It can be used to enforce that a pointer to a refcounted object acquired from a
kfunc or BPF helper is passed as an argument to this kfunc without any
modifications (e.g. pointer arithmetic) such that it is trusted and points to
the original object.
There are two types of pointers to kernel objects which are considered "valid":
Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs,
but those can have a non-zero offset.
1. Pointers which are passed as tracepoint or struct_ops callback arguments.
2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc.
This flag is often used for kfuncs that operate (change some property, perform
some operation) on an object that was obtained using an acquire kfunc. Such
kfuncs need an unchanged pointer to ensure the integrity of the operation being
performed on the expected object.
Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
The definition of "valid" pointers is subject to change at any time, and has
absolutely no ABI stability guarantees.
2.4.6 KF_SLEEPABLE flag
-----------------------
@@ -169,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these
calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
added later.
2.4.8 KF_RCU flag
-----------------
The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument.
When used together with KF_ACQUIRE, it indicates the kfunc should have a
single argument which must be a trusted argument or a MEM_RCU pointer.
The argument may have reference count of 0 and the kfunc must take this
into consideration.
2.5 Registering the kfuncs
--------------------------
@@ -191,3 +222,201 @@ type. An example is shown below::
return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
}
late_initcall(init_subsystem);
3. Core kfuncs
==============
The BPF subsystem provides a number of "core" kfuncs that are potentially
applicable to a wide variety of different possible use cases and programs.
Those kfuncs are documented here.
3.1 struct task_struct * kfuncs
-------------------------------
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
used as kptrs:
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_task_acquire bpf_task_release
These kfuncs are useful when you want to acquire or release a reference to a
``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a
struct_ops callback arg. For example:
.. code-block:: c
/**
* A trivial example tracepoint program that shows how to
* acquire and release a struct task_struct * pointer.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *acquired;
acquired = bpf_task_acquire(task);
/*
* In a typical program you'd do something like store
* the task in a map, and the map will automatically
* release it later. Here, we release it manually.
*/
bpf_task_release(acquired);
return 0;
}
----
A BPF program can also look up a task from a pid. This can be useful if the
caller doesn't have a trusted pointer to a ``struct task_struct *`` object that
it can acquire a reference on with bpf_task_acquire().
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_task_from_pid
Here is an example of it being used:
.. code-block:: c
SEC("tp_btf/task_newtask")
int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *lookup;
lookup = bpf_task_from_pid(task->pid);
if (!lookup)
/* A task should always be found, as %task is a tracepoint arg. */
return -ENOENT;
if (lookup->pid != task->pid) {
/* bpf_task_from_pid() looks up the task via its
* globally-unique pid from the init_pid_ns. Thus,
* the pid of the lookup task should always be the
* same as the input task.
*/
bpf_task_release(lookup);
return -EINVAL;
}
/* bpf_task_from_pid() returns an acquired reference,
* so it must be dropped before returning from the
* tracepoint handler.
*/
bpf_task_release(lookup);
return 0;
}
3.2 struct cgroup * kfuncs
--------------------------
``struct cgroup *`` objects also have acquire and release functions:
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_acquire bpf_cgroup_release
These kfuncs are used in exactly the same manner as bpf_task_acquire() and
bpf_task_release() respectively, so we won't provide examples for them.
----
You may also acquire a reference to a ``struct cgroup`` kptr that's already
stored in a map using bpf_cgroup_kptr_get():
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_kptr_get
Here's an example of how it can be used:
.. code-block:: c
/* struct containing the struct task_struct kptr which is actually stored in the map. */
struct __cgroups_kfunc_map_value {
struct cgroup __kptr_ref * cgroup;
};
/* The map containing struct __cgroups_kfunc_map_value entries. */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, int);
__type(value, struct __cgroups_kfunc_map_value);
__uint(max_entries, 1);
} __cgroups_kfunc_map SEC(".maps");
/* ... */
/**
* A simple example tracepoint program showing how a
* struct cgroup kptr that is stored in a map can
* be acquired using the bpf_cgroup_kptr_get() kfunc.
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
{
struct cgroup *kptr;
struct __cgroups_kfunc_map_value *v;
s32 id = cgrp->self.id;
/* Assume a cgroup kptr was previously stored in the map. */
v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
if (!v)
return -ENOENT;
/* Acquire a reference to the cgroup kptr that's already stored in the map. */
kptr = bpf_cgroup_kptr_get(&v->cgroup);
if (!kptr)
/* If no cgroup was present in the map, it's because
* we're racing with another CPU that removed it with
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
* above, and our call to bpf_cgroup_kptr_get().
* bpf_cgroup_kptr_get() internally safely handles this
* race, and will return NULL if the task is no longer
* present in the map by the time we invoke the kfunc.
*/
return -EBUSY;
/* Free the reference we just took above. Note that the
* original struct cgroup kptr is still in the map. It will
* be freed either at a later time if another context deletes
* it from the map, or automatically by the BPF subsystem if
* it's still present when the map is destroyed.
*/
bpf_cgroup_release(kptr);
return 0;
}
----
Another kfunc available for interacting with ``struct cgroup *`` objects is
bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
and return it as a cgroup kptr.
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_ancestor
Eventually, BPF should be updated to allow this to happen with a normal memory
load in the program itself. This is currently not possible without more work in
the verifier. bpf_cgroup_ancestor() can be used as follows:
.. code-block:: c
/**
* Simple tracepoint example that illustrates how a cgroup's
* ancestor can be accessed using bpf_cgroup_ancestor().
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
{
struct cgroup *parent;
/* The parent cgroup resides at the level before the current cgroup's level. */
parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1);
if (!parent)
return -ENOENT;
bpf_printk("Parent id is %d", parent->self.id);
/* Return the parent cgroup that was acquired above. */
bpf_cgroup_release(parent);
return 0;
}

View File

@@ -1,5 +1,7 @@
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
.. _libbpf:
libbpf
======
@@ -7,6 +9,7 @@ libbpf
:maxdepth: 1
API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
program_types
libbpf_naming_convention
libbpf_build

View File

@@ -0,0 +1,203 @@
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
.. _program_types_and_elf:
Program Types and ELF Sections
==============================
The table below lists the program types, their attach types where relevant and the ELF section
names supported by libbpf for them. The ELF section names follow these rules:
- ``type`` is an exact match, e.g. ``SEC("socket")``
- ``type+`` means it can be either exact ``SEC("type")`` or well-formed ``SEC("type/extras")``
with a '``/``' separator between ``type`` and ``extras``.
When ``extras`` are specified, they provide details of how to auto-attach the BPF program. The
format of ``extras`` depends on the program type, e.g. ``SEC("tracepoint/<category>/<name>")``
for tracepoints or ``SEC("usdt/<path>:<provider>:<name>")`` for USDT probes. The extras are
described in more detail in the footnotes.
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| Program Type | Attach Type | ELF Section Name | Sleepable |
+===========================================+========================================+==================================+===========+
| ``BPF_PROG_TYPE_CGROUP_DEVICE`` | ``BPF_CGROUP_DEVICE`` | ``cgroup/dev`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SKB`` | | ``cgroup/skb`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_EGRESS`` | ``cgroup_skb/egress`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_INGRESS`` | ``cgroup_skb/ingress`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` | ``BPF_CGROUP_GETSOCKOPT`` | ``cgroup/getsockopt`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_SETSOCKOPT`` | ``cgroup/setsockopt`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCK_ADDR`` | ``BPF_CGROUP_INET4_BIND`` | ``cgroup/bind4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_CONNECT`` | ``cgroup/connect4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_GETPEERNAME`` | ``cgroup/getpeername4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_GETSOCKNAME`` | ``cgroup/getsockname4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_BIND`` | ``cgroup/bind6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_CONNECT`` | ``cgroup/connect6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_GETPEERNAME`` | ``cgroup/getpeername6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_GETSOCKNAME`` | ``cgroup/getsockname6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP4_RECVMSG`` | ``cgroup/recvmsg4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP4_SENDMSG`` | ``cgroup/sendmsg4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP6_RECVMSG`` | ``cgroup/recvmsg6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP6_SENDMSG`` | ``cgroup/sendmsg6`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCK`` | ``BPF_CGROUP_INET4_POST_BIND`` | ``cgroup/post_bind4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_POST_BIND`` | ``cgroup/post_bind6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_SOCK_CREATE`` | ``cgroup/sock_create`` | |
+ + +----------------------------------+-----------+
| | | ``cgroup/sock`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_SOCK_RELEASE`` | ``cgroup/sock_release`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SYSCTL`` | ``BPF_CGROUP_SYSCTL`` | ``cgroup/sysctl`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_EXT`` | | ``freplace+`` [#fentry]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_FLOW_DISSECTOR`` | ``BPF_FLOW_DISSECTOR`` | ``flow_dissector`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_KPROBE`` | | ``kprobe+`` [#kprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``kretprobe+`` [#kprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``ksyscall+`` [#ksyscall]_ | |
+ + +----------------------------------+-----------+
| | | ``kretsyscall+`` [#ksyscall]_ | |
+ + +----------------------------------+-----------+
| | | ``uprobe+`` [#uprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``uprobe.s+`` [#uprobe]_ | Yes |
+ + +----------------------------------+-----------+
| | | ``uretprobe+`` [#uprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``uretprobe.s+`` [#uprobe]_ | Yes |
+ + +----------------------------------+-----------+
| | | ``usdt+`` [#usdt]_ | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_KPROBE_MULTI`` | ``kprobe.multi+`` [#kpmulti]_ | |
+ + +----------------------------------+-----------+
| | | ``kretprobe.multi+`` [#kpmulti]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LIRC_MODE2`` | ``BPF_LIRC_MODE2`` | ``lirc_mode2`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LSM`` | ``BPF_LSM_CGROUP`` | ``lsm_cgroup+`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_LSM_MAC`` | ``lsm+`` [#lsm]_ | |
+ + +----------------------------------+-----------+
| | | ``lsm.s+`` [#lsm]_ | Yes |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_IN`` | | ``lwt_in`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_OUT`` | | ``lwt_out`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` | | ``lwt_seg6local`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_XMIT`` | | ``lwt_xmit`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_PERF_EVENT`` | | ``perf_event`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE`` | | ``raw_tp.w+`` [#rawtp]_ | |
+ + +----------------------------------+-----------+
| | | ``raw_tracepoint.w+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_RAW_TRACEPOINT`` | | ``raw_tp+`` [#rawtp]_ | |
+ + +----------------------------------+-----------+
| | | ``raw_tracepoint+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SCHED_ACT`` | | ``action`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` | |
+ + +----------------------------------+-----------+
| | | ``tc`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_LOOKUP`` | ``BPF_SK_LOOKUP`` | ``sk_lookup`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_MSG`` | ``BPF_SK_MSG_VERDICT`` | ``sk_msg`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_REUSEPORT`` | ``BPF_SK_REUSEPORT_SELECT_OR_MIGRATE`` | ``sk_reuseport/migrate`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_REUSEPORT_SELECT`` | ``sk_reuseport`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_SKB`` | | ``sk_skb`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_SKB_STREAM_PARSER`` | ``sk_skb/stream_parser`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_SKB_STREAM_VERDICT`` | ``sk_skb/stream_verdict`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SOCKET_FILTER`` | | ``socket`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SOCK_OPS`` | ``BPF_CGROUP_SOCK_OPS`` | ``sockops`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_STRUCT_OPS`` | | ``struct_ops+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SYSCALL`` | | ``syscall`` | Yes |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_TRACEPOINT`` | | ``tp+`` [#tp]_ | |
+ + +----------------------------------+-----------+
| | | ``tracepoint+`` [#tp]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_TRACING`` | ``BPF_MODIFY_RETURN`` | ``fmod_ret+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fmod_ret.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_FENTRY`` | ``fentry+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fentry.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_FEXIT`` | ``fexit+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fexit.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_ITER`` | ``iter+`` [#iter]_ | |
+ + +----------------------------------+-----------+
| | | ``iter.s+`` [#iter]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | |
+ + +----------------------------------+-----------+
| | | ``xdp/cpumap`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_XDP_DEVMAP`` | ``xdp.frags/devmap`` | |
+ + +----------------------------------+-----------+
| | | ``xdp/devmap`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_XDP`` | ``xdp.frags`` | |
+ + +----------------------------------+-----------+
| | | ``xdp`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
.. rubric:: Footnotes
.. [#fentry] The ``fentry`` attach format is ``fentry[.s]/<function>``.
.. [#kprobe] The ``kprobe`` attach format is ``kprobe/<function>[+<offset>]``. Valid
characters for ``function`` are ``a-zA-Z0-9_.`` and ``offset`` must be a valid
non-negative integer.
.. [#ksyscall] The ``ksyscall`` attach format is ``ksyscall/<syscall>``.
.. [#uprobe] The ``uprobe`` attach format is ``uprobe[.s]/<path>:<function>[+<offset>]``.
.. [#usdt] The ``usdt`` attach format is ``usdt/<path>:<provider>:<name>``.
.. [#kpmulti] The ``kprobe.multi`` attach format is ``kprobe.multi/<pattern>`` where ``pattern``
supports ``*`` and ``?`` wildcards. Valid characters for pattern are
``a-zA-Z0-9_.*?``.
.. [#lsm] The ``lsm`` attachment format is ``lsm[.s]/<hook>``.
.. [#rawtp] The ``raw_tp`` attach format is ``raw_tracepoint[.w]/<tracepoint>``.
.. [#tp] The ``tracepoint`` attach format is ``tracepoint/<category>/<name>``.
.. [#iter] The ``iter`` attach format is ``iter[.s]/<struct-name>``.

View File

@@ -0,0 +1,262 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
================================================
BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_PERCPU_ARRAY
================================================
.. note::
- ``BPF_MAP_TYPE_ARRAY`` was introduced in kernel version 3.19
- ``BPF_MAP_TYPE_PERCPU_ARRAY`` was introduced in version 4.6
``BPF_MAP_TYPE_ARRAY`` and ``BPF_MAP_TYPE_PERCPU_ARRAY`` provide generic array
storage. The key type is an unsigned 32-bit integer (4 bytes) and the map is
of constant size. The size of the array is defined in ``max_entries`` at
creation time. All array elements are pre-allocated and zero initialized when
created. ``BPF_MAP_TYPE_PERCPU_ARRAY`` uses a different memory region for each
CPU whereas ``BPF_MAP_TYPE_ARRAY`` uses the same memory region. The value
stored can be of any size, however, all array elements are aligned to 8
bytes.
Since kernel 5.5, memory mapping may be enabled for ``BPF_MAP_TYPE_ARRAY`` by
setting the flag ``BPF_F_MMAPABLE``. The map definition is page-aligned and
starts on the first page. Sufficient page-sized and page-aligned blocks of
memory are allocated to store all array values, starting on the second page,
which in some cases will result in over-allocation of memory. The benefit of
using this is increased performance and ease of use since userspace programs
would not be required to use helper functions to access and mutate data.
Usage
=====
Kernel BPF
----------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Array elements can be retrieved using the ``bpf_map_lookup_elem()`` helper.
This helper returns a pointer into the array element, so to avoid data races
with userspace reading the value, the user must use primitives like
``__sync_fetch_and_add()`` when updating the value in-place.
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Array elements can be updated using the ``bpf_map_update_elem()`` helper.
``bpf_map_update_elem()`` returns 0 on success, or negative error in case of
failure.
Since the array is of constant size, ``bpf_map_delete_elem()`` is not supported.
To clear an array element, you may use ``bpf_map_update_elem()`` to insert a
zero value to that index.
Per CPU Array
-------------
Values stored in ``BPF_MAP_TYPE_ARRAY`` can be accessed by multiple programs
across different CPUs. To restrict storage to a single CPU, you may use a
``BPF_MAP_TYPE_PERCPU_ARRAY``.
When using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` the ``bpf_map_update_elem()`` and
``bpf_map_lookup_elem()`` helpers automatically access the slot for the current
CPU.
bpf_map_lookup_percpu_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the array
value for a specific CPU. Returns value on success , or ``NULL`` if no entry was
found or ``cpu`` is invalid.
Concurrency
-----------
Since kernel version 5.1, the BPF infrastructure provides ``struct bpf_spin_lock``
to synchronize access.
Userspace
---------
Access from userspace uses libbpf APIs with the same names as above, with
the map identified by its ``fd``.
Examples
========
Please see the ``tools/testing/selftests/bpf`` directory for functional
examples. The code samples below demonstrate API usage.
Kernel BPF
----------
This snippet shows how to declare an array in a BPF program.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, u32);
__type(value, long);
__uint(max_entries, 256);
} my_map SEC(".maps");
This example BPF program shows how to access an array element.
.. code-block:: c
int bpf_prog(struct __sk_buff *skb)
{
struct iphdr ip;
int index;
long *value;
if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip, sizeof(ip)) < 0)
return 0;
index = ip.protocol;
value = bpf_map_lookup_elem(&my_map, &index);
if (value)
__sync_fetch_and_add(value, skb->len);
return 0;
}
Userspace
---------
BPF_MAP_TYPE_ARRAY
~~~~~~~~~~~~~~~~~~
This snippet shows how to create an array, using ``bpf_map_create_opts`` to
set flags.
.. code-block:: c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
int create_array()
{
int fd;
LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE);
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY,
"example_array", /* name */
sizeof(__u32), /* key size */
sizeof(long), /* value size */
256, /* max entries */
&opts); /* create opts */
return fd;
}
This snippet shows how to initialize the elements of an array.
.. code-block:: c
int initialize_array(int fd)
{
__u32 i;
long value;
int ret;
for (i = 0; i < 256; i++) {
value = i;
ret = bpf_map_update_elem(fd, &i, &value, BPF_ANY);
if (ret < 0)
return ret;
}
return ret;
}
This snippet shows how to retrieve an element value from an array.
.. code-block:: c
int lookup(int fd)
{
__u32 index = 42;
long value;
int ret;
ret = bpf_map_lookup_elem(fd, &index, &value);
if (ret < 0)
return ret;
/* use value here */
assert(value == 42);
return ret;
}
BPF_MAP_TYPE_PERCPU_ARRAY
~~~~~~~~~~~~~~~~~~~~~~~~~
This snippet shows how to initialize the elements of a per CPU array.
.. code-block:: c
int initialize_array(int fd)
{
int ncpus = libbpf_num_possible_cpus();
long values[ncpus];
__u32 i, j;
int ret;
for (i = 0; i < 256 ; i++) {
for (j = 0; j < ncpus; j++)
values[j] = i;
ret = bpf_map_update_elem(fd, &i, &values, BPF_ANY);
if (ret < 0)
return ret;
}
return ret;
}
This snippet shows how to access the per CPU elements of an array value.
.. code-block:: c
int lookup(int fd)
{
int ncpus = libbpf_num_possible_cpus();
__u32 index = 42, j;
long values[ncpus];
int ret;
ret = bpf_map_lookup_elem(fd, &index, &values);
if (ret < 0)
return ret;
for (j = 0; j < ncpus; j++) {
/* Use per CPU value here */
assert(values[j] == 42);
}
return ret;
}
Semantics
=========
As shown in the example above, when accessing a ``BPF_MAP_TYPE_PERCPU_ARRAY``
in userspace, each value is an array with ``ncpus`` elements.
When calling ``bpf_map_update_elem()`` the flag ``BPF_NOEXIST`` can not be used
for these maps.

View File

@@ -0,0 +1,174 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=========================
BPF_MAP_TYPE_BLOOM_FILTER
=========================
.. note::
- ``BPF_MAP_TYPE_BLOOM_FILTER`` was introduced in kernel version 5.16
``BPF_MAP_TYPE_BLOOM_FILTER`` provides a BPF bloom filter map. Bloom
filters are a space-efficient probabilistic data structure used to
quickly test whether an element exists in a set. In a bloom filter,
false positives are possible whereas false negatives are not.
The bloom filter map does not have keys, only values. When the bloom
filter map is created, it must be created with a ``key_size`` of 0. The
bloom filter map supports two operations:
- push: adding an element to the map
- peek: determining whether an element is present in the map
BPF programs must use ``bpf_map_push_elem`` to add an element to the
bloom filter map and ``bpf_map_peek_elem`` to query the map. These
operations are exposed to userspace applications using the existing
``bpf`` syscall in the following way:
- ``BPF_MAP_UPDATE_ELEM`` -> push
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
The ``max_entries`` size that is specified at map creation time is used
to approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries
into the bloom filter than ``max_entries``, this may lead to a higher
false positive rate.
The number of hashes to use for the bloom filter is configurable using
the lower 4 bits of ``map_extra`` in ``union bpf_attr`` at map creation
time. If no number is specified, the default used will be 5 hash
functions. In general, using more hashes decreases both the false
positive rate and the speed of a lookup.
It is not possible to delete elements from a bloom filter map. A bloom
filter map may be used as an inner map. The user is responsible for
synchronising concurrent updates and lookups to ensure no false negative
lookups occur.
Usage
=====
Kernel BPF
----------
bpf_map_push_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
A ``value`` can be added to a bloom filter using the
``bpf_map_push_elem()`` helper. The ``flags`` parameter must be set to
``BPF_ANY`` when adding an entry to the bloom filter. This helper
returns ``0`` on success, or negative error in case of failure.
bpf_map_peek_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_peek_elem(struct bpf_map *map, void *value)
The ``bpf_map_peek_elem()`` helper is used to determine whether
``value`` is present in the bloom filter map. This helper returns ``0``
if ``value`` is probably present in the map, or ``-ENOENT`` if ``value``
is definitely not present in the map.
Userspace
---------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
A userspace program can add a ``value`` to a bloom filter using libbpf's
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
``NULL`` and ``flags`` must be set to ``BPF_ANY``. Returns ``0`` on
success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem (int fd, const void *key, void *value)
A userspace program can determine the presence of ``value`` in a bloom
filter using libbpf's ``bpf_map_lookup_elem`` function. The ``key``
parameter must be set to ``NULL``. Returns ``0`` if ``value`` is
probably present in the map, or ``-ENOENT`` if ``value`` is definitely
not present in the map.
Examples
========
Kernel BPF
----------
This snippet shows how to declare a bloom filter in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_BLOOM_FILTER);
__type(value, __u32);
__uint(max_entries, 1000);
__uint(map_extra, 3);
} bloom_filter SEC(".maps");
This snippet shows how to determine presence of a value in a bloom
filter in a BPF program:
.. code-block:: c
void *lookup(__u32 key)
{
if (bpf_map_peek_elem(&bloom_filter, &key) == 0) {
/* Verify not a false positive and fetch an associated
* value using a secondary lookup, e.g. in a hash table
*/
return bpf_map_lookup_elem(&hash_table, &key);
}
return 0;
}
Userspace
---------
This snippet shows how to use libbpf to create a bloom filter map from
userspace:
.. code-block:: c
int create_bloom()
{
LIBBPF_OPTS(bpf_map_create_opts, opts,
.map_extra = 3); /* number of hashes */
return bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER,
"ipv6_bloom", /* name */
0, /* key size, must be zero */
sizeof(ipv6_addr), /* value size */
10000, /* max entries */
&opts); /* create options */
}
This snippet shows how to add an element to a bloom filter from
userspace:
.. code-block:: c
int add_element(struct bpf_map *bloom_map, __u32 value)
{
int bloom_fd = bpf_map__fd(bloom_map);
return bpf_map_update_elem(bloom_fd, NULL, &value, BPF_ANY);
}
References
==========
https://lwn.net/ml/bpf/20210831225005.2762202-1-joannekoong@fb.com/

View File

@@ -0,0 +1,109 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
=========================
BPF_MAP_TYPE_CGRP_STORAGE
=========================
The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
storage for cgroups. It is only available with ``CONFIG_CGROUPS``.
The programs are made available by the same Kconfig. The
data for a particular cgroup can be retrieved by looking up the map
with that cgroup.
This document describes the usage and semantics of the
``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
Usage
=====
The map key must be ``sizeof(int)`` representing a cgroup fd.
To access the storage in a program, use ``bpf_cgrp_storage_get``::
void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
a new local storage will be created if one does not exist.
The local storage can be removed with ``bpf_cgrp_storage_delete``::
long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
The map is available to all program types.
Examples
========
A BPF program example with BPF_MAP_TYPE_CGRP_STORAGE::
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct {
__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, long);
} cgrp_storage SEC(".maps");
SEC("tp_btf/sys_enter")
int BPF_PROG(on_enter, struct pt_regs *regs, long id)
{
struct task_struct *task = bpf_get_current_task_btf();
long *ptr;
ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
BPF_LOCAL_STORAGE_GET_F_CREATE);
if (ptr)
__sync_fetch_and_add(ptr, 1);
return 0;
}
Userspace accessing map declared above::
#include <linux/bpf.h>
#include <linux/libbpf.h>
__u32 map_lookup(struct bpf_map *map, int cgrp_fd)
{
__u32 *value;
value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
if (value)
return *value;
return 0;
}
Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
============================================================================
The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only supports one cgroup
which is attached by a BPF program.
(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
``bpf_get_local_storage()`` always returns non-NULL local storage.
``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
To avoid such null local storage issue, user space can do
``bpf_map_update_elem()`` to pre-allocate local storage before a BPF program
is attached.
(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a BPF program
while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
prog detach time.
So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.

View File

@@ -0,0 +1,177 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
===================
BPF_MAP_TYPE_CPUMAP
===================
.. note::
- ``BPF_MAP_TYPE_CPUMAP`` was introduced in kernel version 4.15
.. kernel-doc:: kernel/bpf/cpumap.c
:doc: cpu map
An example use-case for this map type is software based Receive Side Scaling (RSS).
The CPUMAP represents the CPUs in the system indexed as the map-key, and the
map-value is the config setting (per CPUMAP entry). Each CPUMAP entry has a dedicated
kernel thread bound to the given CPU to represent the remote CPU execution unit.
Starting from Linux kernel version 5.9 the CPUMAP can run a second XDP program
on the remote CPU. This allows an XDP program to split its processing across
multiple CPUs. For example, a scenario where the initial CPU (that sees/receives
the packets) needs to do minimal packet processing and the remote CPU (to which
the packet is directed) can afford to spend more cycles processing the frame. The
initial CPU is where the XDP redirect program is executed. The remote CPU
receives raw ``xdp_frame`` objects.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_CPUMAP`` this map contains references to CPUs.
The lower two bits of ``flags`` are used as the return code if the map lookup
fails. This is so that the return value can be one of the XDP program return
codes up to ``XDP_TX``, as chosen by the caller.
User space
----------
.. note::
CPUMAP entries can only be updated/looked up/deleted from user space and not
from an eBPF program. Trying to call these functions from a kernel eBPF
program will result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
CPU entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically. The ``value`` parameter
can be ``struct bpf_cpumap_val``.
.. code-block:: c
struct bpf_cpumap_val {
__u32 qsize; /* queue size to remote target CPU */
union {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
};
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_lookup_elem(int fd, const void *key, void *value);
CPU entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_delete_elem(int fd, const void *key);
CPU entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
Examples
========
Kernel
------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_CPUMAP`` called
``cpu_map`` and how to redirect packets to a remote CPU using a round robin scheme.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_CPUMAP);
__type(key, __u32);
__type(value, struct bpf_cpumap_val);
__uint(max_entries, 12);
} cpu_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 12);
} cpus_available SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 1);
} cpus_iterator SEC(".maps");
SEC("xdp")
int xdp_redir_cpu_round_robin(struct xdp_md *ctx)
{
__u32 key = 0;
__u32 cpu_dest = 0;
__u32 *cpu_selected, *cpu_iterator;
__u32 cpu_idx;
cpu_iterator = bpf_map_lookup_elem(&cpus_iterator, &key);
if (!cpu_iterator)
return XDP_ABORTED;
cpu_idx = *cpu_iterator;
*cpu_iterator += 1;
if (*cpu_iterator == bpf_num_possible_cpus())
*cpu_iterator = 0;
cpu_selected = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
if (!cpu_selected)
return XDP_ABORTED;
cpu_dest = *cpu_selected;
if (cpu_dest >= bpf_num_possible_cpus())
return XDP_ABORTED;
return bpf_redirect_map(&cpu_map, cpu_dest, 0);
}
User space
----------
The following code snippet shows how to dynamically set the max_entries for a
CPUMAP to the max number of cpus available on the system.
.. code-block:: c
int set_max_cpu_entries(struct bpf_map *cpu_map)
{
if (bpf_map__set_max_entries(cpu_map, libbpf_num_possible_cpus()) < 0) {
fprintf(stderr, "Failed to set max entries for cpu_map map: %s",
strerror(errno));
return -1;
}
return 0;
}
References
===========
- https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#redirecting_into_a_cpumap

View File

@@ -0,0 +1,238 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=================================================
BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
=================================================
.. note::
- ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14
- ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4
``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily
used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``.
``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as
the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH``
is backed by a hash table that uses a key to lookup a reference to a net device.
The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``>
pairs to update the maps with new net devices.
.. note::
- The key to a hash map doesn't have to be an ``ifindex``.
- While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices
it comes at the cost of a hash of the key when performing a look up.
The setup and packet enqueue/send code is shared between the two types of
devmap; only the lookup and insertion is different.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains
references to net devices (for forwarding packets through other ports).
The lower two bits of *flags* are used as the return code if the map lookup
fails. This is so that the return value can be one of the XDP program return
codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags``
can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined
below.
With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces
in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded
from the broadcast.
.. note::
- The key is ignored if BPF_F_BROADCAST is set.
- The broadcast feature can also be used to implement multicast forwarding:
simply create multiple DEVMAPs, each one corresponding to a single multicast group.
This helper will return ``XDP_REDIRECT`` on success, or the value of the two
lower bits of the ``flags`` argument if the map lookup fails.
More information about redirection can be found :doc:`redirect`
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
User space
----------
.. note::
DEVMAP entries can only be updated/deleted from user space and not
from an eBPF program. Trying to call these functions from a kernel eBPF
program will result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
Net device entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically. The ``value`` parameter
can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards
compatibility.
.. code-block:: c
struct bpf_devmap_val {
__u32 ifindex; /* device index */
union {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
};
The ``flags`` argument can be one of the following:
- ``BPF_ANY``: Create a new element or update an existing element.
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
- ``BPF_EXIST``: Update an existing element.
DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd``
to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have
access to both Rx device and Tx device. The program associated with the ``fd``
must have type XDP with expected attach type ``xdp_devmap``.
When a program is associated with a device index, the program is run on an
``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples
of how to attach/use xdp_devmap progs can be found in the kernel selftests:
- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c``
- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c``
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
.. c:function::
int bpf_map_lookup_elem(int fd, const void *key, void *value);
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
.. c:function::
int bpf_map_delete_elem(int fd, const void *key);
Net device entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
Examples
========
Kernel BPF
----------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP``
called tx_port.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 256);
} tx_port SEC(".maps");
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH``
called forward_map.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
__type(key, __u32);
__type(value, struct bpf_devmap_val);
__uint(max_entries, 32);
} forward_map SEC(".maps");
.. note::
The value type in the DEVMAP above is a ``struct bpf_devmap_val``
The following code snippet shows a simple xdp_redirect_map program. This program
would work with a user space program that populates the devmap ``forward_map`` based
on ingress ifindexes. The BPF program (below) is redirecting packets using the
ingress ``ifindex`` as the ``key``.
.. code-block:: c
SEC("xdp")
int xdp_redirect_map_func(struct xdp_md *ctx)
{
int index = ctx->ingress_ifindex;
return bpf_redirect_map(&forward_map, index, 0);
}
The following code snippet shows a BPF program that is broadcasting packets to
all the interfaces in the ``tx_port`` devmap.
.. code-block:: c
SEC("xdp")
int xdp_redirect_map_func(struct xdp_md *ctx)
{
return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS);
}
User space
----------
The following code snippet shows how to update a devmap called ``tx_port``.
.. code-block:: c
int update_devmap(int ifindex, int redirect_ifindex)
{
int ret;
ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0);
if (ret < 0) {
fprintf(stderr, "Failed to update devmap_ value: %s\n",
strerror(errno));
}
return ret;
}
The following code snippet shows how to update a hash_devmap called ``forward_map``.
.. code-block:: c
int update_devmap(int ifindex, int redirect_ifindex)
{
struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex };
int ret;
ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
if (ret < 0) {
fprintf(stderr, "Failed to update devmap_ value: %s\n",
strerror(errno));
}
return ret;
}
References
===========
- https://lwn.net/Articles/728146/
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176
- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106

View File

@@ -34,7 +34,14 @@ the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
Usage
=====
.. c:function::
Kernel BPF
----------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Hash entries can be added or updated using the ``bpf_map_update_elem()``
@@ -49,14 +56,22 @@ parameter can be used to control the update behaviour:
``bpf_map_update_elem()`` returns 0 on success, or negative error in
case of failure.
.. c:function::
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
helper. This helper returns a pointer to the value associated with
``key``, or ``NULL`` if no entry was found.
.. c:function::
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Hash entries can be deleted using the ``bpf_map_delete_elem()``
@@ -70,7 +85,11 @@ For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
automatically access the hash slot for the current CPU.
.. c:function::
bpf_map_lookup_percpu_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
@@ -89,7 +108,11 @@ See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
Userspace
---------
.. c:function::
bpf_map_get_next_key()
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
In userspace, it is possible to iterate through the keys of a hash using

View File

@@ -0,0 +1,197 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=====================
BPF_MAP_TYPE_LPM_TRIE
=====================
.. note::
- ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11
``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
can be used to match IP addresses to a stored set of prefixes.
Internally, data is stored in an unbalanced trie of nodes that uses
``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
network byte order, i.e. big endian, so ``data[0]`` stores the most
significant byte.
LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key``, extended by
``max_prefixlen/8`` bytes.
- For IPv4 addresses the data length is 4 bytes
- For IPv6 addresses the data length is 16 bytes
The value type stored in the LPM trie can be any user defined type.
.. note::
When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
``BPF_F_NO_PREALLOC`` flag.
Usage
=====
Kernel BPF
----------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
The longest prefix entry for a given data value can be found using the
``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
value associated with the longest matching ``key``, or ``NULL`` if no
entry was found.
The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
performing longest prefix lookups. For example, when searching for the
longest prefix match for an IPv4 address, ``prefixlen`` should be set to
``32``.
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Prefix entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically.
``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
case of failure.
.. note::
The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,
but the value is ignored, giving BPF_ANY semantics.
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Prefix entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case
of failure.
Userspace
---------
Access from userspace uses libbpf APIs with the same names as above, with
the map identified by ``fd``.
bpf_map_get_next_key()
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key)
A userspace program can iterate through the entries in an LPM trie using
libbpf's ``bpf_map_get_next_key()`` function. The first key can be
fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
``NULL``. Subsequent calls will fetch the next key that follows the
current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
error in case of failure.
``bpf_map_get_next_key()`` will iterate through the LPM trie elements
from leftmost leaf first. This means that iteration will return more
specific keys before less specific ones.
Examples
========
Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
of LPM trie usage from userspace. The code snippets below demonstrate
API usage.
Kernel BPF
----------
The following BPF code snippet shows how to declare a new LPM trie for IPv4
address prefixes:
.. code-block:: c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
struct ipv4_lpm_key {
__u32 prefixlen;
__u32 data;
};
struct {
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
__type(key, struct ipv4_lpm_key);
__type(value, __u32);
__uint(map_flags, BPF_F_NO_PREALLOC);
__uint(max_entries, 255);
} ipv4_lpm_map SEC(".maps");
The following BPF code snippet shows how to lookup by IPv4 address:
.. code-block:: c
void *lookup(__u32 ipaddr)
{
struct ipv4_lpm_key key = {
.prefixlen = 32,
.data = ipaddr
};
return bpf_map_lookup_elem(&ipv4_lpm_map, &key);
}
Userspace
---------
The following snippet shows how to insert an IPv4 prefix entry into an
LPM trie:
.. code-block:: c
int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)
{
struct ipv4_lpm_key ipv4_key = {
.prefixlen = prefixlen,
.data = addr
};
return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);
}
The following snippet shows a userspace program walking through the entries
of an LPM trie:
.. code-block:: c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
void iterate_lpm_trie(int map_fd)
{
struct ipv4_lpm_key *cur_key = NULL;
struct ipv4_lpm_key next_key;
struct value value;
int err;
for (;;) {
err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
if (err)
break;
bpf_map_lookup_elem(map_fd, &next_key, &value);
/* Use key and value here */
cur_key = &next_key;
}
}

View File

@@ -0,0 +1,130 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
========================================================
BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
========================================================
.. note::
- ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` were
introduced in kernel version 4.12
``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` provide general
purpose support for map in map storage. One level of nesting is supported, where
an outer map contains instances of a single type of inner map, for example
``array_of_maps->sock_map``.
When creating an outer map, an inner map instance is used to initialize the
metadata that the outer map holds about its inner maps. This inner map has a
separate lifetime from the outer map and can be deleted after the outer map has
been created.
The outer map supports element lookup, update and delete from user space using
the syscall API. A BPF program is only allowed to do element lookup in the outer
map.
.. note::
- Multi-level nesting is not supported.
- Any BPF map type can be used as an inner map, except for
``BPF_MAP_TYPE_PROG_ARRAY``.
- A BPF program cannot update or delete outer map entries.
For ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` the key is an unsigned 32-bit integer index
into the array. The array is a fixed size with ``max_entries`` elements that are
zero initialized when created.
For ``BPF_MAP_TYPE_HASH_OF_MAPS`` the key type can be chosen when defining the
map. The kernel is responsible for allocating and freeing key/value pairs, up to
the max_entries limit that you specify. Hash maps use pre-allocation of hash
table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be used to disable
pre-allocation when it is too memory expensive.
Usage
=====
Kernel BPF Helper
-----------------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Inner maps can be retrieved using the ``bpf_map_lookup_elem()`` helper. This
helper returns a pointer to the inner map, or ``NULL`` if no entry was found.
Examples
========
Kernel BPF Example
------------------
This snippet shows how to create and initialise an array of devmaps in a BPF
program. Note that the outer array can only be modified from user space using
the syscall API.
.. code-block:: c
struct inner_map {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__uint(max_entries, 10);
__type(key, __u32);
__type(value, __u32);
} inner_map1 SEC(".maps"), inner_map2 SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
__uint(max_entries, 2);
__type(key, __u32);
__array(values, struct inner_map);
} outer_map SEC(".maps") = {
.values = { &inner_map1,
&inner_map2 }
};
See ``progs/test_btf_map_in_map.c`` in ``tools/testing/selftests/bpf`` for more
examples of declarative initialisation of outer maps.
User Space
----------
This snippet shows how to create an array based outer map:
.. code-block:: c
int create_outer_array(int inner_fd) {
LIBBPF_OPTS(bpf_map_create_opts, opts, .inner_map_fd = inner_fd);
int fd;
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY_OF_MAPS,
"example_array", /* name */
sizeof(__u32), /* key size */
sizeof(__u32), /* value size */
256, /* max entries */
&opts); /* create opts */
return fd;
}
This snippet shows how to add an inner map to an outer map:
.. code-block:: c
int add_devmap(int outer_fd, int index, const char *name) {
int fd;
fd = bpf_map_create(BPF_MAP_TYPE_DEVMAP, name,
sizeof(__u32), sizeof(__u32), 256, NULL);
if (fd < 0)
return fd;
return bpf_map_update_elem(outer_fd, &index, &fd, BPF_ANY);
}
References
==========
- https://lore.kernel.org/netdev/20170322170035.923581-3-kafai@fb.com/
- https://lore.kernel.org/netdev/20170322170035.923581-4-kafai@fb.com/

View File

@@ -0,0 +1,146 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=========================================
BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
=========================================
.. note::
- ``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` were introduced
in kernel version 4.20
``BPF_MAP_TYPE_QUEUE`` provides FIFO storage and ``BPF_MAP_TYPE_STACK``
provides LIFO storage for BPF programs. These maps support peek, pop and
push operations that are exposed to BPF programs through the respective
helpers. These operations are exposed to userspace applications using
the existing ``bpf`` syscall in the following way:
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
- ``BPF_MAP_LOOKUP_AND_DELETE_ELEM`` -> pop
- ``BPF_MAP_UPDATE_ELEM`` -> push
``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` do not support
``BPF_F_NO_PREALLOC``.
Usage
=====
Kernel BPF
----------
bpf_map_push_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
An element ``value`` can be added to a queue or stack using the
``bpf_map_push_elem`` helper. The ``flags`` parameter must be set to
``BPF_ANY`` or ``BPF_EXIST``. If ``flags`` is set to ``BPF_EXIST`` then,
when the queue or stack is full, the oldest element will be removed to
make room for ``value`` to be added. Returns ``0`` on success, or
negative error in case of failure.
bpf_map_peek_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_peek_elem(struct bpf_map *map, void *value)
This helper fetches an element ``value`` from a queue or stack without
removing it. Returns ``0`` on success, or negative error in case of
failure.
bpf_map_pop_elem()
~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_pop_elem(struct bpf_map *map, void *value)
This helper removes an element into ``value`` from a queue or
stack. Returns ``0`` on success, or negative error in case of failure.
Userspace
---------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
A userspace program can push ``value`` onto a queue or stack using libbpf's
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
``NULL`` and ``flags`` must be set to ``BPF_ANY`` or ``BPF_EXIST``, with the
same semantics as the ``bpf_map_push_elem`` kernel helper. Returns ``0`` on
success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem (int fd, const void *key, void *value)
A userspace program can peek at the ``value`` at the head of a queue or stack
using the libbpf ``bpf_map_lookup_elem`` function. The ``key`` parameter must be
set to ``NULL``. Returns ``0`` on success, or negative error in case of
failure.
bpf_map_lookup_and_delete_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_and_delete_elem (int fd, const void *key, void *value)
A userspace program can pop a ``value`` from the head of a queue or stack using
the libbpf ``bpf_map_lookup_and_delete_elem`` function. The ``key`` parameter
must be set to ``NULL``. Returns ``0`` on success, or negative error in case of
failure.
Examples
========
Kernel BPF
----------
This snippet shows how to declare a queue in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__type(value, __u32);
__uint(max_entries, 10);
} queue SEC(".maps");
Userspace
---------
This snippet shows how to use libbpf's low-level API to create a queue from
userspace:
.. code-block:: c
int create_queue()
{
return bpf_map_create(BPF_MAP_TYPE_QUEUE,
"sample_queue", /* name */
0, /* key size, must be zero */
sizeof(__u32), /* value size */
10, /* max entries */
NULL); /* create options */
}
References
==========
https://lwn.net/ml/netdev/153986858555.9127.14517764371945179514.stgit@kernel/

View File

@@ -0,0 +1,155 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=======================
BPF_MAP_TYPE_SK_STORAGE
=======================
.. note::
- ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2
``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF
programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage
to be provided and acts as the handle for accessing the socket-local
storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored
locally with each socket instead of with the map. The kernel is responsible for
allocating storage for a socket when requested and for freeing the storage when
either the map or the socket is deleted.
.. note::
- The key type must be ``int`` and ``max_entries`` must be set to ``0``.
- The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for
socket-local storage.
Usage
=====
Kernel BPF
----------
bpf_sk_storage_get()
~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags)
Socket-local storage can be retrieved using the ``bpf_sk_storage_get()``
helper. The helper gets the storage from ``sk`` that is associated with ``map``.
If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then
``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not
already exist. ``value`` can be used together with
``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it
will be zero initialized. Returns a pointer to the storage on success, or
``NULL`` in case of failure.
.. note::
- ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs.
- ``sk`` is a ``struct bpf_sock`` pointer for other program types.
bpf_sk_storage_delete()
~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_sk_storage_delete(struct bpf_map *map, void *sk)
Socket-local storage can be deleted using the ``bpf_sk_storage_delete()``
helper. The helper deletes the storage from ``sk`` that is identified by
``map``. Returns ``0`` on success, or negative error in case of failure.
User space
----------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf
function. ``key`` must be a pointer to a valid ``fd`` in the user space
program. The ``flags`` parameter can be used to control the update behaviour:
- ``BPF_ANY`` will create storage for ``fd`` or update existing storage.
- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already
exist, otherwise the call will fail with ``-EEXIST``.
- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists,
otherwise the call will fail with ``-ENOENT``.
Returns ``0`` on success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem(int map_fd, const void *key, void *value)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf
function. ``key`` must be a pointer to a valid ``fd`` in the user space
program. Returns ``0`` on success, or negative error in case of failure.
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_delete_elem(int map_fd, const void *key)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf
function. Returns ``0`` on success, or negative error in case of failure.
Examples
========
Kernel BPF
----------
This snippet shows how to declare socket-local storage in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct my_storage);
} socket_storage SEC(".maps");
This snippet shows how to retrieve socket-local storage in a BPF program:
.. code-block:: c
SEC("sockops")
int _sockops(struct bpf_sock_ops *ctx)
{
struct my_storage *storage;
struct bpf_sock *sk;
sk = ctx->sk;
if (!sk)
return 1;
storage = bpf_sk_storage_get(&socket_storage, sk, 0,
BPF_LOCAL_STORAGE_GET_F_CREATE);
if (!storage)
return 1;
/* Use 'storage' here */
return 1;
}
Please see the ``tools/testing/selftests/bpf`` directory for functional
examples.
References
==========
https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/

View File

@@ -0,0 +1,192 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
===================
BPF_MAP_TYPE_XSKMAP
===================
.. note::
- ``BPF_MAP_TYPE_XSKMAP`` was introduced in kernel version 4.18
The ``BPF_MAP_TYPE_XSKMAP`` is used as a backend map for XDP BPF helper
call ``bpf_redirect_map()`` and ``XDP_REDIRECT`` action, like 'devmap' and 'cpumap'.
This map type redirects raw XDP frames to `AF_XDP`_ sockets (XSKs), a new type of
address family in the kernel that allows redirection of frames from a driver to
user space without having to traverse the full network stack. An AF_XDP socket
binds to a single netdev queue. A mapping of XSKs to queues is shown below:
.. code-block:: none
+---------------------------------------------------+
| xsk A | xsk B | xsk C |<---+ User space
=========================================================|==========
| Queue 0 | Queue 1 | Queue 2 | | Kernel
+---------------------------------------------------+ |
| Netdev eth0 | |
+---------------------------------------------------+ |
| +=============+ | |
| | key | xsk | | |
| +---------+ +=============+ | |
| | | | 0 | xsk A | | |
| | | +-------------+ | |
| | | | 1 | xsk B | | |
| | BPF |-- redirect -->+-------------+-------------+
| | prog | | 2 | xsk C | |
| | | +-------------+ |
| | | |
| | | |
| +---------+ |
| |
+---------------------------------------------------+
.. note::
An AF_XDP socket that is bound to a certain <netdev/queue_id> will *only*
accept XDP frames from that <netdev/queue_id>. If an XDP program tries to redirect
from a <netdev/queue_id> other than what the socket is bound to, the frame will
not be received on the socket.
Typically an XSKMAP is created per netdev. This map contains an array of XSK File
Descriptors (FDs). The number of array elements is typically set or adjusted using
the ``max_entries`` map parameter. For AF_XDP ``max_entries`` is equal to the number
of queues supported by the netdev.
.. note::
Both the map key and map value size must be 4 bytes.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_XSKMAP`` this map contains references to XSK FDs
for sockets attached to a netdev's queues.
.. note::
If the map is empty at an index, the packet is dropped. This means that it is
necessary to have an XDP program loaded with at least one XSK in the
XSKMAP to be able to get any traffic to user space through the socket.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
XSK entry references of type ``struct xdp_sock *`` can be retrieved using the
``bpf_map_lookup_elem()`` helper.
User space
----------
.. note::
XSK entries can only be updated/deleted from user space and not from
a BPF program. Trying to call these functions from a kernel BPF program will
result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
XSK entries can be added or updated using the ``bpf_map_update_elem()``
helper. The ``key`` parameter is equal to the queue_id of the queue the XSK
is attaching to. And the ``value`` parameter is the FD value of that socket.
Under the hood, the XSKMAP update function uses the XSK FD value to retrieve the
associated ``struct xdp_sock`` instance.
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_lookup_elem(int fd, const void *key, void *value)
Returns ``struct xdp_sock *`` or negative error in case of failure.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_delete_elem(int fd, const void *key)
XSK entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
.. note::
When `libxdp`_ deletes an XSK it also removes the associated socket
entry from the XSKMAP.
Examples
========
Kernel
------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_XSKMAP`` called
``xsks_map`` and how to redirect packets to an XSK.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 64);
} xsks_map SEC(".maps");
SEC("xdp")
int xsk_redir_prog(struct xdp_md *ctx)
{
__u32 index = ctx->rx_queue_index;
if (bpf_map_lookup_elem(&xsks_map, &index))
return bpf_redirect_map(&xsks_map, index, 0);
return XDP_PASS;
}
User space
----------
The following code snippet shows how to update an XSKMAP with an XSK entry.
.. code-block:: c
int update_xsks_map(struct bpf_map *xsks_map, int queue_id, int xsk_fd)
{
int ret;
ret = bpf_map_update_elem(bpf_map__fd(xsks_map), &queue_id, &xsk_fd, 0);
if (ret < 0)
fprintf(stderr, "Failed to update xsks_map: %s\n", strerror(errno));
return ret;
}
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
For a detailed explaination of the AF_XDP interface please see:
- `libxdp-readme`_.
- `AF_XDP`_ kernel documentation.
.. note::
The most comprehensive resource for using XSKMAPs and AF_XDP is `libxdp`_.
.. _libxdp: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
.. _AF_XDP: https://www.kernel.org/doc/html/latest/networking/af_xdp.html
.. _bpf-examples: https://github.com/xdp-project/bpf-examples
.. _libxdp-readme: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp#using-af_xdp-sockets

Some files were not shown because too many files have changed in this diff Show More