mirror of
https://github.com/Dasharo/linux.git
synced 2026-03-06 15:25:10 -08:00
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations
- Add inet drop monitor support
- A few GRO performance improvements
- Add infrastructure for atomic dev stats, addressing long standing
data races
- De-duplicate common code between OVS and conntrack offloading
infrastructure
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up the
workload with the number of available CPUs
- Add the helper support for connection-tracking OVS offload
BPF:
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF
- Make cgroup local storage available to non-cgroup attached BPF
programs
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers
- A relevant bunch of BPF verifier fixes and improvements
- Veristat tool improvements to support custom filtering, sorting,
and replay of results
- Add LLVM disassembler as default library for dumping JITed code
- Lots of new BPF documentation for various BPF maps
- Add bpf_rcu_read_{,un}lock() support for sleepable programs
- Add RCU grace period chaining to BPF to wait for the completion of
access from both sleepable and non-sleepable BPF programs
- Add support storing struct task_struct objects as kptrs in maps
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions
Protocols:
- TCP: implement Protective Load Balancing across switch links
- TCP: allow dynamically disabling TCP-MD5 static key, reverting back
to fast[er]-path
- UDP: Introduce optional per-netns hash lookup table
- IPv6: simplify and cleanup sockets disposal
- Netlink: support different type policies for each generic netlink
operation
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support
- MPTCP: add netlink notification support for listener sockets events
- SCTP: add VRF support, allowing sctp sockets binding to VRF devices
- Add bridging MAC Authentication Bypass (MAB) support
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios
- More work for Wi-Fi 7 support, comprising conversion of all the
existing drivers to internal TX queue usage
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading
- IPSec: extended ack support for more descriptive XFRM error
reporting
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking
- IEEE 802154: synchronous send frame and extended filtering support,
initial support for scanning available 15.4 networks
- Tun: bump the link speed from 10Mbps to 10Gbps
- Tun/VirtioNet: implement UDP segmentation offload support
Driver API:
- PHY/SFP: improve power level switching between standard level 1 and
the higher power levels
- New API for netdev <-> devlink_port linkage
- PTP: convert existing drivers to new frequency adjustment
implementation
- DSA: add support for rx offloading
- Autoload DSA tagging driver when dynamically changing protocol
- Add new PCP and APPTRUST attributes to Data Center Bridging
- Add configuration support for 800Gbps link speed
- Add devlink port function attribute to enable/disable RoCE and
migratable
- Extend devlink-rate to support strict prioriry and weighted fair
queuing
- Add devlink support to directly reading from region memory
- New device tree helper to fetch MAC address from nvmem
- New big TCP helper to simplify temporary header stripping
New hardware / drivers:
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches
- Marvel Prestera AC5X Ethernet Switch
- WangXun 10 Gigabit NIC
- Motorcomm yt8521 Gigabit Ethernet
- Microchip ksz9563 Gigabit Ethernet Switch
- Microsoft Azure Network Adapter
- Linux Automation 10Base-T1L adapter
- PHY:
- Aquantia AQR112 and AQR412
- Motorcomm YT8531S
- PTP:
- Orolia ART-CARD
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets
- Realtek RTL8852BE and RTL8723DS
- Cypress.CYW4373A0 WiFi + Bluetooth combo device
Drivers:
- CAN:
- gs_usb: bus error reporting support
- kvaser_usb: listen only and bus error reporting support
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping
- implement devlink-rate support
- support direct read from memory
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate
- Support for enhanced events compression
- extend H/W offload packet manipulation capabilities
- implement IPSec packet offload mode
- nVidia/Mellanox (mlx4):
- better big TCP support
- Netronome Ethernet NICs (nfp):
- IPsec offload support
- add support for multicast filter
- Broadcom:
- RSS and PTP support improvements
- AMD/SolarFlare:
- netlink extened ack improvements
- add basic flower matches to offload, and related stats
- Virtual NICs:
- ibmvnic: introduce affinity hint support
- small / embedded:
- FreeScale fec: add initial XDP support
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
- TI am65-cpsw: add suspend/resume support
- Mediatek MT7986: add RX wireless wthernet dispatch support
- Realtek 8169: enable GRO software interrupt coalescing per
default
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support
- add ip6gre support
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support
- enable flow offload support
- Renesas:
- add rswitch R-Car Gen4 gPTP support
- Microchip (lan966x):
- add full XDP support
- add TC H/W offload via VCAP
- enable PTP on bridge interfaces
- Microchip (ksz8):
- add MTU support for KSZ8 series
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support
- add ack signal support
- enable coredump support
- remain_on_channel support
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support
- RealTek WiFi (rtw89):
- new dynamic header firmware format support
- wake-over-WLAN support"
* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
ipvs: fix type warning in do_div() on 32 bit
net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
net: ipa: add IPA v4.7 support
dt-bindings: net: qcom,ipa: Add SM6350 compatible
bnxt: Use generic HBH removal helper in tx path
IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
selftests: forwarding: Add bridge MDB test
selftests: forwarding: Rename bridge_mdb test
bridge: mcast: Support replacement of MDB port group entries
bridge: mcast: Allow user space to specify MDB entry routing protocol
bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
bridge: mcast: Add support for (*, G) with a source list and filter mode
bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
bridge: mcast: Add a flag for user installed source entries
bridge: mcast: Expose __br_multicast_del_group_src()
bridge: mcast: Expose br_multicast_new_group_src()
bridge: mcast: Add a centralized error path
bridge: mcast: Place netlink policy before validation functions
bridge: mcast: Split (*, G) and (S, G) addition into different functions
bridge: mcast: Do not derive entry type from its filter mode
...
This commit is contained in:
@@ -298,3 +298,48 @@ A: NO.
|
||||
|
||||
The BTF_ID macro does not cause a function to become part of the ABI
|
||||
any more than does the EXPORT_SYMBOL_GPL macro.
|
||||
|
||||
Q: What is the compatibility story for special BPF types in map values?
|
||||
-----------------------------------------------------------------------
|
||||
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
|
||||
values (when using BTF support for BPF maps). This allows to use helpers for
|
||||
such objects on these fields inside map values. Users are also allowed to embed
|
||||
pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
|
||||
kernel preserve backwards compatibility for these features?
|
||||
|
||||
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
|
||||
NO, but see below.
|
||||
|
||||
For struct types that have been added already, like bpf_spin_lock and bpf_timer,
|
||||
the kernel will preserve backwards compatibility, as they are part of UAPI.
|
||||
|
||||
For kptrs, they are also part of UAPI, but only with respect to the kptr
|
||||
mechanism. The types that you can use with a __kptr and __kptr_ref tagged
|
||||
pointer in your struct are NOT part of the UAPI contract. The supported types can
|
||||
and will change across kernel releases. However, operations like accessing kptr
|
||||
fields and bpf_kptr_xchg() helper will continue to be supported across kernel
|
||||
releases for the supported types.
|
||||
|
||||
For any other supported struct type, unless explicitly stated in this document
|
||||
and added to bpf.h UAPI header, such types can and will arbitrarily change their
|
||||
size, type, and alignment, or any other user visible API or ABI detail across
|
||||
kernel releases. The users must adapt their BPF programs to the new changes and
|
||||
update them to make sure their programs continue to work correctly.
|
||||
|
||||
NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
|
||||
order to introduce more special fields in the future. Hence, user programs must
|
||||
avoid defining types with 'bpf\_' prefix to not be broken in future releases.
|
||||
In other words, no backwards compatibility is guaranteed if one using a type
|
||||
in BTF with 'bpf\_' prefix.
|
||||
|
||||
Q: What is the compatibility story for special BPF types in allocated objects?
|
||||
------------------------------------------------------------------------------
|
||||
Q: Same as above, but for allocated objects (i.e. objects allocated using
|
||||
bpf_obj_new for user defined types). Will the kernel preserve backwards
|
||||
compatibility for these features?
|
||||
|
||||
A: NO.
|
||||
|
||||
Unlike map value types, there are no stability guarantees for this case. The
|
||||
whole API to work with allocated objects and any support for special fields
|
||||
inside them is unstable (since it is exposed through kfuncs).
|
||||
|
||||
@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
|
||||
Submitting patches
|
||||
==================
|
||||
|
||||
Q: How do I run BPF CI on my changes before sending them out for review?
|
||||
------------------------------------------------------------------------
|
||||
A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
|
||||
While GitHub also provides a CLI that can be used to accomplish the same
|
||||
results, here we focus on the UI based workflow.
|
||||
|
||||
The following steps lay out how to start a CI run for your patches:
|
||||
|
||||
- Create a fork of the aforementioned repository in your own account (one time
|
||||
action)
|
||||
|
||||
- Clone the fork locally, check out a new branch tracking either the bpf-next
|
||||
or bpf branch, and apply your to-be-tested patches on top of it
|
||||
|
||||
- Push the local branch to your fork and create a pull request against
|
||||
kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively
|
||||
|
||||
Shortly after the pull request has been created, the CI workflow will run. Note
|
||||
that capacity is shared with patches submitted upstream being checked and so
|
||||
depending on utilization the run can take a while to finish.
|
||||
|
||||
Note furthermore that both base branches (bpf-next_base and bpf_base) will be
|
||||
updated as patches are pushed to the respective upstream branches they track. As
|
||||
such, your patch set will automatically (be attempted to) be rebased as well.
|
||||
This behavior can result in a CI run being aborted and restarted with the new
|
||||
base line.
|
||||
|
||||
Q: To which mailing list do I need to submit my BPF patches?
|
||||
------------------------------------------------------------
|
||||
A: Please submit your BPF patches to the bpf kernel mailing list:
|
||||
|
||||
485
Documentation/bpf/bpf_iterators.rst
Normal file
485
Documentation/bpf/bpf_iterators.rst
Normal file
@@ -0,0 +1,485 @@
|
||||
=============
|
||||
BPF Iterators
|
||||
=============
|
||||
|
||||
|
||||
----------
|
||||
Motivation
|
||||
----------
|
||||
|
||||
There are a few existing ways to dump kernel data into user space. The most
|
||||
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
|
||||
all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
|
||||
sockets in the system. However, their output format tends to be fixed, and if
|
||||
users want more information about these sockets, they have to patch the kernel,
|
||||
which often takes time to publish upstream and release. The same is true for popular
|
||||
tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
|
||||
additional information needs a kernel patch.
|
||||
|
||||
To solve this problem, the `drgn
|
||||
<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
|
||||
dig out the kernel data with no kernel change. However, the main drawback for
|
||||
drgn is performance, as it cannot do pointer tracing inside the kernel. In
|
||||
addition, drgn cannot validate a pointer value and may read invalid data if the
|
||||
pointer becomes invalid inside the kernel.
|
||||
|
||||
The BPF iterator solves the above problem by providing flexibility on what data
|
||||
(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
|
||||
data object.
|
||||
|
||||
----------------------
|
||||
How BPF Iterators Work
|
||||
----------------------
|
||||
|
||||
A BPF iterator is a type of BPF program that allows users to iterate over
|
||||
specific types of kernel objects. Unlike traditional BPF tracing programs that
|
||||
allow users to define callbacks that are invoked at particular points of
|
||||
execution in the kernel, BPF iterators allow users to define callbacks that
|
||||
should be executed for every entry in a variety of kernel data structures.
|
||||
|
||||
For example, users can define a BPF iterator that iterates over every task on
|
||||
the system and dumps the total amount of CPU runtime currently used by each of
|
||||
them. Another BPF task iterator may instead dump the cgroup information for each
|
||||
task. Such flexibility is the core value of BPF iterators.
|
||||
|
||||
A BPF program is always loaded into the kernel at the behest of a user space
|
||||
process. A user space process loads a BPF program by opening and initializing
|
||||
the program skeleton as required and then invoking a syscall to have the BPF
|
||||
program verified and loaded by the kernel.
|
||||
|
||||
In traditional tracing programs, a program is activated by having user space
|
||||
obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
|
||||
activated, the program callback will be invoked whenever the tracepoint is
|
||||
triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
|
||||
program is obtained using ``bpf_link_create()``, and the program callback is
|
||||
invoked by issuing system calls from user space.
|
||||
|
||||
Next, let us see how you can use the iterators to iterate on kernel objects and
|
||||
read data.
|
||||
|
||||
------------------------
|
||||
How to Use BPF iterators
|
||||
------------------------
|
||||
|
||||
BPF selftests are a great resource to illustrate how to use the iterators. In
|
||||
this section, we’ll walk through a BPF selftest which shows how to load and use
|
||||
a BPF iterator program. To begin, we’ll look at `bpf_iter.c
|
||||
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
|
||||
which illustrates how to load and trigger BPF iterators on the user space side.
|
||||
Later, we’ll look at a BPF program that runs in kernel space.
|
||||
|
||||
Loading a BPF iterator in the kernel from user space typically involves the
|
||||
following steps:
|
||||
|
||||
* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
|
||||
has verified and loaded the program, it returns a file descriptor (fd) to user
|
||||
space.
|
||||
* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
|
||||
specified with the BPF program file descriptor received from the kernel.
|
||||
* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
|
||||
``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
|
||||
* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
|
||||
available.
|
||||
* Close the iterator fd using ``close(bpf_iter_fd)``.
|
||||
* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
|
||||
|
||||
The following are a few examples of selftest BPF iterator programs:
|
||||
|
||||
* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
|
||||
* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
|
||||
* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
|
||||
|
||||
Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
|
||||
|
||||
Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
|
||||
<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
|
||||
Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
|
||||
represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
|
||||
iterator.
|
||||
|
||||
::
|
||||
|
||||
struct bpf_iter__task_file {
|
||||
union {
|
||||
struct bpf_iter_meta *meta;
|
||||
};
|
||||
union {
|
||||
struct task_struct *task;
|
||||
};
|
||||
u32 fd;
|
||||
union {
|
||||
struct file *file;
|
||||
};
|
||||
};
|
||||
|
||||
In the above code, the field 'meta' contains the metadata, which is the same for
|
||||
all BPF iterator programs. The rest of the fields are specific to different
|
||||
iterators. For example, for task_file iterators, the kernel layer provides the
|
||||
'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
|
||||
counted
|
||||
<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
|
||||
so they won't go away when the BPF program runs.
|
||||
|
||||
Here is a snippet from the ``bpf_iter_task_file.c`` file:
|
||||
|
||||
::
|
||||
|
||||
SEC("iter/task_file")
|
||||
int dump_task_file(struct bpf_iter__task_file *ctx)
|
||||
{
|
||||
struct seq_file *seq = ctx->meta->seq;
|
||||
struct task_struct *task = ctx->task;
|
||||
struct file *file = ctx->file;
|
||||
__u32 fd = ctx->fd;
|
||||
|
||||
if (task == NULL || file == NULL)
|
||||
return 0;
|
||||
|
||||
if (ctx->meta->seq_num == 0) {
|
||||
count = 0;
|
||||
BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
|
||||
}
|
||||
|
||||
if (tgid == task->tgid && task->tgid != task->pid)
|
||||
count++;
|
||||
|
||||
if (last_tgid != task->tgid) {
|
||||
last_tgid = task->tgid;
|
||||
unique_tgid_count++;
|
||||
}
|
||||
|
||||
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
|
||||
(long)file->f_op);
|
||||
return 0;
|
||||
}
|
||||
|
||||
In the above example, the section name ``SEC(iter/task_file)``, indicates that
|
||||
the program is a BPF iterator program to iterate all files from all tasks. The
|
||||
context of the program is ``bpf_iter__task_file`` struct.
|
||||
|
||||
The user space program invokes the BPF iterator program running in the kernel
|
||||
by issuing a ``read()`` syscall. Once invoked, the BPF
|
||||
program can export data to user space using a variety of BPF helper functions.
|
||||
You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
|
||||
``bpf_seq_write()`` function based on whether you need formatted output or just
|
||||
binary data, respectively. For binary-encoded data, the user space applications
|
||||
can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
|
||||
you can use ``cat <path>`` to print the results similar to ``cat
|
||||
/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
|
||||
use ``rm -f <path>`` to remove the pinned iterator.
|
||||
|
||||
For example, you can use the following command to create a BPF iterator from the
|
||||
``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
|
||||
path:
|
||||
|
||||
::
|
||||
|
||||
$ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
|
||||
|
||||
And then print out the results using the following command:
|
||||
|
||||
::
|
||||
|
||||
$ cat /sys/fs/bpf/my_route
|
||||
|
||||
|
||||
-------------------------------------------------------
|
||||
Implement Kernel Support for BPF Iterator Program Types
|
||||
-------------------------------------------------------
|
||||
|
||||
To implement a BPF iterator in the kernel, the developer must make a one-time
|
||||
change to the following key data structure defined in the `bpf.h
|
||||
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
|
||||
file.
|
||||
|
||||
::
|
||||
|
||||
struct bpf_iter_reg {
|
||||
const char *target;
|
||||
bpf_iter_attach_target_t attach_target;
|
||||
bpf_iter_detach_target_t detach_target;
|
||||
bpf_iter_show_fdinfo_t show_fdinfo;
|
||||
bpf_iter_fill_link_info_t fill_link_info;
|
||||
bpf_iter_get_func_proto_t get_func_proto;
|
||||
u32 ctx_arg_info_size;
|
||||
u32 feature;
|
||||
struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
|
||||
const struct bpf_iter_seq_info *seq_info;
|
||||
};
|
||||
|
||||
After filling the data structure fields, call ``bpf_iter_reg_target()`` to
|
||||
register the iterator to the main BPF iterator subsystem.
|
||||
|
||||
The following is the breakdown for each field in struct ``bpf_iter_reg``.
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 50
|
||||
:header-rows: 1
|
||||
|
||||
* - Fields
|
||||
- Description
|
||||
* - target
|
||||
- Specifies the name of the BPF iterator. For example: ``bpf_map``,
|
||||
``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
|
||||
* - attach_target and detach_target
|
||||
- Allows for target specific ``link_create`` action since some targets
|
||||
may need special processing. Called during the user space link_create stage.
|
||||
* - show_fdinfo and fill_link_info
|
||||
- Called to fill target specific information when user tries to get link
|
||||
info associated with the iterator.
|
||||
* - get_func_proto
|
||||
- Permits a BPF iterator to access BPF helpers specific to the iterator.
|
||||
* - ctx_arg_info_size and ctx_arg_info
|
||||
- Specifies the verifier states for BPF program arguments associated with
|
||||
the bpf iterator.
|
||||
* - feature
|
||||
- Specifies certain action requests in the kernel BPF iterator
|
||||
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
|
||||
that the kernel function cond_resched() is called to avoid other kernel
|
||||
subsystem (e.g., rcu) misbehaving.
|
||||
* - seq_info
|
||||
- Specifies certain action requests in the kernel BPF iterator
|
||||
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
|
||||
that the kernel function cond_resched() is called to avoid other kernel
|
||||
subsystem (e.g., rcu) misbehaving.
|
||||
|
||||
|
||||
`Click here
|
||||
<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
|
||||
to see an implementation of the ``task_vma`` BPF iterator in the kernel.
|
||||
|
||||
---------------------------------
|
||||
Parameterizing BPF Task Iterators
|
||||
---------------------------------
|
||||
|
||||
By default, BPF iterators walk through all the objects of the specified types
|
||||
(processes, cgroups, maps, etc.) across the entire system to read relevant
|
||||
kernel data. But often, there are cases where we only care about a much smaller
|
||||
subset of iterable kernel objects, such as only iterating tasks within a
|
||||
specific process. Therefore, BPF iterator programs support filtering out objects
|
||||
from iteration by allowing user space to configure the iterator program when it
|
||||
is attached.
|
||||
|
||||
--------------------------
|
||||
BPF Task Iterator Program
|
||||
--------------------------
|
||||
|
||||
The following code is a BPF iterator program to print files and task information
|
||||
through the ``seq_file`` of the iterator. It is a standard BPF iterator program
|
||||
that visits every file of an iterator. We will use this BPF program in our
|
||||
example later.
|
||||
|
||||
::
|
||||
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
|
||||
char _license[] SEC("license") = "GPL";
|
||||
|
||||
SEC("iter/task_file")
|
||||
int dump_task_file(struct bpf_iter__task_file *ctx)
|
||||
{
|
||||
struct seq_file *seq = ctx->meta->seq;
|
||||
struct task_struct *task = ctx->task;
|
||||
struct file *file = ctx->file;
|
||||
__u32 fd = ctx->fd;
|
||||
if (task == NULL || file == NULL)
|
||||
return 0;
|
||||
if (ctx->meta->seq_num == 0) {
|
||||
BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
|
||||
}
|
||||
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
|
||||
(long)file->f_op);
|
||||
return 0;
|
||||
}
|
||||
|
||||
----------------------------------------
|
||||
Creating a File Iterator with Parameters
|
||||
----------------------------------------
|
||||
|
||||
Now, let us look at how to create an iterator that includes only files of a
|
||||
process.
|
||||
|
||||
First, fill the ``bpf_iter_attach_opts`` struct as shown below:
|
||||
|
||||
::
|
||||
|
||||
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
|
||||
union bpf_iter_link_info linfo;
|
||||
memset(&linfo, 0, sizeof(linfo));
|
||||
linfo.task.pid = getpid();
|
||||
opts.link_info = &linfo;
|
||||
opts.link_info_len = sizeof(linfo);
|
||||
|
||||
``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
|
||||
that only includes opened files for the process with the specified ``pid``. In
|
||||
this example, we will only be iterating files for our process. If
|
||||
``linfo.task.pid`` is zero, the iterator will visit every opened file of every
|
||||
process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
|
||||
that visits opened files of a specific thread, not a process. In this example,
|
||||
``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
|
||||
separate file descriptor table. In most circumstances, all process threads share
|
||||
a single file descriptor table.
|
||||
|
||||
Now, in the userspace program, pass the pointer of struct to the
|
||||
``bpf_program__attach_iter()``.
|
||||
|
||||
::
|
||||
|
||||
link = bpf_program__attach_iter(prog, &opts); iter_fd =
|
||||
bpf_iter_create(bpf_link__fd(link));
|
||||
|
||||
If both *tid* and *pid* are zero, an iterator created from this struct
|
||||
``bpf_iter_attach_opts`` will include every opened file of every task in the
|
||||
system (in the namespace, actually.) It is the same as passing a NULL as the
|
||||
second argument to ``bpf_program__attach_iter()``.
|
||||
|
||||
The whole program looks like the following code:
|
||||
|
||||
::
|
||||
|
||||
#include <stdio.h>
|
||||
#include <unistd.h>
|
||||
#include <bpf/bpf.h>
|
||||
#include <bpf/libbpf.h>
|
||||
#include "bpf_iter_task_ex.skel.h"
|
||||
|
||||
static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
|
||||
{
|
||||
struct bpf_link *link;
|
||||
char buf[16] = {};
|
||||
int iter_fd = -1, len;
|
||||
int ret = 0;
|
||||
|
||||
link = bpf_program__attach_iter(prog, opts);
|
||||
if (!link) {
|
||||
fprintf(stderr, "bpf_program__attach_iter() fails\n");
|
||||
return -1;
|
||||
}
|
||||
iter_fd = bpf_iter_create(bpf_link__fd(link));
|
||||
if (iter_fd < 0) {
|
||||
fprintf(stderr, "bpf_iter_create() fails\n");
|
||||
ret = -1;
|
||||
goto free_link;
|
||||
}
|
||||
/* not check contents, but ensure read() ends without error */
|
||||
while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
|
||||
buf[len] = 0;
|
||||
printf("%s", buf);
|
||||
}
|
||||
printf("\n");
|
||||
free_link:
|
||||
if (iter_fd >= 0)
|
||||
close(iter_fd);
|
||||
bpf_link__destroy(link);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void test_task_file(void)
|
||||
{
|
||||
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
|
||||
struct bpf_iter_task_ex *skel;
|
||||
union bpf_iter_link_info linfo;
|
||||
skel = bpf_iter_task_ex__open_and_load();
|
||||
if (skel == NULL)
|
||||
return;
|
||||
memset(&linfo, 0, sizeof(linfo));
|
||||
linfo.task.pid = getpid();
|
||||
opts.link_info = &linfo;
|
||||
opts.link_info_len = sizeof(linfo);
|
||||
printf("PID %d\n", getpid());
|
||||
do_read_opts(skel->progs.dump_task_file, &opts);
|
||||
bpf_iter_task_ex__destroy(skel);
|
||||
}
|
||||
|
||||
int main(int argc, const char * const * argv)
|
||||
{
|
||||
test_task_file();
|
||||
return 0;
|
||||
}
|
||||
|
||||
The following lines are the output of the program.
|
||||
::
|
||||
|
||||
PID 1859
|
||||
|
||||
tgid pid fd file
|
||||
1859 1859 0 ffffffff82270aa0
|
||||
1859 1859 1 ffffffff82270aa0
|
||||
1859 1859 2 ffffffff82270aa0
|
||||
1859 1859 3 ffffffff82272980
|
||||
1859 1859 4 ffffffff8225e120
|
||||
1859 1859 5 ffffffff82255120
|
||||
1859 1859 6 ffffffff82254f00
|
||||
1859 1859 7 ffffffff82254d80
|
||||
1859 1859 8 ffffffff8225abe0
|
||||
|
||||
------------------
|
||||
Without Parameters
|
||||
------------------
|
||||
|
||||
Let us look at how a BPF iterator without parameters skips files of other
|
||||
processes in the system. In this case, the BPF program has to check the pid or
|
||||
the tid of tasks, or it will receive every opened file in the system (in the
|
||||
current *pid* namespace, actually). So, we usually add a global variable in the
|
||||
BPF program to pass a *pid* to the BPF program.
|
||||
|
||||
The BPF program would look like the following block.
|
||||
|
||||
::
|
||||
|
||||
......
|
||||
int target_pid = 0;
|
||||
|
||||
SEC("iter/task_file")
|
||||
int dump_task_file(struct bpf_iter__task_file *ctx)
|
||||
{
|
||||
......
|
||||
if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
|
||||
return 0;
|
||||
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
|
||||
(long)file->f_op);
|
||||
return 0;
|
||||
}
|
||||
|
||||
The user space program would look like the following block:
|
||||
|
||||
::
|
||||
|
||||
......
|
||||
static void test_task_file(void)
|
||||
{
|
||||
......
|
||||
skel = bpf_iter_task_ex__open_and_load();
|
||||
if (skel == NULL)
|
||||
return;
|
||||
skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
|
||||
memset(&linfo, 0, sizeof(linfo));
|
||||
linfo.task.pid = getpid();
|
||||
opts.link_info = &linfo;
|
||||
opts.link_info_len = sizeof(linfo);
|
||||
......
|
||||
}
|
||||
|
||||
``target_pid`` is a global variable in the BPF program. The user space program
|
||||
should initialize the variable with a process ID to skip opened files of other
|
||||
processes in the BPF program. When you parametrize a BPF iterator, the iterator
|
||||
calls the BPF program fewer times which can save significant resources.
|
||||
|
||||
---------------------------
|
||||
Parametrizing VMA Iterators
|
||||
---------------------------
|
||||
|
||||
By default, a BPF VMA iterator includes every VMA in every process. However,
|
||||
you can still specify a process or a thread to include only its VMAs. Unlike
|
||||
files, a thread can not have a separate address space (since Linux 2.6.0-test6).
|
||||
Here, using *tid* makes no difference from using *pid*.
|
||||
|
||||
----------------------------
|
||||
Parametrizing Task Iterators
|
||||
----------------------------
|
||||
|
||||
A BPF task iterator with *pid* includes all tasks (threads) of a process. The
|
||||
BPF program receives these tasks one after another. You can specify a BPF task
|
||||
iterator with *tid* parameter to include only the tasks that match the given
|
||||
*tid*.
|
||||
@@ -1062,4 +1062,9 @@ format.::
|
||||
7. Testing
|
||||
==========
|
||||
|
||||
Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
|
||||
The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
|
||||
provides an extensive set of BTF-related tests.
|
||||
|
||||
.. Links
|
||||
.. _tools/testing/selftests/bpf/prog_tests/btf.c:
|
||||
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c
|
||||
|
||||
@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
|
||||
maps
|
||||
bpf_prog_run
|
||||
classic_vs_extended.rst
|
||||
bpf_iterators
|
||||
bpf_licensing
|
||||
test_debug
|
||||
clang-notes
|
||||
linux-notes
|
||||
other
|
||||
redirect
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
||||
@@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU`` means::
|
||||
|
||||
src_reg = (u32) src_reg ^ (u32) imm32
|
||||
dst_reg = (u32) dst_reg ^ (u32) imm32
|
||||
|
||||
``BPF_XOR | BPF_K | BPF_ALU64`` means::
|
||||
|
||||
src_reg = src_reg ^ imm32
|
||||
dst_reg = dst_reg ^ imm32
|
||||
|
||||
|
||||
Byte swap instructions
|
||||
|
||||
@@ -72,6 +72,30 @@ argument as its size. By default, without __sz annotation, the size of the type
|
||||
of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
|
||||
pointer.
|
||||
|
||||
2.2.2 __k Annotation
|
||||
--------------------
|
||||
|
||||
This annotation is only understood for scalar arguments, where it indicates that
|
||||
the verifier must check the scalar argument to be a known constant, which does
|
||||
not indicate a size parameter, and the value of the constant is relevant to the
|
||||
safety of the program.
|
||||
|
||||
An example is given below::
|
||||
|
||||
void *bpf_obj_new(u32 local_type_id__k, ...)
|
||||
{
|
||||
...
|
||||
}
|
||||
|
||||
Here, bpf_obj_new uses local_type_id argument to find out the size of that type
|
||||
ID in program's BTF and return a sized pointer to it. Each type ID will have a
|
||||
distinct size, hence it is crucial to treat each such call as distinct when
|
||||
values don't match during verifier state pruning checks.
|
||||
|
||||
Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
|
||||
size parameter, and the value of the constant matters for program safety, __k
|
||||
suffix should be used.
|
||||
|
||||
.. _BPF_kfunc_nodef:
|
||||
|
||||
2.3 Using an existing kernel function
|
||||
@@ -137,22 +161,20 @@ KF_ACQUIRE and KF_RET_NULL flags.
|
||||
--------------------------
|
||||
|
||||
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
|
||||
indicates that the all pointer arguments will always have a guaranteed lifetime,
|
||||
and pointers to kernel objects are always passed to helpers in their unmodified
|
||||
form (as obtained from acquire kfuncs).
|
||||
indicates that the all pointer arguments are valid, and that all pointers to
|
||||
BTF objects have been passed in their unmodified form (that is, at a zero
|
||||
offset, and without having been obtained from walking another pointer).
|
||||
|
||||
It can be used to enforce that a pointer to a refcounted object acquired from a
|
||||
kfunc or BPF helper is passed as an argument to this kfunc without any
|
||||
modifications (e.g. pointer arithmetic) such that it is trusted and points to
|
||||
the original object.
|
||||
There are two types of pointers to kernel objects which are considered "valid":
|
||||
|
||||
Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs,
|
||||
but those can have a non-zero offset.
|
||||
1. Pointers which are passed as tracepoint or struct_ops callback arguments.
|
||||
2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc.
|
||||
|
||||
This flag is often used for kfuncs that operate (change some property, perform
|
||||
some operation) on an object that was obtained using an acquire kfunc. Such
|
||||
kfuncs need an unchanged pointer to ensure the integrity of the operation being
|
||||
performed on the expected object.
|
||||
Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
|
||||
KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
|
||||
|
||||
The definition of "valid" pointers is subject to change at any time, and has
|
||||
absolutely no ABI stability guarantees.
|
||||
|
||||
2.4.6 KF_SLEEPABLE flag
|
||||
-----------------------
|
||||
@@ -169,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these
|
||||
calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
|
||||
added later.
|
||||
|
||||
2.4.8 KF_RCU flag
|
||||
-----------------
|
||||
|
||||
The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument.
|
||||
When used together with KF_ACQUIRE, it indicates the kfunc should have a
|
||||
single argument which must be a trusted argument or a MEM_RCU pointer.
|
||||
The argument may have reference count of 0 and the kfunc must take this
|
||||
into consideration.
|
||||
|
||||
2.5 Registering the kfuncs
|
||||
--------------------------
|
||||
|
||||
@@ -191,3 +222,201 @@ type. An example is shown below::
|
||||
return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
|
||||
}
|
||||
late_initcall(init_subsystem);
|
||||
|
||||
3. Core kfuncs
|
||||
==============
|
||||
|
||||
The BPF subsystem provides a number of "core" kfuncs that are potentially
|
||||
applicable to a wide variety of different possible use cases and programs.
|
||||
Those kfuncs are documented here.
|
||||
|
||||
3.1 struct task_struct * kfuncs
|
||||
-------------------------------
|
||||
|
||||
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
|
||||
used as kptrs:
|
||||
|
||||
.. kernel-doc:: kernel/bpf/helpers.c
|
||||
:identifiers: bpf_task_acquire bpf_task_release
|
||||
|
||||
These kfuncs are useful when you want to acquire or release a reference to a
|
||||
``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a
|
||||
struct_ops callback arg. For example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* A trivial example tracepoint program that shows how to
|
||||
* acquire and release a struct task_struct * pointer.
|
||||
*/
|
||||
SEC("tp_btf/task_newtask")
|
||||
int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
|
||||
{
|
||||
struct task_struct *acquired;
|
||||
|
||||
acquired = bpf_task_acquire(task);
|
||||
|
||||
/*
|
||||
* In a typical program you'd do something like store
|
||||
* the task in a map, and the map will automatically
|
||||
* release it later. Here, we release it manually.
|
||||
*/
|
||||
bpf_task_release(acquired);
|
||||
return 0;
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
A BPF program can also look up a task from a pid. This can be useful if the
|
||||
caller doesn't have a trusted pointer to a ``struct task_struct *`` object that
|
||||
it can acquire a reference on with bpf_task_acquire().
|
||||
|
||||
.. kernel-doc:: kernel/bpf/helpers.c
|
||||
:identifiers: bpf_task_from_pid
|
||||
|
||||
Here is an example of it being used:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("tp_btf/task_newtask")
|
||||
int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags)
|
||||
{
|
||||
struct task_struct *lookup;
|
||||
|
||||
lookup = bpf_task_from_pid(task->pid);
|
||||
if (!lookup)
|
||||
/* A task should always be found, as %task is a tracepoint arg. */
|
||||
return -ENOENT;
|
||||
|
||||
if (lookup->pid != task->pid) {
|
||||
/* bpf_task_from_pid() looks up the task via its
|
||||
* globally-unique pid from the init_pid_ns. Thus,
|
||||
* the pid of the lookup task should always be the
|
||||
* same as the input task.
|
||||
*/
|
||||
bpf_task_release(lookup);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
/* bpf_task_from_pid() returns an acquired reference,
|
||||
* so it must be dropped before returning from the
|
||||
* tracepoint handler.
|
||||
*/
|
||||
bpf_task_release(lookup);
|
||||
return 0;
|
||||
}
|
||||
|
||||
3.2 struct cgroup * kfuncs
|
||||
--------------------------
|
||||
|
||||
``struct cgroup *`` objects also have acquire and release functions:
|
||||
|
||||
.. kernel-doc:: kernel/bpf/helpers.c
|
||||
:identifiers: bpf_cgroup_acquire bpf_cgroup_release
|
||||
|
||||
These kfuncs are used in exactly the same manner as bpf_task_acquire() and
|
||||
bpf_task_release() respectively, so we won't provide examples for them.
|
||||
|
||||
----
|
||||
|
||||
You may also acquire a reference to a ``struct cgroup`` kptr that's already
|
||||
stored in a map using bpf_cgroup_kptr_get():
|
||||
|
||||
.. kernel-doc:: kernel/bpf/helpers.c
|
||||
:identifiers: bpf_cgroup_kptr_get
|
||||
|
||||
Here's an example of how it can be used:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/* struct containing the struct task_struct kptr which is actually stored in the map. */
|
||||
struct __cgroups_kfunc_map_value {
|
||||
struct cgroup __kptr_ref * cgroup;
|
||||
};
|
||||
|
||||
/* The map containing struct __cgroups_kfunc_map_value entries. */
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_HASH);
|
||||
__type(key, int);
|
||||
__type(value, struct __cgroups_kfunc_map_value);
|
||||
__uint(max_entries, 1);
|
||||
} __cgroups_kfunc_map SEC(".maps");
|
||||
|
||||
/* ... */
|
||||
|
||||
/**
|
||||
* A simple example tracepoint program showing how a
|
||||
* struct cgroup kptr that is stored in a map can
|
||||
* be acquired using the bpf_cgroup_kptr_get() kfunc.
|
||||
*/
|
||||
SEC("tp_btf/cgroup_mkdir")
|
||||
int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
|
||||
{
|
||||
struct cgroup *kptr;
|
||||
struct __cgroups_kfunc_map_value *v;
|
||||
s32 id = cgrp->self.id;
|
||||
|
||||
/* Assume a cgroup kptr was previously stored in the map. */
|
||||
v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
|
||||
if (!v)
|
||||
return -ENOENT;
|
||||
|
||||
/* Acquire a reference to the cgroup kptr that's already stored in the map. */
|
||||
kptr = bpf_cgroup_kptr_get(&v->cgroup);
|
||||
if (!kptr)
|
||||
/* If no cgroup was present in the map, it's because
|
||||
* we're racing with another CPU that removed it with
|
||||
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
|
||||
* above, and our call to bpf_cgroup_kptr_get().
|
||||
* bpf_cgroup_kptr_get() internally safely handles this
|
||||
* race, and will return NULL if the task is no longer
|
||||
* present in the map by the time we invoke the kfunc.
|
||||
*/
|
||||
return -EBUSY;
|
||||
|
||||
/* Free the reference we just took above. Note that the
|
||||
* original struct cgroup kptr is still in the map. It will
|
||||
* be freed either at a later time if another context deletes
|
||||
* it from the map, or automatically by the BPF subsystem if
|
||||
* it's still present when the map is destroyed.
|
||||
*/
|
||||
bpf_cgroup_release(kptr);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
----
|
||||
|
||||
Another kfunc available for interacting with ``struct cgroup *`` objects is
|
||||
bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
|
||||
and return it as a cgroup kptr.
|
||||
|
||||
.. kernel-doc:: kernel/bpf/helpers.c
|
||||
:identifiers: bpf_cgroup_ancestor
|
||||
|
||||
Eventually, BPF should be updated to allow this to happen with a normal memory
|
||||
load in the program itself. This is currently not possible without more work in
|
||||
the verifier. bpf_cgroup_ancestor() can be used as follows:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
/**
|
||||
* Simple tracepoint example that illustrates how a cgroup's
|
||||
* ancestor can be accessed using bpf_cgroup_ancestor().
|
||||
*/
|
||||
SEC("tp_btf/cgroup_mkdir")
|
||||
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
|
||||
{
|
||||
struct cgroup *parent;
|
||||
|
||||
/* The parent cgroup resides at the level before the current cgroup's level. */
|
||||
parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1);
|
||||
if (!parent)
|
||||
return -ENOENT;
|
||||
|
||||
bpf_printk("Parent id is %d", parent->self.id);
|
||||
|
||||
/* Return the parent cgroup that was acquired above. */
|
||||
bpf_cgroup_release(parent);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
.. _libbpf:
|
||||
|
||||
libbpf
|
||||
======
|
||||
|
||||
@@ -7,6 +9,7 @@ libbpf
|
||||
:maxdepth: 1
|
||||
|
||||
API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
|
||||
program_types
|
||||
libbpf_naming_convention
|
||||
libbpf_build
|
||||
|
||||
|
||||
203
Documentation/bpf/libbpf/program_types.rst
Normal file
203
Documentation/bpf/libbpf/program_types.rst
Normal file
@@ -0,0 +1,203 @@
|
||||
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
.. _program_types_and_elf:
|
||||
|
||||
Program Types and ELF Sections
|
||||
==============================
|
||||
|
||||
The table below lists the program types, their attach types where relevant and the ELF section
|
||||
names supported by libbpf for them. The ELF section names follow these rules:
|
||||
|
||||
- ``type`` is an exact match, e.g. ``SEC("socket")``
|
||||
- ``type+`` means it can be either exact ``SEC("type")`` or well-formed ``SEC("type/extras")``
|
||||
with a '``/``' separator between ``type`` and ``extras``.
|
||||
|
||||
When ``extras`` are specified, they provide details of how to auto-attach the BPF program. The
|
||||
format of ``extras`` depends on the program type, e.g. ``SEC("tracepoint/<category>/<name>")``
|
||||
for tracepoints or ``SEC("usdt/<path>:<provider>:<name>")`` for USDT probes. The extras are
|
||||
described in more detail in the footnotes.
|
||||
|
||||
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| Program Type | Attach Type | ELF Section Name | Sleepable |
|
||||
+===========================================+========================================+==================================+===========+
|
||||
| ``BPF_PROG_TYPE_CGROUP_DEVICE`` | ``BPF_CGROUP_DEVICE`` | ``cgroup/dev`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_CGROUP_SKB`` | | ``cgroup/skb`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET_EGRESS`` | ``cgroup_skb/egress`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET_INGRESS`` | ``cgroup_skb/ingress`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` | ``BPF_CGROUP_GETSOCKOPT`` | ``cgroup/getsockopt`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_SETSOCKOPT`` | ``cgroup/setsockopt`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_CGROUP_SOCK_ADDR`` | ``BPF_CGROUP_INET4_BIND`` | ``cgroup/bind4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET4_CONNECT`` | ``cgroup/connect4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET4_GETPEERNAME`` | ``cgroup/getpeername4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET4_GETSOCKNAME`` | ``cgroup/getsockname4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET6_BIND`` | ``cgroup/bind6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET6_CONNECT`` | ``cgroup/connect6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET6_GETPEERNAME`` | ``cgroup/getpeername6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET6_GETSOCKNAME`` | ``cgroup/getsockname6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_UDP4_RECVMSG`` | ``cgroup/recvmsg4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_UDP4_SENDMSG`` | ``cgroup/sendmsg4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_UDP6_RECVMSG`` | ``cgroup/recvmsg6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_UDP6_SENDMSG`` | ``cgroup/sendmsg6`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_CGROUP_SOCK`` | ``BPF_CGROUP_INET4_POST_BIND`` | ``cgroup/post_bind4`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET6_POST_BIND`` | ``cgroup/post_bind6`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET_SOCK_CREATE`` | ``cgroup/sock_create`` | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``cgroup/sock`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_CGROUP_INET_SOCK_RELEASE`` | ``cgroup/sock_release`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_CGROUP_SYSCTL`` | ``BPF_CGROUP_SYSCTL`` | ``cgroup/sysctl`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_EXT`` | | ``freplace+`` [#fentry]_ | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_FLOW_DISSECTOR`` | ``BPF_FLOW_DISSECTOR`` | ``flow_dissector`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_KPROBE`` | | ``kprobe+`` [#kprobe]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``kretprobe+`` [#kprobe]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``ksyscall+`` [#ksyscall]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``kretsyscall+`` [#ksyscall]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``uprobe+`` [#uprobe]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``uprobe.s+`` [#uprobe]_ | Yes |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``uretprobe+`` [#uprobe]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``uretprobe.s+`` [#uprobe]_ | Yes |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``usdt+`` [#usdt]_ | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_TRACE_KPROBE_MULTI`` | ``kprobe.multi+`` [#kpmulti]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``kretprobe.multi+`` [#kpmulti]_ | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LIRC_MODE2`` | ``BPF_LIRC_MODE2`` | ``lirc_mode2`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LSM`` | ``BPF_LSM_CGROUP`` | ``lsm_cgroup+`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_LSM_MAC`` | ``lsm+`` [#lsm]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``lsm.s+`` [#lsm]_ | Yes |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LWT_IN`` | | ``lwt_in`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LWT_OUT`` | | ``lwt_out`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` | | ``lwt_seg6local`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_LWT_XMIT`` | | ``lwt_xmit`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_PERF_EVENT`` | | ``perf_event`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE`` | | ``raw_tp.w+`` [#rawtp]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``raw_tracepoint.w+`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_RAW_TRACEPOINT`` | | ``raw_tp+`` [#rawtp]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``raw_tracepoint+`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SCHED_ACT`` | | ``action`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``tc`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SK_LOOKUP`` | ``BPF_SK_LOOKUP`` | ``sk_lookup`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SK_MSG`` | ``BPF_SK_MSG_VERDICT`` | ``sk_msg`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SK_REUSEPORT`` | ``BPF_SK_REUSEPORT_SELECT_OR_MIGRATE`` | ``sk_reuseport/migrate`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_SK_REUSEPORT_SELECT`` | ``sk_reuseport`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SK_SKB`` | | ``sk_skb`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_SK_SKB_STREAM_PARSER`` | ``sk_skb/stream_parser`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_SK_SKB_STREAM_VERDICT`` | ``sk_skb/stream_verdict`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SOCKET_FILTER`` | | ``socket`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SOCK_OPS`` | ``BPF_CGROUP_SOCK_OPS`` | ``sockops`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_STRUCT_OPS`` | | ``struct_ops+`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_SYSCALL`` | | ``syscall`` | Yes |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_TRACEPOINT`` | | ``tp+`` [#tp]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``tracepoint+`` [#tp]_ | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_TRACING`` | ``BPF_MODIFY_RETURN`` | ``fmod_ret+`` [#fentry]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``fmod_ret.s+`` [#fentry]_ | Yes |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_TRACE_FENTRY`` | ``fentry+`` [#fentry]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``fentry.s+`` [#fentry]_ | Yes |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_TRACE_FEXIT`` | ``fexit+`` [#fentry]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``fexit.s+`` [#fentry]_ | Yes |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_TRACE_ITER`` | ``iter+`` [#iter]_ | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``iter.s+`` [#iter]_ | Yes |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
| ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``xdp/cpumap`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_XDP_DEVMAP`` | ``xdp.frags/devmap`` | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``xdp/devmap`` | |
|
||||
+ +----------------------------------------+----------------------------------+-----------+
|
||||
| | ``BPF_XDP`` | ``xdp.frags`` | |
|
||||
+ + +----------------------------------+-----------+
|
||||
| | | ``xdp`` | |
|
||||
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
|
||||
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
.. [#fentry] The ``fentry`` attach format is ``fentry[.s]/<function>``.
|
||||
.. [#kprobe] The ``kprobe`` attach format is ``kprobe/<function>[+<offset>]``. Valid
|
||||
characters for ``function`` are ``a-zA-Z0-9_.`` and ``offset`` must be a valid
|
||||
non-negative integer.
|
||||
.. [#ksyscall] The ``ksyscall`` attach format is ``ksyscall/<syscall>``.
|
||||
.. [#uprobe] The ``uprobe`` attach format is ``uprobe[.s]/<path>:<function>[+<offset>]``.
|
||||
.. [#usdt] The ``usdt`` attach format is ``usdt/<path>:<provider>:<name>``.
|
||||
.. [#kpmulti] The ``kprobe.multi`` attach format is ``kprobe.multi/<pattern>`` where ``pattern``
|
||||
supports ``*`` and ``?`` wildcards. Valid characters for pattern are
|
||||
``a-zA-Z0-9_.*?``.
|
||||
.. [#lsm] The ``lsm`` attachment format is ``lsm[.s]/<hook>``.
|
||||
.. [#rawtp] The ``raw_tp`` attach format is ``raw_tracepoint[.w]/<tracepoint>``.
|
||||
.. [#tp] The ``tracepoint`` attach format is ``tracepoint/<category>/<name>``.
|
||||
.. [#iter] The ``iter`` attach format is ``iter[.s]/<struct-name>``.
|
||||
262
Documentation/bpf/map_array.rst
Normal file
262
Documentation/bpf/map_array.rst
Normal file
@@ -0,0 +1,262 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
================================================
|
||||
BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_PERCPU_ARRAY
|
||||
================================================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_ARRAY`` was introduced in kernel version 3.19
|
||||
- ``BPF_MAP_TYPE_PERCPU_ARRAY`` was introduced in version 4.6
|
||||
|
||||
``BPF_MAP_TYPE_ARRAY`` and ``BPF_MAP_TYPE_PERCPU_ARRAY`` provide generic array
|
||||
storage. The key type is an unsigned 32-bit integer (4 bytes) and the map is
|
||||
of constant size. The size of the array is defined in ``max_entries`` at
|
||||
creation time. All array elements are pre-allocated and zero initialized when
|
||||
created. ``BPF_MAP_TYPE_PERCPU_ARRAY`` uses a different memory region for each
|
||||
CPU whereas ``BPF_MAP_TYPE_ARRAY`` uses the same memory region. The value
|
||||
stored can be of any size, however, all array elements are aligned to 8
|
||||
bytes.
|
||||
|
||||
Since kernel 5.5, memory mapping may be enabled for ``BPF_MAP_TYPE_ARRAY`` by
|
||||
setting the flag ``BPF_F_MMAPABLE``. The map definition is page-aligned and
|
||||
starts on the first page. Sufficient page-sized and page-aligned blocks of
|
||||
memory are allocated to store all array values, starting on the second page,
|
||||
which in some cases will result in over-allocation of memory. The benefit of
|
||||
using this is increased performance and ease of use since userspace programs
|
||||
would not be required to use helper functions to access and mutate data.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Array elements can be retrieved using the ``bpf_map_lookup_elem()`` helper.
|
||||
This helper returns a pointer into the array element, so to avoid data races
|
||||
with userspace reading the value, the user must use primitives like
|
||||
``__sync_fetch_and_add()`` when updating the value in-place.
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
|
||||
|
||||
Array elements can be updated using the ``bpf_map_update_elem()`` helper.
|
||||
|
||||
``bpf_map_update_elem()`` returns 0 on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
Since the array is of constant size, ``bpf_map_delete_elem()`` is not supported.
|
||||
To clear an array element, you may use ``bpf_map_update_elem()`` to insert a
|
||||
zero value to that index.
|
||||
|
||||
Per CPU Array
|
||||
-------------
|
||||
|
||||
Values stored in ``BPF_MAP_TYPE_ARRAY`` can be accessed by multiple programs
|
||||
across different CPUs. To restrict storage to a single CPU, you may use a
|
||||
``BPF_MAP_TYPE_PERCPU_ARRAY``.
|
||||
|
||||
When using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` the ``bpf_map_update_elem()`` and
|
||||
``bpf_map_lookup_elem()`` helpers automatically access the slot for the current
|
||||
CPU.
|
||||
|
||||
bpf_map_lookup_percpu_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
|
||||
|
||||
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the array
|
||||
value for a specific CPU. Returns value on success , or ``NULL`` if no entry was
|
||||
found or ``cpu`` is invalid.
|
||||
|
||||
Concurrency
|
||||
-----------
|
||||
|
||||
Since kernel version 5.1, the BPF infrastructure provides ``struct bpf_spin_lock``
|
||||
to synchronize access.
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
Access from userspace uses libbpf APIs with the same names as above, with
|
||||
the map identified by its ``fd``.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Please see the ``tools/testing/selftests/bpf`` directory for functional
|
||||
examples. The code samples below demonstrate API usage.
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
This snippet shows how to declare an array in a BPF program.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__type(key, u32);
|
||||
__type(value, long);
|
||||
__uint(max_entries, 256);
|
||||
} my_map SEC(".maps");
|
||||
|
||||
|
||||
This example BPF program shows how to access an array element.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_prog(struct __sk_buff *skb)
|
||||
{
|
||||
struct iphdr ip;
|
||||
int index;
|
||||
long *value;
|
||||
|
||||
if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip, sizeof(ip)) < 0)
|
||||
return 0;
|
||||
|
||||
index = ip.protocol;
|
||||
value = bpf_map_lookup_elem(&my_map, &index);
|
||||
if (value)
|
||||
__sync_fetch_and_add(value, skb->len);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
BPF_MAP_TYPE_ARRAY
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This snippet shows how to create an array, using ``bpf_map_create_opts`` to
|
||||
set flags.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#include <bpf/libbpf.h>
|
||||
#include <bpf/bpf.h>
|
||||
|
||||
int create_array()
|
||||
{
|
||||
int fd;
|
||||
LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE);
|
||||
|
||||
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY,
|
||||
"example_array", /* name */
|
||||
sizeof(__u32), /* key size */
|
||||
sizeof(long), /* value size */
|
||||
256, /* max entries */
|
||||
&opts); /* create opts */
|
||||
return fd;
|
||||
}
|
||||
|
||||
This snippet shows how to initialize the elements of an array.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int initialize_array(int fd)
|
||||
{
|
||||
__u32 i;
|
||||
long value;
|
||||
int ret;
|
||||
|
||||
for (i = 0; i < 256; i++) {
|
||||
value = i;
|
||||
ret = bpf_map_update_elem(fd, &i, &value, BPF_ANY);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
This snippet shows how to retrieve an element value from an array.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int lookup(int fd)
|
||||
{
|
||||
__u32 index = 42;
|
||||
long value;
|
||||
int ret;
|
||||
|
||||
ret = bpf_map_lookup_elem(fd, &index, &value);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
/* use value here */
|
||||
assert(value == 42);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
BPF_MAP_TYPE_PERCPU_ARRAY
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This snippet shows how to initialize the elements of a per CPU array.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int initialize_array(int fd)
|
||||
{
|
||||
int ncpus = libbpf_num_possible_cpus();
|
||||
long values[ncpus];
|
||||
__u32 i, j;
|
||||
int ret;
|
||||
|
||||
for (i = 0; i < 256 ; i++) {
|
||||
for (j = 0; j < ncpus; j++)
|
||||
values[j] = i;
|
||||
ret = bpf_map_update_elem(fd, &i, &values, BPF_ANY);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
This snippet shows how to access the per CPU elements of an array value.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int lookup(int fd)
|
||||
{
|
||||
int ncpus = libbpf_num_possible_cpus();
|
||||
__u32 index = 42, j;
|
||||
long values[ncpus];
|
||||
int ret;
|
||||
|
||||
ret = bpf_map_lookup_elem(fd, &index, &values);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
for (j = 0; j < ncpus; j++) {
|
||||
/* Use per CPU value here */
|
||||
assert(values[j] == 42);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
Semantics
|
||||
=========
|
||||
|
||||
As shown in the example above, when accessing a ``BPF_MAP_TYPE_PERCPU_ARRAY``
|
||||
in userspace, each value is an array with ``ncpus`` elements.
|
||||
|
||||
When calling ``bpf_map_update_elem()`` the flag ``BPF_NOEXIST`` can not be used
|
||||
for these maps.
|
||||
174
Documentation/bpf/map_bloom_filter.rst
Normal file
174
Documentation/bpf/map_bloom_filter.rst
Normal file
@@ -0,0 +1,174 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
=========================
|
||||
BPF_MAP_TYPE_BLOOM_FILTER
|
||||
=========================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_BLOOM_FILTER`` was introduced in kernel version 5.16
|
||||
|
||||
``BPF_MAP_TYPE_BLOOM_FILTER`` provides a BPF bloom filter map. Bloom
|
||||
filters are a space-efficient probabilistic data structure used to
|
||||
quickly test whether an element exists in a set. In a bloom filter,
|
||||
false positives are possible whereas false negatives are not.
|
||||
|
||||
The bloom filter map does not have keys, only values. When the bloom
|
||||
filter map is created, it must be created with a ``key_size`` of 0. The
|
||||
bloom filter map supports two operations:
|
||||
|
||||
- push: adding an element to the map
|
||||
- peek: determining whether an element is present in the map
|
||||
|
||||
BPF programs must use ``bpf_map_push_elem`` to add an element to the
|
||||
bloom filter map and ``bpf_map_peek_elem`` to query the map. These
|
||||
operations are exposed to userspace applications using the existing
|
||||
``bpf`` syscall in the following way:
|
||||
|
||||
- ``BPF_MAP_UPDATE_ELEM`` -> push
|
||||
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
|
||||
|
||||
The ``max_entries`` size that is specified at map creation time is used
|
||||
to approximate a reasonable bitmap size for the bloom filter, and is not
|
||||
otherwise strictly enforced. If the user wishes to insert more entries
|
||||
into the bloom filter than ``max_entries``, this may lead to a higher
|
||||
false positive rate.
|
||||
|
||||
The number of hashes to use for the bloom filter is configurable using
|
||||
the lower 4 bits of ``map_extra`` in ``union bpf_attr`` at map creation
|
||||
time. If no number is specified, the default used will be 5 hash
|
||||
functions. In general, using more hashes decreases both the false
|
||||
positive rate and the speed of a lookup.
|
||||
|
||||
It is not possible to delete elements from a bloom filter map. A bloom
|
||||
filter map may be used as an inner map. The user is responsible for
|
||||
synchronising concurrent updates and lookups to ensure no false negative
|
||||
lookups occur.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_map_push_elem()
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
|
||||
|
||||
A ``value`` can be added to a bloom filter using the
|
||||
``bpf_map_push_elem()`` helper. The ``flags`` parameter must be set to
|
||||
``BPF_ANY`` when adding an entry to the bloom filter. This helper
|
||||
returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
bpf_map_peek_elem()
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_peek_elem(struct bpf_map *map, void *value)
|
||||
|
||||
The ``bpf_map_peek_elem()`` helper is used to determine whether
|
||||
``value`` is present in the bloom filter map. This helper returns ``0``
|
||||
if ``value`` is probably present in the map, or ``-ENOENT`` if ``value``
|
||||
is definitely not present in the map.
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
|
||||
|
||||
A userspace program can add a ``value`` to a bloom filter using libbpf's
|
||||
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
|
||||
``NULL`` and ``flags`` must be set to ``BPF_ANY``. Returns ``0`` on
|
||||
success, or negative error in case of failure.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem (int fd, const void *key, void *value)
|
||||
|
||||
A userspace program can determine the presence of ``value`` in a bloom
|
||||
filter using libbpf's ``bpf_map_lookup_elem`` function. The ``key``
|
||||
parameter must be set to ``NULL``. Returns ``0`` if ``value`` is
|
||||
probably present in the map, or ``-ENOENT`` if ``value`` is definitely
|
||||
not present in the map.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
This snippet shows how to declare a bloom filter in a BPF program:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_BLOOM_FILTER);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 1000);
|
||||
__uint(map_extra, 3);
|
||||
} bloom_filter SEC(".maps");
|
||||
|
||||
This snippet shows how to determine presence of a value in a bloom
|
||||
filter in a BPF program:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *lookup(__u32 key)
|
||||
{
|
||||
if (bpf_map_peek_elem(&bloom_filter, &key) == 0) {
|
||||
/* Verify not a false positive and fetch an associated
|
||||
* value using a secondary lookup, e.g. in a hash table
|
||||
*/
|
||||
return bpf_map_lookup_elem(&hash_table, &key);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
This snippet shows how to use libbpf to create a bloom filter map from
|
||||
userspace:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int create_bloom()
|
||||
{
|
||||
LIBBPF_OPTS(bpf_map_create_opts, opts,
|
||||
.map_extra = 3); /* number of hashes */
|
||||
|
||||
return bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER,
|
||||
"ipv6_bloom", /* name */
|
||||
0, /* key size, must be zero */
|
||||
sizeof(ipv6_addr), /* value size */
|
||||
10000, /* max entries */
|
||||
&opts); /* create options */
|
||||
}
|
||||
|
||||
This snippet shows how to add an element to a bloom filter from
|
||||
userspace:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int add_element(struct bpf_map *bloom_map, __u32 value)
|
||||
{
|
||||
int bloom_fd = bpf_map__fd(bloom_map);
|
||||
return bpf_map_update_elem(bloom_fd, NULL, &value, BPF_ANY);
|
||||
}
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
https://lwn.net/ml/bpf/20210831225005.2762202-1-joannekoong@fb.com/
|
||||
109
Documentation/bpf/map_cgrp_storage.rst
Normal file
109
Documentation/bpf/map_cgrp_storage.rst
Normal file
@@ -0,0 +1,109 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
|
||||
|
||||
=========================
|
||||
BPF_MAP_TYPE_CGRP_STORAGE
|
||||
=========================
|
||||
|
||||
The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
|
||||
storage for cgroups. It is only available with ``CONFIG_CGROUPS``.
|
||||
The programs are made available by the same Kconfig. The
|
||||
data for a particular cgroup can be retrieved by looking up the map
|
||||
with that cgroup.
|
||||
|
||||
This document describes the usage and semantics of the
|
||||
``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
The map key must be ``sizeof(int)`` representing a cgroup fd.
|
||||
To access the storage in a program, use ``bpf_cgrp_storage_get``::
|
||||
|
||||
void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
|
||||
|
||||
``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
|
||||
a new local storage will be created if one does not exist.
|
||||
|
||||
The local storage can be removed with ``bpf_cgrp_storage_delete``::
|
||||
|
||||
long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
|
||||
|
||||
The map is available to all program types.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
A BPF program example with BPF_MAP_TYPE_CGRP_STORAGE::
|
||||
|
||||
#include <vmlinux.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
#include <bpf/bpf_tracing.h>
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
|
||||
__uint(map_flags, BPF_F_NO_PREALLOC);
|
||||
__type(key, int);
|
||||
__type(value, long);
|
||||
} cgrp_storage SEC(".maps");
|
||||
|
||||
SEC("tp_btf/sys_enter")
|
||||
int BPF_PROG(on_enter, struct pt_regs *regs, long id)
|
||||
{
|
||||
struct task_struct *task = bpf_get_current_task_btf();
|
||||
long *ptr;
|
||||
|
||||
ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
|
||||
BPF_LOCAL_STORAGE_GET_F_CREATE);
|
||||
if (ptr)
|
||||
__sync_fetch_and_add(ptr, 1);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
Userspace accessing map declared above::
|
||||
|
||||
#include <linux/bpf.h>
|
||||
#include <linux/libbpf.h>
|
||||
|
||||
__u32 map_lookup(struct bpf_map *map, int cgrp_fd)
|
||||
{
|
||||
__u32 *value;
|
||||
value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
|
||||
if (value)
|
||||
return *value;
|
||||
return 0;
|
||||
}
|
||||
|
||||
Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
|
||||
============================================================================
|
||||
|
||||
The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
|
||||
deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
|
||||
``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
|
||||
illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
|
||||
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
|
||||
|
||||
(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
|
||||
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
|
||||
like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
|
||||
|
||||
(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
|
||||
cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only supports one cgroup
|
||||
which is attached by a BPF program.
|
||||
|
||||
(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
|
||||
``bpf_get_local_storage()`` always returns non-NULL local storage.
|
||||
``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
|
||||
it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
|
||||
To avoid such null local storage issue, user space can do
|
||||
``bpf_map_update_elem()`` to pre-allocate local storage before a BPF program
|
||||
is attached.
|
||||
|
||||
(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a BPF program
|
||||
while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
|
||||
prog detach time.
|
||||
|
||||
So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
|
||||
functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
|
||||
instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
|
||||
177
Documentation/bpf/map_cpumap.rst
Normal file
177
Documentation/bpf/map_cpumap.rst
Normal file
@@ -0,0 +1,177 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
===================
|
||||
BPF_MAP_TYPE_CPUMAP
|
||||
===================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_CPUMAP`` was introduced in kernel version 4.15
|
||||
|
||||
.. kernel-doc:: kernel/bpf/cpumap.c
|
||||
:doc: cpu map
|
||||
|
||||
An example use-case for this map type is software based Receive Side Scaling (RSS).
|
||||
|
||||
The CPUMAP represents the CPUs in the system indexed as the map-key, and the
|
||||
map-value is the config setting (per CPUMAP entry). Each CPUMAP entry has a dedicated
|
||||
kernel thread bound to the given CPU to represent the remote CPU execution unit.
|
||||
|
||||
Starting from Linux kernel version 5.9 the CPUMAP can run a second XDP program
|
||||
on the remote CPU. This allows an XDP program to split its processing across
|
||||
multiple CPUs. For example, a scenario where the initial CPU (that sees/receives
|
||||
the packets) needs to do minimal packet processing and the remote CPU (to which
|
||||
the packet is directed) can afford to spend more cycles processing the frame. The
|
||||
initial CPU is where the XDP redirect program is executed. The remote CPU
|
||||
receives raw ``xdp_frame`` objects.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
bpf_redirect_map()
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
|
||||
|
||||
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
|
||||
For ``BPF_MAP_TYPE_CPUMAP`` this map contains references to CPUs.
|
||||
|
||||
The lower two bits of ``flags`` are used as the return code if the map lookup
|
||||
fails. This is so that the return value can be one of the XDP program return
|
||||
codes up to ``XDP_TX``, as chosen by the caller.
|
||||
|
||||
User space
|
||||
----------
|
||||
.. note::
|
||||
CPUMAP entries can only be updated/looked up/deleted from user space and not
|
||||
from an eBPF program. Trying to call these functions from a kernel eBPF
|
||||
program will result in the program failing to load and a verifier warning.
|
||||
|
||||
bpf_map_update_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
|
||||
|
||||
CPU entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
helper. This helper replaces existing elements atomically. The ``value`` parameter
|
||||
can be ``struct bpf_cpumap_val``.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct bpf_cpumap_val {
|
||||
__u32 qsize; /* queue size to remote target CPU */
|
||||
union {
|
||||
int fd; /* prog fd on map write */
|
||||
__u32 id; /* prog id on map read */
|
||||
} bpf_prog;
|
||||
};
|
||||
|
||||
The flags argument can be one of the following:
|
||||
- BPF_ANY: Create a new element or update an existing element.
|
||||
- BPF_NOEXIST: Create a new element only if it did not exist.
|
||||
- BPF_EXIST: Update an existing element.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem(int fd, const void *key, void *value);
|
||||
|
||||
CPU entries can be retrieved using the ``bpf_map_lookup_elem()``
|
||||
helper.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_delete_elem(int fd, const void *key);
|
||||
|
||||
CPU entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
helper. This helper will return 0 on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
Examples
|
||||
========
|
||||
Kernel
|
||||
------
|
||||
|
||||
The following code snippet shows how to declare a ``BPF_MAP_TYPE_CPUMAP`` called
|
||||
``cpu_map`` and how to redirect packets to a remote CPU using a round robin scheme.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_CPUMAP);
|
||||
__type(key, __u32);
|
||||
__type(value, struct bpf_cpumap_val);
|
||||
__uint(max_entries, 12);
|
||||
} cpu_map SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY);
|
||||
__type(key, __u32);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 12);
|
||||
} cpus_available SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
|
||||
__type(key, __u32);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 1);
|
||||
} cpus_iterator SEC(".maps");
|
||||
|
||||
SEC("xdp")
|
||||
int xdp_redir_cpu_round_robin(struct xdp_md *ctx)
|
||||
{
|
||||
__u32 key = 0;
|
||||
__u32 cpu_dest = 0;
|
||||
__u32 *cpu_selected, *cpu_iterator;
|
||||
__u32 cpu_idx;
|
||||
|
||||
cpu_iterator = bpf_map_lookup_elem(&cpus_iterator, &key);
|
||||
if (!cpu_iterator)
|
||||
return XDP_ABORTED;
|
||||
cpu_idx = *cpu_iterator;
|
||||
|
||||
*cpu_iterator += 1;
|
||||
if (*cpu_iterator == bpf_num_possible_cpus())
|
||||
*cpu_iterator = 0;
|
||||
|
||||
cpu_selected = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
|
||||
if (!cpu_selected)
|
||||
return XDP_ABORTED;
|
||||
cpu_dest = *cpu_selected;
|
||||
|
||||
if (cpu_dest >= bpf_num_possible_cpus())
|
||||
return XDP_ABORTED;
|
||||
|
||||
return bpf_redirect_map(&cpu_map, cpu_dest, 0);
|
||||
}
|
||||
|
||||
User space
|
||||
----------
|
||||
|
||||
The following code snippet shows how to dynamically set the max_entries for a
|
||||
CPUMAP to the max number of cpus available on the system.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int set_max_cpu_entries(struct bpf_map *cpu_map)
|
||||
{
|
||||
if (bpf_map__set_max_entries(cpu_map, libbpf_num_possible_cpus()) < 0) {
|
||||
fprintf(stderr, "Failed to set max entries for cpu_map map: %s",
|
||||
strerror(errno));
|
||||
return -1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
References
|
||||
===========
|
||||
|
||||
- https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#redirecting_into_a_cpumap
|
||||
238
Documentation/bpf/map_devmap.rst
Normal file
238
Documentation/bpf/map_devmap.rst
Normal file
@@ -0,0 +1,238 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
=================================================
|
||||
BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
|
||||
=================================================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14
|
||||
- ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4
|
||||
|
||||
``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily
|
||||
used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``.
|
||||
``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as
|
||||
the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH``
|
||||
is backed by a hash table that uses a key to lookup a reference to a net device.
|
||||
The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``>
|
||||
pairs to update the maps with new net devices.
|
||||
|
||||
.. note::
|
||||
- The key to a hash map doesn't have to be an ``ifindex``.
|
||||
- While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices
|
||||
it comes at the cost of a hash of the key when performing a look up.
|
||||
|
||||
The setup and packet enqueue/send code is shared between the two types of
|
||||
devmap; only the lookup and insertion is different.
|
||||
|
||||
Usage
|
||||
=====
|
||||
Kernel BPF
|
||||
----------
|
||||
bpf_redirect_map()
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
|
||||
|
||||
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
|
||||
For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains
|
||||
references to net devices (for forwarding packets through other ports).
|
||||
|
||||
The lower two bits of *flags* are used as the return code if the map lookup
|
||||
fails. This is so that the return value can be one of the XDP program return
|
||||
codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags``
|
||||
can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined
|
||||
below.
|
||||
|
||||
With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces
|
||||
in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded
|
||||
from the broadcast.
|
||||
|
||||
.. note::
|
||||
- The key is ignored if BPF_F_BROADCAST is set.
|
||||
- The broadcast feature can also be used to implement multicast forwarding:
|
||||
simply create multiple DEVMAPs, each one corresponding to a single multicast group.
|
||||
|
||||
This helper will return ``XDP_REDIRECT`` on success, or the value of the two
|
||||
lower bits of the ``flags`` argument if the map lookup fails.
|
||||
|
||||
More information about redirection can be found :doc:`redirect`
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
|
||||
helper.
|
||||
|
||||
User space
|
||||
----------
|
||||
.. note::
|
||||
DEVMAP entries can only be updated/deleted from user space and not
|
||||
from an eBPF program. Trying to call these functions from a kernel eBPF
|
||||
program will result in the program failing to load and a verifier warning.
|
||||
|
||||
bpf_map_update_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
|
||||
|
||||
Net device entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
helper. This helper replaces existing elements atomically. The ``value`` parameter
|
||||
can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards
|
||||
compatibility.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct bpf_devmap_val {
|
||||
__u32 ifindex; /* device index */
|
||||
union {
|
||||
int fd; /* prog fd on map write */
|
||||
__u32 id; /* prog id on map read */
|
||||
} bpf_prog;
|
||||
};
|
||||
|
||||
The ``flags`` argument can be one of the following:
|
||||
- ``BPF_ANY``: Create a new element or update an existing element.
|
||||
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
|
||||
- ``BPF_EXIST``: Update an existing element.
|
||||
|
||||
DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd``
|
||||
to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have
|
||||
access to both Rx device and Tx device. The program associated with the ``fd``
|
||||
must have type XDP with expected attach type ``xdp_devmap``.
|
||||
When a program is associated with a device index, the program is run on an
|
||||
``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples
|
||||
of how to attach/use xdp_devmap progs can be found in the kernel selftests:
|
||||
|
||||
- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c``
|
||||
- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c``
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
.. c:function::
|
||||
int bpf_map_lookup_elem(int fd, const void *key, void *value);
|
||||
|
||||
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
|
||||
helper.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
.. c:function::
|
||||
int bpf_map_delete_elem(int fd, const void *key);
|
||||
|
||||
Net device entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
helper. This helper will return 0 on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP``
|
||||
called tx_port.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_DEVMAP);
|
||||
__type(key, __u32);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 256);
|
||||
} tx_port SEC(".maps");
|
||||
|
||||
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH``
|
||||
called forward_map.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
|
||||
__type(key, __u32);
|
||||
__type(value, struct bpf_devmap_val);
|
||||
__uint(max_entries, 32);
|
||||
} forward_map SEC(".maps");
|
||||
|
||||
.. note::
|
||||
|
||||
The value type in the DEVMAP above is a ``struct bpf_devmap_val``
|
||||
|
||||
The following code snippet shows a simple xdp_redirect_map program. This program
|
||||
would work with a user space program that populates the devmap ``forward_map`` based
|
||||
on ingress ifindexes. The BPF program (below) is redirecting packets using the
|
||||
ingress ``ifindex`` as the ``key``.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("xdp")
|
||||
int xdp_redirect_map_func(struct xdp_md *ctx)
|
||||
{
|
||||
int index = ctx->ingress_ifindex;
|
||||
|
||||
return bpf_redirect_map(&forward_map, index, 0);
|
||||
}
|
||||
|
||||
The following code snippet shows a BPF program that is broadcasting packets to
|
||||
all the interfaces in the ``tx_port`` devmap.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("xdp")
|
||||
int xdp_redirect_map_func(struct xdp_md *ctx)
|
||||
{
|
||||
return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS);
|
||||
}
|
||||
|
||||
User space
|
||||
----------
|
||||
|
||||
The following code snippet shows how to update a devmap called ``tx_port``.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int update_devmap(int ifindex, int redirect_ifindex)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0);
|
||||
if (ret < 0) {
|
||||
fprintf(stderr, "Failed to update devmap_ value: %s\n",
|
||||
strerror(errno));
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
The following code snippet shows how to update a hash_devmap called ``forward_map``.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int update_devmap(int ifindex, int redirect_ifindex)
|
||||
{
|
||||
struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex };
|
||||
int ret;
|
||||
|
||||
ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
|
||||
if (ret < 0) {
|
||||
fprintf(stderr, "Failed to update devmap_ value: %s\n",
|
||||
strerror(errno));
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
References
|
||||
===========
|
||||
|
||||
- https://lwn.net/Articles/728146/
|
||||
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176
|
||||
- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106
|
||||
@@ -34,7 +34,14 @@ the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
|
||||
Usage
|
||||
=====
|
||||
|
||||
.. c:function::
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
|
||||
|
||||
Hash entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
@@ -49,14 +56,22 @@ parameter can be used to control the update behaviour:
|
||||
``bpf_map_update_elem()`` returns 0 on success, or negative error in
|
||||
case of failure.
|
||||
|
||||
.. c:function::
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
|
||||
helper. This helper returns a pointer to the value associated with
|
||||
``key``, or ``NULL`` if no entry was found.
|
||||
|
||||
.. c:function::
|
||||
bpf_map_delete_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Hash entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
@@ -70,7 +85,11 @@ For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
|
||||
the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
|
||||
automatically access the hash slot for the current CPU.
|
||||
|
||||
.. c:function::
|
||||
bpf_map_lookup_percpu_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
|
||||
|
||||
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
|
||||
@@ -89,7 +108,11 @@ See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
|
||||
Userspace
|
||||
---------
|
||||
|
||||
.. c:function::
|
||||
bpf_map_get_next_key()
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
|
||||
|
||||
In userspace, it is possible to iterate through the keys of a hash using
|
||||
|
||||
197
Documentation/bpf/map_lpm_trie.rst
Normal file
197
Documentation/bpf/map_lpm_trie.rst
Normal file
@@ -0,0 +1,197 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
=====================
|
||||
BPF_MAP_TYPE_LPM_TRIE
|
||||
=====================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11
|
||||
|
||||
``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
|
||||
can be used to match IP addresses to a stored set of prefixes.
|
||||
Internally, data is stored in an unbalanced trie of nodes that uses
|
||||
``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
|
||||
network byte order, i.e. big endian, so ``data[0]`` stores the most
|
||||
significant byte.
|
||||
|
||||
LPM tries may be created with a maximum prefix length that is a multiple
|
||||
of 8, in the range from 8 to 2048. The key used for lookup and update
|
||||
operations is a ``struct bpf_lpm_trie_key``, extended by
|
||||
``max_prefixlen/8`` bytes.
|
||||
|
||||
- For IPv4 addresses the data length is 4 bytes
|
||||
- For IPv6 addresses the data length is 16 bytes
|
||||
|
||||
The value type stored in the LPM trie can be any user defined type.
|
||||
|
||||
.. note::
|
||||
When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
|
||||
``BPF_F_NO_PREALLOC`` flag.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
The longest prefix entry for a given data value can be found using the
|
||||
``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
|
||||
value associated with the longest matching ``key``, or ``NULL`` if no
|
||||
entry was found.
|
||||
|
||||
The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
|
||||
performing longest prefix lookups. For example, when searching for the
|
||||
longest prefix match for an IPv4 address, ``prefixlen`` should be set to
|
||||
``32``.
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
|
||||
|
||||
Prefix entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
helper. This helper replaces existing elements atomically.
|
||||
|
||||
``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
|
||||
case of failure.
|
||||
|
||||
.. note::
|
||||
The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,
|
||||
but the value is ignored, giving BPF_ANY semantics.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Prefix entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
helper. This helper will return 0 on success, or negative error in case
|
||||
of failure.
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
Access from userspace uses libbpf APIs with the same names as above, with
|
||||
the map identified by ``fd``.
|
||||
|
||||
bpf_map_get_next_key()
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key)
|
||||
|
||||
A userspace program can iterate through the entries in an LPM trie using
|
||||
libbpf's ``bpf_map_get_next_key()`` function. The first key can be
|
||||
fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
|
||||
``NULL``. Subsequent calls will fetch the next key that follows the
|
||||
current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
|
||||
``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
|
||||
error in case of failure.
|
||||
|
||||
``bpf_map_get_next_key()`` will iterate through the LPM trie elements
|
||||
from leftmost leaf first. This means that iteration will return more
|
||||
specific keys before less specific ones.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
|
||||
of LPM trie usage from userspace. The code snippets below demonstrate
|
||||
API usage.
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
The following BPF code snippet shows how to declare a new LPM trie for IPv4
|
||||
address prefixes:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#include <linux/bpf.h>
|
||||
#include <bpf/bpf_helpers.h>
|
||||
|
||||
struct ipv4_lpm_key {
|
||||
__u32 prefixlen;
|
||||
__u32 data;
|
||||
};
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
|
||||
__type(key, struct ipv4_lpm_key);
|
||||
__type(value, __u32);
|
||||
__uint(map_flags, BPF_F_NO_PREALLOC);
|
||||
__uint(max_entries, 255);
|
||||
} ipv4_lpm_map SEC(".maps");
|
||||
|
||||
The following BPF code snippet shows how to lookup by IPv4 address:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *lookup(__u32 ipaddr)
|
||||
{
|
||||
struct ipv4_lpm_key key = {
|
||||
.prefixlen = 32,
|
||||
.data = ipaddr
|
||||
};
|
||||
|
||||
return bpf_map_lookup_elem(&ipv4_lpm_map, &key);
|
||||
}
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
The following snippet shows how to insert an IPv4 prefix entry into an
|
||||
LPM trie:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)
|
||||
{
|
||||
struct ipv4_lpm_key ipv4_key = {
|
||||
.prefixlen = prefixlen,
|
||||
.data = addr
|
||||
};
|
||||
return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);
|
||||
}
|
||||
|
||||
The following snippet shows a userspace program walking through the entries
|
||||
of an LPM trie:
|
||||
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#include <bpf/libbpf.h>
|
||||
#include <bpf/bpf.h>
|
||||
|
||||
void iterate_lpm_trie(int map_fd)
|
||||
{
|
||||
struct ipv4_lpm_key *cur_key = NULL;
|
||||
struct ipv4_lpm_key next_key;
|
||||
struct value value;
|
||||
int err;
|
||||
|
||||
for (;;) {
|
||||
err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
|
||||
if (err)
|
||||
break;
|
||||
|
||||
bpf_map_lookup_elem(map_fd, &next_key, &value);
|
||||
|
||||
/* Use key and value here */
|
||||
|
||||
cur_key = &next_key;
|
||||
}
|
||||
}
|
||||
130
Documentation/bpf/map_of_maps.rst
Normal file
130
Documentation/bpf/map_of_maps.rst
Normal file
@@ -0,0 +1,130 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
========================================================
|
||||
BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
|
||||
========================================================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` were
|
||||
introduced in kernel version 4.12
|
||||
|
||||
``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` provide general
|
||||
purpose support for map in map storage. One level of nesting is supported, where
|
||||
an outer map contains instances of a single type of inner map, for example
|
||||
``array_of_maps->sock_map``.
|
||||
|
||||
When creating an outer map, an inner map instance is used to initialize the
|
||||
metadata that the outer map holds about its inner maps. This inner map has a
|
||||
separate lifetime from the outer map and can be deleted after the outer map has
|
||||
been created.
|
||||
|
||||
The outer map supports element lookup, update and delete from user space using
|
||||
the syscall API. A BPF program is only allowed to do element lookup in the outer
|
||||
map.
|
||||
|
||||
.. note::
|
||||
- Multi-level nesting is not supported.
|
||||
- Any BPF map type can be used as an inner map, except for
|
||||
``BPF_MAP_TYPE_PROG_ARRAY``.
|
||||
- A BPF program cannot update or delete outer map entries.
|
||||
|
||||
For ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` the key is an unsigned 32-bit integer index
|
||||
into the array. The array is a fixed size with ``max_entries`` elements that are
|
||||
zero initialized when created.
|
||||
|
||||
For ``BPF_MAP_TYPE_HASH_OF_MAPS`` the key type can be chosen when defining the
|
||||
map. The kernel is responsible for allocating and freeing key/value pairs, up to
|
||||
the max_entries limit that you specify. Hash maps use pre-allocation of hash
|
||||
table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be used to disable
|
||||
pre-allocation when it is too memory expensive.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF Helper
|
||||
-----------------
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
Inner maps can be retrieved using the ``bpf_map_lookup_elem()`` helper. This
|
||||
helper returns a pointer to the inner map, or ``NULL`` if no entry was found.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF Example
|
||||
------------------
|
||||
|
||||
This snippet shows how to create and initialise an array of devmaps in a BPF
|
||||
program. Note that the outer array can only be modified from user space using
|
||||
the syscall API.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct inner_map {
|
||||
__uint(type, BPF_MAP_TYPE_DEVMAP);
|
||||
__uint(max_entries, 10);
|
||||
__type(key, __u32);
|
||||
__type(value, __u32);
|
||||
} inner_map1 SEC(".maps"), inner_map2 SEC(".maps");
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
|
||||
__uint(max_entries, 2);
|
||||
__type(key, __u32);
|
||||
__array(values, struct inner_map);
|
||||
} outer_map SEC(".maps") = {
|
||||
.values = { &inner_map1,
|
||||
&inner_map2 }
|
||||
};
|
||||
|
||||
See ``progs/test_btf_map_in_map.c`` in ``tools/testing/selftests/bpf`` for more
|
||||
examples of declarative initialisation of outer maps.
|
||||
|
||||
User Space
|
||||
----------
|
||||
|
||||
This snippet shows how to create an array based outer map:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int create_outer_array(int inner_fd) {
|
||||
LIBBPF_OPTS(bpf_map_create_opts, opts, .inner_map_fd = inner_fd);
|
||||
int fd;
|
||||
|
||||
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY_OF_MAPS,
|
||||
"example_array", /* name */
|
||||
sizeof(__u32), /* key size */
|
||||
sizeof(__u32), /* value size */
|
||||
256, /* max entries */
|
||||
&opts); /* create opts */
|
||||
return fd;
|
||||
}
|
||||
|
||||
|
||||
This snippet shows how to add an inner map to an outer map:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int add_devmap(int outer_fd, int index, const char *name) {
|
||||
int fd;
|
||||
|
||||
fd = bpf_map_create(BPF_MAP_TYPE_DEVMAP, name,
|
||||
sizeof(__u32), sizeof(__u32), 256, NULL);
|
||||
if (fd < 0)
|
||||
return fd;
|
||||
|
||||
return bpf_map_update_elem(outer_fd, &index, &fd, BPF_ANY);
|
||||
}
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- https://lore.kernel.org/netdev/20170322170035.923581-3-kafai@fb.com/
|
||||
- https://lore.kernel.org/netdev/20170322170035.923581-4-kafai@fb.com/
|
||||
146
Documentation/bpf/map_queue_stack.rst
Normal file
146
Documentation/bpf/map_queue_stack.rst
Normal file
@@ -0,0 +1,146 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
=========================================
|
||||
BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
|
||||
=========================================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` were introduced
|
||||
in kernel version 4.20
|
||||
|
||||
``BPF_MAP_TYPE_QUEUE`` provides FIFO storage and ``BPF_MAP_TYPE_STACK``
|
||||
provides LIFO storage for BPF programs. These maps support peek, pop and
|
||||
push operations that are exposed to BPF programs through the respective
|
||||
helpers. These operations are exposed to userspace applications using
|
||||
the existing ``bpf`` syscall in the following way:
|
||||
|
||||
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
|
||||
- ``BPF_MAP_LOOKUP_AND_DELETE_ELEM`` -> pop
|
||||
- ``BPF_MAP_UPDATE_ELEM`` -> push
|
||||
|
||||
``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` do not support
|
||||
``BPF_F_NO_PREALLOC``.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_map_push_elem()
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
|
||||
|
||||
An element ``value`` can be added to a queue or stack using the
|
||||
``bpf_map_push_elem`` helper. The ``flags`` parameter must be set to
|
||||
``BPF_ANY`` or ``BPF_EXIST``. If ``flags`` is set to ``BPF_EXIST`` then,
|
||||
when the queue or stack is full, the oldest element will be removed to
|
||||
make room for ``value`` to be added. Returns ``0`` on success, or
|
||||
negative error in case of failure.
|
||||
|
||||
bpf_map_peek_elem()
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_peek_elem(struct bpf_map *map, void *value)
|
||||
|
||||
This helper fetches an element ``value`` from a queue or stack without
|
||||
removing it. Returns ``0`` on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
bpf_map_pop_elem()
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_map_pop_elem(struct bpf_map *map, void *value)
|
||||
|
||||
This helper removes an element into ``value`` from a queue or
|
||||
stack. Returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
|
||||
|
||||
A userspace program can push ``value`` onto a queue or stack using libbpf's
|
||||
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
|
||||
``NULL`` and ``flags`` must be set to ``BPF_ANY`` or ``BPF_EXIST``, with the
|
||||
same semantics as the ``bpf_map_push_elem`` kernel helper. Returns ``0`` on
|
||||
success, or negative error in case of failure.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem (int fd, const void *key, void *value)
|
||||
|
||||
A userspace program can peek at the ``value`` at the head of a queue or stack
|
||||
using the libbpf ``bpf_map_lookup_elem`` function. The ``key`` parameter must be
|
||||
set to ``NULL``. Returns ``0`` on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
bpf_map_lookup_and_delete_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_and_delete_elem (int fd, const void *key, void *value)
|
||||
|
||||
A userspace program can pop a ``value`` from the head of a queue or stack using
|
||||
the libbpf ``bpf_map_lookup_and_delete_elem`` function. The ``key`` parameter
|
||||
must be set to ``NULL``. Returns ``0`` on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
This snippet shows how to declare a queue in a BPF program:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_QUEUE);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 10);
|
||||
} queue SEC(".maps");
|
||||
|
||||
|
||||
Userspace
|
||||
---------
|
||||
|
||||
This snippet shows how to use libbpf's low-level API to create a queue from
|
||||
userspace:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int create_queue()
|
||||
{
|
||||
return bpf_map_create(BPF_MAP_TYPE_QUEUE,
|
||||
"sample_queue", /* name */
|
||||
0, /* key size, must be zero */
|
||||
sizeof(__u32), /* value size */
|
||||
10, /* max entries */
|
||||
NULL); /* create options */
|
||||
}
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
https://lwn.net/ml/netdev/153986858555.9127.14517764371945179514.stgit@kernel/
|
||||
155
Documentation/bpf/map_sk_storage.rst
Normal file
155
Documentation/bpf/map_sk_storage.rst
Normal file
@@ -0,0 +1,155 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
=======================
|
||||
BPF_MAP_TYPE_SK_STORAGE
|
||||
=======================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2
|
||||
|
||||
``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF
|
||||
programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage
|
||||
to be provided and acts as the handle for accessing the socket-local
|
||||
storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored
|
||||
locally with each socket instead of with the map. The kernel is responsible for
|
||||
allocating storage for a socket when requested and for freeing the storage when
|
||||
either the map or the socket is deleted.
|
||||
|
||||
.. note::
|
||||
- The key type must be ``int`` and ``max_entries`` must be set to ``0``.
|
||||
- The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for
|
||||
socket-local storage.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
bpf_sk_storage_get()
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags)
|
||||
|
||||
Socket-local storage can be retrieved using the ``bpf_sk_storage_get()``
|
||||
helper. The helper gets the storage from ``sk`` that is associated with ``map``.
|
||||
If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then
|
||||
``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not
|
||||
already exist. ``value`` can be used together with
|
||||
``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it
|
||||
will be zero initialized. Returns a pointer to the storage on success, or
|
||||
``NULL`` in case of failure.
|
||||
|
||||
.. note::
|
||||
- ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs.
|
||||
- ``sk`` is a ``struct bpf_sock`` pointer for other program types.
|
||||
|
||||
bpf_sk_storage_delete()
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_sk_storage_delete(struct bpf_map *map, void *sk)
|
||||
|
||||
Socket-local storage can be deleted using the ``bpf_sk_storage_delete()``
|
||||
helper. The helper deletes the storage from ``sk`` that is identified by
|
||||
``map``. Returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
User space
|
||||
----------
|
||||
|
||||
bpf_map_update_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags)
|
||||
|
||||
Socket-local storage for the socket identified by ``key`` belonging to
|
||||
``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf
|
||||
function. ``key`` must be a pointer to a valid ``fd`` in the user space
|
||||
program. The ``flags`` parameter can be used to control the update behaviour:
|
||||
|
||||
- ``BPF_ANY`` will create storage for ``fd`` or update existing storage.
|
||||
- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already
|
||||
exist, otherwise the call will fail with ``-EEXIST``.
|
||||
- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists,
|
||||
otherwise the call will fail with ``-ENOENT``.
|
||||
|
||||
Returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem(int map_fd, const void *key, void *value)
|
||||
|
||||
Socket-local storage for the socket identified by ``key`` belonging to
|
||||
``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf
|
||||
function. ``key`` must be a pointer to a valid ``fd`` in the user space
|
||||
program. Returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_delete_elem(int map_fd, const void *key)
|
||||
|
||||
Socket-local storage for the socket identified by ``key`` belonging to
|
||||
``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf
|
||||
function. Returns ``0`` on success, or negative error in case of failure.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
|
||||
This snippet shows how to declare socket-local storage in a BPF program:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
|
||||
__uint(map_flags, BPF_F_NO_PREALLOC);
|
||||
__type(key, int);
|
||||
__type(value, struct my_storage);
|
||||
} socket_storage SEC(".maps");
|
||||
|
||||
This snippet shows how to retrieve socket-local storage in a BPF program:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
SEC("sockops")
|
||||
int _sockops(struct bpf_sock_ops *ctx)
|
||||
{
|
||||
struct my_storage *storage;
|
||||
struct bpf_sock *sk;
|
||||
|
||||
sk = ctx->sk;
|
||||
if (!sk)
|
||||
return 1;
|
||||
|
||||
storage = bpf_sk_storage_get(&socket_storage, sk, 0,
|
||||
BPF_LOCAL_STORAGE_GET_F_CREATE);
|
||||
if (!storage)
|
||||
return 1;
|
||||
|
||||
/* Use 'storage' here */
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
Please see the ``tools/testing/selftests/bpf`` directory for functional
|
||||
examples.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/
|
||||
192
Documentation/bpf/map_xskmap.rst
Normal file
192
Documentation/bpf/map_xskmap.rst
Normal file
@@ -0,0 +1,192 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only
|
||||
.. Copyright (C) 2022 Red Hat, Inc.
|
||||
|
||||
===================
|
||||
BPF_MAP_TYPE_XSKMAP
|
||||
===================
|
||||
|
||||
.. note::
|
||||
- ``BPF_MAP_TYPE_XSKMAP`` was introduced in kernel version 4.18
|
||||
|
||||
The ``BPF_MAP_TYPE_XSKMAP`` is used as a backend map for XDP BPF helper
|
||||
call ``bpf_redirect_map()`` and ``XDP_REDIRECT`` action, like 'devmap' and 'cpumap'.
|
||||
This map type redirects raw XDP frames to `AF_XDP`_ sockets (XSKs), a new type of
|
||||
address family in the kernel that allows redirection of frames from a driver to
|
||||
user space without having to traverse the full network stack. An AF_XDP socket
|
||||
binds to a single netdev queue. A mapping of XSKs to queues is shown below:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
+---------------------------------------------------+
|
||||
| xsk A | xsk B | xsk C |<---+ User space
|
||||
=========================================================|==========
|
||||
| Queue 0 | Queue 1 | Queue 2 | | Kernel
|
||||
+---------------------------------------------------+ |
|
||||
| Netdev eth0 | |
|
||||
+---------------------------------------------------+ |
|
||||
| +=============+ | |
|
||||
| | key | xsk | | |
|
||||
| +---------+ +=============+ | |
|
||||
| | | | 0 | xsk A | | |
|
||||
| | | +-------------+ | |
|
||||
| | | | 1 | xsk B | | |
|
||||
| | BPF |-- redirect -->+-------------+-------------+
|
||||
| | prog | | 2 | xsk C | |
|
||||
| | | +-------------+ |
|
||||
| | | |
|
||||
| | | |
|
||||
| +---------+ |
|
||||
| |
|
||||
+---------------------------------------------------+
|
||||
|
||||
.. note::
|
||||
An AF_XDP socket that is bound to a certain <netdev/queue_id> will *only*
|
||||
accept XDP frames from that <netdev/queue_id>. If an XDP program tries to redirect
|
||||
from a <netdev/queue_id> other than what the socket is bound to, the frame will
|
||||
not be received on the socket.
|
||||
|
||||
Typically an XSKMAP is created per netdev. This map contains an array of XSK File
|
||||
Descriptors (FDs). The number of array elements is typically set or adjusted using
|
||||
the ``max_entries`` map parameter. For AF_XDP ``max_entries`` is equal to the number
|
||||
of queues supported by the netdev.
|
||||
|
||||
.. note::
|
||||
Both the map key and map value size must be 4 bytes.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
Kernel BPF
|
||||
----------
|
||||
bpf_redirect_map()
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
|
||||
|
||||
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
|
||||
For ``BPF_MAP_TYPE_XSKMAP`` this map contains references to XSK FDs
|
||||
for sockets attached to a netdev's queues.
|
||||
|
||||
.. note::
|
||||
If the map is empty at an index, the packet is dropped. This means that it is
|
||||
necessary to have an XDP program loaded with at least one XSK in the
|
||||
XSKMAP to be able to get any traffic to user space through the socket.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
|
||||
|
||||
XSK entry references of type ``struct xdp_sock *`` can be retrieved using the
|
||||
``bpf_map_lookup_elem()`` helper.
|
||||
|
||||
User space
|
||||
----------
|
||||
.. note::
|
||||
XSK entries can only be updated/deleted from user space and not from
|
||||
a BPF program. Trying to call these functions from a kernel BPF program will
|
||||
result in the program failing to load and a verifier warning.
|
||||
|
||||
bpf_map_update_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
|
||||
|
||||
XSK entries can be added or updated using the ``bpf_map_update_elem()``
|
||||
helper. The ``key`` parameter is equal to the queue_id of the queue the XSK
|
||||
is attaching to. And the ``value`` parameter is the FD value of that socket.
|
||||
|
||||
Under the hood, the XSKMAP update function uses the XSK FD value to retrieve the
|
||||
associated ``struct xdp_sock`` instance.
|
||||
|
||||
The flags argument can be one of the following:
|
||||
|
||||
- BPF_ANY: Create a new element or update an existing element.
|
||||
- BPF_NOEXIST: Create a new element only if it did not exist.
|
||||
- BPF_EXIST: Update an existing element.
|
||||
|
||||
bpf_map_lookup_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_lookup_elem(int fd, const void *key, void *value)
|
||||
|
||||
Returns ``struct xdp_sock *`` or negative error in case of failure.
|
||||
|
||||
bpf_map_delete_elem()
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. code-block:: c
|
||||
|
||||
int bpf_map_delete_elem(int fd, const void *key)
|
||||
|
||||
XSK entries can be deleted using the ``bpf_map_delete_elem()``
|
||||
helper. This helper will return 0 on success, or negative error in case of
|
||||
failure.
|
||||
|
||||
.. note::
|
||||
When `libxdp`_ deletes an XSK it also removes the associated socket
|
||||
entry from the XSKMAP.
|
||||
|
||||
Examples
|
||||
========
|
||||
Kernel
|
||||
------
|
||||
|
||||
The following code snippet shows how to declare a ``BPF_MAP_TYPE_XSKMAP`` called
|
||||
``xsks_map`` and how to redirect packets to an XSK.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
struct {
|
||||
__uint(type, BPF_MAP_TYPE_XSKMAP);
|
||||
__type(key, __u32);
|
||||
__type(value, __u32);
|
||||
__uint(max_entries, 64);
|
||||
} xsks_map SEC(".maps");
|
||||
|
||||
|
||||
SEC("xdp")
|
||||
int xsk_redir_prog(struct xdp_md *ctx)
|
||||
{
|
||||
__u32 index = ctx->rx_queue_index;
|
||||
|
||||
if (bpf_map_lookup_elem(&xsks_map, &index))
|
||||
return bpf_redirect_map(&xsks_map, index, 0);
|
||||
return XDP_PASS;
|
||||
}
|
||||
|
||||
User space
|
||||
----------
|
||||
|
||||
The following code snippet shows how to update an XSKMAP with an XSK entry.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
int update_xsks_map(struct bpf_map *xsks_map, int queue_id, int xsk_fd)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = bpf_map_update_elem(bpf_map__fd(xsks_map), &queue_id, &xsk_fd, 0);
|
||||
if (ret < 0)
|
||||
fprintf(stderr, "Failed to update xsks_map: %s\n", strerror(errno));
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
|
||||
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
|
||||
For a detailed explaination of the AF_XDP interface please see:
|
||||
|
||||
- `libxdp-readme`_.
|
||||
- `AF_XDP`_ kernel documentation.
|
||||
|
||||
.. note::
|
||||
The most comprehensive resource for using XSKMAPs and AF_XDP is `libxdp`_.
|
||||
|
||||
.. _libxdp: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
|
||||
.. _AF_XDP: https://www.kernel.org/doc/html/latest/networking/af_xdp.html
|
||||
.. _bpf-examples: https://github.com/xdp-project/bpf-examples
|
||||
.. _libxdp-readme: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp#using-af_xdp-sockets
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user