Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.
The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.
The ops supported are:
SO_RCVBUF
SO_SNDBUF
SO_MAX_PACING_RATE
SO_PRIORITY
SO_RCVLOWAT
SO_MARK
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.
Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.
Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code. For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.
The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.
This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.
This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.
Two new corresponding structs (one for the kernel one for the user/BPF
program):
/* kernel version */
struct bpf_sock_ops_kern {
struct sock *sk;
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
};
/* user version
* Some fields are in network byte order reflecting the sock struct
* Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
* convert them to host byte order.
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4; /* In network byte order */
__u32 local_ip4; /* In network byte order */
__u32 remote_ip6[4]; /* In network byte order */
__u32 local_ip6[4]; /* In network byte order */
__u32 remote_port; /* In network byte order */
__u32 local_port; /* In host byte horder */
};
Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.
The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow for tc BPF programs to set a skb->hash, apart from clearing
and triggering a recalc that we have right now. It allows for BPF
to implement a custom hashing routine for skb_get_hash().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A single BPF_OBJ_GET_INFO_BY_FD cmd is used to obtain the info
for both bpf_prog and bpf_map. The kernel can figure out the
fd is associated with a bpf_prog or bpf_map.
The suggested struct bpf_prog_info and struct bpf_map_info are
not meant to be a complete list and it is not the goal of this patch.
New fields can be added in the future patch.
The focus of this patch is to create the interface,
BPF_OBJ_GET_INFO_BY_FD cmd for exposing the bpf_prog's and
bpf_map's info.
The obj's info, which will be extended (and get bigger) over time, is
separated from the bpf_attr to avoid bloating the bpf_attr.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add BPF_MAP_GET_FD_BY_ID command to allow user to get a fd
from a bpf_map's ID.
bpf_map_inc_not_zero() is added and is called with map_idr_lock
held.
__bpf_map_put() is also added which has the 'bool do_idr_lock'
param to decide if the map_idr_lock should be acquired when
freeing the map->id.
In the error path of bpf_map_inc_not_zero(), it may have to
call __bpf_map_put(map, false) which does not need
to take the map_idr_lock when freeing the map->id.
It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add BPF_PROG_GET_FD_BY_ID command to allow user to get a fd
from a bpf_prog's ID.
bpf_prog_inc_not_zero() is added and is called with prog_idr_lock
held.
__bpf_prog_put() is also added which has the 'bool do_idr_lock'
param to decide if the prog_idr_lock should be acquired when
freeing the prog->id.
In the error path of bpf_prog_inc_not_zero(), it may have to
call __bpf_prog_put(map, false) which does not need
to take the prog_idr_lock when freeing the prog->id.
It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds BPF_PROG_GET_NEXT_ID and BPF_MAP_GET_NEXT_ID
to allow userspace to iterate all bpf_prog IDs and bpf_map IDs.
The API is trying to be consistent with the existing
BPF_MAP_GET_NEXT_KEY.
It is currently limited to CAP_SYS_ADMIN which we can
consider to lift it in followup patches.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit updates documentation of the bpf_perf_event_output and
bpf_perf_event_read helpers to match their implementation.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new field, "prog_flags", and an initial flag value
BPF_F_STRICT_ALIGNMENT.
When set, the verifier will enforce strict pointer alignment
regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.
The verifier, in this mode, will also use a fixed value of "2" in
place of NET_IP_ALIGN.
This facilitates test cases that will exercise and validate this part
of the verifier even when run on architectures where alignment doesn't
matter.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
The description inside uapi/linux/bpf.h about bpf_get_socket_uid
helper function is no longer valid. It returns overflowuid rather
than 0 when failed.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add napi_id access to __sk_buff for socket filter program types, tc
program types and other bpf_convert_ctx_access() users. Having access
to skb->napi_id is useful for per RX queue listener siloing, f.e.
in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
used, meaning SO_REUSEPORT enabled listeners can then select the
corresponding socket at SYN time already [1]. The skb is marked via
skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).
Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d4339028b
("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
the NAPI ID associated with the queue for steering, which requires a
prior sk_mark_napi_id() after the socket was looked up.
Semantics for the __sk_buff napi_id access are similar, meaning if
skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
then an invalid napi_id of 0 is returned to the program, otherwise a
valid non-zero napi_id.
[1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.
Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.
Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.
Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.
To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise overflowuid is
returned.
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds hash of maps support (hashmap->bpf_map).
BPF_MAP_TYPE_HASH_OF_MAPS is added.
A map-in-map contains a pointer to another map and lets call
this pointer 'inner_map_ptr'.
Notes on deleting inner_map_ptr from a hash map:
1. For BPF_F_NO_PREALLOC map-in-map, when deleting
an inner_map_ptr, the htab_elem itself will go through
a rcu grace period and the inner_map_ptr resides
in the htab_elem.
2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
when deleting an inner_map_ptr, the htab_elem may
get reused immediately. This situation is similar
to the existing prealloc-ated use cases.
However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
inner_map->ops->map_free(inner_map) which will go
through a rcu grace period (i.e. all bpf_map's map_free
currently goes through a rcu grace period). Hence,
the inner_map_ptr is still safe for the rcu reader side.
This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
check_map_prealloc() in the verifier. preallocation is a
must for BPF_PROG_TYPE_PERF_EVENT. Hence, even we don't expect
heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
map-in-map first.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map). The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.
Any bpf map type can be acted as an inner_map. The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.
Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).
When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also. ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries. The equality of ops is implied by the equality of
map_type.
During outer_map creation time, an inner_map_fd is needed to create an
outer_map. However, the inner_map_fd's life time does not depend on the
outer_map. The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.
Also, for the outer_map:
* It allows element update and delete from syscall
* It allows element lookup from bpf_prog
The above is similar to the current fd_array pattern.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.
Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X
2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.
3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.
In the future this behavior may be extended with a chain of
non-overridable programs.
Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.
Add several testcases and adjust libbpf.
Fixes: 3007098494 ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When programs need to calculate the csum from scratch for small UDP
packets and use bpf_l4_csum_replace() to feed the result from helpers
like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
that would ignore the case of current csum being 0, and which would
still allow for the helper to set the csum and transform when needed
to CSUM_MANGLED_0.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.
Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.
The code carries more information about the internal implementation.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>