Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
program to select a socket.
Update the internal eBPF program by passing to socket option
SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
to select a socket.
This avoids having to keep adding special case fanout modes. One
example use case is application layer load balancing. The QUIC
protocol, for instance, encodes a connection ID in UDP payload.
Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
only user so far.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow e8e85cc5eb ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:
tp_len > dev->mtu + dev->hard_header_len
as dev->mtu and dev->hard_header_len are both unsigned.
That may lead to just returning an incorrect EMSGSIZE errno
to the user.
Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}. However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased. Ultimately, this
caused the deletion of the interface to fail with the following message:
unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.
Fixes: 902fefb82e ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove handling of tx_ring in prb_setup_retire_blk_timer
for TPACKET_V3 because init_prb_bdqc is called only for zero tx_ring
and thus prb_setup_retire_blk_timer for zero tx_ring only.
And also in functon init_prb_bdqc there is no usage of tx_ring.
Thus removing tx_ring from init_prb_bdqc.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Suggested-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
f->num_members. It returns the old value unconditionally, but
f->num_members may have changed since the last store. Ensure
that the return value is always < num.
When modifying the logic, simplify it further by replacing the loop
with an unconditional atomic increment.
Fixes: dc99f60069 ("packet: Add fanout support.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Destruction of the po->rollover must be delayed until there are no
more packets in flight that can access it. The field is destroyed in
packet_release, before synchronize_net. Delay using rcu.
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to tell compiler it must not read f->num_members multiple
times. Otherwise testing if num is not zero is flaky, and we could
attempt an invalid divide by 0 in fanout_demux_cpu()
Note bug was present in packet_rcv_fanout_hash() and
packet_rcv_fanout_lb() but final 3.1 had a simple location
after commit 95ec3eb417 ("packet: Add 'cpu' fanout policy.")
Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.
Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).
Tested:
Passes tools/testing/net/psock_fanout.c, which tests both modes and
flag. My previous tests were limited to bench_rollover, which only
stresses the flag. The test now completes safely. it still gives an
error for mode rollover, because it does not expect the new headroom
(ROOM_NORMAL) requirement. I will send a separate patch to the test.
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
I should have run this test and caught this before submission, of
course. Apologies for the oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid two xchg calls whose return values were unused, causing a
warning on some architectures.
The relevant variable is a hint and read without mutual exclusion.
This fix makes all writers hold the receive_queue lock.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.
If no socket with sufficient room is found, rollover fails. Also count
these events.
Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.
Tested:
Read counters in bench_rollover on all other tests in the patchset
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.
Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
cpu rx rx.k drop.k rollover r.huge r.failed
0 14 14 0 0 0 0
1 20 20 0 0 0 0
2 16 16 0 0 0 0
3 6168824 6168824 0 4867721 4867721 0
4 4867741 4867741 0 0 0 0
5 12 12 0 0 0 0
6 15 15 0 0 0 0
7 17 17 0 0 0 0
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.
Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.
Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.
Tested:
Ran bench_rollover: a process with 8 sockets in a single fanout
group, each pinned to a single cpu that receives one nic recv
interrupt. RPS and RFS are disabled. The benchmark uses packet
rx_ring, which has to take a lock when determining whether a
socket has room.
Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
uniformly across the packet sockets (and inserted an iptables
rule to drop in PREROUTING to avoid protocol stack processing).
Without this patch, all sockets try to migrate traffic to
neighbors, causing lock contention when searching for a non-
empty neighbor. The lock is the top 9 entries.
perf record -a -g sleep 5
- 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
- 99.00% spin_lock
+ 81.77% packet_rcv_has_room.isra.41
+ 18.23% tpacket_rcv
+ 0.84% packet_rcv_has_room.isra.41
+ 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock
+ 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock
+ 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock
+ 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock
+ 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41
On net-next with this patch, this lock contention is no longer a
top entry. Most time is spent in the actual read function. Next up
are other locks:
+ 15.52% bench_rollover bench_rollover [.] reader
+ 4.68% swapper [kernel.kallsyms] [k] memcpy_erms
+ 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51
+ 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms
+ 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv
+ 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq
Looking closer at the remaining _raw_spin_lock, the cost of probing
in rollover is now comparable to the cost of taking the lock later
in tpacket_rcv.
- 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 33.41% packet_rcv_has_room
+ 28.15% tpacket_rcv
+ 19.54% enqueue_to_backlog
+ 6.45% __free_pages_ok
+ 2.78% packet_rcv_fanout
+ 2.13% fanout_demux_rollover
+ 2.01% netif_receive_skb_internal
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.
The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.
lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit
cpu rx rx.k drop.k rollover r.huge r.failed
0 16 16 0 0 0 0
1 21 21 0 0 0 0
2 5227502 5227502 0 0 0 0
3 18 18 0 0 0 0
4 6083289 6083289 0 5227496 0 0
5 22 22 0 0 0 0
6 21 21 0 0 0 0
7 9 9 0 0 0 0
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.
The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
af_packet user that at least the transport header checksum
has been already validated.
For now, the flag may be set for incoming packets only.
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is just an optimization. We don't need the value of status variable
if the packet is filtered.
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having to say
> #ifdef CONFIG_NET_NS
> struct net *net;
> #endif
in structures is a little bit wordy and a little bit error prone.
Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
> struct net *net;
> #endif
> } possible_net_t;
And then in a header say:
> possible_net_t net;
Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.
Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.
This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>