We have 100+ syzbot reports about netns being dismantled too soon,
still unresolved as of today.
We think a missing get_net() or an extra put_net() is the root cause.
In order to find the bug(s), and be able to spot future ones,
this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
to precisely pair all put_net() with corresponding get_net().
To use these helpers, each data structure owning a refcount
should also use a "netns_tracker" to pair the get and put.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device has an active watchdog timer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device is in lweventlist.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to track all dev_hold()/dev_put() to ease leak hunting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will help debugging pesky netdev reference leaks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net device are refcounted. Over the years we had numerous bugs
caused by imbalanced dev_hold() and dev_put() calls.
The general idea is to be able to precisely pair each decrement with
a corresponding prior increment. Both share a cookie, basically
a pointer to private data storing stack traces.
This patch adds dev_hold_track() and dev_put_track().
To use these helpers, each data structure owning a refcount
should also use a "netdevice_tracker" to pair the hold and put.
netdevice_tracker dev_tracker;
...
dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
...
dev_put_track(dev, &dev_tracker);
Whenever a leak happens, we will get precise stack traces
of the point dev_hold_track() happened, at device dismantle phase.
We will also get a stack trace if too many dev_put_track() for the same
netdevice_tracker are attempted.
This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)
We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev->dev_addr should only be modified via helpers,
but someone may be casting off the const. Add a runtime
check to catch abuses.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. We converted all users to make modifications via appropriate
helpers, make netdev->dev_addr const.
The update helpers need to upcast from the buffer to
struct netdev_hw_addr.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are not fast path, there is no point in inlining them.
Also provide netif_freeze_queues()/netif_unfreeze_queues()
so that we can use them from dev_watchdog() in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.
As we want to no longer disrupt transmits, we use an
atomic_long_t instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same rationale than prior patch : using the dedicated
section avoid holes and pack all these bool values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.
gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is used once, no need to keep it in fat net/core/skbuff.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS for net-next:
1) Add new run_estimation toggle to IPVS to stop the estimation_timer
logic, from Dust Li.
2) Relax superfluous dynset check on NFT_SET_TIMEOUT.
3) Add egress hook, from Lukas Wunner.
4) Nowadays, almost all hook functions in x_table land just call the hook
evaluation loop. Remove remaining hook wrappers from iptables and IPVS.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Qdisc::running sequence counter has two uses:
1. Reliably reading qdisc's tc statistics while the qdisc is running
(a seqcount read/retry loop at gnet_stats_add_basic()).
2. As a flag, indicating whether the qdisc in question is running
(without any retry loops).
For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.
Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().
For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.
Now that the qdisc->running sequence counter is no longer used, remove
it.
Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
such as local ARP traffic generated for clustering purposes or DHCP
(Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)
The egress hook introduced herein complements the ingress hook added by
commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key"). A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.
Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc). On ingress, packets are classified first by
tc, then by netfilter. On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.
Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred). E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
In this case, netfilter egress classifying is not performed when leaving
the host namespace! That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying. That is only logical since it hasn't
passed through netfilter ingress classifying either.
Packets can alternatively be redirected at the netfilter layer using
nft fwd. Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.
Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress(). This ensures that
netfilter is applied both on the overlay and underlying network.
Interaction between tc and netfilter is possible by setting and querying
skb->mark.
If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:
Before: 1537 1538 1538 1537 1538 1537 Mb/sec
After: 1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec
After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec
When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured. That is caused by checking dev->nf_hooks_egress
against NULL.
Measurements were performed on a Core i7-3615QM. Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>