Suppose that we input the following commands at first:
# nfacct add test
# iptables -A INPUT -m nfacct --nfacct-name test
And now "test" acct's refcnt is 2, but later when we try to delete the
"test" nfacct and the related iptables rule at the same time, race maybe
happen:
CPU0 CPU1
nfnl_acct_try_del nfnl_acct_put
atomic_dec_and_test //ref=1,testfail -
- atomic_dec_and_test //ref=0,testok
- kfree_rcu
atomic_inc //ref=1 -
So after the rcu grace period, nf_acct will be freed but it is still linked
in the nfnl_acct_list, and we can access it later, then oops will happen.
Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic
operation atomic_cmpxchg here to fix this problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inet_lookup_listener() and inet6_lookup_listener() no longer
take a reference on the found listener.
This minimal patch adds back the refcounting, but we might do
this differently in net-next later.
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Reported-and-tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should report the over quota message to the right net namespace
instead of the init netns.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, if nfnetlink_log.ko is not loaded, we cannot add rules
to log packets to the userspace when we specify it with arp family,
such as:
# nft add rule arp filter input log group 0
<cmdline>:1:1-37: Error: Could not process rule: No such file or
directory
add rule arp filter input log group 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should skip the conntracks that belong to a different namespace,
otherwise other unrelated netns's conntrack entries will be dumped via
/proc/net/nf_conntrack.
Fixes: 56d52d4892 ("netfilter: conntrack: use a single hashtable for all namespaces")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sean Wang says:
====================
mediatek: Fix warning and issue
This patch set fixes the following warning and issues
v1 -> v2: Fix message typos and add coverletter
v2 -> v3: Split from the previous series for submitting bug fixes
as a series targeting 'net'
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Runtime warning occurs if DMA-API debug feature is enabled that would be
raised by pointers passed to DMA API as arguments to inconsistent struct
device objects, so that the patch makes them usage aligned between DMA
operations such as dma_map_*() and dma_unmap_*() to eliminate the warning.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 08ef55c6f2
("net-next: mediatek: fix gigabit and flow control advertisement")
had supported proper flow control settings for GMAC1. But for GMAC0,
1.GMAC0 shares the common logic with GMAC1 inside mtk_phy_link_adjust()
to adapt various settings for the target phy.
2.GMAC0 uses fixed-phy to connect to a builtin gigabit switch with
fixed link speed as commit 0c72c50f6f
("net-next: mediatek: add fixed-phy support") describes.
3.However, fixed-phy doesn't enable SUPPORTED_Pause & SUPPORTED_Asym_Pause
supported flag on default that would cause mtk_phy_link_adjust() not to
enable flow control setting on GMAC0 properly and cause packet dropped
when high traffic.
Due to these reasons, the patch adds SUPPORTED_Pause & SUPPORTED_Asym_Pause
supported flags on fixed-phy used by the driver to have proper handling on
the both GMAC with the shared common logic.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch fixes up the incorrect setup of reduced MII (RMII) on GMAC
and adds the supplement for the setup of reverse MII (REVMII) on GMAC
, and rearranges the error handling for invalid PHY argument.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitaly Kuznetsov says:
====================
hv_netvsc: fixes for VF removal path
Kernel crash is reported after VF is removed and detached from netvsc
device. Turns out we have multiple different (but related) issues on the
VF removal path which I'm trying to address with PATCHes 2-5 of this
series. PATCH1 is required to support the change.
Changes since v1:
- Re-arrange patches in the series to not introduce new issues [David Miller]
- Add PATCH5 which fixes a new issue I discovered while testing.
- Add Haiyang' A-b tags to PATCH1-4
With regards to Stephen's suggestion: I believe that switching to using RCU
and eliminating vf_use_cnt/vf_inject is the right thing to do long-term, we
can either put this on top of this series or do it later in net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding driver sets IFF_BONDING on both master (the bonding device) and
slave (the real NIC) devices and in netvsc_netdev_event() we want to skip
master devices only. Currently, there is an uncertainty when a slave
interface is removed: if bonding module comes first in netdev_chain it
clears IFF_BONDING flag on the netdev and netvsc_netdev_event() correctly
handles NETDEV_UNREGISTER event, but in case netvsc comes first on the
chain it sees the device with IFF_BONDING still attached and skips it. As
we still hold vf_netdev pointer to the device we crash on the next inject.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're not guaranteed to see NETDEV_REGISTER/NETDEV_UNREGISTER notifications
only once per VF but we increase/decrease module refcount unconditionally.
Check vf_netdev to make sure we don't take/release it twice. We presume
that only one VF per netvsc device may exist.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We reset vf_inject on VF going down (netvsc_vf_down()) but we don't on
VF removal (netvsc_unregister_vf()) so vf_inject stays 'true' while
vf_netdev is already NULL and we're trying to inject packets into NULL
net device in netvsc_recv_callback() causing kernel to crash.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a deadlock scenario:
- netvsc_vf_up() schedules netvsc_notify_peers() work and quits.
- netvsc_vf_down() runs before netvsc_notify_peers() gets executed. As it
is being executed from netdev notifier chain we hold rtnl lock when we
get here.
- we enter while (atomic_read(&net_device_ctx->vf_use_cnt) != 0) loop and
wait till netvsc_notify_peers() drops vf_use_cnt.
- netvsc_notify_peers() starts on some other CPU but netdev_notify_peers()
will hang on rtnl_lock().
- deadlock!
Instead of introducing additional synchronization I suggest we drop
gwrk.dwrk completely and call NETDEV_NOTIFY_PEERS directly. As we're
acting under rtnl lock this is legitimate.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct netvsc_device is not suitable for storing VF information as this
structure is being destroyed on MTU change / set channel operation (see
rndis_filter_device_remove()). Move all VF related stuff to struct
net_device_context which is persistent.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that the inner_protocol is set on transmit so that GSO segmentation,
which relies on that field, works correctly.
This is achieved by setting the inner_protocol in gre_build_header rather
than each caller of that function. It ensures that the inner_protocol is
set when gre_fb_xmit() is used to transmit GRE which was not previously the
case.
I have observed this is not the case when OvS transmits GRE using
lwtunnel metadata (which it always does).
Fixes: 3872035241 ("gre: Use inner_proto to obtain inner header protocol")
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ping_v6_sendmsg does not set flowi6_oif in response to
sin6_scope_id or sk_bound_dev_if, so it is not possible to use
these APIs to ping an IPv6 address on a different interface.
Instead, it sets flowi6_iif, which is incorrect but harmless.
Stop setting flowi6_iif, and support various ways of setting oif
in the same priority order used by udpv6_sendmsg.
Tested: https://android-review.googlesource.com/#/c/254470/
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In mlxsw_sp_router_fib4_add_info_destroy(), the fib_entry pointer is used
after it has been freed by mlxsw_sp_fib_entry_destroy(). Use a temporary
variable to fix this.
Fixes: 61c503f976 ("mlxsw: spectrum_router: Implement fib4 add/del switchdev obj ops")
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sander reports following splat after netfilter nat bysrc table got
converted to rhashtable:
swapper/0: page allocation failure: order:3, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc1 [..]
[<ffffffff811633ed>] warn_alloc_failed+0xdd/0x140
[<ffffffff811638b1>] __alloc_pages_nodemask+0x3e1/0xcf0
[<ffffffff811a72ed>] alloc_pages_current+0x8d/0x110
[<ffffffff8117cb7f>] kmalloc_order+0x1f/0x70
[<ffffffff811aec19>] __kmalloc+0x129/0x140
[<ffffffff8146d561>] bucket_table_alloc+0xc1/0x1d0
[<ffffffff8146da1d>] rhashtable_insert_rehash+0x5d/0xe0
[<ffffffff819fcfff>] nf_nat_setup_info+0x2ef/0x400
The failure happens when allocating the spinlock array.
Even with GFP_KERNEL its unlikely for such a large allocation
to succeed.
Thomas Graf pointed me at inet_ehash_locks_alloc(), so in addition
to adding NOWARN for atomic allocations this also makes the bucket-array
sizing more conservative.
In commit 095dc8e0c3 ("tcp: fix/cleanup inet_ehash_locks_alloc()"),
Eric Dumazet says: "Budget 2 cache lines per cpu worth of 'spinlocks'".
IOW, consider size needed by a single spinlock when determining
number of locks per cpu. So with 64 byte per cacheline and 4 byte per
spinlock this gives 32 locks per cpu.
Resulting size of the lock-array (sizeof(spinlock) == 4):
cpus: 1 2 4 8 16 32 64
old: 1k 1k 4k 8k 16k 16k 16k
new: 128 256 512 1k 2k 4k 8k
8k allocation should have decent chance of success even
with GFP_ATOMIC, and should not fail with GFP_KERNEL.
With 72-byte spinlock (LOCKDEP):
cpus : 1 2
old: 9k 18k
new: ~2k ~4k
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:
eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
However, this doesn't prevent a warning on a configuration such as:
eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1
In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:
- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
take (macvlan_netdev_addr_lock_key, 1)
By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, trying to setup a vlan over a macsec device, or other
combinations of devices, triggers a lockdep warning.
Use netdev_lockdep_set_classes and ndo_get_lock_subclass, similar to
what macvlan does.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If IPv6 is disabled when the option is set to keep IPv6
addresses on link down, userspace is unaware of this as
there is no such indication via netlink. The solution is to
remove the IPv6 addresses in this case, which results in
netlink messages indicating removal of addresses in the
usual manner. This fix also makes the behavior consistent
with the case of having IPv6 disabled first, which stops
IPv6 addresses from being added.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_transport_seq_start() does not currently clear iter->start_fail on
success, but relies on it being zero when it is allocated (by
seq_open_net()).
This can be a problem in the following sequence:
open() // allocates iter (and implicitly sets iter->start_fail = 0)
read()
- iter->start() // fails and sets iter->start_fail = 1
- iter->stop() // doesn't call sctp_transport_walk_stop() (correct)
read() again
- iter->start() // succeeds, but doesn't change iter->start_fail
- iter->stop() // doesn't call sctp_transport_walk_stop() (wrong)
We should initialize sctp_ht_iter::start_fail to zero if ->start()
succeeds, otherwise it's possible that we leave an old value of 1 there,
which will cause ->stop() to not call sctp_transport_walk_stop(), which
causes all sorts of problems like not calling rcu_read_unlock() (and
preempt_enable()), eventually leading to more warnings like this:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 16551, name: trinity-c2
Preemption disabled at:[<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
[<ffffffff81149abb>] preempt_count_add+0x1fb/0x280
[<ffffffff83295892>] _raw_spin_lock+0x12/0x40
[<ffffffff819bceb6>] rhashtable_walk_start+0x46/0x150
[<ffffffff82ec665f>] sctp_transport_walk_start+0x2f/0x60
[<ffffffff82edda1d>] sctp_transport_seq_start+0x4d/0x150
[<ffffffff81439e50>] traverse+0x170/0x850
[<ffffffff8143aeec>] seq_read+0x7cc/0x1180
[<ffffffff814f996c>] proc_reg_read+0xbc/0x180
[<ffffffff813d0384>] do_loop_readv_writev+0x134/0x210
[<ffffffff813d2a95>] do_readv_writev+0x565/0x660
[<ffffffff813d6857>] vfs_readv+0x67/0xa0
[<ffffffff813d6c16>] do_preadv+0x126/0x170
[<ffffffff813d710c>] SyS_preadv+0xc/0x10
[<ffffffff8100334c>] do_syscall_64+0x19c/0x410
[<ffffffff83296225>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Notice that this is a subtly different stacktrace from the one in commit
5fc382d875 ("net/sctp: terminate rhashtable walk correctly").
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-By: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>