Commit Graph

8037 Commits

Author SHA1 Message Date
Xin Long
3190b649b4 sctp: add SCTP_PLPMTUD_PROBE_INTERVAL sockopt for sock/asoc/transport
With this socket option, users can change probe_interval for
a transport, asoc or sock after it's created.

Note that if the change is for an asoc, also apply the change
to each transport in this asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:51 -07:00
Ido Schimmel
913d026fbf ethtool: Document correct attribute type
'ETHTOOL_A_MODULE_EEPROM_DATA' is a binary attribute, not a nested one.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 10:40:54 -07:00
Jakub Kicinski
adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
Linus Torvalds
9ed13a17e3 Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
  bluetooth, netfilter and can.

  Current release - regressions:

   - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
     to fix modifying offloaded qdiscs

   - lantiq: net: fix duplicated skb in rx descriptor ring

   - rtnetlink: fix regression in bridge VLAN configuration, empty info
     is not an error, bot-generated "fix" was not needed

   - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
     creation

  Current release - new code bugs:

   - ethtool: fix NULL pointer dereference during module EEPROM dump via
     the new netlink API

   - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
     queue should not be visible to the stack

   - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs

   - mlx5e: verify dev is present in get devlink port ndo, avoid a panic

  Previous releases - regressions:

   - neighbour: allow NUD_NOARP entries to be force GCed

   - further fixes for fallout from reorg of WiFi locking (staging:
     rtl8723bs, mac80211, cfg80211)

   - skbuff: fix incorrect msg_zerocopy copy notifications

   - mac80211: fix NULL ptr deref for injected rate info

   - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs

  Previous releases - always broken:

   - bpf: more speculative execution fixes

   - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local

   - udp: fix race between close() and udp_abort() resulting in a panic

   - fix out of bounds when parsing TCP options before packets are
     validated (in netfilter: synproxy, tc: sch_cake and mptcp)

   - mptcp: improve operation under memory pressure, add missing
     wake-ups

   - mptcp: fix double-lock/soft lookup in subflow_error_report()

   - bridge: fix races (null pointer deref and UAF) in vlan tunnel
     egress

   - ena: fix DMA mapping function issues in XDP

   - rds: fix memory leak in rds_recvmsg

  Misc:

   - vrf: allow larger MTUs

   - icmp: don't send out ICMP messages with a source address of 0.0.0.0

   - cdc_ncm: switch to eth%d interface naming"

* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
  net: ethernet: fix potential use-after-free in ec_bhf_remove
  selftests/net: Add icmp.sh for testing ICMP dummy address responses
  icmp: don't send out ICMP messages with a source address of 0.0.0.0
  net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
  net: ll_temac: Fix TX BD buffer overwrite
  net: ll_temac: Add memory-barriers for TX BD access
  net: ll_temac: Make sure to free skb when it is completely used
  MAINTAINERS: add Guvenc as SMC maintainer
  bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
  bnxt_en: Fix TQM fastpath ring backing store computation
  bnxt_en: Rediscover PHY capabilities after firmware reset
  cxgb4: fix wrong shift.
  mac80211: handle various extensible elements correctly
  mac80211: reset profile_periodicity/ema_ap
  cfg80211: avoid double free of PMSR request
  cfg80211: make certificate generation more robust
  mac80211: minstrel_ht: fix sample time check
  net: qed: Fix memcpy() overflow of qed_dcbx_params()
  net: cdc_eem: fix tx fixup skb leak
  net: hamradio: fix memory leak in mkiss_close
  ...
2021-06-18 18:55:29 -07:00
David S. Miller
103ebe658a Revert "net: add pf_family_names[] for protocol family"
This reverts commit 1f3c98eadd.

Does not build...

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 13:02:45 -07:00
Yejune Deng
1f3c98eadd net: add pf_family_names[] for protocol family
Modify the pr_info content from int to char *, this looks more readable.

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 13:02:21 -07:00
Toke Høiland-Jørgensen
3218274773 icmp: don't send out ICMP messages with a source address of 0.0.0.0
When constructing ICMP response messages, the kernel will try to pick a
suitable source address for the outgoing packet. However, if no IPv4
addresses are configured on the system at all, this will fail and we end up
producing an ICMP message with a source address of 0.0.0.0. This can happen
on a box routing IPv4 traffic via v6 nexthops, for instance.

Since 0.0.0.0 is not generally routable on the internet, there's a good
chance that such ICMP messages will never make it back to the sender of the
original packet that the ICMP message was sent in response to. This, in
turn, can create connectivity and PMTUd problems for senders. Fortunately,
RFC7600 reserves a dummy address to be used as a source for ICMP
messages (192.0.0.8/32), so let's teach the kernel to substitute that
address as a last resort if the regular source address selection procedure
fails.

Below is a quick example reproducing this issue with network namespaces:

ip netns add ns0
ip l add type veth peer netns ns0
ip l set dev veth0 up
ip a add 10.0.0.1/24 dev veth0
ip a add fc00:dead:cafe:42::1/64 dev veth0
ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2
ip -n ns0 l set dev veth0 up
ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0
ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1
ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0
ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1
tcpdump -tpni veth0 -c 2 icmp &
ping -w 1 10.1.0.1 > /dev/null
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64
IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92
2 packets captured
2 packets received by filter
0 packets dropped by kernel

With this patch the above capture changes to:
IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64
IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Juliusz Chroboczek <jch@irif.fr>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 12:13:24 -07:00
Geliang Tang
752e906732 mptcp: add csum_enabled in mptcp_sock
This patch added a new member named csum_enabled in struct mptcp_sock,
used a dummy mptcp_is_checksum_enabled() helper to initialize it.

Also added a new member named mptcpi_csum_enabled in struct mptcp_info
to expose the csum_enabled flag.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 11:40:11 -07:00
Andrea Mayer
8b532109bf seg6: add support for SRv6 End.DT46 Behavior
IETF RFC 8986 [1] includes the definition of SRv6 End.DT4, End.DT6, and
End.DT46 Behaviors.

The current SRv6 code in the Linux kernel only implements End.DT4 and
End.DT6 which can be used respectively to support IPv4-in-IPv6 and
IPv6-in-IPv6 VPNs. With End.DT4 and End.DT6 it is not possible to create a
single SRv6 VPN tunnel to carry both IPv4 and IPv6 traffic.

The proposed End.DT46 implementation is meant to support the decapsulation
of IPv4 and IPv6 traffic coming from a single SRv6 tunnel.
The implementation of the SRv6 End.DT46 Behavior in the Linux kernel
greatly simplifies the setup and operations of SRv6 VPNs.

The SRv6 End.DT46 Behavior leverages the infrastructure of SRv6 End.DT{4,6}
Behaviors implemented so far, because it makes use of a VRF device in
order to force the routing lookup into the associated routing table.

To make the End.DT46 work properly, it must be guaranteed that the routing
table used for routing lookup operations is bound to one and only one VRF
during the tunnel creation. Such constraint has to be enforced by enabling
the VRF strict_mode sysctl parameter, i.e.:

 $ sysctl -wq net.vrf.strict_mode=1

Note that the same approach is used for the SRv6 End.DT4 Behavior and for
the End.DT6 Behavior in VRF mode.

The command used to instantiate an SRv6 End.DT46 Behavior is
straightforward, i.e.:

 $ ip -6 route add 2001:db8::1 encap seg6local action End.DT46 vrftable 100 dev vrf100.

[1] https://www.rfc-editor.org/rfc/rfc8986.html#name-enddt46-decapsulation-and-s

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Performance and impact of SRv6 End.DT46 Behavior on the SRv6 Networking
=======================================================================

This patch aims to add the SRv6 End.DT46 Behavior with minimal impact on
the performance of SRv6 End.DT4 and End.DT6 Behaviors.
In order to verify this, we tested the performance of the newly introduced
SRv6 End.DT46 Behavior and compared it with the performance of SRv6
End.DT{4,6} Behaviors, considering both the patched kernel and the kernel
before applying the End.DT46 patch (referred to as vanilla kernel).

In details, the following decapsulation scenarios were considered:

 1.a) IPv6 traffic in SRv6 End.DT46 Behavior on patched kernel;
 1.b) IPv4 traffic in SRv6 End.DT46 Behavior on patched kernel;
 2.a) SRv6 End.DT6 Behavior (VRF mode) on patched kernel;
 2.b) SRv6 End.DT4 Behavior on patched kernel;
 3.a) SRv6 End.DT6 Behavior (VRF mode) on vanilla kernel (without the
      End.DT46 patch);
 3.b) SRv6 End.DT4 Behavior on vanilla kernel (without the End.DT46 patch).

All tests were performed on a testbed deployed on the CloudLab [2]
facilities. We considered IPv{4,6} traffic handled by a single core (at 2.4
GHz on a Xeon(R) CPU E5-2630 v3) on kernel 5.13-rc1 using packets of size
~ 100 bytes.

Scenario (1.a): average 684.70 kpps; std. dev. 0.7 kpps;
Scenario (1.b): average 711.69 kpps; std. dev. 1.2 kpps;
Scenario (2.a): average 690.70 kpps; std. dev. 1.2 kpps;
Scenario (2.b): average 722.22 kpps; std. dev. 1.7 kpps;
Scenario (3.a): average 690.02 kpps; std. dev. 2.6 kpps;
Scenario (3.b): average 721.91 kpps; std. dev. 1.2 kpps;

Considering the results for the patched kernel (1.a, 1.b, 2.a, 2.b) we
observe that the performance degradation incurred in using End.DT46 rather
than End.DT6 and End.DT4 respectively for IPv6 and IPv4 traffic is minimal,
around 0.9% and 1.5%. Such very minimal performance degradation is the
price to be paid if one prefers to use a single tunnel capable of handling
both types of traffic (IPv4 and IPv6).

Comparing the results for End.DT4 and End.DT6 under the patched and the
vanilla kernel (2.a, 2.b, 3.a, 3.b) we observe that the introduction of the
End.DT46 patch has no impact on the performance of End.DT4 and End.DT6.

[2] https://www.cloudlab.us

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 11:35:47 -07:00
David S. Miller
a52171ae7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-06-17

The following pull-request contains BPF updates for your *net-next* tree.

We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).

The main changes are:

1) BPF infrastructure to migrate TCP child sockets from a listener to another
   in the same reuseport group/map, from Kuniyuki Iwashima.

2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
   noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.

3) Streamline error reporting changes in libbpf as planned out in the
   'libbpf: the road to v1.0' effort, from Andrii Nakryiko.

4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.

5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
   types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.

6) Support new LLVM relocations in libbpf to make them more linker friendly,
   also add a doc to describe the BPF backend relocations, from Yonghong Song.

7) Silence long standing KUBSAN complaints on register-based shifts in
   interpreter, from Daniel Borkmann and Eric Biggers.

8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
   target arch cannot be determined, from Lorenz Bauer.

9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.

10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.

11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
    programs to bpf_helpers.h header, from Florent Revest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17 11:54:56 -07:00
Jian Shen
2d8ea148e5 net: fix mistake path for netdev_features_strings
Th_strings arrays netdev_features_strings, tunable_strings, and
phy_tunable_strings has been moved to file net/ethtool/common.c.
So fixes the comment.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17 11:32:50 -07:00
Guvenc Gulce
f0dd7bf5e3 net/smc: Add netlink support for SMC fallback statistics
Add support to collect more detailed SMC fallback reason statistics and
provide these statistics to user space on the netlink interface.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Guvenc Gulce
8c40602b4b net/smc: Add netlink support for SMC statistics
Add the netlink function which collects the statistics information and
delivers it to the userspace.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Kuniyuki Iwashima
d5e4ddaeb6 bpf: Support socket migration by eBPF.
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
to check if the attached eBPF program is capable of migrating sockets. When
the eBPF program is attached, we run it for socket migration if the
expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
net.ipv4.tcp_migrate_req is enabled.

Currently, the expected_attach_type is not enforced for the
BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
earlier idea in the commit aac3fc320d ("bpf: Post-hooks for sys_bind") to
fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().

Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
select a new listener based on the child socket. migrating_sk varies
depending on if it is migrating a request in the accept queue or during
3WHS.

  - accept_queue : sock (ESTABLISHED/SYN_RECV)
  - 3WHS         : request_sock (NEW_SYN_RECV)

In the eBPF program, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fallbacks to the random selection
  - SK_DROP, cancel the migration.

There is a noteworthy point. We select a listening socket in three places,
but we do not have struct skb at closing a listener or retransmitting a
SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
(e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-15 18:01:06 +02:00
Kuniyuki Iwashima
e061047684 bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.
We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing to select a new
listener.

We can currently get a unique ID of each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the pointer of sk available in sk_reuseport_md so that we
can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
2021-06-15 18:01:06 +02:00
Johannes Berg
88b710532e wwan: add interface creation support
Add support to create (and destroy) interfaces via a new
rtnetlink kind "wwan". The responsible driver has to use
the new wwan_register_ops() to make this possible.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12 13:16:45 -07:00
Johannes Berg
00e77ed8e6 rtnetlink: add IFLA_PARENT_[DEV|DEV_BUS]_NAME
In some cases, for example in the upcoming WWAN framework changes,
there's no natural "parent netdev", so sometimes dummy netdevs are
created or similar. IFLA_PARENT_DEV_NAME is a new attribute intended to
contain a device (sysfs, struct device) name that can be used instead
when creating a new netdev, if the rtnetlink family implements it.

As suggested by Parav Pandit, we also introduce IFLA_PARENT_DEV_BUS_NAME
attribute in order to uniquely identify a device on the system (with
bus/name pair).

ip-link(8) support for the generic parent device attributes will help
us avoid code duplication, so no other link type will require a custom
code to handle the parent name attribute. E.g. the WWAN interface
creation command will looks like this:

$ ip link add wwan0-1 parent-dev wwan0 type wwan channel-id 1

So, some future subsystem (or driver) FOO will have an interface
creation command that looks like this:

$ ip link add foo1-3 parent-dev foo1 type foo bar-id 3 baz-type Y

Below is an example of dumping link info of a random device with these
new attributes:

$ ip --details link show wlp0s20f3
  4: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
     state UP mode DORMANT group default qlen 1000
     ...
     parent_bus pci parent_dev 0000:00:14.3

Co-developed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Co-developed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Suggested-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12 13:16:45 -07:00
Linus Torvalds
b2568eeb96 Merge tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Just an API change for the registration changes that went into this
  release. Better to get it sorted out now than before it's too late"

* tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block:
  io_uring: add feature flag for rsrc tags
  io_uring: change registration/upd/rsrc tagging ABI
2021-06-12 11:53:20 -07:00
Arseny Krasnov
f07b2a5b04 virtio/vsock: defines and constants for SEQPACKET
Add set of defines and constants for SOCK_SEQPACKET support
in vsock.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 13:32:47 -07:00
Pavel Begunkov
9690557e22 io_uring: add feature flag for rsrc tags
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of
new IORING_REGISTER operations, in particular
IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc
tagging, and also indicating implemented dynamic fixed buffer updates.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 16:33:51 -06:00
Pavel Begunkov
992da01aa9 io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.

It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.

Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 16:33:51 -06:00
Chen Li
d409989b59 netlink: simplify NLMSG_DATA with NLMSG_HDRLEN
The NLMSG_LENGTH(0) may confuse the API users,
NLMSG_HDRLEN is much more clear.

Besides, some code style problems are also fixed.
Signed-off-by: Chen Li <chenli@uniontech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10 12:51:33 -07:00
David S. Miller
7f3579e189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add nfgenmsg field to nfnetlink's struct nfnl_info and use it.

2) Remove nft_ctx_init_from_elemattr() and nft_ctx_init_from_setattr()
   helper functions.

3) Add the nf_ct_pernet() helper function to fetch the conntrack
   pernetns data area.

4) Expose TCP and UDP flowtable offload timeouts through sysctl,
   from Oz Shlomo.

5) Add nfnetlink_hook subsystem to fetch the netfilter hook
   pipeline configuration, from Florian Westphal. This also includes
   a new field to annotate the hook type as metadata.

6) Fix unsafe memory access to non-linear skbuff in the new SCTP
   chunk support for nft_exthdr, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 14:50:35 -07:00
David S. Miller
126285651b Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Bug fixes overlapping feature additions and refactoring, mostly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 13:01:52 -07:00
Florian Westphal
e2cf17d377 netfilter: add new hook nfnl subsystem
This nfnl subsystem allows to dump the list of all active netfiler hooks,
e.g. defrag, conntrack, nf/ip/arp/ip6tables and so on.

This helps to see what kind of features are currently enabled in
the network stack.

Sample output from nft tool using this infra:

 $ nft list hook ip input
 family ip hook input {
   +0000000010 nft_do_chain_inet [nf_tables] # nft table firewalld INPUT
   +0000000100 nf_nat_ipv4_local_in [nf_nat]
   +2147483647 ipv4_confirm [nf_conntrack]
 }

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-07 12:41:10 +02:00