Commit Graph

6933 Commits

Author SHA1 Message Date
Linus Torvalds 96144c58ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix cfg80211 deadlock, from Johannes Berg.

 2) RXRPC fails to send norigications, from David Howells.

 3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

 4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

 5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

 6) Fix hashtable memory leak in dccp, from Wang Hai.

 7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

 8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

 9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: ethernet: ti: ale: fix allmulti for nu type ale
  net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
  net: atm: Remove the error message according to the atomic context
  bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
  libbpf: Support pre-initializing .bss global variables
  tools/bpftool: Fix skeleton codegen
  bpf: Fix memlock accounting for sock_hash
  bpf: sockmap: Don't attach programs to UDP sockets
  bpf: tcp: Recv() should return 0 when the peer socket is closed
  ibmvnic: Flush existing work items before device removal
  genetlink: clean up family attributes allocations
  net: ipa: header pad field only valid for AP->modem endpoint
  net: ipa: program upper nibbles of sequencer type
  net: ipa: fix modem LAN RX endpoint id
  net: ipa: program metadata mask differently
  ionic: add pcie_print_link_status
  rxrpc: Fix race between incoming ACK parser and retransmitter
  net/mlx5: E-Switch, Fix some error pointer dereferences
  net/mlx5: Don't fail driver on failure to create debugfs
  net/mlx5e: CT: Fix ipv6 nat header rewrite actions
  ...
2020-06-13 16:27:13 -07:00
David S. Miller fa7566a0d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-06-12

The following pull-request contains BPF updates for your *net* tree.

We've added 26 non-merge commits during the last 10 day(s) which contain
a total of 27 files changed, 348 insertions(+), 93 deletions(-).

The main changes are:

1) sock_hash accounting fix, from Andrey.

2) libbpf fix and probe_mem sanitizing, from Andrii.

3) sock_hash fixes, from Jakub.

4) devmap_val fix, from Jesper.

5) load_bytes_relative fix, from YiFei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-13 15:28:08 -07:00
Andrey Ignatov 60e5ca8a64 bpf: Fix memlock accounting for sock_hash
Add missed bpf_map_charge_init() in sock_hash_alloc() and
correspondingly bpf_map_charge_finish() on ENOMEM.

It was found accidentally while working on unrelated selftest that
checks "map->memory.pages > 0" is true for all map types.

Before:
	# bpftool m l
	...
	3692: sockhash  name m_sockhash  flags 0x0
		key 4B  value 4B  max_entries 8  memlock 0B

After:
	# bpftool m l
	...
	84: sockmap  name m_sockmap  flags 0x0
		key 4B  value 4B  max_entries 8  memlock 4096B

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com
2020-06-12 15:21:29 -07:00
Lorenz Bauer f6fede8569 bpf: sockmap: Don't attach programs to UDP sockets
The stream parser infrastructure isn't set up to deal with UDP
sockets, so we mustn't try to attach programs to them.

I remember making this change at some point, but I must have lost
it while rebasing or something similar.

Fixes: 7b98cd42b0 ("bpf: sockmap: Add UDP support")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200611172520.327602-1-lmb@cloudflare.com
2020-06-12 15:13:43 -07:00
YiFei Zhu 0f5d82f187 net/filter: Permit reading NET in load_bytes_relative when MAC not set
Added a check in the switch case on start_header that checks for
the existence of the header, and in the case that MAC is not set
and the caller requests for MAC, -EFAULT. If the caller requests
for NET then MAC's existence is completely ignored.

There is no function to check NET header's existence and as far
as cgroup_skb/egress is concerned it should always be set.

Removed for ptr >= the start of header, considering offset is
bounded unsigned and should always be true. len <= end - mac is
redundant to ptr + len <= end.

Fixes: 3eee1f75f2 ("bpf: fix bpf_skb_load_bytes_relative pkt length check")
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com
2020-06-11 16:05:56 +02:00
Linus Torvalds 1c38372662 Merge branch 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysctl fixes from Al Viro:
 "Fixups to regressions in sysctl series"

* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysctl: reject gigantic reads/write to sysctl files
  cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
  trace: fix an incorrect __user annotation on stack_trace_sysctl
  random: fix an incorrect __user annotation on proc_do_entropy
  net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
  net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
2020-06-10 16:05:54 -07:00
Cong Wang 845e0ebb44 net: change addr_list_lock back to static key
The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:

CPU 0:				CPU 1:
(RCU read lock)			(RTNL lock)
dev_mc_seq_show()		netdev_update_lockdep_key()
				  -> lockdep_unregister_key()
 -> netif_addr_lock_bh()

because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.

In commit 1a33e10e4a ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68fc2
("net: core: add generic lockdep keys").

This patch reverts the rest and also part of commit f3b0a18bb6
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().

And hopefully this reduces some syzbot lockdep noises too.

Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-09 12:59:45 -07:00
Jakub Sitnicki 75e68e5bf2 bpf, sockhash: Synchronize delete from bucket list on map free
We can end up modifying the sockhash bucket list from two CPUs when a
sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
that is in the sockhash is unlinking itself from it on another CPU
it (sock_hash_delete_from_link).

This results in accessing a list element that is in an undefined state as
reported by KASAN:

| ==================================================================
| BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
| Write of size 8 at addr dead000000000122 by task kworker/2:1/95
|
| CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x97/0xe0
|  ? sock_hash_free+0x13c/0x280
|  __kasan_report.cold+0x5/0x40
|  ? mark_lock+0xbc1/0xc00
|  ? sock_hash_free+0x13c/0x280
|  kasan_report+0x38/0x50
|  ? sock_hash_free+0x152/0x280
|  sock_hash_free+0x13c/0x280
|  bpf_map_free_deferred+0xb2/0xd0
|  ? bpf_map_charge_finish+0x50/0x50
|  ? rcu_read_lock_sched_held+0x81/0xb0
|  ? rcu_read_lock_bh_held+0x90/0x90
|  process_one_work+0x59a/0xac0
|  ? lock_release+0x3b0/0x3b0
|  ? pwq_dec_nr_in_flight+0x110/0x110
|  ? rwlock_bug.part.0+0x60/0x60
|  worker_thread+0x7a/0x680
|  ? _raw_spin_unlock_irqrestore+0x4c/0x60
|  kthread+0x1cc/0x220
|  ? process_one_work+0xac0/0xac0
|  ? kthread_create_on_node+0xa0/0xa0
|  ret_from_fork+0x24/0x30
| ==================================================================

Fix it by reintroducing spin-lock protected critical section around the
code that removes the elements from the bucket on sockhash free.

To do that we also need to defer processing of removed elements, until out
of atomic context so that we can unlink the socket from the map when
holding the sock lock.

Fixes: 90db6d772f ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com
2020-06-09 10:59:04 -07:00
Jakub Sitnicki 33a7c83156 bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_free
When sockhash gets destroyed while sockets are still linked to it, we will
walk the bucket lists and delete the links. However, we are not freeing the
list elements after processing them, leaking the memory.

The leak can be triggered by close()'ing a sockhash map when it still
contains sockets, and observed with kmemleak:

  unreferenced object 0xffff888116e86f00 (size 64):
    comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff  ...A.....i/.....
    backtrace:
      [<00000000dd089ebb>] sock_hash_update_common+0x4ca/0x760
      [<00000000b8219bd5>] sock_hash_update_elem+0x1d2/0x200
      [<000000005e2c23de>] __do_sys_bpf+0x2046/0x2990
      [<00000000d0084618>] do_syscall_64+0xad/0x9a0
      [<000000000d96f263>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fix it by freeing the list element when we're done with it.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com
2020-06-09 10:59:04 -07:00
Christoph Hellwig 56965ac725 net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
cpumask_parse_user works on __user pointers, so this is wrong now.

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reported-by: build test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-08 10:13:56 -04:00
Ahmed S. Darwish 11d6011c2c net: core: device_rename: Use rwsem instead of a seqcount
Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

Commit 5dbe7c178d ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.

To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.

Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.

>From all the above, replace the seqcount with a rwsem.

Fixes: 5dbe7c178d (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93 (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <stable@vger.kernel.org>
Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ]
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ]
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:50:42 -07:00
Ahmed Abdelsalam bb986a5042 seg6: fix seg6_validate_srh() to avoid slab-out-of-bounds
The seg6_validate_srh() is used to validate SRH for three cases:

case1: SRH of data-plane SRv6 packets to be processed by the Linux kernel.
Case2: SRH of the netlink message received  from user-space (iproute2)
Case3: SRH injected into packets through setsockopt

In case1, the SRH can be encoded in the Reduced way (i.e., first SID is
carried in DA only and not represented as SID in the SRH) and the
seg6_validate_srh() now handles this case correctly.

In case2 and case3, the SRH shouldn’t be encoded in the Reduced way
otherwise we lose the first segment (i.e., the first hop).

The current implementation of the seg6_validate_srh() allow SRH of case2
and case3 to be encoded in the Reduced way. This leads a slab-out-of-bounds
problem.

This patch verifies SRH of case1, case2 and case3. Allowing case1 to be
reduced while preventing SRH of case2 and case3 from being reduced .

Reported-by: syzbot+e8c028b62439eac42073@syzkaller.appspotmail.com
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 0cb7498f23 ("seg6: fix SRH processing to comply with RFC8754")
Signed-off-by: Ahmed Abdelsalam <ahabdels@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:39:32 -07:00
Matthieu Baerts 49b2357594 bpf: Fix unused-var without NETDEVICES
A recent commit added new variables only used if CONFIG_NETDEVICES is
set. A simple fix would be to only declare these variables if the same
condition is valid but Alexei suggested an even simpler solution:

    since CONFIG_NETDEVICES doesn't change anything in .h I think the
    best is to remove #ifdef CONFIG_NETDEVICES from net/core/filter.c
    and rely on sock_bindtoindex() returning ENOPROTOOPT in the extreme
    case of oddly configured kernels.

Fixes: 70c58997c1 ("bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200603190347.2310320-1-matthieu.baerts@tessares.net
2020-06-04 22:52:10 +02:00
Linus Torvalds cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds e7c93cbfe9 Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread updates from Christian Brauner:
 "We have been discussing using pidfds to attach to namespaces for quite
  a while and the patches have in one form or another already existed
  for about a year. But I wanted to wait to see how the general api
  would be received and adopted.

  This contains the changes to make it possible to use pidfds to attach
  to the namespaces of a process, i.e. they can be passed as the first
  argument to the setns() syscall.

  When only a single namespace type is specified the semantics are
  equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
  equals setns(pidfd, CLONE_NEWNET).

  However, when a pidfd is passed, multiple namespace flags can be
  specified in the second setns() argument and setns() will attach the
  caller to all the specified namespaces all at once or to none of them.

  Specifying 0 is not valid together with a pidfd. Here are just two
  obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

  Allowing to also attach subsets of namespaces supports various
  use-cases where callers setns to a subset of namespaces to retain
  privilege, perform an action and then re-attach another subset of
  namespaces.

  Apart from significantly reducing the number of syscalls needed to
  attach to all currently supported namespaces (eight "open+setns"
  sequences vs just a single "setns()"), this also allows atomic setns
  to a set of namespaces, i.e. either attaching to all namespaces
  succeeds or we fail without having changed anything.

  This is centered around a new internal struct nsset which holds all
  information necessary for a task to switch to a new set of namespaces
  atomically. Fwiw, with this change a pidfd becomes the only token
  needed to interact with a container. I'm expecting this to be
  picked-up by util-linux for nsenter rather soon.

  Associated with this change is a shiny new test-suite dedicated to
  setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/pidfd: add pidfd setns tests
  nsproxy: attach to namespaces via pidfds
  nsproxy: add struct nsset
2020-06-03 13:12:57 -07:00
Daniel Borkmann 7cdec54f97 bpf: Add csum_level helper for fixing up csum levels
Add a bpf_csum_level() helper which BPF programs can use in combination
with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
flag to the latter to avoid falling back to CHECKSUM_NONE.

The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
current level. Without this helper, there is no way to otherwise adjust the
skb->csum_level. I did not add an extra dummy flags as there is plenty of free
bitspace in level argument itself iff ever needed in future.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net
2020-06-02 11:50:23 -07:00
Daniel Borkmann 836e66c218 bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
Lorenz recently reported:

  In our TC classifier cls_redirect [0], we use the following sequence of
  helper calls to decapsulate a GUE (basically IP + UDP + custom header)
  encapsulated packet:

    bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

  It seems like some checksums of the inner headers are not validated in
  this case. For example, a TCP SYN packet with invalid TCP checksum is
  still accepted by the network stack and elicits a SYN ACK. [...]

  That is, we receive the following packet from the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
  On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
  into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
  turned into a no-op due to CHECKSUM_UNNECESSARY.

The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.

The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.

  [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

Fixes: 2be7e212d5 ("bpf: add bpf_skb_adjust_room helper")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net
2020-06-02 11:50:23 -07:00
Jakub Sitnicki b27f7bb590 flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.

The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.

This is a code move with no functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Jakub Sitnicki a3fd7ceee0 net: Introduce netns_bpf for BPF programs attached to netns
In order to:

 (1) attach more than one BPF program type to netns, or
 (2) support attaching BPF programs to netns with bpf_link, or
 (3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Jakub Sitnicki 171526f6fe flow_dissector: Pull locking up from prog attach callback
Split out the part of attach callback that happens with attach/detach lock
acquired. This structures the prog attach callback in a way that opens up
doors for moving the locking out of flow_dissector and into generic
callbacks for attaching/detaching progs to netns in subsequent patches.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Ferenc Fejes 70c58997c1 bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt
Extending the supported sockopts in bpf_setsockopt with
SO_BINDTODEVICE. We call sock_bindtoindex with parameter
lock_sk = false in this context because we already owning
the socket.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu
2020-06-01 14:57:14 -07:00
Ferenc Fejes 8ea204c2b6 net: Make locking in sock_bindtoindex optional
The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.

The modification applied to all users of the sock_bindtoindex.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu
2020-06-01 14:57:14 -07:00
John Fastabend e91de6afa8 bpf: Fix running sk_skb program types with ktls
KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.

The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.

At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.

We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.

Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.

Discovered while testing OpenSSL 3.0 Alpha2.0 release.

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
John Fastabend ca2f5f21db bpf: Refactor sockmap redirect code so its easy to reuse
We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.

No logic changes are intended.

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern 64b59025c1 xdp: Add xdp_txq_info to xdp_buff
Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00