Commit Graph

829 Commits

Author SHA1 Message Date
Mianhan Liu
a1554c0026 include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h
nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
The advantage of this change is that it can reduce the obsolete
includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
since it has already included mm.h.  Similarly, after checking all the
other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
rule, so these files can have swap.h removed too.

Moreover, after preprocessing all the files that use
nr_free_buffer_pages, it turns out that those files have already
included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
safely.  This change will not affect the compilation of other files.

Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Cc: Jakub Kicinski <kuba@kernel.org>
CC: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:43 -07:00
Jakub Kicinski
f444fea789 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/ptp/Kconfig:
  55c8fca1da ("ptp_pch: Restore dependency on PCI")
  e5f3155267 ("ethernet: fix PTP_1588_CLOCK dependencies")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-19 18:09:18 -07:00
kaixi.fan
01634047bf ovs: clear skb->tstamp in forwarding path
fq qdisc requires tstamp to be cleared in the forwarding path. Now ovs
doesn't clear skb->tstamp. We encountered a problem with linux
version 5.4.56 and ovs version 2.14.1, and packets failed to
dequeue from qdisc when fq qdisc was attached to ovs port.

Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: kaixi.fan <fankaixi.li@bytedance.com>
Signed-off-by: xiexiaohui <xiexiaohui.xxh@bytedance.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-18 11:31:13 +01:00
Jakub Kicinski
f4083a752a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
  9e26680733 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
  9e518f2580 ("bnxt_en: 1PPS functions to configure TSIO pins")
  099fdeda65 ("bnxt_en: Event handler for PPS events")

kernel/bpf/helpers.c
include/linux/bpf-cgroup.h
  a2baf4e8bb ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
  c7603cfa04 ("bpf: Add ambient BPF runtime context stored in current")

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
  5957cc557d ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
  2d0b41a376 ("net/mlx5: Refcount mlx5_irq with integer")

MAINTAINERS
  7b637cd52f ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
  7d901a1e87 ("net: phy: add Maxlinear GPY115/21x/24x driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 06:41:22 -07:00
Randy Dunlap
d6e712aa7e net: openvswitch: fix kernel-doc warnings in flow.c
Repair kernel-doc notation in a few places to make it conform to
the expected format.

Fixes the following kernel-doc warnings:

flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Parse vlan tag from vlan header.
flow.c:296: warning: missing initial short description on line:
 * Parse vlan tag from vlan header.
flow.c:537: warning: No description found for return value of 'key_extract_l3l4'
flow.c:769: warning: No description found for return value of 'key_extract'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: dev@openvswitch.org
Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09 15:37:35 -07:00
Mark Gray
076999e460 openvswitch: fix sparse warning incorrect type
fix incorrect type in argument 1 (different address spaces)

../net/openvswitch/datapath.c:169:17: warning: incorrect type in argument 1 (different address spaces)
../net/openvswitch/datapath.c:169:17:    expected void const *
../net/openvswitch/datapath.c:169:17:    got struct dp_nlsk_pids [noderef] __rcu *upcall_portids

Found at: https://patchwork.kernel.org/project/netdevbpf/patch/20210630095350.817785-1-mark.d.gray@redhat.com/#24285159

Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-27 11:48:43 +01:00
Mark Gray
784dcfa56e openvswitch: fix alignment issues
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-27 11:48:42 +01:00
Mark Gray
b83d23a2a3 openvswitch: Introduce per-cpu upcall dispatch
The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is a correspondingly
large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

The corresponding user space code can be found at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html

Bugzilla: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 11:06:33 -07:00
Baowen Zheng
b18114476a openvswitch: Optimize operation for key comparison
In the current implement when comparing two flow keys, we will return
result after comparing the whole key from start to end.

In our optimization, we will return result in the first none-zero
comparison, then we will improve the flow table looking up efficiency.

Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 11:13:10 -07:00
Aaron Conole
c4ab7b56be openvswitch: add trace points
This makes openvswitch module use the event tracing framework
to log the upcall interface and action execution pipeline.  When
using openvswitch as the packet forwarding engine, some types of
debugging are made possible simply by using the ovs-vswitchd's
ofproto/trace command.  However, such a command has some
limitations:

  1. When trying to trace packets that go through the CT action,
     the state of the packet can't be determined, and probably
     would be potentially wrong.

  2. Deducing problem packets can sometimes be difficult as well
     even if many of the flows are known

  3. It's possible to use the openvswitch module even without
     the ovs-vswitchd (although, not common use).

Introduce the event tracing points here to make it possible for
working through these problems in kernel space.  The style is
copied from the mac80211 driver-trace / trace code for
consistency - this creates some checkpatch splats, but the
official 'guide' for adding tracepoints, as well as the existing
examples all add the same splats so it seems acceptable.

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 10:47:32 -07:00
Jakub Kicinski
5ada57a9a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 09:55:10 -07:00
Tao Liu
e4df1b0c24 openvswitch: meter: fix race when getting now_ms.
We have observed meters working unexpected if traffic is 3+Gbit/s
with multiple connections.

now_ms is not pretected by meter->lock, we may get a negative
long_delta_ms when another cpu updated meter->used, then:
    delta_ms = (u32)long_delta_ms;
which will be a large value.

    band->bucket += delta_ms * band->rate;
then we get a wrong band->bucket.

OpenVswitch userspace datapath has fixed the same issue[1] some
time ago, and we port the implementation to kernel datapath.

[1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/

Fixes: 96fbc13d7e ("openvswitch: Add meter infrastructure")
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13 15:54:59 -07:00
Yejune Deng
d2792e91de net: openvswitch: Remove unnecessary skb_nfct()
There is no need add 'if (skb_nfct(skb))' assignment, the
nf_conntrack_put() would check it.

Signed-off-by: Yejune Deng <yejunedeng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-10 14:18:19 -07:00
Davide Caratti
7c0ea5930c openvswitch: fix stack OOB read while fragmenting IPv4 packets
running openvswitch on kernels built with KASAN, it's possible to see the
following splat while testing fragmentation of IPv4 packets:

 BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60
 Read of size 1 at addr ffff888112fc713c by task handler2/1367

 CPU: 0 PID: 1367 Comm: handler2 Not tainted 5.12.0-rc6+ #418
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 Call Trace:
  dump_stack+0x92/0xc1
  print_address_description.constprop.7+0x1a/0x150
  kasan_report.cold.13+0x7f/0x111
  ip_do_fragment+0x1b03/0x1f60
  ovs_fragment+0x5bf/0x840 [openvswitch]
  do_execute_actions+0x1bd5/0x2400 [openvswitch]
  ovs_execute_actions+0xc8/0x3d0 [openvswitch]
  ovs_packet_cmd_execute+0xa39/0x1150 [openvswitch]
  genl_family_rcv_msg_doit.isra.15+0x227/0x2d0
  genl_rcv_msg+0x287/0x490
  netlink_rcv_skb+0x120/0x380
  genl_rcv+0x24/0x40
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x719/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5ba/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f957079db07
 Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 eb ec ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 24 ed ff ff 48
 RSP: 002b:00007f956ce35a50 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 00007f957079db07
 RDX: 0000000000000000 RSI: 00007f956ce35ae0 RDI: 0000000000000019
 RBP: 00007f956ce35ae0 R08: 0000000000000000 R09: 00007f9558006730
 R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
 R13: 00007f956ce37308 R14: 00007f956ce35f80 R15: 00007f956ce35ae0

 The buggy address belongs to the page:
 page:00000000af2a1d93 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112fc7
 flags: 0x17ffffc0000000()
 raw: 0017ffffc0000000 0000000000000000 dead000000000122 0000000000000000
 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 addr ffff888112fc713c is located in stack of task handler2/1367 at offset 180 in frame:
  ovs_fragment+0x0/0x840 [openvswitch]

 this frame has 2 objects:
  [32, 144) 'ovs_dst'
  [192, 424) 'ovs_rt'

 Memory state around the buggy address:
  ffff888112fc7000: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff888112fc7080: 00 f1 f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00
 >ffff888112fc7100: 00 00 00 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00
                                         ^
  ffff888112fc7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff888112fc7200: 00 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00

for IPv4 packets, ovs_fragment() uses a temporary struct dst_entry. Then,
in the following call graph:

  ip_do_fragment()
    ip_skb_dst_mtu()
      ip_dst_mtu_maybe_forward()
        ip_mtu_locked()

the pointer to struct dst_entry is used as pointer to struct rtable: this
turns the access to struct members like rt_mtu_locked into an OOB read in
the stack. Fix this changing the temporary variable used for IPv4 packets
in ovs_fragment(), similarly to what is done for IPv6 few lines below.

Fixes: d52e5a7e7c ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmt")
Cc: <stable@vger.kernel.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-29 15:31:53 -07:00
David S. Miller
6876a18d33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-26 12:00:00 -07:00
Ilya Maximets
7d742b509d openvswitch: meter: remove rate from the bucket size calculation
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change proposed for the userspace implementation.

Fixes: 96fbc13d7e ("openvswitch: Add meter infrastructure")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:08:47 -07:00
Jakub Kicinski
8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Ilya Maximets
4d51419d49 openvswitch: fix send of uninitialized stack memory in ct limit reply
'struct ovs_zone_limit' has more members than initialized in
ovs_ct_limit_get_default_limit().  The rest of the memory is a random
kernel stack content that ends up being sent to userspace.

Fix that by using designated initializer that will clear all
non-specified fields.

Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-05 12:54:42 -07:00
Christophe JAILLET
7d42e84eb9 net: openvswitch: Use 'skb_push_rcsum()' instead of hand coding it
'skb_push()'/'skb_postpush_rcsum()' can be replaced by an equivalent
'skb_push_rcsum()' which is less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-04 01:43:02 -07:00
David S. Miller
efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Bhaskar Chowdhury
f44773058c openvswitch: Fix a typo
s/subsytem/subsystem/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:59:46 -07:00
Flavio Leitner
ebfbc46b35 openvswitch: Warn over-mtu packets only if iface is UP.
It is not unusual to have the bridge port down. Sometimes
it has the old MTU, which is fine since it's not being used.

However, the kernel spams the log with a warning message
when a packet is going to be sent over such port. Fix that
by warning only if the interface is UP.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16 16:28:30 -07:00
wenxu
d29334c15d net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct
When openvswitch conntrack offload with act_ct action. The first rule
do conntrack in the act_ct in tc subsystem. And miss the next rule in
the tc and fallback to the ovs datapath but miss set post_ct flag
which will lead the ct_state_key with -trk flag.

Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16 15:22:18 -07:00
Eelco Chaudron
a5317f3b06 net: openvswitch: add log message for error case
As requested by upstream OVS, added some error messages in the
validate_and_copy_dec_ttl function.

Includes a small cleanup, which removes an unnecessary parameter
from the dec_ttl_exception_handler() function.

Reported-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Link: https://lore.kernel.org/r/161054576573.26637.18396634650212670580.stgit@ebuild
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 16:32:14 -08:00
Eelco Chaudron
09d6217254 net: openvswitch: fix TTL decrement exception action execution
Currently, the exception actions are not processed correctly as the wrong
dataset is passed. This change fixes this, including the misleading
comment.

In addition, a check was added to make sure we work on an IPv4 packet,
and not just assume if it's not IPv6 it's IPv4.

This was all tested using OVS with patch,
https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
applied and sending packets with a TTL of 1 (and 0), both with IPv4
and IPv6.

Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:18:25 -08:00