Commit Graph

189 Commits

Author SHA1 Message Date
Nikolay Aleksandrov
a258aeacd7 bonding: add support for xstats and export 3ad stats
This patch adds support for extended statistics (xstats) call to the
bonding. The first user would be the 3ad code which counts the following
events:
 - LACPDU Rx/Tx
 - LACPDU unknown type Rx
 - LACPDU illegal Rx
 - Marker Rx/Tx
 - Marker response Rx/Tx
 - Marker unknown type Rx

All of these are exported via netlink as separate attributes to be
easily extensible as we plan to add more in the future.
Similar to how the bridge and other xstats exports, the structure
inside is:
 [ IFLA_STATS_LINK_XSTATS ]
   -> [ LINK_XSTATS_TYPE_BOND ]
        -> [ BOND_XSTATS_3AD ]
             -> [ 3ad stats attributes ]

With this structure it's easy to add more stat types later.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22 12:04:14 -08:00
Nikolay Aleksandrov
a428afe82f net: bridge: add support for user-controlled bool options
We have been adding many new bridge options, a big number of which are
boolean but still take up netlink attribute ids and waste space in the skb.
Recently we discussed learning from link-local packets[1] and decided
yet another new boolean option will be needed, thus introducing this API
to save some bridge nl space.
The API supports changing the value of multiple boolean options at once
via the br_boolopt_multi struct which has an optmask (which options to
set, bit per opt) and optval (options' new values). Future boolean
options will only be added to the br_boolopt_id enum and then will have
to be handled in br_boolopt_toggle/get. The API will automatically
add the ability to change and export them via netlink, sysfs can use the
single boolopt function versions to do the same. The behaviour with
failing/succeeding is the same as with normal netlink option changing.

If an option requires mapping to internal kernel flag or needs special
configuration to be enabled then it should be handled in
br_boolopt_toggle. It should also be able to retrieve an option's current
state via br_boolopt_get.

v2: WARN_ON() on unsupported option as that shouldn't be possible and
    also will help catch people who add new options without handling
    them for both set and get. Pass down extack so if an option desires
    it could set it on error and be more user-friendly.

[1] https://www.spinics.net/lists/netdev/msg532698.html

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 15:04:15 -08:00
Stefano Brivio
a025fb5f49 geneve: Allow configuration of DF behaviour
draft-ietf-nvo3-geneve-08 says:

   It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
   [RFC1981]) be used by setting the DF bit in the IP header when Geneve
   packets are transmitted over IPv4 (this is the default with IPv6).

Now that ICMP error handling is working for GENEVE, we can comply with
this recommendation.

Make this configurable, though, to avoid breaking existing setups. By
default, DF won't be set. It can be set or inherited from inner IPv4
packets. If it's configured to be inherited and we are encapsulating IPv6,
it will be set.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, apply DF
  setting only if (!geneve->collect_md) in geneve_xmit_skb()
  (Stephen Hemminger)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Stefano Brivio
b4d3069783 vxlan: Allow configuration of DF behaviour
Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
DF is configured to be inherited, always set it.

For IPv4, inheriting DF from the inner header was probably intended from
the very beginning judging by the comment to vxlan_xmit(), but it wasn't
actually implemented -- also because it would have done more harm than
good, without handling for ICMP Fragmentation Needed messages.

According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
SHOULD set the DF-bit [...]", whatever that means.

Given this background, the only sane option is probably to let the user
decide, and keep the current behaviour as default.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, move DF
  setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Nikolay Aleksandrov
9163a0fc1f net: bridge: add support for per-port vlan stats
This patch adds an option to have per-port vlan stats instead of the
default global stats. The option can be set only when there are no port
vlans in the bridge since we need to allocate the stats if it is set
when vlans are being added to ports (and respectively free them
when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
largest user of options. The current stats design allows us to add
these without any changes to the fast-path, it all comes down to
the per-vlan stats pointer which, if this option is enabled, will
be allocated for each port vlan instead of using the global bridge-wide
one.

CC: bridge@lists.linux-foundation.org
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-12 10:18:58 -07:00
Hangbin Liu
52d0d404d3 geneve: add ttl inherit support
Similar with commit 72f6d71e49 ("vxlan: add ttl inherit support"),
currently ttl == 0 means "use whatever default value" on geneve instead
of inherit inner ttl. To respect compatibility with old behavior, let's
add a new IFLA_GENEVE_TTL_INHERIT for geneve ttl inherit support.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 20:38:22 -07:00
Christian Brauner
19d8f1ad12 if_link: add IFLA_TARGET_NETNSID alias
This adds IFLA_TARGET_NETNSID as an alias for IFLA_IF_NETNSID for
RTM_*LINK requests.
The new name is clearer and also aligns with the newly introduced
IFA_TARGET_NETNSID propert for RTM_*ADDR requests.

Signed-off-by: Christian Brauner <christian@brauner.io>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Stephen Hemminger
3e7a50ceb1 net: report min and max mtu network device settings
Report the minimum and maximum MTU allowed on a device
via netlink so that it can be displayed by tools like
ip link.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29 12:57:26 -07:00
David S. Miller
7a49d3d4ea Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2018-07-27

1) Extend the output_mark to also support the input direction
   and masking the mark values before applying to the skb.

2) Add a new lookup key for the upcomming xfrm interfaces.

3) Extend the xfrm lookups to match xfrm interface IDs.

4) Add virtual xfrm interfaces. The purpose of these interfaces
   is to overcome the design limitations that the existing
   VTI devices have.

  The main limitations that we see with the current VTI are the
  following:

  VTI interfaces are L3 tunnels with configurable endpoints.
  For xfrm, the tunnel endpoint are already determined by the SA.
  So the VTI tunnel endpoints must be either the same as on the
  SA or wildcards. In case VTI tunnel endpoints are same as on
  the SA, we get a one to one correlation between the SA and
  the tunnel. So each SA needs its own tunnel interface.

  On the other hand, we can have only one VTI tunnel with
  wildcard src/dst tunnel endpoints in the system because the
  lookup is based on the tunnel endpoints. The existing tunnel
  lookup won't work with multiple tunnels with wildcard
  tunnel endpoints. Some usecases require more than on
  VTI tunnel of this type, for example if somebody has multiple
  namespaces and every namespace requires such a VTI.

  VTI needs separate interfaces for IPv4 and IPv6 tunnels.
  So when routing to a VTI, we have to know to which address
  family this traffic class is going to be encapsulated.
  This is a lmitation because it makes routing more complex
  and it is not always possible to know what happens behind the
  VTI, e.g. when the VTI is move to some namespace.

  VTI works just with tunnel mode SAs. We need generic interfaces
  that ensures transfomation, regardless of the xfrm mode and
  the encapsulated address family.

  VTI is configured with a combination GRE keys and xfrm marks.
  With this we have to deal with some extra cases in the generic
  tunnel lookup because the GRE keys on the VTI are actually
  not GRE keys, the GRE keys were just reused for something else.
  All extensions to the VTI interfaces would require to add
  even more complexity to the generic tunnel lookup.

  So to overcome this, we developed xfrm interfaces with the
  following design goal:

  It should be possible to tunnel IPv4 and IPv6 through the same
  interface.

  No limitation on xfrm mode (tunnel, transport and beet).

  Should be a generic virtual interface that ensures IPsec
  transformation, no need to know what happens behind the
  interface.

  Interfaces should be configured with a new key that must match a
  new policy/SA lookup key.

  The lookup logic should stay in the xfrm codebase, no need to
  change or extend generic routing and tunnel lookups.

  Should be possible to use IPsec hardware offloads of the underlying
  interface.

5) Remove xfrm pcpu policy cache. This was added after the flowcache
   removal, but it turned out to make things even worse.
   From Florian Westphal.

6) Allow to update the set mark on SA updates.
   From Nathan Harold.

7) Convert some timestamps to time64_t.
   From Arnd Bergmann.

8) Don't check the offload_handle in xfrm code,
   it is an opaque data cookie for the driver.
   From Shannon Nelson.

9) Remove xfrmi interface ID from flowi. After this pach
   no generic code is touched anymore to do xfrm interface
   lookups. From Benedict Wong.

10) Allow to update the xfrm interface ID on SA updates.
    From Nathan Harold.

11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle.
    From YueHaibing.

12) Return more detailed errors on xfrm interface creation.
    From Benedict Wong.

13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR.
    From the kbuild test robot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-27 09:33:37 -07:00
Nikolay Aleksandrov
2756f68c31 net: bridge: add support for backup port
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
allows to set a backup port to be used for known unicast traffic if the
port has gone carrier down. The backup pointer is rcu protected and set
only under RTNL, a counter is maintained so when deleting a port we know
how many other ports reference it as a backup and we remove it from all.
Also the pointer is in the first cache line which is hot at the time of
the check and thus in the common case we only add one more test.
The backup port will be used only for the non-flooding case since
it's a part of the bridge and the flooded packets will be forwarded to it
anyway. To remove the forwarding just send a 0/non-existing backup port.
This is used to avoid numerous scalability problems when using MLAG most
notably if we have thousands of fdbs one would need to change all of them
on port carrier going down which takes too long and causes a storm of fdb
notifications (and again when the port comes back up). In a Multi-chassis
Link Aggregation setup usually hosts are connected to two different
switches which act as a single logical switch. Those switches usually have
a control and backup link between them called peerlink which might be used
for communication in case a host loses connectivity to one of them.
We need a fast way to failover in case a host port goes down and currently
none of the solutions (like bond) cannot fulfill the requirements because
the participating ports are actually the "master" devices and must have the
same peerlink as their backup interface and at the same time all of them
must participate in the bridge device. As Roopa noted it's normal practice
in routing called fast re-route where a precalculated backup path is used
when the main one is down.
Another use case of this is with EVPN, having a single vxlan device which
is backup of every port. Due to the nature of master devices it's not
currently possible to use one device as a backup for many and still have
all of them participate in the bridge (which is master itself).
More detailed information about MLAG is available at the link below.
https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG

Further explanation and a diagram by Roopa:
Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.

the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.

br0 -- team0---eth0
      |
      -- switch-peerlink

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23 09:32:15 -07:00
Jakub Kicinski
a25717d2b6 xdp: support simultaneous driver and hw XDP attachment
Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time.  Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present).  We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.

Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13 20:26:35 +02:00
Jakub Kicinski
4f91da26c8 xdp: add per mode attributes for attached programs
In preparation for support of simultaneous driver and hardware XDP
support add per-mode attributes.  The catch-all IFLA_XDP_PROG_ID
will still be reported, but user space can now also access the
program ID in a new IFLA_XDP_<mode>_PROG_ID attribute.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13 20:26:35 +02:00
Steffen Klassert
f203b76d78 xfrm: Add virtual xfrm interfaces
This patch adds support for virtual xfrm interfaces.
Packets that are routed through such an interface
are guaranteed to be IPsec transformed or dropped.
It is a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind
the interface. This means that we can tunnel IPv4 and
IPv6 through the same interface and support all xfrm
modes (tunnel, transport and beet) on it.

Co-developed-by: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Benedict Wong <benedictwong@google.com>
Tested-by: Antony Antony <antony@phenome.org>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2018-06-23 16:07:25 +02:00
Nikolay Aleksandrov
7d850abd5f net: bridge: add support for port isolation
This patch adds support for a new port flag - BR_ISOLATED. If it is set
then isolated ports cannot communicate between each other, but they can
still communicate with non-isolated ports. The same can be achieved via
ACLs but they can't scale with large number of ports and also the
complexity of the rules grows. This feature can be used to achieve
isolated vlan functionality (similar to pvlan) as well, though currently
it will be port-wide (for all vlans on the port). The new test in
should_deliver uses data that is already cache hot and the new boolean
is used to avoid an additional source port test in should_deliver.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25 14:37:20 -04:00
Hangbin Liu
72f6d71e49 vxlan: add ttl inherit support
Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.

But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.

So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified with ip cmd.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:53:13 -04:00
Subash Abhinov Kasiviswanathan
14452ca3b5 net: qualcomm: rmnet: Export mux_id and flags to netlink
Define new netlink attributes for rmnet mux_id and flags. These
flags / mux_id were earlier using vlan flags / id respectively.
The flag bits are also moved to uapi and are renamed with
prefix RMNET_FLAG_*.

Also add the rmnet policy to handle the new netlink attributes.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:00:44 -04:00
Sabrina Dubroca
1ec010e705 tun: export flags, uid, gid, queue information over netlink
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 16:58:22 -05:00
Nicolas Dichtel
38e01b3056 dev: advertise the new ifindex when the netns iface changes
The goal is to let the user follow an interface that moves to another
netns.

CC: Jiri Benc <jbenc@redhat.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:23:52 -05:00
David Decotigny
b2d3bcfa26 net: core: Expose number of link up/down transitions
Expose the number of times the link has been going UP or DOWN, and
update the "carrier_changes" counter to be the sum of these two events.
While at it, also update the sysfs-class-net documentation to cover:
carrier_changes (3.15), carrier_up_count (4.16) and carrier_down_count
(4.16)

Signed-off-by: David Decotigny <decot@googlers.com>
[Florian:
* rebase
* add documentation
* merge carrier_changes with up/down counters]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 15:42:05 -05:00
Eugenia Emantayev
c5a9f6f0ab net/core: Add drop counters to VF statistics
Modern hardware can decide to drop packets going to/from a VF.
Add receive and transmit drop counters to be displayed at hypervisor
layer in iproute2 per VF statistics.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-09 07:40:48 +02:00
Jiri Benc
79e1ad148c rtnetlink: use netnsid to query interface
Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 21:49:17 +09:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Greg Kroah-Hartman
6f52b16c5b License cleanup: add SPDX license identifier to uapi header files with no license
Many user space API headers are missing licensing information, which
makes it hard for compliance tools to determine the correct license.

By default are files without license information under the default
license of the kernel, which is GPLV2.  Marking them GPLV2 would exclude
them from being included in non GPLV2 code, which is obviously not
intended. The user space API headers fall under the syscall exception
which is in the kernels COPYING file:

   NOTE! This copyright does *not* cover user programs that use kernel
   services by normal system calls - this is merely considered normal use
   of the kernel, and does *not* fall under the heading of "derived work".

otherwise syscall usage would not be possible.

Update the files which contain no license information with an SPDX
license identifier.  The chosen identifier is 'GPL-2.0 WITH
Linux-syscall-note' which is the officially assigned identifier for the
Linux syscall exception.  SPDX license identifiers are a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.  See the previous patch in this series for the
methodology of how this patch was researched.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:19:54 +01:00
Mahesh Bandewar
fe89aa6b25 ipvlan: implement VEPA mode
This is very similar to the Macvlan VEPA mode, however, there is some
difference. IPvlan uses the mac-address of the lower device, so the VEPA
mode has implications of ICMP-redirects for packets destined for its
immediate neighbors sharing same master since the packets will have same
source and dest mac. The external switch/router will send redirect msg.

Having said that, this will be useful tool in terms of debugging
since IPvlan will not switch packets within its slaves and rely completely
on the external entity as intended in 802.1Qbg.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 18:39:57 +09:00
Mahesh Bandewar
a190d04db9 ipvlan: introduce 'private' attribute for all existing modes.
IPvlan has always operated in bridge mode. However there are scenarios
where each slave should be able to talk through the master device but
not necessarily across each other. Think of an environment where each
of a namespace is a private and independant customer. In this scenario
the machine which is hosting these namespaces neither want to tell who
their neighbor is nor the individual namespaces care to talk to neighbor
on short-circuited network path.

This patch implements the mode that is very similar to the 'private' mode
in macvlan where individual slaves can send and receive traffic through
the master device, just that they can not talk among slave devices.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 18:39:57 +09:00