Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
This commit is contained in:
Linus Torvalds
2017-11-15 11:56:19 -08:00
1617 changed files with 91499 additions and 27219 deletions

View File

@@ -110,3 +110,51 @@ Description: When new NVM image is written to the non-active NVM
is directly the status value from the DMA configuration
based mailbox before the device is power cycled. Writing
0 here clears the status.
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/key
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: This contains name of the property directory the XDomain
service exposes. This entry describes the protocol in
question. Following directories are already reserved by
the Apple XDomain specification:
network: IP/ethernet over Thunderbolt
targetdm: Target disk mode protocol over Thunderbolt
extdisp: External display mode protocol over Thunderbolt
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/modalias
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: Stores the same MODALIAS value emitted by uevent for
the XDomain service. Format: tbtsvc:kSpNvNrN
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/prtcid
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: This contains XDomain protocol identifier the XDomain
service supports.
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/prtcvers
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: This contains XDomain protocol version the XDomain
service supports.
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/prtcrevs
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: This contains XDomain software version the XDomain
service supports.
What: /sys/bus/thunderbolt/devices/<xdomain>.<service>/prtcstns
Date: Jan 2018
KernelVersion: 4.15
Contact: thunderbolt-software@lists.01.org
Description: This contains XDomain service specific settings as
bitmask. Format: %x

View File

@@ -197,3 +197,27 @@ information is missing.
To recover from this mode, one needs to flash a valid NVM image to the
host host controller in the same way it is done in the previous chapter.
Networking over Thunderbolt cable
---------------------------------
Thunderbolt technology allows software communication across two hosts
connected by a Thunderbolt cable.
It is possible to tunnel any kind of traffic over Thunderbolt link but
currently we only support Apple ThunderboltIP protocol.
If the other host is running Windows or macOS only thing you need to
do is to connect Thunderbolt cable between the two hosts, the
``thunderbolt-net`` is loaded automatically. If the other host is also
Linux you should load ``thunderbolt-net`` manually on one host (it does
not matter which one)::
# modprobe thunderbolt-net
This triggers module load on the other host automatically. If the driver
is built-in to the kernel image, there is no need to do anything.
The driver will create one virtual ethernet interface per Thunderbolt
port which are named like ``thunderbolt0`` and so on. From this point
you can either use standard userspace tools like ``ifconfig`` to
configure the interface or let your GUI to handle it automatically.

View File

@@ -0,0 +1,156 @@
BPF extensibility and applicability to networking, tracing, security
in the linux kernel and several user space implementations of BPF
virtual machine led to a number of misunderstanding on what BPF actually is.
This short QA is an attempt to address that and outline a direction
of where BPF is heading long term.
Q: Is BPF a generic instruction set similar to x64 and arm64?
A: NO.
Q: Is BPF a generic virtual machine ?
A: NO.
BPF is generic instruction set _with_ C calling convention.
Q: Why C calling convention was chosen?
A: Because BPF programs are designed to run in the linux kernel
which is written in C, hence BPF defines instruction set compatible
with two most used architectures x64 and arm64 (and takes into
consideration important quirks of other architectures) and
defines calling convention that is compatible with C calling
convention of the linux kernel on those architectures.
Q: can multiple return values be supported in the future?
A: NO. BPF allows only register R0 to be used as return value.
Q: can more than 5 function arguments be supported in the future?
A: NO. BPF calling convention only allows registers R1-R5 to be used
as arguments. BPF is not a standalone instruction set.
(unlike x64 ISA that allows msft, cdecl and other conventions)
Q: can BPF programs access instruction pointer or return address?
A: NO.
Q: can BPF programs access stack pointer ?
A: NO. Only frame pointer (register R10) is accessible.
From compiler point of view it's necessary to have stack pointer.
For example LLVM defines register R11 as stack pointer in its
BPF backend, but it makes sure that generated code never uses it.
Q: Does C-calling convention diminishes possible use cases?
A: YES. BPF design forces addition of major functionality in the form
of kernel helper functions and kernel objects like BPF maps with
seamless interoperability between them. It lets kernel call into
BPF programs and programs call kernel helpers with zero overhead.
As all of them were native C code. That is particularly the case
for JITed BPF programs that are indistinguishable from
native kernel C code.
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
A: Soft yes. At least for now until BPF core has support for
bpf-to-bpf calls, indirect calls, loops, global variables,
jump tables, read only sections and all other normal constructs
that C code can produce.
Q: Can loops be supported in a safe way?
A: It's not clear yet. BPF developers are trying to find a way to
support bounded loops where the verifier can guarantee that
the program terminates in less than 4096 instructions.
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
C code cannot express them and has to use builtin intrinsics?
A: This is artifact of compatibility with classic BPF. Modern
networking code in BPF performs better without them.
See 'direct packet access'.
Q: It seems not all BPF instructions are one-to-one to native CPU.
For example why BPF_JNE and other compare and jumps are not cpu-like?
A: This was necessary to avoid introducing flags into ISA which are
impossible to make generic and efficient across CPU architectures.
Q: why BPF_DIV instruction doesn't map to x64 div?
A: Because if we picked one-to-one relationship to x64 it would have made
it more complicated to support on arm64 and other archs. Also it
needs div-by-zero runtime check.
Q: why there is no BPF_SDIV for signed divide operation?
A: Because it would be rarely used. llvm errors in such case and
prints a suggestion to use unsigned divide instead
Q: Why BPF has implicit prologue and epilogue?
A: Because architectures like sparc have register windows and in general
there are enough subtle differences between architectures, so naive
store return address into stack won't work. Another reason is BPF has
to be safe from division by zero (and legacy exception path
of LD_ABS insn). Those instructions need to invoke epilogue and
return implicitly.
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
A: Because classic BPF didn't have them and BPF authors felt that compiler
workaround would be acceptable. Turned out that programs lose performance
due to lack of these compare instructions and they were added.
These two instructions is a perfect example what kind of new BPF
instructions are acceptable and can be added in the future.
These two already had equivalent instructions in native CPUs.
New instructions that don't have one-to-one mapping to HW instructions
will not be accepted.
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
registers which makes BPF inefficient virtual machine for 32-bit
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
be added to BPF in the future?
A: NO. The first thing to improve performance on 32-bit archs is to teach
LLVM to generate code that uses 32-bit subregisters. Then second step
is to teach verifier to mark operations where zero-ing upper bits
is unnecessary. Then JITs can take advantage of those markings and
drastically reduce size of generated code and improve performance.
Q: Does BPF have a stable ABI?
A: YES. BPF instructions, arguments to BPF programs, set of helper
functions and their arguments, recognized return codes are all part
of ABI. However when tracing programs are using bpf_probe_read() helper
to walk kernel internal datastructures and compile with kernel
internal headers these accesses can and will break with newer
kernels. The union bpf_attr -> kern_version is checked at load time
to prevent accidentally loading kprobe-based bpf programs written
for a different kernel. Networking programs don't do kern_version check.
Q: How much stack space a BPF program uses?
A: Currently all program types are limited to 512 bytes of stack
space, but the verifier computes the actual amount of stack used
and both interpreter and most JITed code consume necessary amount.
Q: Can BPF be offloaded to HW?
A: YES. BPF HW offload is supported by NFP driver.
Q: Does classic BPF interpreter still exist?
A: NO. Classic BPF programs are converted into extend BPF instructions.
Q: Can BPF call arbitrary kernel functions?
A: NO. BPF programs can only call a set of helper functions which
is defined for every program type.
Q: Can BPF overwrite arbitrary kernel memory?
A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read()
and bpf_probe_read_str() helpers. Networking programs cannot read
arbitrary memory, since they don't have access to these helpers.
Programs can never read or write arbitrary memory directly.
Q: Can BPF overwrite arbitrary user memory?
A: Sort-of. Tracing BPF programs can overwrite the user memory
of the current task with bpf_probe_write_user(). Every time such
program is loaded the kernel will print warning message, so
this helper is only useful for experiments and prototypes.
Tracing BPF programs are root only.
Q: When bpf_trace_printk() helper is used the kernel prints nasty
warning message. Why is that?
A: This is done to nudge program authors into better interfaces when
programs need to pass data to user space. Like bpf_perf_event_output()
can be used to efficiently stream data via perf ring buffer.
BPF maps can be used for asynchronous data sharing between kernel
and user space. bpf_trace_printk() should only be used for debugging.
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
A: NO.

View File

@@ -0,0 +1,5 @@
The following properties are common to the Bluetooth controllers:
- local-bd-address: array of 6 bytes, specifies the BD address that was
uniquely assigned to the Bluetooth device, formatted with least significant
byte first (little-endian).

View File

@@ -52,7 +52,7 @@ I2C managed mode:
port@1 { /* external port 1 */
reg = <1>;
label = "lan1;
label = "lan1";
};
port@2 { /* external port 2 */
@@ -89,7 +89,7 @@ MDIO managed mode:
port@1 { /* external port 1 */
reg = <1>;
label = "lan1;
label = "lan1";
};
port@2 { /* external port 2 */

View File

@@ -34,6 +34,19 @@ Optional properties:
- fsl,err006687-workaround-present: If present indicates that the system has
the hardware workaround for ERR006687 applied and does not need a software
workaround.
-interrupt-names: names of the interrupts listed in interrupts property in
the same order. The defaults if not specified are
__Number of interrupts__ __Default__
1 "int0"
2 "int0", "pps"
3 "int0", "int1", "int2"
4 "int0", "int1", "int2", "pps"
The order may be changed as long as they correspond to the interrupts
property. Currently, only i.mx7 uses "int1" and "int2". They correspond to
tx/rx queues 1 and 2. "int0" will be used for queue 0 and ENET_MII interrupts.
For imx6sx, "int0" handles all 3 queues and ENET_MII. "pps" is for the pulse
per second interrupt associated with 1588 precision time protocol(PTP).
Optional subnodes:
- mdio : specifies the mdio bus in the FEC, used as a container for phy nodes

View File

@@ -17,6 +17,8 @@ Required properties:
- "renesas,etheravb-r8a7795" for the R8A7795 SoC.
- "renesas,etheravb-r8a7796" for the R8A7796 SoC.
- "renesas,etheravb-r8a77970" for the R8A77970 SoC.
- "renesas,etheravb-r8a77995" for the R8A77995 SoC.
- "renesas,etheravb-rcar-gen3" as a fallback for the above
R-Car Gen3 devices.
@@ -40,7 +42,7 @@ Optional properties:
- interrupt-parent: the phandle for the interrupt controller that services
interrupts for this device.
- interrupt-names: A list of interrupt names.
For the R8A779[56] SoCs this property is mandatory;
For the R-Car Gen 3 SoCs this property is mandatory;
it should include one entry per channel, named "ch%u",
where %u is the channel number ranging from 0 to 24.
For other SoCs this property is optional; if present

View File

@@ -4,7 +4,8 @@ This file provides information on what the device node for the SH EtherMAC
interface contains.
Required properties:
- compatible: "renesas,gether-r8a7740" if the device is a part of R8A7740 SoC.
- compatible: Must contain one or more of the following:
"renesas,gether-r8a7740" if the device is a part of R8A7740 SoC.
"renesas,ether-r8a7743" if the device is a part of R8A7743 SoC.
"renesas,ether-r8a7745" if the device is a part of R8A7745 SoC.
"renesas,ether-r8a7778" if the device is a part of R8A7778 SoC.
@@ -14,6 +15,14 @@ Required properties:
"renesas,ether-r8a7793" if the device is a part of R8A7793 SoC.
"renesas,ether-r8a7794" if the device is a part of R8A7794 SoC.
"renesas,ether-r7s72100" if the device is a part of R7S72100 SoC.
"renesas,rcar-gen1-ether" for a generic R-Car Gen1 device.
"renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1
device.
When compatible with the generic version, nodes must list
the SoC-specific version corresponding to the platform
first followed by the generic version.
- reg: offset and length of (1) the E-DMAC/feLic register block (required),
(2) the TSU register block (optional).
- interrupts: interrupt specifier for the sole interrupt.
@@ -36,7 +45,8 @@ Optional properties:
Example (Lager board):
ethernet@ee700000 {
compatible = "renesas,ether-r8a7790";
compatible = "renesas,ether-r8a7790",
"renesas,rcar-gen2-ether";
reg = <0 0xee700000 0 0x400>;
interrupt-parent = <&gic>;
interrupts = <0 162 IRQ_TYPE_LEVEL_HIGH>;

View File

@@ -12,7 +12,7 @@ Required properties:
Valid interrupt names are:
- "macirq" (combined signal for various interrupt events)
- "eth_wake_irq" (the interrupt to manage the remote wake-up packet detection)
- "eth_lpi" (the interrupt that occurs when Tx or Rx enters/exits LPI state)
- "eth_lpi" (the interrupt that occurs when Rx exits the LPI state)
- phy-mode: See ethernet.txt file in the same directory.
- snps,reset-gpio gpio number for phy reset.
- snps,reset-active-low boolean flag to indicate if phy reset is active low.

View File

@@ -37,6 +37,11 @@ The following properties are defined to the bluetooth node:
Definition: must be:
"qcom,wcnss-bt"
- local-bd-address:
Usage: optional
Value type: <u8 array>
Definition: see Documentation/devicetree/bindings/net/bluetooth.txt
== WiFi
The following properties are defined to the WiFi node:
@@ -91,6 +96,9 @@ smd {
bt {
compatible = "qcom,wcnss-bt";
/* BD address 00:11:22:33:44:55 */
local-bd-address = [ 55 44 33 22 11 00 ];
};
wlan {

View File

@@ -299,9 +299,6 @@ Data path helpers
.. kernel-doc:: include/net/cfg80211.h
:functions: ieee80211_data_to_8023
.. kernel-doc:: include/net/cfg80211.h
:functions: ieee80211_data_from_8023
.. kernel-doc:: include/net/cfg80211.h
:functions: ieee80211_amsdu_to_8023s

View File

@@ -0,0 +1,37 @@
LAN9303 Ethernet switch driver
==============================
The LAN9303 is a three port 10/100 Mbps ethernet switch with integrated phys for
the two external ethernet ports. The third port is an RMII/MII interface to a
host master network interface (e.g. fixed link).
Driver details
==============
The driver is implemented as a DSA driver, see
Documentation/networking/dsa/dsa.txt.
See Documentation/devicetree/bindings/net/dsa/lan9303.txt for device tree
binding.
The LAN9303 can be managed both via MDIO and I2C, both supported by this driver.
At startup the driver configures the device to provide two separate network
interfaces (which is the default state of a DSA device). Due to HW limitations,
no HW MAC learning takes place in this mode.
When both user ports are joined to the same bridge, the normal HW MAC learning
is enabled. This means that unicast traffic is forwarded in HW. Broadcast and
multicast is flooded in HW. STP is also supported in this mode. The driver
support fdb/mdb operations as well, meaning IGMP snooping is supported.
If one of the user ports leave the bridge, the ports goes back to the initial
separated operation.
Driver limitations
==================
- Support for VLAN filtering is not implemented
- The HW does not support VLAN-specific fdb entries

View File

@@ -1,6 +1,7 @@
The Linux kernel GTP tunneling module
======================================================================
Documentation by Harald Welte <laforge@gnumonks.org>
Documentation by Harald Welte <laforge@gnumonks.org> and
Andreas Schultz <aschultz@tpip.net>
In 'drivers/net/gtp.c' you are finding a kernel-level implementation
of a GTP tunnel endpoint.
@@ -91,9 +92,13 @@ http://git.osmocom.org/libgtpnl/
== Protocol Versions ==
There are two different versions of GTP-U: v0 and v1. Both are
implemented in the Kernel GTP module. Version 0 is a legacy version,
and deprecated from recent 3GPP specifications.
There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1
[3GPP TS 29.281]. Both are implemented in the Kernel GTP module.
Version 0 is a legacy version, and deprecated from recent 3GPP
specifications.
GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151
for GTPv1-U and 3386 for GTPv0-U.
There are three versions of GTP-C: v0, v1, and v2. As the kernel
doesn't implement GTP-C, we don't have to worry about this. It's the
@@ -133,3 +138,93 @@ doe to a lack of user interest, it never got merged.
In 2015, Andreas Schultz came to the rescue and fixed lots more bugs,
extended it with new features and finally pushed all of us to get it
mainline, where it was merged in 4.7.0.
== Architectural Details ==
=== Local GTP-U entity and tunnel identification ===
GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
for GTPv1-U and 3386 for GTPv0-U.
There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW
instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
per GTP-U entity.
A specific tunnel is only defined by the destination entity. Since the
destination port is constant, only the destination IP and TEID define
a tunnel. The source IP and Port have no meaning for the tunnel.
Therefore:
* when sending, the remote entity is defined by the remote IP and
the tunnel endpoint id. The source IP and port have no meaning and
can be changed at any time.
* when receiving the local entity is defined by the local
destination IP and the tunnel endpoint id. The source IP and port
have no meaning and can change at any time.
[3GPP TS 29.281] Section 4.3.0 defines this so:
> The TEID in the GTP-U header is used to de-multiplex traffic
> incoming from remote tunnel endpoints so that it is delivered to the
> User plane entities in a way that allows multiplexing of different
> users, different packet protocols and different QoS levels.
> Therefore no two remote GTP-U endpoints shall send traffic to a
> GTP-U protocol entity using the same TEID value except
> for data forwarding as part of mobility procedures.
The definition above only defines that two remote GTP-U endpoints
*should not* send to the same TEID, it *does not* forbid or exclude
such a scenario. In fact, the mentioned mobility procedures make it
necessary that the GTP-U entity accepts traffic for TEIDs from
multiple or unknown peers.
Therefore, the receiving side identifies tunnels exclusively based on
TEIDs, not based on the source IP!
== APN vs. Network Device ==
The GTP-U driver creates a Linux network device for each Gi/SGi
interface.
[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This
may lead to the impression that the GGSN/P-GW can have only one such
interface.
Correct is that the Gi/SGi reference point defines the interworking
between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP
based networks.
There is no provision in any of the 3GPP documents that limits the
number of Gi/SGi interfaces implemented by a GGSN/P-GW.
[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a
specific Gi/SGi interfaces is made through the Access Point Name
(APN):
> 2. each private network manages its own addressing. In general this
> will result in different private networks having overlapping
> address ranges. A logically separate connection (e.g. an IP in IP
> tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW
> and each private network.
>
> In this case the IP address alone is not necessarily unique. The
> pair of values, Access Point Name (APN) and IPv4 address and/or
> IPv6 prefixes, is unique.
In order to support the overlapping address range use case, each APN
is mapped to a separate Gi/SGi interface (network device).
NOTE: The Access Point Name is purely a control plane (GTP-C) concept.
At the GTP-U level, only Tunnel Endpoint Identifiers are present in
GTP-U packets and network devices are known
Therefore for a given UE the mapping in IP to PDN network is:
* network device + MS IP -> Peer IP + Peer TEID,
and from PDN to IP network:
* local GTP-U IP + TEID -> network device
Furthermore, before a received T-PDU is injected into the network
device the MS IP is checked against the IP recorded in PDP context.

View File

@@ -0,0 +1,285 @@
Identifier Locator Addressing (ILA)
Introduction
============
Identifier-locator addressing (ILA) is a technique used with IPv6 that
differentiates between location and identity of a network node. Part of an
address expresses the immutable identity of the node, and another part
indicates the location of the node which can be dynamic. Identifier-locator
addressing can be used to efficiently implement overlay networks for
network virtualization as well as solutions for use cases in mobility.
ILA can be thought of as means to implement an overlay network without
encapsulation. This is accomplished by performing network address
translation on destination addresses as a packet traverses a network. To
the network, an ILA translated packet appears to be no different than any
other IPv6 packet. For instance, if the transport protocol is TCP then an
ILA translated packet looks like just another TCP/IPv6 packet. The
advantage of this is that ILA is transparent to the network so that
optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work.
The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila.
ILA terminology
===============
- Identifier A number that identifies an addressable node in the network
independent of its location. ILA identifiers are sixty-four
bit values.
- Locator A network prefix that routes to a physical host. Locators
provide the topological location of an addressed node. ILA
locators are sixty-four bit prefixes.
- ILA mapping
A mapping of an ILA identifier to a locator (or to a
locator and meta data). An ILA domain maintains a database
that contains mappings for all destinations in the domain.
- SIR address
An IPv6 address composed of a SIR prefix (upper sixty-
four bits) and an identifier (lower sixty-four bits).
SIR addresses are visible to applications and provide a
means for them to address nodes independent of their
location.
- ILA address
An IPv6 address composed of a locator (upper sixty-four
bits) and an identifier (low order sixty-four bits). ILA
addresses are never visible to an application.
- ILA host An end host that is capable of performing ILA translations
on transmit or receive.
- ILA router A network node that performs ILA translation and forwarding
of translated packets.
- ILA forwarding cache
A type of ILA router that only maintains a working set
cache of mappings.
- ILA node A network node capable of performing ILA translations. This
can be an ILA router, ILA forwarding cache, or ILA host.
Operation
=========
There are two fundamental operations with ILA:
- Translate a SIR address to an ILA address. This is performed on ingress
to an ILA overlay.
- Translate an ILA address to a SIR address. This is performed on egress
from the ILA overlay.
ILA can be deployed either on end hosts or intermediate devices in the
network; these are provided by "ILA hosts" and "ILA routers" respectively.
Configuration and datapath for these two points of deployment is somewhat
different.
The diagram below illustrates the flow of packets through ILA as well
as showing ILA hosts and routers.
+--------+ +--------+
| Host A +-+ +--->| Host B |
| | | (2) ILA (') | |
+--------+ | ...addressed.... ( ) +--------+
V +---+--+ . packet . +---+--+ (_)
(1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR
addressed +->|router| . . |router|->-+ addressed
packet +---+--+ . IPv6 . +---+--+ packet
/ . Network .
/ . . +--+-++--------+
+--------+ / . . |ILA || Host |
| Host +--+ . .- -|host|| |
| | . . +--+-++--------+
+--------+ ................
Transport checksum handling
===========================
When an address is translated by ILA, an encapsulated transport checksum
that includes the translated address in a pseudo header may be rendered
incorrect on the wire. This is a problem for intermediate devices,
including checksum offload in NICs, that process the checksum. There are
three options to deal with this:
- no action Allow the checksum to be incorrect on the wire. Before
a receiver verifies a checksum the ILA to SIR address
translation must be done.
- adjust transport checksum
When ILA translation is performed the packet is parsed
and if a transport layer checksum is found then it is
adjusted to reflect the correct checksum per the
translated address.
- checksum neutral mapping
When an address is translated the difference can be offset
elsewhere in a part of the packet that is covered by the
the checksum. The low order sixteen bits of the identifier
are used. This method is preferred since it doesn't require
parsing a packet beyond the IP header and in most cases the
adjustment can be precomputed and saved with the mapping.
Note that the checksum neutral adjustment affects the low order sixteen
bits of the identifier. When ILA to SIR address translation is done on
egress the low order bits are restored to the original value which
restores the identifier as it was originally sent.
Identifier types
================
ILA defines different types of identifiers for different use cases.
The defined types are:
0: interface identifier
1: locally unique identifier
2: virtual networking identifier for IPv4 address
3: virtual networking identifier for IPv6 unicast address
4: virtual networking identifier for IPv6 multicast address
5: non-local address identifier
In the current implementation of kernel ILA only locally unique identifiers
(LUID) are supported. LUID allows for a generic, unformatted 64 bit
identifier.
Identifier formats
==================
Kernel ILA supports two optional fields in an identifier for formatting:
"C-bit" and "identifier type". The presence of these fields is determined
by configuration as demonstrated below.
If the identifier type is present it occupies the three highest order
bits of an identifier. The possible values are given in the above list.
If the C-bit is present, this is used as an indication that checksum
neutral mapping has been done. The C-bit can only be set in an
ILA address, never a SIR address.
In the simplest format the identifier types, C-bit, and checksum
adjustment value are not present so an identifier is considered an
unstructured sixty-four bit value.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The checksum neutral adjustment may be configured to always be
present using neutral-map-auto. In this case there is no C-bit, but the
checksum adjustment is in the low order 16 bits. The identifier is
still sixty-four bits.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Checksum-neutral adjustment |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The C-bit may used to explicitly indicate that checksum neutral
mapping has been applied to an ILA address. The format is:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |C| Identifier |
| +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Checksum-neutral adjustment |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The identifier type field may be present to indicate the identifier
type. If it is not present then the type is inferred based on mapping
configuration. The checksum neutral adjustment may automatically
used with the identifier type as illustrated below.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type| Identifier |
+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Checksum-neutral adjustment |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
If the identifier type and the C-bit can be present simultaneously so
the identifier format would be:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type|C| Identifier |
+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | Checksum-neutral adjustment |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Configuration
=============
There are two methods to configure ILA mappings. One is by using LWT routes
and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat
is intended to be used in the receive path for ILA hosts .
An ILA router has also been implemented in XDP. Description of that is
outside the scope of this document.
The usage of for ILA LWT routes is:
ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR
Destination (DEST) can either be a SIR address (for an ILA host or ingress
ILA router) or an ILA address (egress ILA router). LOC is the sixty-four
bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four
bits of the destination address. Checksum MODE is one of "no-action",
"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is
set then the C-bit will be present. Identifier TYPE one of "luid" or
"use-format." In the case of use-format, the identifier type field is
present and the effective type is taken from that.
The usage of ila_xlat is:
ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE
MATCH indicates the incoming locator that must be matched to apply
a the translaiton. LOC is the locator that overwrites the upper
sixty-four bits of the destination address. MODE and TYPE have the
same meanings as described above.
Some examples
=============
# Configure an ILA route that uses checksum neutral mapping as well
# as type field. Note that the type field is set in the SIR address
# (the 2000 implies type is 1 which is LUID).
ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \
csum-mode neutral-map ident-type use-format
# Configure an ILA LWT route that uses auto checksum neutral mapping
# (no C-bit) and configure identifier type to be LUID so that the
# identifier type field will not be present.
ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \
csum-mode neutral-map-auto ident-type luid
ila_xlat configuration
# Configure an ILA to SIR mapping that matches a locator and overwrites
# it with a SIR address (3333:0:0:1 in this example). The C-bit and
# identifier field are used.
ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
csum-mode neutral-map-auto ident-type use-format
# Configure an ILA to SIR mapping where checksum neutral is automatically
# set without the C-bit and the identifier type is configured to be LUID
# so that the identifier type field is not present.
ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
csum-mode neutral-map-auto ident-type use-format

View File

@@ -289,8 +289,7 @@ tcp_ecn_fallback - BOOLEAN
Default: 1 (fallback enabled)
tcp_fack - BOOLEAN
Enable FACK congestion avoidance and fast retransmission.
The value is not used, if tcp_sack is not enabled.
This is a legacy option, it has no effect anymore.
tcp_fin_timeout - INTEGER
The length of time an orphaned (no longer referenced by any
@@ -454,6 +453,7 @@ tcp_recovery - INTEGER
RACK: 0x1 enables the RACK loss detection for fast detection of lost
retransmissions and tail drops.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
Default: 0x1
@@ -1385,6 +1385,30 @@ mld_qrv - INTEGER
Default: 2 (as specified by RFC3810 9.1)
Minimum: 1 (as specified by RFC6636 4.5)
max_dst_opts_cnt - INTEGER
Maximum number of non-padding TLVs allowed in a Destination
options extension header. If this value is less than zero
then unknown options are disallowed and the number of known
TLVs allowed is the absolute value of this number.
Default: 8
max_hbh_opts_cnt - INTEGER
Maximum number of non-padding TLVs allowed in a Hop-by-Hop
options extension header. If this value is less than zero
then unknown options are disallowed and the number of known
TLVs allowed is the absolute value of this number.
Default: 8
max dst_opts_len - INTEGER
Maximum length allowed for a Destination options extension
header.
Default: INT_MAX (unlimited)
max hbh_opts_len - INTEGER
Maximum length allowed for a Hop-by-Hop options extension
header.
Default: INT_MAX (unlimited)
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
@@ -1707,6 +1731,15 @@ ndisc_notify - BOOLEAN
1 - Generate unsolicited neighbour advertisements when device is brought
up or hardware address changes.
ndisc_tclass - INTEGER
The IPv6 Traffic Class to use by default when sending IPv6 Neighbor
Discovery (Router Solicitation, Router Advertisement, Neighbor
Solicitation, Neighbor Advertisement, Redirect) messages.
These 8 bits can be interpreted as 6 high order bits holding the DSCP
value and 2 low order bits representing ECN (which you probably want
to leave cleared).
0 - (default)
mldv1_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
MLDv1 report retransmit will take place.

View File

@@ -22,9 +22,21 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
There are no module parameters for this driver and it can be configured
using IProute2/ip utility.
ip link add link <master-dev> name <slave-dev> type ipvlan mode { l2 | l3 | l3s }
ip link add link <master> name <slave> type ipvlan [ mode MODE ] [ FLAGS ]
where
MODE: l3 (default) | l3s | l2
FLAGS: bridge (default) | private | vepa
e.g. ip link add link eth0 name ipvl0 type ipvlan mode l2
e.g.
(a) Following will create IPvlan link with eth0 as master in
L3 bridge mode
bash# ip link add link eth0 name ipvl0 type ipvlan
(b) This command will create IPvlan link in L2 bridge mode.
bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge
(c) This command will create an IPvlan device in L2 private mode.
bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private
(d) This command will create an IPvlan device in L2 vepa mode.
bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa
4. Operating modes:
@@ -54,7 +66,29 @@ works in this mode and hence it is L3-symmetric (L3s). This will have slightly l
performance but that shouldn't matter since you are choosing this mode over plain-L3
mode to make conn-tracking work.
5. What to choose (macvlan vs. ipvlan)?
5. Mode flags:
At this time following mode flags are available
5.1 bridge:
This is the default option. To configure the IPvlan port in this mode,
user can choose to either add this option on the command-line or don't specify
anything. This is the traditional mode where slaves can cross-talk among
themseleves apart from talking through the master device.
5.2 private:
If this option is added to the command-line, the port is set in private
mode. i.e. port wont allow cross communication between slaves.
5.3 vepa:
If this is added to the command-line, the port is set in VEPA mode.
i.e. port will offload switching functionality to the external entity as
described in 802.1Qbg
Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the
master-device, so the packets which are emitted in this mode for the adjacent
neighbor will have source and destination mac same. This will make the switch /
router send the redirect message.
6. What to choose (macvlan vs. ipvlan)?
These two devices are very similar in many regards and the specific use
case could very well define which device to choose. if one of the following
situations defines your use case then you can choose to use ipvlan -

View File

@@ -64,7 +64,10 @@ A: To understand this, you need to know a bit of background information
If you aren't subscribed to netdev and/or are simply unsure if net-next
has re-opened yet, simply check the net-next git repository link above for
any new networking-related commits.
any new networking-related commits. You may also check the following
website for the current status:
http://vger.kernel.org/~davem/net-next.html
The "net" tree continues to collect fixes for the vX.Y content, and
is fed back to Linus at regular (~weekly) intervals. Meaning that the

View File

@@ -19,12 +19,12 @@ Features
Receive Side Scaling
--------------------
Hyper-V supports receive side scaling. For TCP, packets are
distributed among available queues based on IP address and port
Hyper-V supports receive side scaling. For TCP & UDP, packets can
be distributed among available queues based on IP address and port
number.
For UDP, we can switch UDP hash level between L3 and L4 by ethtool
command. UDP over IPv4 and v6 can be set differently. The default
For TCP & UDP, we can switch hash level between L3 and L4 by ethtool
command. TCP/UDP over IPv4 and v6 can be set differently. The default
hash level is L4. We currently only allow switching TX hash level
from within the guests.

View File

@@ -19,6 +19,14 @@ core regulatory domain all wireless devices should adhere to.
How to get regulatory domains to the kernel
-------------------------------------------
When the regulatory domain is first set up, the kernel will request a
database file (regulatory.db) containing all the regulatory rules. It
will then use that database when it needs to look up the rules for a
given country.
How to get regulatory domains to the kernel (old CRDA solution)
---------------------------------------------------------------
Userspace gets a regulatory domain in the kernel by having
a userspace agent build it and send it via nl80211. Only
expected regulatory domains will be respected by the kernel.
@@ -192,23 +200,5 @@ Then in some part of your code after your wiphy has been registered:
Statically compiled regulatory database
---------------------------------------
In most situations the userland solution using CRDA as described
above is the preferred solution. However in some cases a set of
rules built into the kernel itself may be desirable. To account
for this situation, a configuration option has been provided
(i.e. CONFIG_CFG80211_INTERNAL_REGDB). With this option enabled,
the wireless database information contained in net/wireless/db.txt is
used to generate a data structure encoded in net/wireless/regdb.c.
That option also enables code in net/wireless/reg.c which queries
the data in regdb.c as an alternative to using CRDA.
The file net/wireless/db.txt should be kept up-to-date with the db.txt
file available in the git repository here:
git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/wireless-regdb.git
Again, most users in most situations should be using the CRDA package
provided with their distribution, and in most other situations users
should be building and using CRDA on their own rather than using
this option. If you are not absolutely sure that you should be using
CONFIG_CFG80211_INTERNAL_REGDB then _DO_NOT_USE_IT_.
When a database should be fixed into the kernel, it can be provided as a
firmware file at build time that is then linked into the kernel.

View File

@@ -280,6 +280,18 @@ Interaction with the user of the RxRPC socket:
nominated by a socket option.
Notes on sendmsg:
(*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
making progress at accepting packets within a reasonable time such that we
manage to queue up all the data for transmission. This requires the
client to accept at least one packet per 2*RTT time period.
If this isn't set, sendmsg() will return immediately, either returning
EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data
consumed.
Notes on recvmsg:
(*) If there's a sequence of data messages belonging to a particular call on
@@ -782,7 +794,9 @@ The kernel interface functions are as follows:
struct key *key,
unsigned long user_call_ID,
s64 tx_total_len,
gfp_t gfp);
gfp_t gfp,
rxrpc_notify_rx_t notify_rx,
bool upgrade);
This allocates the infrastructure to make a new RxRPC call and assigns
call and connection numbers. The call will be made on the UDP port that
@@ -803,6 +817,13 @@ The kernel interface functions are as follows:
allows the kernel to encrypt directly to the packet buffers, thereby
saving a copy. The value may not be less than -1.
notify_rx is a pointer to a function to be called when events such as
incoming data packets or remote aborts happen.
upgrade should be set to true if a client operation should request that
the server upgrade the service to a better one. The resultant service ID
is returned by rxrpc_kernel_recv_data().
If this function is successful, an opaque reference to the RxRPC call is
returned. The caller now holds a reference on this and it must be
properly ended.
@@ -850,7 +871,8 @@ The kernel interface functions are as follows:
size_t size,
size_t *_offset,
bool want_more,
u32 *_abort)
u32 *_abort,
u16 *_service)
This is used to receive data from either the reply part of a client call
or the request part of a service call. buf and size specify how much
@@ -873,6 +895,9 @@ The kernel interface functions are as follows:
If a remote ABORT is detected, the abort code received will be stored in
*_abort and ECONNABORTED will be returned.
The service ID that the call ended up with is returned into *_service.
This can be used to see if a call got a service upgrade.
(*) Abort a call.
void rxrpc_kernel_abort_call(struct socket *sock,
@@ -1020,6 +1045,30 @@ The kernel interface functions are as follows:
It returns 0 if the call was requeued and an error otherwise.
(*) Get call RTT.
u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call);
Get the RTT time to the peer in use by a call. The value returned is in
nanoseconds.
(*) Check call still alive.
u32 rxrpc_kernel_check_life(struct socket *sock,
struct rxrpc_call *call);
This returns a number that is updated when ACKs are received from the peer
(notably including PING RESPONSE ACKs which we can elicit by sending PING
ACKs to see if the call still exists on the server). The caller should
compare the numbers of two calls to see if the call is still alive after
waiting for a suitable interval.
This allows the caller to work out if the server is still contactable and
if the call is still alive on the server whilst waiting for the server to
process a client operation.
This function may transmit a PING ACK.
=======================
CONFIGURABLE PARAMETERS

Some files were not shown because too many files have changed in this diff Show More