Add skb->hash to the __sk_buff offset map, so it can be accessed from
an eBPF program. We currently already do this for classic BPF filters,
but not yet on eBPF, it might be useful as a demuxer in combination with
helpers like bpf_clone_redirect(), toy example:
__section("cls-lb") int ingress_main(struct __sk_buff *skb)
{
unsigned int which = 3 + (skb->hash & 7);
/* bpf_skb_store_bytes(skb, ...); */
/* bpf_l{3,4}_csum_replace(skb, ...); */
bpf_clone_redirect(skb, which, 0);
return -1;
}
I was thinking whether to add skb_get_hash(), but then concluded the
raw skb->hash seems fine in this case: we can directly access the hash
w/o extra eBPF helper function call, it's filled out by many NICs on
ingress, and in case the entropy level would not be sufficient, people
can still implement their own specific sw fallback hash mix anyway.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
tlb_dynamic_lb could be set only via sysfs, this patch allows it to be
set via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two vxlan driver flags FLOWBASED and COLLECT_METADATA need to be set to
make use of its new flow mode. The former already exposed. Expose the latter.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
skb: pointer to skb
key: pointer to 'struct bpf_tunnel_key'
size: size of 'struct bpf_tunnel_key'
flags: room for future extensions
First eBPF program that uses these helpers will allocate per_cpu
metadata_dst structures that will be used on TX.
On RX metadata_dst is allocated by tunnel driver.
Typical usage for TX:
struct bpf_tunnel_key tkey;
... populate tkey ...
bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
RX:
struct bpf_tunnel_key tkey = {};
bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
... lookup or redirect based on tkey ...
'struct bpf_tunnel_key' will be extended in the future by adding
elements to the end and the 'size' argument will indicate which fields
are populated, thereby keeping backwards compatibility.
The 'flags' argument may be used as well when the 'size' is not enough or
to indicate completely different layout of bpf_tunnel_key.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6fd99094de ("ipv6: Don't reduce hop limit for an interface")
disabled accept hop limit from RA if it is smaller than the current hop
limit for security stuff. But this behavior kind of break the RFC definition.
RFC 4861, 6.3.4. Processing Received Router Advertisements
A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
and Retrans Timer) may contain a value denoting that it is
unspecified. In such cases, the parameter should be ignored and the
host should continue using whatever value it is already using.
If the received Cur Hop Limit value is non-zero, the host SHOULD set
its CurHopLimit variable to the received value.
So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
hop limit value they can accept from RA. And set default to 1 to meet RFC
standards.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull virtio/vhost fixes from Michael Tsirkin:
"Bugfixes and documentation fixes.
Igor's patch that allows users to tweak memory table size is
borderline, but it does fix known crashes, so I merged it"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: add max_mem_regions module parameter
vhost: extend memory regions allocation to vmalloc
9p/trans_virtio: reset virtio device on remove
virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio
MAINTAINERS: separate section for s390 virtio drivers
virtio: define virtio_pci_cfg_cap in header.
virtio: Fix typecast of pointer in vring_init()
virtio scsi: fix unused variable warning
vhost: use binary search instead of linear in find_region()
virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS
Per RFC 6724, section 4, "Candidate Source Addresses":
It is RECOMMENDED that the candidate source addresses be the set
of unicast addresses assigned to the interface that will be used
to send to the destination (the "outgoing" interface).
Add a sysctl to enable this behaviour.
Signed-off-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.
ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200
A new static key controls the collection of metadata at tunnel level
upon demand.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.
Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst. The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.
This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.
It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.
Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.
Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implementation uses lwtunnel infrastructure to register
hooks for mpls tunnel encaps.
It picks cues from iptunnel_encaps infrastructure and previous
mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provides infrastructure to parse/dump/store encap information for
light weight tunnels like mpls. Encap information for such tunnels
is associated with fib routes.
This infrastructure is based on previous suggestions from
Eric Biederman to follow the xfrm infrastructure.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces two new RTA attributes to attach encap
data to fib routes.
Example iproute2 command to attach mpls encap data to ipv4 routes
$ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
helper functions. These functions may change skb->data/hlen which are
cached by some JITs to improve performance of ld_abs/ld_ind instructions.
Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
re-compute header len and re-cache skb->data/hlen back into cpu registers.
Note, skb->data/hlen are not directly accessible from the programs,
so any changes to skb->data done either by these helpers or by other
TC actions are safe.
eBPF JIT supported by three architectures:
- arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
- x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
(experiments showed that conditional re-caching is slower).
- s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
in the program (re-caching is tbd).
These helpers allow more scalable handling of vlan from the programs.
Instead of creating thousands of vlan netdevs on top of eth0 and attaching
TC+ingress+bpf to all of them, the program can be attached to eth0 directly
and manipulate vlans as necessary.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-07-17
This series contains updates to igb, ixgbe, ixgbevf, i40e, bnx2x,
freescale, siena and dp83640.
Jacob provides several patches to clarify the intended way to implement
both SIOCSHWTSTAMP and ethtool's get_ts_info(). It is okay to support
the specific filters in SIOCSHWTSTAMP by upscaling them to the generic
filters.
Alex Duyck provides a igb patch to pull the time stamp from the fragment
before it gets added to the skb, to avoid a possible issue in which the
fragment can possibly be less than IGB_RX_HDR_LEN due to the time stamp
being pulled after the copybreak check. Also provides a ixgbevf patch to
fold the ixgbevf_pull_tail() call into ixgbevf_add_rx_frag(), which gives
the advantage that the fragment does not have to be modified after it is
added to the skb.
Fan provides patches for ixgbe/ixgbevf to set the receive hash type
based on receive descriptor RSS type.
Todd provides a fix for igb where on check for link on any media other
than copper was not being detected since it was looking on the incorrect
PHY page (due to the page being used gets switched before the function
to check link gets executed).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It would be very useful to retrieve the net_cls's classid from an eBPF
program to allow for a more fine-grained classification, it could be
directly used or in conjunction with additional policies. I.e. docker,
but also tooling such as cgexec, can easily run applications via net_cls
cgroups:
cgcreate -g net_cls:/foo
echo 42 > foo/net_cls.classid
cgexec -g net_cls:foo <prog>
Thus, their respecitve classid cookie of foo can then be looked up on
the egress path to apply further policies. The helper is desigend such
that a non-zero value returns the cgroup id.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds some clarification about the intended way to implement
both SIOCSHWTSTAMP and ethtool's get_ts_info. The HWTSTAMP API has
several Rx filters which are very specific, as well as more general
filters. The specific filters really only exist to support some broken
hardware which can't fully implement the generic filters. This patch
adds clarification that it is okay to support the specific filters in
SIOCSHWTSTAMP by upscaling them to the generic filters. In addition,
update the header for ethtool_ts_info to specify that drivers ought to
only report the filters they support without upscaling in this manner.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Reviewed-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Conflicts:
net/bridge/br_mdb.c
Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).
Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Missing list head init in bluetooth hidp session creation, from Tedd
Ho-Jeong An.
2) Don't leak SKB in bridge netfilter error paths, from Florian
Westphal.
3) ipv6 netdevice private leak in netfilter bridging, fixed by Julien
Grall.
4) Fix regression in IP over hamradio bpq encapsulation, from Ralf
Baechle.
5) Fix race between rhashtable resize events and table walks, from Phil
Sutter.
6) Missing validation of IFLA_VF_INFO netlink attributes, fix from
Daniel Borkmann.
7) Missing security layer socket state initialization in tipc code,
from Stephen Smalley.
8) Fix shared IRQ handling in boomerang 3c59x interrupt handler, from
Denys Vlasenko.
9) Missing minor_idr destroy on module unload on macvtap driver, from
Johannes Thumshirn.
10) Various pktgen kernel thread races, from Oleg Nesterov.
11) Fix races that can cause packets to be processed in the backlog even
after a device attached to that SKB has been fully unregistered.
From Julian Anastasov.
12) bcmgenet driver doesn't account packet drops vs. errors properly,
fix from Petri Gynther.
13) Array index validation and off by one fix in DSA layer from Florian
Fainelli
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (66 commits)
can: replace timestamp as unique skb attribute
ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
can: c_can: Fix default pinmux glitch at init
can: rcar_can: unify error messages
can: rcar_can: print request_irq() error code
can: rcar_can: fix typo in error message
can: rcar_can: print signed IRQ #
can: rcar_can: fix IRQ check
net: dsa: Fix off-by-one in switch address parsing
net: dsa: Test array index before use
net: switchdev: don't abort unsupported operations
net: bcmgenet: fix accounting of packet drops vs errors
cdc_ncm: update specs URL
Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
bridge: mdb: allow the user to delete mdb entry if there's a querier
net: call rcu_read_lock early in process_backlog
net: do not process device backlog during unregistration
bridge: fix potential crash in __netdev_pick_tx()
net: axienet: Fix devm_ioremap_resource return value check
...