Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.
The OVS VXLAN port is kept unaware of extensions at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
OVS:
# ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
# ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.
Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.
Remote checksum offload for VXLAN is described in:
https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
With UDP checksums and Remote Checksum Offload
IPv4
Client
11.84% CPU utilization
Server
12.96% CPU utilization
9197 Mbps
IPv6
Client
12.46% CPU utilization
Server
14.48% CPU utilization
8963 Mbps
With UDP checksums, no remote checksum offload
IPv4
Client
15.67% CPU utilization
Server
14.83% CPU utilization
9094 Mbps
IPv6
Client
16.21% CPU utilization
Server
14.32% CPU utilization
9058 Mbps
No UDP checksums
IPv4
Client
15.03% CPU utilization
Server
23.09% CPU utilization
9089 Mbps
IPv6
Client
16.18% CPU utilization
Server
26.57% CPU utilization
8954 Mbps
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleans up the header flags of VXLAN in anticipation of
defining some new ones:
- Move header related definitions from vxlan.c to vxlan.h
- Change VXLAN_FLAGS to be VXLAN_HF_VNI (only currently defined flag)
- Move check for unknown flags to after we find vxlan_sock, this
assumes that some flags may be processed based on tunnel
configuration
- Add a comment about why the stack treating unknown set flags as an
error instead of ignoring them
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the only tunnel protocol that supports GRO with encapsulated
Ethernet is VXLAN. This pulls out the Ethernet code into a proper layer
so that it can be used by other tunnel protocols such as GRE and Geneve.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of error vxlan_xmit_one() can free already freed skb.
Also fixes memory leak of dst-entry.
Fixes: acbf74a763 ("vxlan: Refactor vxlan driver to make use
of the common UDP tunnel functions").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when trying to reuse a socket, vxlan_sock_add will grab
vn->sock_lock, locate a reusable socket, inc refcount and release
vn->sock_lock.
But vxlan_sock_release() will first decrement refcount, and then grab
that lock. refcnt operations are atomic but as currently we have
deferred works which hold vs->refcnt each, this might happen, leading to
a use after free (specially after vxlan_igmp_leave):
CPU 1 CPU 2
deferred work vxlan_sock_add
... ...
spin_lock(&vn->sock_lock)
vs = vxlan_find_sock();
vxlan_sock_release
dec vs->refcnt, reaches 0
spin_lock(&vn->sock_lock)
vxlan_sock_hold(vs), refcnt=1
spin_unlock(&vn->sock_lock)
hlist_del_rcu(&vs->hlist);
vxlan_notify_del_rx_port(vs)
spin_unlock(&vn->sock_lock)
So when we look for a reusable socket, we check if it wasn't freed
already before reusing it.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Fixes: 7c47cedf43 ("vxlan: move IGMP join/leave to work queue")
Signed-off-by: David S. Miller <davem@davemloft.net>
In "vxlan: Call udp_sock_create" there was a logic error that resulted in
the default for IPv6 VXLAN tunnels going from using checksums to not using
checksums. Since there is currently no support in iproute2 for setting
these values it means that a kernel after the change cannot talk over a IPv6
VXLAN tunnel to a kernel prior the change.
Fixes: 3ee64f3 ("vxlan: Call udp_sock_create")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use them to push skb->vlan_tci into the payload and avoid code
duplication.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most NICs that report NETIF_F_GSO_UDP_TUNNEL support VXLAN, and not
other UDP-based encapsulation protocols where the format and size of the
header differs. This patch implements a generic ndo_gso_check() for
VXLAN which will only advertise GSO support when the skb looks like it
contains VXLAN (or no UDP tunnelling at all).
Implementation shamelessly stolen from Tom Herbert:
http://thread.gmane.org/gmane.linux.network/332428/focus=333111
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/chelsio/cxgb4vf/sge.c
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.
ixgbe_phy.c was a set of overlapping whitespace changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we only match against local port number in order to reuse
socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
follow. The following steps reproduce it:
# ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
srcport 5000 6000 dev eth0
# ip link add vxlan7 type vxlan id 43 group ff0e::110 \
srcport 5000 6000 dev eth0
# ip link set vxlan6 up
# ip link set vxlan7 up
<panic>
[ 4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
...
[ 4.188076] Call Trace:
[ 4.188085] [<ffffffff81667c4a>] ? ipv6_sock_mc_join+0x3a/0x630
[ 4.188098] [<ffffffffa05a6ad6>] vxlan_igmp_join+0x66/0xd0 [vxlan]
[ 4.188113] [<ffffffff810a3430>] process_one_work+0x220/0x710
[ 4.188125] [<ffffffff810a33c4>] ? process_one_work+0x1b4/0x710
[ 4.188138] [<ffffffff810a3a3b>] worker_thread+0x11b/0x3a0
[ 4.188149] [<ffffffff810a3920>] ? process_one_work+0x710/0x710
So address family must also match in order to reuse a socket.
Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing GRO processing for UDP tunnels, we never add
SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
is added (such as SKB_GSO_TCPV4). The result is that if the packet is
later resegmented we will do GSO but not treat it as a tunnel. This
results in UDP fragmentation of the outer header instead of (i.e.) TCP
segmentation of the inner header as was originally on the wire.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add definition to vxlan nla_policy for UDP checksum. This is necessary
to enable UDP checksums on VXLAN.
In some instances, enabling UDP checksums can improve performance on
receive for devices that return legacy checksum-unnecessary for UDP/IP.
Also, UDP checksum provides some protection against VNI corruption.
Testing:
Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.
TCP_STREAM
IPv4, without UDP checksums
14.41% TX CPU utilization
25.71% RX CPU utilization
9083.4 Mbps
IPv4, with UDP checksums
13.99% TX CPU utilization
13.40% RX CPU utilization
9095.65 Mbps
TCP_RR
IPv4, without UDP checksums
94.08% TX CPU utilization
156/248/462 90/95/99% latencies
1.12743e+06 tps
IPv4, with UDP checksums
94.43% TX CPU utilization
158/250/462 90/95/99% latencies
1.13345e+06 tps
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload
Fixes: 91269e390d ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when netif_rx() is done, the netif_rx handled skb maybe be freed,
and should not be used.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>