Commit Graph

4461 Commits

Author SHA1 Message Date
Shmulik Ladkani 636c262808 net: skbuff: Remove errornous length validation in skb_vlan_pop()
In 93515d53b1
  "net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.

In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
considered a no-op).

This validation was moved as is into the new common 'skb_vlan_pop'.

Alas, in its original location (openvswitch), there was a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.

For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
hw-accel case) or to fail moving next tag into hw-accel tag.

Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
It is superfluous since inner '__skb_vlan_pop' already verifies there
are VLAN_ETH_HLEN writable bytes at the mac_header.

Note this presents a slight change to skb_vlan_pop() users:
In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
returns an error, as opposed to previous "no-op" behavior.
Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
'skb_vlan_pop' fails.

Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:35:57 -04:00
Shmulik Ladkani bfca4c520f net: skbuff: Export __skb_vlan_pop
This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
Daniel Borkmann 36bbef52c7 bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:32:11 -04:00
Steffen Klassert 07b26c9454 gso: Support partial splitting at the frag_list pointer
Since commit 8a29111c7 ("net: gro: allow to build full sized skb")
gro may build buffers with a frag_list. This can hurt forwarding
because most NICs can't offload such packets, they need to be
segmented in software. This patch splits buffers with a frag_list
at the frag_list pointer into buffers that can be TSO offloaded.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 20:59:34 -04:00
Johannes Weiner d979a39d72 cgroup: duplicate cgroup reference when cloning sockets
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Nogah Frankel 69ae6ad2ff net: core: Add offload stats to if_stats_msg
Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:33:42 -04:00
David S. Miller b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Javier Martinez Canillas 181402a5c7 net: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:10 -07:00
Daniel Borkmann f3694e0012 bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.

This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.

BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.

I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann 374fb54eea bpf: add own ctx rewriter on ifindex for clsact progs
When fetching ifindex, we don't need to test dev for being NULL since
we're always guaranteed to have a valid dev for clsact programs. Thus,
avoid this test in fast path.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann f035a51536 bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros
Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
which otherwise often result in overly long bytes_to_bpf_size(sizeof())
and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
check in convert_bpf_extensions(), but we should rather make that generic
as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
users to detect any rewriter size issues at compile time. Note, there are
currently none, but we want to assert that it stays this way.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann 6088b5823b bpf: minor cleanups in helpers
Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:03 -07:00
stephen hemminger b8b867e132 rtnetlink: remove unused ifla_stats_policy
This structure is defined but never used. Flagged with W=1

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 16:52:43 -07:00
Yaogong Wang 9f5afeae51 tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:25:58 -07:00
Mahesh Bandewar 24b27fc4cd bonding: Fix bonding crash
Following few steps will crash kernel -

  (a) Create bonding master
      > modprobe bonding miimon=50
  (b) Create macvlan bridge on eth2
      > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
	   type macvlan
  (c) Now try adding eth2 into the bond
      > echo +eth2 > /sys/class/net/bond0/bonding/slaves
      <crash>

Bonding does lots of things before checking if the device enslaved is
busy or not.

In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.

This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:41:12 -07:00
WANG Cong bc51dddf98 netns: avoid disabling irq for netns id
We never read or change netns id in hardirq context,
the only place we read netns id in softirq context
is in vxlan_xmit(). So, it should be enough to just
disable BH.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:39:59 -07:00
WANG Cong 38f507f1ba vxlan: call peernet2id() in fdb notification
netns id should be already allocated each time we change
netns, that is, in dev_change_net_namespace() (more precisely
in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:39:58 -07:00
Gao Feng 635c223cfa rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
The original codes depend on that the function parameters are evaluated from
left to right. But the parameter's evaluation order is not defined in C
standard actually.

When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys,
hashrnd) with some compilers or environment, the keys passed to
flow_keys_have_l4 is not initialized.

Fixes: 6db61d79c1 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash")

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:45:03 -07:00
Roopa Prabhu d297653dd6 rtnetlink: fdb dump: optimize by saving last interface markers
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real    1m11.791s
user    0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real    0m2.017s
user    0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:56:15 -07:00
stephen hemminger 3ee5256da0 netns: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger 85bae4bd8a drop_monitor: make genl_multicast_group const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
Roopa Prabhu 14972cbd34 net: lwtunnel: Handle fragmentation
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
Eric Dumazet 41852497a9 net: batch calls to flush_all_backlogs()
After commit 145dd5f9c8 ("net: flush the softnet backlog in process
context"), we can easily batch calls to flush_all_backlogs() for all
devices processed in rollback_registered_many()

Tested:

Before patch, on an idle host.

modprobe dummy numdummies=10000
perf stat -e context-switches -a rmmod dummy

 Performance counter stats for 'system wide':

         1,211,798      context-switches

       1.302137465 seconds time elapsed

After patch:

perf stat -e context-switches -a rmmod dummy

 Performance counter stats for 'system wide':

           225,523      context-switches

       0.721623566 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:17:20 -07:00
Ido Schimmel 6bc506b4fb bridge: switchdev: Add forward mark support for stacked devices
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.

It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.

This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.

The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.

Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).

Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 13:13:36 -07:00
Paolo Abeni 145dd5f9c8 net: flush the softnet backlog in process context
Currently in process_backlog(), the process_queue dequeuing is
performed with local IRQ disabled, to protect against
flush_backlog(), which runs in hard IRQ context.

This patch moves the flush operation to a work queue and runs the
callback with bottom half disabled to protect the process_queue
against dequeuing.
Since process_queue is now always manipulated in bottom half context,
the irq disable/enable pair around the dequeue operation are removed.

To keep the flush time as low as possible, the flush
works are scheduled on all online cpu simultaneously, using the
high priority work-queue and statically allocated, per cpu,
work structs.

Overall this change increases the time required to destroy a device
to improve slightly the packets reinjection performances.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 11:51:07 -07:00