Commit Graph

474 Commits

Author SHA1 Message Date
Herbert Xu 213dd74aee skbuff: Do not scrub skb mark within the same name space
On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e8e ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e8e ("tunnels: harmonize cleanup done on skb on rx path")
>
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels.  While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
>
> Else it's fine for me.

Sure.

PS I used the wrong email for James the first time around.  So
let me repeat the question here.  Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?

---8<---
The commit ea23192e8e ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scrub_packet to preserve the mark field.

Fixes: ea23192e8e ("tunnels: harmonize cleanup done on skb on rx path")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:20:40 -04:00
Herbert Xu 4c0ee414e8 Revert "net: Reset secmark when scrubbing packet"
This patch reverts commit b8fb4e0648
because the secmark must be preserved even when a packet crosses
namespace boundaries.  The reason is that security labels apply to
the system as a whole and is not per-namespace.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:19:02 -04:00
Sheng Yong 8bc0034cf6 net: remove extra newlines
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 22:24:37 -04:00
David S. Miller 0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Willem de Bruijn 3a8dd9711e sock: fix possible NULL sk dereference in __skb_tstamp_tx
Test that sk != NULL before reading sk->sk_tsflags.

Fixes: 49ca0d8bfa ("net-timestamp: no-payload option")
Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 00:09:55 -04:00
Eric Dumazet c29390c6df xps: must clear sender_cpu before forwarding
John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John <jw@nuclearfallout.net>
Bisected-by: John <jw@nuclearfallout.net>
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 23:51:18 -04:00
Eric Dumazet 58025e46ea net: gro: remove obsolete code from skb_gro_receive()
Some drivers use copybreak to copy tiny frames into smaller skb,
and this smaller skb might not have skb->head_frag set for various
reasons.

skb_gro_receive() currently doesn't allow to aggregate the smaller skb
into the previous GRO packet if this GRO packet has at least 2 MSS in
it.

Following workload easily demonstrates the problem.

netperf -t TCP_RR -H target -- -r 3000,3000

(tcpdump shows one GRO packet with 2 MSS, plus one additional packet of
104 bytes that should have been appended.)

It turns out that we can remove code from skb_gro_receive(), because
commit 8a29111c7c ("net: gro: allow to build full sized skb") and its
followups removed the assumption that a GRO packet with a frag_list had
to have an empty head.

Removing this code allows the aggregation of the last (incomplete) frame
in some RPC workloads. Note that tcp_gro_receive() already takes care of
forcing a flush if necessary, including this case.

If we want to avoid using frag_list in the first place (in forwarding
workloads for example, as the outgoing NIC is generally not able to cope
with skbs having a frag_list), we need to address this separately.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 21:50:55 -05:00
David S. Miller 71a83a6db6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/rocker/rocker.c

The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03 21:16:48 -05:00
Bojan Prtvar 059a2440fd net: Remove state argument from skb_find_text()
Although it is clear that textsearch state is intentionally passed to
skb_find_text() as uninitialized argument, it was never used by the
callers. Therefore, we can simplify skb_find_text() by making it
local variable.

Signed-off-by: Bojan Prtvar <prtvar.b@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22 15:59:54 -05:00
Eric Dumazet 997d5c3f44 sock: sock_dequeue_err_skb() needs hard irq safety
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.

Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 364a9e9324 ("sock: deduplicate errqueue dequeue")
Fixes: cb820f8e4b ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20 15:52:21 -05:00
Eric Dumazet 2bd82484bb xps: fix xps for stacked devices
A typical qdisc setup is the following :

bond0 : bonding device, using HTB hierarchy
eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc

XPS allows to spread packets on specific tx queues, based on the cpu
doing the send.

Problem is that dequeues from bond0 qdisc can happen on random cpus,
due to the fact that qdisc_run() can dequeue a batch of packets.

CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0

CPUB -> dequeue packet P1 from bond0
        enqueue packet on eth1/eth2
CPUC -> dequeue packet P2 from bond0
        enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)

get_xps_queue() then might select wrong queue for P1, since current cpu
might be different than CPUA.

P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)

Effect of this bug is TCP reorders, and more generally not optimal
TX queue placement. (A victim bulk flow can be migrated to the wrong TX
queue for a while)

To fix this, we have to record sender cpu number the first time
dev_queue_xmit() is called for one tx skb.

We can union napi_id (used on receive path) and sender_cpu,
granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
this union idea)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 13:02:54 -08:00
Willem de Bruijn b245be1f4d net-timestamp: no-payload only sysctl
Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

Changes
  (v1 -> v2)
  - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
  (rfc -> v1)
  - document the sysctl in Documentation/sysctl/net.txt
  - fix access control race: read .._OPT_TSONLY only once,
        use same value for permission check and skb generation.
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02 18:46:51 -08:00
Willem de Bruijn 49ca0d8bfa net-timestamp: no-payload option
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.

Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

Changes (rfc -> v1)
  - add documentation
  - remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02 18:46:51 -08:00
Jiri Pirko df8a39defa net: rename vlan_tx_* helpers since "tx" is misleading there
The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 17:51:08 -05:00
Florian Westphal e8768f9715 net: skbuff: don't zero tc members when freeing skb
Not needed, only four cases:
 - kfree_skb (or one of its aliases).
   Don't need to zero, memory will be freed.
 - kfree_skb_partial and head was stolen:  memory will be freed.
 - skb_morph:  The skb header fields (including tc ones) will be
   copied over from the 'to-be-morphed' skb right after
   skb_release_head_state returns.
 - skb_segment:  Same as before, all the skb header
   fields are copied over from the original skb right away.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-02 16:04:29 -05:00
Thomas Graf b8fb4e0648 net: Reset secmark when scrubbing packet
skb_scrub_packet() is called when a packet switches between a context
such as between underlay and overlay, between namespaces, or between
L3 subnets.

While we already scrub the packet mark, connection tracking entry,
and cached destination, the security mark/context is left intact.

It seems wrong to inherit the security context of a packet when going
from overlay to underlay or across forwarding paths.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-24 00:21:43 -05:00
Alexander Duyck fd11a83dd3 net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb
This change pulls the core functionality out of __netdev_alloc_skb and
places them in a new function named __alloc_rx_skb.  The reason for doing
this is to make these bits accessible to a new function __napi_alloc_skb.
In addition __alloc_rx_skb now has a new flags value that is used to
determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
flag is set then the NAPI pool is used.  The advantage of this is that we
do not have to use local_irq_save/restore when accessing the NAPI pool from
NAPI context.

In my test setup I saw at least 11ns of savings using the napi_alloc_skb
function versus the netdev_alloc_skb function, most of this being due to
the fact that we didn't have to call local_irq_save/restore.

The main use case for napi_alloc_skb would be for things such as copybreak
or page fragment based receive paths where an skb is allocated after the
data has been received instead of before.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 13:31:57 -05:00
Alexander Duyck ffde7328a3 net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache.  By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context.  The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.

I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag.  First I updated the allocation to do either a 32K
allocation or an order 0 page.  This is based on the changes in commmit
d9b2938aa where it was found that latencies could be reduced in case of
failures.  Then I also rewrote the logic to work from the end of the page to
the start.  By doing this the size value doesn't have to be used unless we
have run out of space for page fragments.  Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set.  This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 13:31:57 -05:00
Eric Dumazet 6ffe75eb53 net: avoid two atomic operations in fast clones
Commit ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
took the wrong way to save one atomic operation.

It is actually possible to avoid two atomic operations, if we
do not change skb->fclone values, and only rely on clone_ref
content to signal if the clone is available or not.

skb_clone() can simply use the fast clone if clone_ref is 1.

kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.

Note that because we usually free the clone before the original skb,
this particular attempt is only done for the original skb to have better
branch prediction.

SKB_FCLONE_FREE is removed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 13:40:20 -05:00
David S. Miller 1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Eric Dumazet e7820e39b7 net: Revert "net: avoid one atomic operation in skb_clone()"
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.

By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.

We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.

Reported-by: Chris Mason <clm@fb.com>
Fixes: ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 15:26:32 -05:00
Jiri Pirko 93515d53b1 net: move vlan pop/push functions into common code
So it can be used from out of openvswitch code.
Did couple of cosmetic changes on the way, namely variable naming and
adding support for 8021AD proto.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:18 -05:00
Jiri Pirko e21951212f net: move make_writable helper into common code
note that skb_make_writable already exists in net/netfilter/core.c
but does something slightly different.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Tom Herbert e585f23636 udp: Changes to udp_offload to support remote checksum offload
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).

Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 16:30:03 -05:00
Toshiaki Makita 432c856fcf net: skb_segment() should preserve backpressure
This patch generalizes commit d6a4a10411 ("tcp: GSO should be TSQ
friendly") to protocols using skb_set_owner_w()

TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
as explained in commit 6ff50cd555 ("tcp: gso: do not generate out of
order packets")

This allows UDP sockets using UFO to get proper backpressure,
thus avoiding qdisc drops and excessive cpu usage.

Here are performance test results (macvlan on vlan):

- Before
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      144096 1224195    1258.56
212992           60.00          51              0.45

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.23      0.00     25.26      0.08      0.00     74.43

- After
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      109593      0     957.20
212992           60.00      109593            957.20

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.18      0.00      8.38      0.02      0.00     91.43

[edumazet] Rewrote patch and changelog.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 14:47:19 -04:00