dev_forward_skb loops an skb back into host networking
stack which might hang on the memory indefinitely.
In particular, this can happen in macvtap in bridged mode.
Copy the userspace fragments to avoid blocking the
sender in that case.
As this patch makes skb_copy_ubufs extern now,
I also added some documentation and made it clear
the SKBTX_DEV_ZEROCOPY flag automatically instead
of doing it in all callers. This can be made into a separate
patch if people feel it's worth it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RX rings should use GFP_KERNEL allocations if possible, add
__netdev_alloc_skb_ip_align() helper to ease this.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rearrange struct sk_buff members comments to follow their
definition order. Also, add missing comments for ooo_okay
and dropcount members.
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace buffers support in skb shared info. A new
struct skb_ubuf_info is needed to maintain the userspace buffers
argument and index, a callback is used to notify userspace to release
the buffers once lower device has done DMA (Last reference to that skb
has gone).
If there is any userspace apps to reference these userspace buffers,
then these userspaces buffers will be copied into kernel. This way we
can prevent userspace apps from holding these userspace buffers too long.
Use destructor_arg to point to the userspace buffer info; a new tx flags
SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.
Signed-off-by: Shirley Ma <xma@...ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comment for the skb_tx_timestamp() function suggests calling it just
after a buffer is released to the hardware for transmission. However,
for drivers that free the buffer in an ISR, this produces a race between
the time stamp code and the ISR. This commit changes the comment to advise
placing the call just before handing the buffer over to the hardware.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
but rather in vlan_do_receive. Otherwise the vlan header
will not be properly put on the packet in the case of
vlan header accelleration.
As we remove the check from vlan_check_reorder_header
rename it vlan_reorder_header to keep the naming clean.
Fix up the skb->pkt_type early so we don't look at the packet
after adding the vlan tag, which guarantees we don't goof
and look at the wrong field.
Use a simple if statement instead of a complicated switch
statement to decided that we need to increment rx_stats
for a multicast packet.
Hopefully at somepoint we will just declare the case where
VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
the code. Until then this keeps it working correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
bnx2x: allow device properly initialize after hotplug
bnx2x: fix DMAE timeout according to hw specifications
bnx2x: properly handle CFC DEL in cnic flow
bnx2x: call dev_kfree_skb_any instead of dev_kfree_skb
net: filter: move forward declarations to avoid compile warnings
pktgen: refactor pg_init() code
pktgen: use vzalloc_node() instead of vmalloc_node() + memset()
net: skb_trim explicitely check the linearity instead of data_len
ipv4: Give backtrace in ip_rt_bug().
net: avoid synchronize_rcu() in dev_deactivate_many
net: remove synchronize_net() from netdev_set_master()
rtnetlink: ignore NETDEV_RELEASE and NETDEV_JOIN event
net: rename NETDEV_BONDING_DESLAVE to NETDEV_RELEASE
bridge: call NETDEV_JOIN notifiers when add a slave
netpoll: disable netpoll when enslave a device
macvlan: Forward unicast frames in bridge mode to lowerdev
net: Remove linux/prefetch.h include from linux/skbuff.h
ipv4: Include linux/prefetch.h in fib_trie.c
netlabel: Remove prefetches from list handlers.
drivers/net: add prefetch header for prefetch users
...
Fixed up prefetch parts: removed a few duplicate prefetch.h includes,
fixed the location of the igb prefetch.h, took my version of the
skbuff.h code without the extra parentheses etc.
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h. The skbuff
list traversal still had them.
Quoth David Miller:
"Please just remove the prefetches.
Those are modelled after list.h as I intend to eventually convert
SKB list handling to "struct list_head" but we're not there yet.
Therefore if we kill prefetches from list.h we should kill it from
these things in skbuff.h too."
Requested-by: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of the check on data_len is to check linearity, so use the inline
helper for this. No overhead and more explicit.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes build errors on s390 and probably other archs as well:
In file included from net/ipv4/ip_forward.c:32:0:
include/net/udp.h: In function 'udp_csum_outgoing':
include/net/udp.h:141:2: error: implicit declaration of function 'prefetch'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit a715dea3c8 ("net: Always
allocate at least 16 skb frags regardless of page size"), the value
of MAX_SKB_FRAGS can now take on either an "unsigned long" or an
"int" value.
This causes warnings like:
net/packet/af_packet.c: In function ‘tpacket_fill_skb’:
net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’
Fix by forcing the constant to be unsigned long, otherwise we have
a situation where the type of a system wide constant is variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
When analysing performance of the cxgb3 on a ppc64 box I noticed that
we weren't doing much GRO merging. It turns out we are limited by the
number of SKB frags:
#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
With a 4kB page size we have 18 frags, but with a 64kB page size we
only have 3 frags.
I ran a single stream TCP bandwidth test to compare the performance of
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.
kernel-doc for rx_handler_result is taken from Nicolas' patch.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.
Occurences found by `grep -r` on net/, drivers/net, include/
[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.
This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.
Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce skb_checksum_start_offset() to replace repetitive calculation.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>