deliver_no_wcard is not being set in skb_copy_header.
In the skb_cloned case it is not being cleared and
may cause the skb to be dropped when the loopback device
pushes it back up the stack.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct sk_forward_alloc handling for error_queue would need to use a
backlog of frames that softirq handler could not deliver because socket
is owned by user thread. Or extend backlog processing to be able to
process normal and error packets.
Another possibility is to not use mem charge for error queue, this is
what I implemented in this patch.
Note: this reverts commit 29030374
(net: fix sk_forward_alloc corruptions), since we dont need to lock
socket anymore.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As David found out, sock_queue_err_skb() should be called with socket
lock hold, or we risk sk_forward_alloc corruption, since we use non
atomic operations to update this field.
This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
(BH already disabled)
1) skb_tstamp_tx()
2) Before calling ip_icmp_error(), in __udp4_lib_err()
3) Before calling ipv6_icmp_error(), in __udp6_lib_err()
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
drivers/net/usb/asix.c: Fix pointer cast.
be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
proc_dointvec: write a single value
hso: add support for new products
Phonet: fix potential use-after-free in pep_sock_close()
ath9k: remove VEOL support for ad-hoc
ath9k: change beacon allocation to prefer the first beacon slot
sock.h: fix kernel-doc warning
cls_cgroup: Fix build error when built-in
macvlan: do proper cleanup in macvlan_common_newlink() V2
be2net: Bug fix in init code in probe
net/dccp: expansion of error code size
ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
wireless: fix sta_info.h kernel-doc warnings
wireless: fix mac80211.h kernel-doc warnings
iwlwifi: testing the wrong variable in iwl_add_bssid_station()
ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
ath9k_htc: dereferencing before check in hif_usb_tx_cb()
rt2x00: Fix rt2800usb TX descriptor writing.
rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
...
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When GRO produces fraglist entries, and the resulting skb hits
an interface that is incapable of TSO but capable of FRAGLIST,
we end up producing a bogus packet with gso_size non-zero.
This was reported in the field with older versions of KVM that
did not set the TSO bits on tuntap.
This patch fixes that.
Reported-by: Igor Zhang <yugzhang@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use low order bit of skb->_skb_dst to tell dst is not refcounted.
Change _skb_dst to _skb_refdst to make sure all uses are catched.
skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.
New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)
skb_dst_drop() drops a reference only if skb dst was refcounted.
skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.
Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().
Use skb_dst_force() in dev_requeue_skb().
Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450 @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC
- 1.080.000 pps without a single drop at NIC level.
Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
first sk_buff cache line, the second to prefetch the shinfo part.
Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.
All skb_shared_info fields before 'dataref' are cleared in
__alloc_skb().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot")
we uninlined skb_pull.
But in some critical paths it makes sense to inline this thing
and it helps performance significantly.
Create an skb_pull_inline() so that we can do this in a way that
serves also as annotation.
Based upon a patch by Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in skbuff.c ; this patch
removes these two exports. It also marks these functions as 'static'
and removeS the extern declarations of them from
include/linux/skbuff.h
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements software receive side packet steering (RPS). RPS
distributes the load of received packet processing across multiple CPUs.
Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load. This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.
This solution queues packets early on in the receive path on the backlog queues
of other CPUs. This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel. For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask. The IPI mechanism is used to raise networking receive
softirqs between CPUs. This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.
Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.
The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical
bit maps for receive queues in the device (numbered by <n>). If a device
does not support multi-queue, a single variable is used for the device (rx-0).
Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization. Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy. Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.
e1000e on 8 core Intel
Without RPS: 108K tps at 33% CPU
With RPS: 311K tps at 64% CPU
forcedeth on 16 core AMD
Without RPS: 156K tps at 15% CPU
With RPS: 404K tps at 49% CPU
bnx2x on 16 core AMD
Without RPS 567K tps at 61% CPU (4 HW RX queues)
Without RPS 738K tps at 96% CPU (8 HW RX queues)
With RPS: 854K tps at 76% CPU (4 HW RX queues)
Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet. In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets. It's
probably best not change the masks too frequently.
Signed-off-by: Tom Herbert <therbert@google.com>
include/linux/netdevice.h | 32 ++++-
include/linux/skbuff.h | 3 +
net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++--------
net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++-
net/core/skbuff.c | 2 +
5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we've merged skb's with page frags, and subsequently receive
a trailer skb (< MSS) that is not completely non-linear (this can
occur on Intel NICs if the packet size falls below the threshold),
GRO ends up producing an illegal GSO skb with a frag_list.
This is harmless unless the skb is then forwarded through an
interface that requires software GSO, whereupon the GSO code
will BUG.
This patch detects this case in GRO and avoids merging the
trailer skb.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
NAPI drivers try to recycle SKBs in their polling routine, but we
generally don't know the context in which the polling will be called,
and the skb recycling itself may require IRQs to be enabled.
This patch adds irqs_disabled() test to the skb_recycle_check()
routine, so that we'll not let the drivers hit the skb recycling
path with IRQs disabled.
As a side effect, this patch actually disables skb recycling for some
[broken] drivers. E.g. gianfar driver grabs an irqsave spinlock during
TX ring processing, and then tries to recycle an skb, and that caused
the following badness:
nf_conntrack version 0.5.0 (1008 buckets, 4032 max)
------------[ cut here ]------------
Badness at kernel/softirq.c:143
NIP: c003e3c4 LR: c423a528 CTR: c003e344
...
NIP [c003e3c4] local_bh_enable+0x80/0xc4
LR [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
Call Trace:
[c15d1b60] [c003e32c] local_bh_disable+0x1c/0x34 (unreliable)
[c15d1b70] [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
[c15d1b80] [c02c6370] nf_conntrack_destroy+0x3c/0x70
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211 required this due to the master netdev, but now
it can put all information into skb->cb and this can go.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
netxen: fix tx ring accounting
netxen: fix detection of cut-thru firmware mode
forcedeth: fix dma api mismatches
atm: sk_wmem_alloc initial value is one
net: correct off-by-one write allocations reports
via-velocity : fix no link detection on boot
Net / e100: Fix suspend of devices that cannot be power managed
TI DaVinci EMAC : Fix rmmod error
net: group address list and its count
ipv4: Fix fib_trie rebalancing, part 2
pkt_sched: Update drops stats in act_police
sky2: version 1.23
sky2: add GRO support
sky2: skb recycling
sky2: reduce default transmit ring
sky2: receive counter update
sky2: fix shutdown synchronization
sky2: PCI irq issues
sky2: more receive shutdown
sky2: turn off pause during shutdown
...
Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
The skb mac_header field is sometimes NULL (or ~0u) as a sentinel
value. The places where skb is expanded add an offset which would
change this flag into an invalid pointer (or offset).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looking at the crash in log_martians(), one suspect is that the check for
mac header being set is not correct. The value of mac_header defaults to
0 on allocation, therefore skb_mac_header_was_set will always be true on
platforms using NET_SKBUFF_USES_OFFSET.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>