Commit Graph

273 Commits

Author SHA1 Message Date
Linus Torvalds 53ee7569ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  bnx2x: allow device properly initialize after hotplug
  bnx2x: fix DMAE timeout according to hw specifications
  bnx2x: properly handle CFC DEL in cnic flow
  bnx2x: call dev_kfree_skb_any instead of dev_kfree_skb
  net: filter: move forward declarations to avoid compile warnings
  pktgen: refactor pg_init() code
  pktgen: use vzalloc_node() instead of vmalloc_node() + memset()
  net: skb_trim explicitely check the linearity instead of data_len
  ipv4: Give backtrace in ip_rt_bug().
  net: avoid synchronize_rcu() in dev_deactivate_many
  net: remove synchronize_net() from netdev_set_master()
  rtnetlink: ignore NETDEV_RELEASE and NETDEV_JOIN event
  net: rename NETDEV_BONDING_DESLAVE to NETDEV_RELEASE
  bridge: call NETDEV_JOIN notifiers when add a slave
  netpoll: disable netpoll when enslave a device
  macvlan: Forward unicast frames in bridge mode to lowerdev
  net: Remove linux/prefetch.h include from linux/skbuff.h
  ipv4: Include linux/prefetch.h in fib_trie.c
  netlabel: Remove prefetches from list handlers.
  drivers/net: add prefetch header for prefetch users
  ...

Fixed up prefetch parts: removed a few duplicate prefetch.h includes,
fixed the location of the igb prefetch.h, took my version of the
skbuff.h code without the extra parentheses etc.
2011-05-23 08:39:24 -07:00
Linus Torvalds a1e4891fd4 Remove prefetch() from <linux/skbuff.h> and "netlabel_addrlist.h"
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h.  The skbuff
list traversal still had them.

Quoth David Miller:
  "Please just remove the prefetches.

  Those are modelled after list.h as I intend to eventually convert
  SKB list handling to "struct list_head" but we're not there yet.

  Therefore if we kill prefetches from list.h we should kill it from
  these things in skbuff.h too."

Requested-by: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 21:43:41 -07:00
Emmanuel Grumbach c4264f27e8 net: skb_trim explicitely check the linearity instead of data_len
The purpose of the check on data_len is to check linearity, so use the inline
helper for this. No overhead and more explicit.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 21:01:21 -04:00
David S. Miller 67f11f4ded net: Remove linux/prefetch.h include from linux/skbuff.h
No longer needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 20:54:11 -04:00
David S. Miller 0fcbe742ea net: Remove prefetches from SKB list handlers.
Noticed by Linus.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 20:35:29 -04:00
Heiko Carstens 34ea646c9f net: add missing prefetch.h include
Fixes build errors on s390 and probably other archs as well:

  In file included from net/ipv4/ip_forward.c:32:0:
  include/net/udp.h: In function 'udp_csum_outgoing':
  include/net/udp.h:141:2: error: implicit declaration of function 'prefetch'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 11:26:02 -07:00
Eric Dumazet 0a14842f5a net: filter: Just In Time compiler for x86-64
In order to speedup packet filtering, here is an implementation of a
JIT compiler for x86_64

It is disabled by default, and must be enabled by the admin.

echo 1 >/proc/sys/net/core/bpf_jit_enable

It uses module_alloc() and module_free() to get memory in the 2GB text
kernel range since we call helpers functions from the generated code.

EAX : BPF A accumulator
EBX : BPF X accumulator
RDI : pointer to skb   (first argument given to JIT function)
RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
r9d : skb->len - skb->data_len (headlen)
r8  : skb->data

To get a trace of generated code, use :

echo 2 >/proc/sys/net/core/bpf_jit_enable

Example of generated code :

# tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

flen=18 proglen=147 pass=3 image=ffffffffa00b5000
JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
JIT code: ffffffffa00b5090: c0 c9 c3

BPF program is 144 bytes long, so native program is almost same size ;)

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 8
(002) ld       [26]
(003) and      #0xffffff00
(004) jeq      #0xc0a81400      jt 16   jf 5
(005) ld       [30]
(006) and      #0xffffff00
(007) jeq      #0xc0a81400      jt 16   jf 17
(008) jeq      #0x806           jt 10   jf 9
(009) jeq      #0x8035          jt 10   jf 17
(010) ld       [28]
(011) and      #0xffffff00
(012) jeq      #0xc0a81400      jt 16   jf 13
(013) ld       [38]
(014) and      #0xffffff00
(015) jeq      #0xc0a81400      jt 16   jf 17
(016) ret      #65535
(017) ret      #0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-27 23:05:08 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
David S. Miller eec009548e net: Fix warnings caused by MAX_SKB_FRAGS change.
After commit a715dea3c8 ("net: Always
allocate at least 16 skb frags regardless of page size"), the value
of MAX_SKB_FRAGS can now take on either an "unsigned long" or an
"int" value.

This causes warnings like:

net/packet/af_packet.c: In function ‘tpacket_fill_skb’:
net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’

Fix by forcing the constant to be unsigned long, otherwise we have
a situation where the type of a system wide constant is variable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-29 23:34:08 -07:00
Anton Blanchard a715dea3c8 net: Always allocate at least 16 skb frags regardless of page size
When analysing performance of the cxgb3 on a ppc64 box I noticed that
we weren't doing much GRO merging. It turns out we are limited by the
number of SKB frags:

#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)

With a 4kB page size we have 18 frags, but with a 64kB page size we
only have 3 frags.

I ran a single stream TCP bandwidth test to compare the performance of

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-28 22:26:32 -07:00
Jiri Pirko 8a4eb5734e net: introduce rx_handler results and logic around that
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.

kernel-doc for rx_handler_result is taken from Nicolas' patch.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-16 12:53:54 -07:00
Michał Mirosław 04ed3e741d net: change netdev->features to u32
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:32:47 -08:00
David S. Miller 686a295553 net: Add safe reverse SKB queue walkers.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 22:47:32 -08:00
KOVACS Krisztian 2fc72c7b84 netfilter: fix compilation when conntrack is disabled but tproxy is enabled
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.

This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.

Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-01-12 20:25:08 +01:00
Michał Mirosław 04fb451eff net: Introduce skb_checksum_start_offset()
Introduce skb_checksum_start_offset() to replace repetitive calculation.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:43:14 -08:00
Vladislav Zolotarov a3d22a68d7 bnx2x: Take the distribution range definition out of skb_tx_hash()
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).

Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:15:53 -08:00
Tom Herbert 3853b5841c xps: Improvements in TX queue selection
In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:19 -08:00
Eric Dumazet 27b75c95f1 net: avoid RCU for NOCACHE dst
There is no point using RCU for dst we allocate for a very short time
(used once).

Change dst_release() to take DST_NOCACHE into account, but also change
skb_dst_set_noref() to force a refcount increment for such dst.

This is a _huge_ gain, because we dont waste memory to store xx thousand
of dsts. Instead of queueing them to RCU, we can free them instantly.

CPU caches can stay hot, re-using same memory blocks to hold temporary
dsts.

Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
since atomic_dec_return() implies a full memory barrier.

Stress test, 160.000.000 udp frames sent, IP route cache disabled
(DDOS).

Before:

real    0m38.091s
user    0m13.189s
sys     7m53.018s

After:

real	0m29.946s
user	0m12.157s
sys	7m40.605s

For reference, if IP route cache was enabled :

real	0m32.030s
user	0m10.521s
sys	8m15.243s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 03:02:23 -07:00
Eric Dumazet 564824b0c5 net: allocate skbs on local node
commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-16 11:13:19 -07:00
Eric Dumazet cb4dfe562c net: skb_frag_t can be smaller on small arches
On 32bit arches, if PAGE_SIZE is smaller than 65536, we can use 16bit
offset and size fields. This patch saves 72 bytes per skb on i386, or
128 bytes after rounding.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 18:31:13 -07:00
Eric Dumazet a02cec2155 net: return operator cleanup
Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-23 14:33:39 -07:00
Eric Dumazet bc8acf2c8c drivers/net: avoid some skb->ip_summed initializations
fresh skbs have ip_summed set to CHECKSUM_NONE (0)

We can avoid setting again skb->ip_summed to CHECKSUM_NONE in drivers.

Introduce skb_checksum_none_assert() helper so that we keep this
assertion documented in driver sources.

Change most occurrences of :

skb->ip_summed = CHECKSUM_NONE;

by :

skb_checksum_none_assert(skb);

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 19:06:22 -07:00
David S. Miller 21dc330157 net: Rename skb_has_frags to skb_has_frag_list
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 00:13:46 -07:00
Oliver Hartkopp 2244d07bfa net: simplify flags for tx timestamping
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 00:08:30 -07:00