Commit Graph

49696 Commits

Author SHA1 Message Date
Eric Dumazet
e4ae004b84 netem: add ECN capability
Add ECN (Explicit Congestion Notification) marking capability to netem

tc qdisc add dev eth0 root netem drop 0.5 ecn

Instead of dropping packets, try to ECN mark them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:39:48 -04:00
Eric Dumazet
18d0700024 net: skb_peek()/skb_peek_tail() cleanups
remove useless casts and rename variables for less confusion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:39:48 -04:00
Chris Elston
a32e0eec70 l2tp: introduce L2TPv3 IP encapsulation support for IPv6
L2TPv3 defines an IP encapsulation packet format where data is carried
directly over IP (no UDP). The kernel already has support for L2TP IP
encapsulation over IPv4 (l2tp_ip). This patch introduces support for
L2TP IP encapsulation over IPv6.

The implementation is derived from ipv6/raw and ipv4/l2tp_ip.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:30:55 -04:00
Chris Elston
f9bac8df90 l2tp: netlink api for l2tpv3 ipv6 unmanaged tunnels
This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using
the netlink API. We already support unmanaged L2TPv3 tunnels over
IPv4. A patch to iproute2 to make use of this feature will be
submitted separately.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:30:55 -04:00
James Chapman
9d4ec1aeda pppox: Replace __attribute__((packed)) in if_pppox.h
Checkpatch warns about the use of __attribute__((packed)). So use the
recommended __packed syntax instead.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:30:55 -04:00
Eric Dumazet
d7e8883cfc net: make GRO aware of skb->head_frag
GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30 21:35:49 -04:00
Eric Dumazet
d3836f21b0 net: allow skb->head to be a page fragment
skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.

We have three spots were it hurts :

1) GRO aggregation

 When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.

2) splice(socket -> pipe).

 We must copy the linear part to a page fragment.
 This kind of defeats splice() purpose (zero copy claim)

3) TCP coalescing.

 Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)

Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.

build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.

Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.

This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30 21:35:11 -04:00
Benjamin LaHaise
d2cf336167 net/l2tp: add support for L2TP over IPv6 UDP
Now that encap_rcv() works on IPv6 UDP sockets, wire L2TP up to IPv6.
Support has been tested with and without hardware offloading.  This
version fixes the L2TP over localhost issue with incorrect checksums
being reported.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-28 22:21:51 -04:00
Benjamin LaHaise
d7f3f62167 net/ipv6/udp: UDP encapsulation: introduce encap_rcv hook into IPv6
Now that the sematics of udpv6_queue_rcv_skb() match IPv4's
udp_queue_rcv_skb(), introduce the UDP encap_rcv() hook for IPv6.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-28 22:21:51 -04:00
Eric Dumazet
6746960140 ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing
Quoting Tore Anderson from :
https://bugzilla.kernel.org/show_bug.cgi?id=42572

When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
size does not take into account the size of the IPv6 Fragmentation
header that needs to be included in outbound packets, causing every
transmitted TCP segment to be fragmented across two IPv6 packets, the
latter of which will only contain 8 bytes of actual payload.

RTAX_FEATURE_ALLFRAG is typically set on a route in response to
receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
PTBs with MTU < 1280 are still valid, in particular when an IPv6
packet is sent to an IPv4 destination through a stateless translator.
Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
the path will be translated to ICMPv6 PTB which may then indicate an
MTU of less than 1280.

The Linux kernel refuses to reduce the effective MTU to anything below
1280 bytes, instead it sets it to exactly 1280 bytes, and
RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
instead of 1232 (additionally taking into account the 8 bytes required
by the IPv6 Fragmentation extension header).

This in turn results in rather inefficient transmission, as every
transmitted TCP segment now is split in two fragments containing
1232+8 bytes of payload.

After this patch, all the outgoing packets that includes a
Fragmentation header all are "atomic" or "non-fragmented" fragments,
i.e., they both have Offset=0 and More Fragments=0.

With help from David S. Miller

Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-27 00:03:34 -04:00
David S. Miller
a85c9bb895 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next 2012-04-26 15:54:45 -04:00
John W. Linville
d9b8ae6bd8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-testmode.c
2012-04-26 15:03:48 -04:00
Pavel Emelyanov
de248a75c3 tcp repair: Fix unaligned access when repairing options (v2)
Don't pick __u8/__u16 values directly from raw pointers, but instead use
an array of structures of code:value pairs. This is OK, since the buffer
we take options from is not an skb memory, but a user-to-kernel one.

For those options which don't require any value now, require this to be
zero (for potential future extension of this API).

v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-26 06:13:51 -04:00
Shan Wei
8dcf01fc00 net: sock_diag_handler structs can be const
read only, so change it to const.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-25 20:46:59 -04:00
Eric Dumazet
38ba0a65fa net: skb_can_coalesce returns a boolean
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-24 00:18:02 -04:00
David S. Miller
f24001941c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")

The former moved around the sysctl register/unregister calls, the
later simply removed them.

With help from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:15:17 -04:00
Eric Dumazet
f545a38f74 net: add a limit parameter to sk_add_backlog()
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.

No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:28:28 -04:00
Eric W. Biederman
b98985073b net ax25: Fix the build when sysctl support is disabled.
Randy Dunlap <rdunlap@xenotime.net> reported:

> On 04/23/2012 12:07 AM, Stephen Rothwell wrote:
>
>> Hi all,
>>
>> Changes since 20120420:
>
>
> include/net/ax25.h:447:75: error: expected ';' before '}' token
>
> static inline int ax25_register_dev_sysctl(ax25_dev *ax25_dev) { return 0 };
> static inline void ax25_unregister_dev_sysctl(ax25_dev *ax25_dev) {};
>
> First function:  move ';' inside braces.
> Second function:  drop the ';'.

Put the semicolons where it makes sense.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:14:47 -04:00
Eric W. Biederman
48c7495857 net sysctl: Add place holder functions for when sysctl support is compiled out of the kernel.
Randy Dunlap <rdunlap@xenotime.net> reported:
> On 04/23/2012 12:07 AM, Stephen Rothwell wrote:
>
>> Hi all,
>>
>> Changes since 20120420:
>
>
>
> ERROR: "unregister_net_sysctl_table" [net/phonet/phonet.ko] undefined!
> ERROR: "register_net_sysctl" [net/phonet/phonet.ko] undefined!
>
> when CONFIG_SYSCTL is not enabled.

Add static inline stub functions to gracefully handle the case when sysctl
support is not present.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 19:24:28 -04:00
Wey-Yi Guy
0d8a0a1728 mac80211: declare ieee80211_ave_rssi as EXPORT
ieee80211_ave_rssi need to be declare as export for driver to use it.

Signed-off-by: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-04-23 15:37:41 -04:00
Linus Torvalds
205b9c9c6e Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes from Benjamin Herrenschmidt:
 "Here are a few fixes for powerpc.  Note the addition to the generic
  irq.h.  This is part of a 3-patches regression fix for mpic due to
  changes in how IRQ_TYPE_NONE is being handled.  Thomas agreed to the
  addition of the new IRQ_TYPE_DEFAULT contant, however he hasn't
  replied with an Ack to the actual patch yet.  I don't to wait much
  longer with these patches tho."

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/mpic: Properly set default triggers
  irq: Add IRQ_TYPE_DEFAULT for use by PIC drivers
  powerpc/mpic: Fix confusion between hw_irq and virq
  powerpc/pmac: Don't add_timer() twice
  powerpc/eeh: Fix crash caused by null eeh_dev
  powerpc/mpc85xx: add MPIC message dts node
  powerpc/mpic_msgr: fix offset error when setting mer register
  powerpc/mpic_msgr: add lock for MPIC message global variable
  powerpc/mpic_msgr: fix compile error when SMP disabled
  powerpc: fix build when CONFIG_BOOKE_WDT is enabled
  powerpc/85xx: don't call of_platform_bus_probe() twice
2012-04-22 21:07:51 -07:00
Linus Torvalds
7e29629543 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix namespace init and cleanup in phonet to fix some oopses, from
    Eric W. Biederman.

 2) Missing kfree_skb() in AF_KEY, from Julia Lawall.

 3) Refcount leak and source address handling fix in l2tp from James
    Chapman.

 4) Memory leak fix in CAIF from Tomasz Gregorek.

 5) When routes are cloned from ipv6 addrconf routes, we don't process
    expirations properly.  Fix from Gao Feng.

 6) Fix panic on DMA errors in atl1 driver, from Tony Zelenoff.

 7) Only enable interrupts in 8139cp driver after we've registered the
    IRQ handler.  From Jason Wang.

 8) Fix too many reads of KS_CIDER register in ks8851 during probe,
    fixing crashes on spurious interrupts.  From Matt Renzelmann.

 9) Missing include in ath5k driver and missing iounmap on probe
    failure, from Jonathan Bither.

10) Fix RX packet handling in smsc911x driver, from Will Deacon.

11) Fix ixgbe WoL on fiber by leaving the laser on during shutdown.

12) ks8851 needs MAX_RECV_FRAMES increased otherwise the internal MAC
    buffers are easily overflown.  Fix from Davide Cimingahi.

13) Fix memory leaks in peak_usb CAN driver, from Jesper Juhl.

14) gred packet scheduler can dump in WRED more when doing a netlink
    dump.  Fix from David Ward.

15) Fix MTU in USB smsc75xx driver, from Stephane Fillod.

16) Dummy device needs ->ndo_uninit handler to properly handle
    ->ndo_init failures.  From Hiroaki SHIMODA.

17) Fix TX fragmentation in ath9k driver, from Sujith Manoharan.

18) Missing RTNL lock in ixgbe PM resume, from Benjamin Poirier.

19) Missing iounmap in farsync WAN driver, from Julia Lawall.

20) With LRO/GRO, tcp_grow_window() is easily tricked into not growing
    the receive window properly, and this hurts performance.  Fix from
    Eric Dumazet.

21) Network namespace init failure can leak net_generic data, fix from
    Julian Anastasov.

22) Fix skb_over_panic due to mis-accounting in TCP for partially ACK'd
    SKBs.  From Eric Dumazet.

23) New IDs for qmi_wwan driver, from Bjørn Mork.

24) Fix races in ax25_exit(), from Eric W. Biederman.

25) IPV6 TCP doesn't handle TCP_MAXSEG socket option properly, copy over
    logic from the IPV4 side.  From Neal Cardwell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
  tcp: fix TCP_MAXSEG for established IPv6 passive sockets
  drivers/net: Do not free an IRQ if its request failed
  drop_monitor: allow more events per second
  ks8851: Fix request_irq/free_irq mismatch
  net/hyperv: Adding cancellation to ensure rndis filter is closed
  ks8851: Fix mutex deadlock in ks8851_net_stop()
  net ax25: Reorder ax25_exit to remove races.
  icplus: fix interrupt for IC+ 101A/G and 1001LF
  net: qmi_wwan: support Sierra Wireless MC77xx devices in QMI mode
  bnx2x: off by one in bnx2x_ets_e3b0_sp_pri_to_cos_set()
  ksz884x: don't copy too much in netdev_set_mac_address()
  tcp: fix retransmit of partially acked frames
  netns: do not leak net_generic data on failed init
  net/sock.h: fix sk_peek_off kernel-doc warning
  tcp: fix tcp_grow_window() for large incoming frames
  drivers/net/wan/farsync.c: add missing iounmap
  davinci_mdio: Fix MDIO timeout check
  ipv6: clean up rt6_clean_expires
  ipv6: fix rt6_update_expires
  arcnet: rimi: Fix device name in debug output
  ...
2012-04-22 21:02:57 -07:00
Benjamin Herrenschmidt
3fca40c704 irq: Add IRQ_TYPE_DEFAULT for use by PIC drivers
This is meant typically to allow a PIC driver's irq domain map() callback
to establish sane defaults for the interrupt (and make sure that the HW
and the irq_desc are in sync as far as the trigger is concerned).

The irq core may not call the set_trigger callback if it thinks the
trigger is already set to the right setting, so we need to ensure new
descriptors are properly synchronized with the hardware.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-04-23 11:04:29 +10:00
Neal Cardwell
900f65d361 tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()
This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 16:36:42 -04:00
Jiri Pirko
19a0b58e50 team: allow to enable/disable ports
This patch changes content of hashlist (used to get port struct by
computed index (0...en_port_count-1)). Now the hash list contains only
enabled ports so userspace will be able to say what ports can be used
for tx/rx. This becomes handy when userspace will need to disable ports
which does not belong to active aggregator. By default, newly added port
is enabled.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 16:26:33 -04:00