Commit Graph

4457 Commits

Author SHA1 Message Date
Jun Zhao 5eb81e8916 ipv4 : igmp : Delete useless parameter in ip_mc_add1_src()
Need not to used 'delta' flag when add single-source to interface
filter source list.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>
2011-11-30 23:10:01 -05:00
David Miller 32092ecf06 atm: clip: Use device neigh support on top of "arp_tbl".
Instead of instantiating an entire new neigh_table instance
just for ATM handling, use the neigh device private facility.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 18:51:03 -05:00
David Miller 76cc714ed5 neigh: Do not set tbl->entry_size in ipv4/ipv6 neigh tables.
Let the core self-size the neigh entry based upon the key length.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 18:46:43 -05:00
Eric Dumazet d8a6e65f8b tcp: inherit listener congestion control for passive cnx
Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
was ignored for its children sockets : right after accept() the
congestion control for new socket is the system default one.

This seems an oversight of the initial design (quoted from Stephen)

Based on prior investigation and patch from Rick.

Reported-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Yuchung Cheng <ycheng@google.com>
Tested-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 16:55:26 -05:00
RongQing.Li e92036a651 ipv4: remove useless codes in ipmr_device_event()
Commit 7dc00c82 added a 'notify' parameter for vif_delete() to
distinguish whether to unregister the device.

When notify=1 means we does not need to unregister the device,
so calling unregister_netdevice_many is useless.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 18:32:33 -05:00
Eric Dumazet f07d960df3 tcp: avoid frag allocation for small frames
tcp_sendmsg() uses select_size() helper to choose skb head size when a
new skb must be allocated.

If GSO is enabled for the socket, current strategy is to force all
payload data to be outside of headroom, in PAGE fragments.

This strategy is not welcome for small packets, wasting memory.

Experiments show that best results are obtained when using 2048 bytes
for skb head (This includes the skb overhead and various headers)

This patch provides better len/truesize ratios for packets sent to
loopback device, and reduce memory needs for in-flight loopback packets,
particularly on arches with big pages.

If a sender sends many 1-byte packets to an unresponsive application,
receiver rmem_alloc will grow faster and will stop queuing these packets
sooner, or will collapse its receive queue to free excess memory.

netperf -t TCP_RR results are improved by ~4 %, and many workloads are
improved as well (tbench, mysql...)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 13:17:03 -05:00
Neal Cardwell 6b5a5c0dbb tcp: do not scale TSO segment size with reordering degree
Since 2005 (c1b4a7e695)
tcp_tso_should_defer has been using tcp_max_burst() as a target limit
for deciding how large to make outgoing TSO packets when not using
sysctl_tcp_tso_win_divisor. But since 2008
(dd9e0dda66) tcp_max_burst() returns the
reordering degree. We should not have tcp_tso_should_defer attempt to
build larger segments just because there is more reordering. This
commit splits the notion of deferral size used in TSO from the notion
of burst size used in cwnd moderation, and returns the TSO deferral
limit to its original value.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 00:29:41 -05:00
Eric Dumazet b90e5794c5 net: dont call jump_label_dec from irq context
Igor Maravic reported an error caused by jump_label_dec() being called
from IRQ context :

 BUG: sleeping function called from invalid context at kernel/mutex.c:271
 in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
 1 lock held by swapper/0:
  #0:  (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
 Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
Call Trace:
 <IRQ>  [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
 [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
 [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8109a37f>] ? local_clock+0x6f/0x80
 [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
 [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
 [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
 [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
 [<ffffffff81566525>] net_disable_timestamp+0x15/0x20
 [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
 [<ffffffff81557b00>] __sk_free+0x80/0x200
 [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
 [<ffffffff81557cba>] sock_wfree+0x3a/0x70
 [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
 [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
 [<ffffffff8155c119>] kfree_skb+0x49/0x170
 [<ffffffff815e936e>] arp_error_report+0x3e/0x90
 [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
 [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
 [<ffffffff81578d20>] ? neigh_update+0x640/0x640
 [<ffffffff81073558>] __do_softirq+0xc8/0x3a0

Since jump_label_{inc|dec} must be called from process context only,
we must defer jump_label_dec() if net_disable_timestamp() is called
from interrupt context.

Reported-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 00:26:25 -05:00
Eric Dumazet 690e99c4ba tcp: tcp_sendmsg() wrong access to sk_route_caps
Now sk_route_caps is u64, its dangerous to use an integer to store
result of an AND operator. It wont work if NETIF_F_SG is moved on the
upper part of u64.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-28 18:58:24 -05:00
Neal Cardwell 8cd6d6162d tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
The problem: Senders were overriding cwnd values picked during an undo
by calling tcp_moderate_cwnd() in tcp_try_to_open().

The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
TCP_CA_Open, since doing so is generally unnecessary and specifically
would override a DSACK-based undo of a cwnd reduction made in fast
recovery.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 18:54:09 -05:00
Neal Cardwell f698204bd0 tcp: allow undo from reordered DSACKs
Previously, SACK-enabled connections hung around in TCP_CA_Disorder
state while snd_una==high_seq, just waiting to accumulate DSACKs and
hopefully undo a cwnd reduction. This could and did lead to the
following unfortunate scenario: if some incoming ACKs advance snd_una
beyond high_seq then we were setting undo_marker to 0 and moving to
TCP_CA_Open, so if (due to reordering in the ACK return path) we
shortly thereafter received a DSACK then we were no longer able to
undo the cwnd reduction.

The change: Simplify the congestion avoidance state machine by
removing the behavior where SACK-enabled connections hung around in
the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
snd_una advances to high_seq or beyond we typically move to
TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
TCP_CA_Disorder if we later receive enough DSACKs.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 18:54:09 -05:00
Neal Cardwell e95ae2f2cf tcp: use SACKs and DSACKs that arrive on ACKs below snd_una
The bug: When the ACK field is below snd_una (which can happen when
ACKs are reordered), senders ignored DSACKs (preventing undo) and did
not call tcp_fastretrans_alert, so they did not increment
prr_delivered to reflect newly-SACKed sequence ranges, and did not
call tcp_xmit_retransmit_queue, thus passing up chances to send out
more retransmitted and new packets based on any newly-SACKed packets.

The change: When the ACK field is below snd_una (the "old_ack" goto
label), call tcp_fastretrans_alert to allow undo based on any
newly-arrived DSACKs and try to send out more packets based on
newly-SACKed packets.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 18:54:09 -05:00
Neal Cardwell 5628adf1a0 tcp: use DSACKs that arrive when packets_out is 0
The bug: Senders ignored DSACKs after recovery when there were no
outstanding packets (a common scenario for HTTP servers).

The change: when there are no outstanding packets (the "no_queue" goto
label), call tcp_fastretrans_alert() in order to use DSACKs to undo
congestion window reductions.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 18:54:09 -05:00
Neal Cardwell 7d2b55f80d tcp: make is_dupack a parameter to tcp_fastretrans_alert()
Allow callers to decide whether an ACK is a duplicate ACK. This is a
prerequisite to allowing fastretrans_alert to be called from new
contexts, such as the no_queue and old_ack code paths, from which we
have extra info that tells us whether an ACK is a dupack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 18:54:08 -05:00
David S. Miller 6dec4ac4ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/inet_diag.c
2011-11-26 14:47:03 -05:00
Steffen Klassert 261663b0ee ipv4: Don't use the cached pmtu informations for input routes
The pmtu informations on the inetpeer are visible for output and
input routes. On packet forwarding, we might propagate a learned
pmtu to the sender. As we update the pmtu informations of the
inetpeer on demand, the original sender of the forwarded packets
might never notice when the pmtu to that inetpeer increases.
So use the mtu of the outgoing device on packet forwarding instead
of the pmtu to the final destination.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:52 -05:00
Steffen Klassert 618f9bc74a net: Move mtu handling down to the protocol depended handlers
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:51 -05:00
Steffen Klassert ebb762f27f net: Rename the dst_opt default_mtu method to mtu
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Steffen Klassert 6b600b26c0 route: Use the device mtu as the default for blackhole routes
As it is, we return null as the default mtu of blackhole routes.
This may lead to a propagation of a bogus pmtu if the default_mtu
method of a blackhole route is invoked. So return dst->dev->mtu
as the default mtu instead.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Li Wei ac8a48106b ipv4: Save nexthop address of LSRR/SSRR option to IPCB.
We can not update iph->daddr in ip_options_rcv_srr(), It is too early.
When some exception ocurred later (eg. in ip_forward() when goto
sr_failed) we need the ip header be identical to the original one as
ICMP need it.

Add a field 'nexthop' in struct ip_options to save nexthop of LSRR
or SSRR option.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23 19:19:32 -05:00
Jun Zhao 685f94e6db ipv4 : igmp : fix error handle in ip_mc_add_src()
When add sources to interface failure, need to roll back the sfcount[MODE]
to before state. We need to match it corresponding.

Acked-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23 17:31:39 -05:00
David S. Miller 46a246c4df netfilter: Remove NOTRACK/RAW dependency on NETFILTER_ADVANCED.
Distributions are using this in their default scripts, so don't hide
them behind the advanced setting.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23 16:07:00 -05:00
Alexey Dobriyan 4e3fd7a06d net: remove ipv6_addr_copy()
C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22 16:43:32 -05:00
Maciej Żenczykowski 717b6d8366 net-netlink: fix diag to export IPv4 tos for dual-stack IPv6 sockets
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22 16:03:10 -05:00
David S. Miller efd0bf97de Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The forcedeth changes had a conflict with the conversion over
to atomic u64 statistics in net-next.

The libertas cfg.c code had a conflict with the bss reference
counting fix by John Linville in net-next.

Conflicts:
	drivers/net/ethernet/nvidia/forcedeth.c
	drivers/net/wireless/libertas/cfg.c
2011-11-21 13:50:33 -05:00