Neal Cardwell says:
====================
fix stretch ACK bugs in TCP CUBIC and Reno
This patch series fixes the TCP CUBIC and Reno congestion control
modules to properly handle stretch ACKs in their respective additive
increase modes, and in the transitions from slow start to additive
increase.
This finishes the project started by commit 9f9843a751 ("tcp:
properly handle stretch acks in slow start"), which fixed behavior for
TCP congestion control when handling stretch ACKs in slow start mode.
Motivation: In the Jan 2015 netdev thread 'BW regression after "tcp:
refine TSO autosizing"', Eyal Perry documented a regression that Eric
Dumazet determined was caused by improper handling of TCP stretch
ACKs.
Background: LRO, GRO, delayed ACKs, and middleboxes can cause "stretch
ACKs" that cover more than the RFC-specified maximum of 2
packets. These stretch ACKs can cause serious performance shortfalls
in common congestion control algorithms, like Reno and CUBIC, which
were designed and tuned years ago with receiver hosts that were not
using LRO or GRO, and were instead ACKing every other packet.
Testing: at Google we have been using this approach for handling
stretch ACKs for CUBIC datacenter and Internet traffic for several
years, with good results.
v2:
* fixed return type of tcp_slow_start() to be u32 instead of int
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug in CUBIC that causes cwnd to increase slightly
too slowly when multiple ACKs arrive in the same jiffy.
If cwnd is supposed to increase at a rate of more than once per jiffy,
then CUBIC was sometimes too slow. Because the bic_target is
calculated for a future point in time, calculated with time in
jiffies, the cwnd can increase over the course of the jiffy while the
bic_target calculated as the proper CUBIC cwnd at time
t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
increases on jiffy tick boundaries.
So since the cnt is set to:
ca->cnt = cwnd / (bic_target - cwnd);
as cwnd increases but bic_target does not increase due to jiffy
granularity, the cnt becomes too large, causing cwnd to increase
too slowly.
For example:
- suppose at the beginning of a jiffy, cwnd=40, bic_target=44
- so CUBIC sets:
ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
- suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
increases cwnd to 41
- so CUBIC sets:
ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13
So now CUBIC will wait for 13 packets to be ACKed before increasing
cwnd to 42, insted of 10 as it should.
The fix is to avoid adjusting the slope (determined by ca->cnt)
multiple times within a jiffy, and instead skip to compute the Reno
cwnd, the "TCP friendliness" code path.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change CUBIC to properly handle stretch ACKs in additive increase mode
by passing in the count of ACKed packets to tcp_cong_avoid_ai().
In addition, because we are now precisely accounting for stretch ACKs,
including delayed ACKs, we can now remove the delayed ACK tracking and
estimation code that tracked recent delayed ACK behavior in
ca->delayed_ack.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Reno to properly handle stretch ACKs in additive increase mode
by passing in the count of ACKed packets to tcp_cong_avoid_ai().
In addition, if snd_cwnd crosses snd_ssthresh during slow start
processing, and we then exit slow start mode, we need to carry over
any remaining "credit" for packets ACKed and apply that to additive
increase by passing this remaining "acked" count to
tcp_cong_avoid_ai().
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on
"stretch ACKs" -- cases where the receiver ACKed more than 1 packet in
a single ACK. For example, suppose w is 10 and we get a stretch ACK
for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2
(since acked/w = 20/10 = 2), but instead we were only increasing cwnd
by 1. This patch fixes that behavior.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
cover more than the RFC-specified maximum of 2 packets. These stretch
ACKs can cause serious performance shortfalls in common congestion
control algorithms that were designed and tuned years ago with
receiver hosts that were not using LRO or GRO, and were instead
politely ACKing every other packet.
This patch series fixes Reno and CUBIC to handle stretch ACKs.
This patch prepares for the upcoming stretch ACK bug fix patches. It
adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
changes all congestion control algorithms to pass in 1 for the ACKed
count. It also changes tcp_slow_start() to return the number of packet
ACK "credits" that were not processed in slow start mode, and can be
processed by the congestion control module in additive increase mode.
In future patches we will fix tcp_cong_avoid_ai() to handle stretch
ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
and additive increase mode.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Don't OOPS on socket AIO, from Christoph Hellwig.
2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
Grumbach.
3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.
4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
Starovoitov.
5) Lots of crash, memory leak, short TX packet et al bug fixes in
sh_eth from Ben Hutchings.
6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
Borkmann.
7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
From Eric Dumazet and Govindarajulu Varadarajan.
8) Header length calculation fix in mac80211 from Fred Chou.
9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
From Ezequiel Garcia.
10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
tcp diag used. From Herbert Xu.
11) amd-xgbe programs wrong rx flow control register, from Thomas
Lendacky.
12) Fix race leading to use after free in ping receive path, from Subash
Abhinov Kasiviswanathan.
13) Cache redirect routes otherwise we can get a heavy backlog of rcu
jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
net: don't OOPS on socket aio
stmmac: prevent probe drivers to crash kernel
bnx2x: fix napi poll return value for repoll
ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
sh_eth: Fix DMA-API usage for RX buffers
sh_eth: Check for DMA mapping errors on transmit
sh_eth: Ensure DMA engines are stopped before freeing buffers
sh_eth: Remove RX overflow log messages
ping: Fix race in free in receive path
udp_diag: Fix socket skipping within chain
can: kvaser_usb: Fix state handling upon BUS_ERROR events
can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
can: kvaser_usb: Send correct context to URB completion
can: kvaser_usb: Do not sleep in atomic context
ipv4: try to cache dst_entries which would cause a redirect
samples: bpf: relax test_maps check
bpf: rcu lock must not be held when calling copy_to_user()
net: sctp: fix slab corruption from use after free on INIT collisions
net: mv643xx_eth: Fix highmem support in non-TSO egress path
sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
...
In the case when alloc_netdev fails we return NULL to a caller. But there is no
check for NULL in the probe drivers. This patch changes NULL to an error
pointer. The function description is amended to reflect what we may get
returned.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull powerpc fixes from Michael Ellerman:
"Two powerpc fixes"
* tag 'powerpc-3.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/powernv: Restore LPCR with LPCR_PECE1 cleared
powerpc/xmon: Fix another endiannes issue in RTAS call from xmon
Pull one more module fix from Rusty Russell:
"SCSI was using module_refcount() to figure out when the module was
unloading: this broke with new atomic refcounting. The code is still
suspicious, but this solves the WARN_ON()"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
scsi: always increment reference count
With the commit d75b1ade56 ("net: less interrupt masking in NAPI") napi
repoll is done only when work_done == budget. When in busy_poll is we return 0
in napi_poll. We should return budget.
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
ipsec 2015-01-26
Just two small fixes for _decode_session6() where we
might decode to wrong header information in some rare
situations.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lubomir Rintel reported that during replacing a route the interface
reference counter isn't correctly decremented.
To quote bug <https://bugzilla.kernel.org/show_bug.cgi?id=91941>:
| [root@rhel7-5 lkundrak]# sh -x lal
| + ip link add dev0 type dummy
| + ip link set dev0 up
| + ip link add dev1 type dummy
| + ip link set dev1 up
| + ip addr add 2001:db8:8086::2/64 dev dev0
| + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20
| + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10
| + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20
| + ip link del dev0 type dummy
| Message from syslogd@rhel7-5 at Jan 23 10:54:41 ...
| kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
|
| Message from syslogd@rhel7-5 at Jan 23 10:54:51 ...
| kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
During replacement of a rt6_info we must walk all parent nodes and check
if the to be replaced rt6_info got propagated. If so, replace it with
an alive one.
Fixes: 4a287eba2d ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag")
Reported-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings says:
====================
Fixes for sh_eth #3
I'm continuing review and testing of Ethernet support on the R-Car H2
chip. This series fixes the last of the more serious issues I've found.
These are not tested on any of the other supported chips.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use the return value of dma_map_single(), rather than calling
virt_to_page() separately
- Check for mapping failue
- Call dma_unmap_single() rather than dma_sync_single_for_cpu()
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
dma_map_single() may fail if an IOMMU or swiotlb is in use, so
we need to check for this.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we try to clear EDRRR and EDTRR and immediately continue to
free buffers. This is unsafe because:
- In general, register writes are not serialised with DMA, so we still
have to wait for DMA to complete somehow
- The R8A7790 (R-Car H2) manual states that the TX running flag cannot
be cleared by writing to EDTRR
- The same manual states that clearing the RX running flag only stops
RX DMA at the next packet boundary
I applied this patch to the driver to detect DMA writes to freed
buffers:
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -1098,7 +1098,14 @@ static void sh_eth_ring_free(struct net_device *ndev)
> /* Free Rx skb ringbuffer */
> if (mdp->rx_skbuff) {
> for (i = 0; i < mdp->num_rx_ring; i++)
> + memcpy(mdp->rx_skbuff[i]->data,
> + "Hello, world", 12);
> + msleep(100);
> + for (i = 0; i < mdp->num_rx_ring; i++) {
> + WARN_ON(memcmp(mdp->rx_skbuff[i]->data,
> + "Hello, world", 12));
> dev_kfree_skb(mdp->rx_skbuff[i]);
> + }
> }
> kfree(mdp->rx_skbuff);
> mdp->rx_skbuff = NULL;
then ran the loop:
while ethtool -G eth0 rx 128 ; ethtool -G eth0 rx 64; do echo -n .; done
and 'ping -f' toward the sh_eth port from another machine. The
warning fired several times a minute.
To fix these issues:
- Deactivate all TX descriptors rather than writing to EDTRR
- As there seems to be no way of telling when RX DMA is stopped,
perform a soft reset to ensure that both DMA enginess are stopped
- To reduce the possibility of the reset truncating a transmitted
frame, disable egress and wait a reasonable time to reach a
packet boundary before resetting
- Update statistics before resetting
(The 'reasonable time' does not allow for CS/CD in half-duplex
mode, but half-duplex no longer seems reasonable!)
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
If RX traffic is overflowing the FIFO or DMA ring, logging every time
this happens just makes things worse. These errors are visible in the
statistics anyway.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde says:
====================
pull-request: can 2015-01-27
this is another pull request for net/master which consists of 4 patches.
All 4 patches are contributed by Ahmed S. Darwish, he fixes more problems in
the kvaser_usb driver.
David, please merge net/master to net-next/master, as we have more kvaser_usb
patches in the queue, that target net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An exception is seen in ICMP ping receive path where the skb
destructor sock_rfree() tries to access a freed socket. This happens
because ping_rcv() releases socket reference with sock_put() and this
internally frees up the socket. Later icmp_rcv() will try to free the
skb and as part of this, skb destructor is called and which leads
to a kernel panic as the socket is freed already in ping_rcv().
-->|exception
-007|sk_mem_uncharge
-007|sock_rfree
-008|skb_release_head_state
-009|skb_release_all
-009|__kfree_skb
-010|kfree_skb
-011|icmp_rcv
-012|ip_local_deliver_finish
Fix this incorrect free by cloning this skb and processing this cloned
skb instead.
This patch was suggested by Eric Dumazet
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on rhashtable walking I noticed that the UDP diag
dumping code is buggy. In particular, the socket skipping within
a chain never happens, even though we record the number of sockets
that should be skipped.
As this code was supposedly copied from TCP, this patch does what
TCP does and resets num before we walk a chain.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While being in an ERROR_WARNING state, and receiving further
bus error events with error counters still in the ERROR_WARNING
range of 97-127 inclusive, the state handling code erroneously
reverts back to ERROR_ACTIVE.
Per the CAN standard, only revert to ERROR_ACTIVE when the
error counters are less than 96.
Moreover, in certain Kvaser models, the BUS_ERROR flag is
always set along with undefined bits in the M16C status
register. Thus use bitwise operators instead of full equality
for checking that register against bus errors.
Signed-off-by: Ahmed S. Darwish <ahmed.darwish@valeo.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
On some x86 laptops, plugging a Kvaser device again after an
unplug makes the firmware always ignore the very first command.
For such a case, provide some room for retries instead of
completely exiting the driver init code.
Signed-off-by: Ahmed S. Darwish <ahmed.darwish@valeo.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Send expected argument to the URB completion hander: a CAN
netdevice instead of the network interface private context
`kvaser_usb_net_priv'.
This was discovered by having some garbage in the kernel
log in place of the netdevice names: can0 and can1.
Signed-off-by: Ahmed S. Darwish <ahmed.darwish@valeo.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>