Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking whether a DATA chunk fits into the estimated rwnd a
full sizeof(struct sk_buff) is added to the needed chunk size. This
quickly exhausts the available rwnd space and leads to packets being
sent which are much below the PMTU limit. This can lead to much worse
performance.
The reason for this behaviour was to avoid putting too much memory
pressure on the receiver. The concept is not completely irational
because a Linux receiver does in fact clone an skb for each DATA chunk
delivered. However, Linux also reserves half the available socket
buffer space for data structures therefore usage of it is already
accounted for.
When proposing to change this the last time it was noted that this
behaviour was introduced to solve a performance issue caused by rwnd
overusage in combination with small DATA chunks.
Trying to reproduce this I found that with the sk_buff overhead removed,
the performance would improve significantly unless socket buffer limits
are increased.
The following numbers have been gathered using a patched iperf
supporting SCTP over a live 1 Gbit ethernet network. The -l option
was used to limit DATA chunk sizes. The numbers listed are based on
the average of 3 test runs each. Default values have been used for
sk_(r|w)mem.
Chunk
Size Unpatched No Overhead
-------------------------------------
4 15.2 Kbit [!] 12.2 Mbit [!]
8 35.8 Kbit [!] 26.0 Mbit [!]
16 95.5 Kbit [!] 54.4 Mbit [!]
32 106.7 Mbit 102.3 Mbit
64 189.2 Mbit 188.3 Mbit
128 331.2 Mbit 334.8 Mbit
256 537.7 Mbit 536.0 Mbit
512 766.9 Mbit 766.6 Mbit
1024 810.1 Mbit 808.6 Mbit
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes BUG that the ASCONF receiver transmits DATA chunks
to the newly added UNCONFIRMED destination.
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
When initiating a graceful shutdown while having data chunks
on the retransmission queue with a peer which is in zero
window mode the shutdown is never completed because the
retransmission error count is reset periodically by the
following two rules:
- Do not timeout association while doing zero window probe.
- Reset overall error count when a heartbeat request has
been acknowledged.
The graceful shutdown will wait for all outstanding TSN to
be acknowledged before sending the SHUTDOWN request. This
never happens due to the peer's zero window not acknowledging
the continuously retransmitted data chunks. Although the
error counter is incremented for each failed retransmission,
the receiving of the SACK announcing the zero window clears
the error count again immediately. Also heartbeat requests
continue to be sent periodically. The peer acknowledges these
requests causing the error counter to be reset as well.
This patch changes behaviour to only reset the overall error
counter for the above rules while not in shutdown. After
reaching the maximum number of retransmission attempts, the
T5 shutdown guard timer is scheduled to give the receiver
some additional time to recover. The timer is stopped as soon
as the receiver acknowledges any data.
The issue can be easily reproduced by establishing a sctp
association over the loopback device, constantly queueing
data at the sender while not reading any at the receiver.
Wait for the window to reach zero, then initiate a shutdown
by killing both processes simultaneously. The association
will never be freed and the chunks on the retransmission
queue will be retransmitted indefinitely.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this case, the SCTP association transmits an ASCONF packet
including addition of the new IP address and deletion of the old
address. This patch implements this functionality.
In this case, the ASCONF chunk is added to the beginning of the
queue, because the other chunks cannot be transmitted in this state.
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is still data waiting to retransmit and remain in
retransmit queue, while doing the next retransmit, if the
chunk is abandoned, we should move it to abandoned list.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have have to remove a transport due to ASCONF, we move
the data to a new active path. This can trigger CACC algorithm
to not mark that data as missing when SACKs arrive. This is
because the transport passed to the CACC algorithm is the one
this data is sitting on, not the one it was sent on (that one
may be gone). So, by sending the original transport (even if
it's NULL), we may start marking data as missing.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change SCTP_DEBUG_PRINTK and SCTP_DEBUG_PRINTK_IPADDR to
use do { print } while (0) guards.
Add SCTP_DEBUG_PRINTK_CONT to fix errors in log when
lines were continued.
Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Add a missing newline in "Failed bind hash alloc"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes from net/ (but not any netfilter files)
all the unnecessary return; statements that precede the
last closing brace of void functions.
It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.
Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, if the highest tsn in the SACK doesn't change, we'll
end up scanning the transmitted lists on the transports twice:
once for locating the highest _new_ tsn, and once for actually
tagging chunks as acked. This is a waste, since we can record
the highest _new_ tsn at the same time as tagging chunks. Long
ago this was not possible because we would try to mark chunks
as missing at the same time as tagging them acked and this approach
didn't work. Now that the two steps are separate, we can re-use
the old approach.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
According to RFC 4960 Section 7.2.4:
If an endpoint is in Fast
Recovery and a SACK arrives that advances the Cumulative TSN Ack
Point, the miss indications are incremented for all TSNs reported
missing in the SACK.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
We don't need to force the T3 timer any more and it's
actually wrong to do as it causes too long of a delay.
The timer will be started if one is not running, but if
one is running, we leave it alone.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
The 'resent' bit is used to make sure that we don't update
rto estimate based on retransmitted chunks. However, we already
have the 'rto_pending' bit that we test when need to update rto,
so 'resent' bit is just extra. Additionally, we currently have
a bug in that we always set a 'resent' bit and thus rto estimate
is only updated by Heartbeats.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
sctp_chunk_is_data macro is defined to decide that
whether a chunk is data chunk or not.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
While doing retranmit, if control chunk exists, such as
FORWARD TSN chunk, and the DATA chunk can not be bundled with
this control chunk because of PMTU limit, no DATA chunk
will be retranmitted in the current implementation. This
patch makes sure to retranmit at least one DATA chunk in this case.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
PR-SCTP extension section 3.5 Sender Side Implementation of PR-SCTP:
C5) If a FORWARD TSN is sent, the sender MUST assure that at
least one T3-rtx timer is running.
So this patch fix to assure at least one T3-rtx timer is running
if a FORWARD TSN is or will to sent.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Not including net/atm/
Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When retransmitting due to T3 timeout, retransmit all the
in-flight chunks for the corresponding transport/path, including
chunks sent less then 1 rto ago.
This is the correct behaviour according to rfc4960 section 6.3.3
E3 and
"Note: Any DATA chunks that were sent to the address for which the
T3-rtx timer expired but did not fit in one MTU (rule E3 above)
should be marked for retransmission and sent as soon as cwnd
allows (normally, when a SACK arrives). ".
This fixes problems when more then one path is present and the T3
retransmission of the first chunk that timeouts stops the T3 timer
for the initial active path, leaving all the other in-flight
chunks waiting forever or until a new chunk is transmitted on the
same path and timeouts (and this will happen only if the cwnd
allows sending new chunks, but since cwnd was dropped to MTU by
the timeout => it will wait until the first heartbeat).
Example: 10 packets in flight, sent at 0.1 s intervals on the
primary path. The primary path is down and the first packet
timeouts. The first packet is retransmitted on another path, the
T3 timer for the primary path is stopped and cwnd is set to MTU.
All the other 9 in-flight packets will not be retransmitted
(unless more new packets are sent on the primary path which depend
on cwnd allowing it, and even in this case the 9 packets will be
retransmitted only after a new packet timeouts which even in the
best case would be more then RTO).
This commit reverts d0ce92910b and
also removes the now unused transport->last_rto, introduced in
b6157d8e03.
p.s The problem is not only when multiple paths are there. It
can happen in a single homed environment. If the application
stops sending data, it possible to have a hung association.
Signed-off-by: Andrei Pelinescu-Onciul <andrei@iptel.org>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation of max.burst ends up limiting new
data during cwnd decay period. The decay is happening becuase
the connection is idle and we are allowed to fill the congestion
window. The point of max.burst is to limit micro-bursts in response
to large acks. This still happens, as max.burst is still applied
to each transmit opportunity. It will also apply if a very large
send is made (greater then allowed by burst).
Tested-by: Florian Niederbacher <florian.niederbacher@student.uibk.ac.at>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>