This patch adds the TCP-specific portion of GRO. The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation. Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers.
Note : We revert commit (248969ae31
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.
This approach has a number of benefits:
1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
of TCP has to be aware of this (this was not the case with
some other approach that was, well, quite intrusive all
around).
4) Clean_rtx_queue can release most of the pages using single
put_page instead of previous PAGE_SIZE/mss+1 calls
In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.
TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).
I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
1) ambiguous => no sensible measurement can be taken anyway
2) non-ambiguous is due to reordering => having the timestamp
of the last segment there is just skewing things more off
than does some good since the ack got triggered by one of
the holes (besides some substle issues that would make
determining right hole/skb even harder problem). Anyway,
it has nothing to do with this change then.
I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.
In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.
Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.
TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)
My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...
[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]
...To counter that, I gave up on the fifth advantage:
5) When growing previous SACK block, less allocs for new skbs
are done, basically a new alloc is needed only when new hole
is detected and when the previous skb runs out of frags space
...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.
With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:
TCPSackShifted 398
TCPSackMerged 877
TCPSackShiftFallback 320
TCPSACKCOLLAPSEFALLBACKGSO 0
TCPSACKCOLLAPSEFALLBACKSKBBITS 0
TCPSACKCOLLAPSEFALLBACKSKBDATA 0
TCPSACKCOLLAPSEFALLBACKBELOW 0
TCPSACKCOLLAPSEFALLBACKFIRST 1
TCPSACKCOLLAPSEFALLBACKPREVBITS 318
TCPSACKCOLLAPSEFALLBACKMSS 1
TCPSACKCOLLAPSEFALLBACKNOHEAD 0
TCPSACKCOLLAPSEFALLBACKSHIFT 0
TCPSACKCOLLAPSENOOPSEQ 0
TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
TCPSACKCOLLAPSENOOPSMALLLEN 0
TCPSACKCOLLAPSEHOLE 12
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The urg_ptr field is not used anywhere and is merely confusing.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current TCP code relies on the local port of the listening socket
being the same as the destination address of the incoming
connection. Port redirection used by many transparent proxying
techniques obviously breaks this, so we have to store the original
destination port address.
This patch extends struct inet_request_sock and stores the incoming
destination port value there. It also modifies the handshake code to
use that value as the source port when sending reply packets.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This minor cleanup simplifies later changes which will convert
struct sk_buff and friends over to using struct list_head.
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly avoid doing it with cumulative ACK. Not clearing
means that we no longer need n^2 processing in resolution of each
fast recovery.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both loops are quite similar, so they can be combined
with little effort. As a result, forward_skb_hint becomes
obsolete as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Main benefit in this is that we can then freely point
the retransmit_skb_hint to anywhere we want to because
there's no longer need to know what would be the count
changes involve, and since this is really used only as a
terminator, unnecessary work is one time walk at most,
and if some retransmissions are necessary after that
point later on, the walk is not full waste of time
anyway.
Since retransmit_high must be kept valid, all lost
markers must ensure that.
Now I also have learned how those "holes" in the
rexmittable skbs can appear, mtu probe does them. So
I removed the misleading comment as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ie., the difference between partial and all clearing doesn't
exists anymore since the SACK optimizations got dropped by
an sacktag rewrite.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch consolidates the code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This should fix the following bugs:
* Connections with MD5 signatures produce invalid packets whenever SACK
options are included
* MD5 signatures are counted twice in the MSS calculations
Behaviour changes:
* A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK
This is because we can't fit any SACK blocks in a packet with MD5 + TS
options. There was discussion about disabling SACK rather than TS in
order to fit in better with old, buggy kernels, but that was deemed to
be unnecessary.
* SYNs with MD5 don't include a TS option
See above.
Additionally, it removes a bunch of duplicated logic for calculating options,
which should help avoid these sort of issues in the future.
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Proc temporary uses stats from init_net.
BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't
have where to get the net from.
I decided to add a sk argument, not the net itself, only to factor
all the required sock_net(sk) calls inside the enter_memory_pressure
callback itself.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>