Commit Graph

57 Commits

Author SHA1 Message Date
Ilpo Järvinen
33f5f57eeb tcp: kill pointless urg_mode
It all started from me noticing that this urgent check in
tcp_clean_rtx_queue is unnecessarily inside the loop. Then
I took a longer look to it and found out that the users of
urg_mode can trivially do without, well almost, there was
one gotcha.

Bonus: those funny people who use urg with >= 2^31 write_seq -
snd_una could now rejoice too (that's the only purpose for the
between being there, otherwise a simple compare would have done
the thing). Not that I assume that the rest of the tcp code
happily lives with such mind-boggling numbers :-). Alas, it
turned out to be impossible to set wmem to such numbers anyway,
yes I really tried a big sendfile after setting some wmem but
nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
So I hacked a bit variable to long and found out that it seems
to work... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:43:06 -07:00
Ilpo Järvinen
0e1c54c2a4 tcp: reorganize retransmit code loops
Both loops are quite similar, so they can be combined
with little effort. As a result, forward_skb_hint becomes
obsolete as well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-20 21:24:21 -07:00
Ilpo Järvinen
006f582c73 tcp: convert retransmit_cnt_hint to seqno
Main benefit in this is that we can then freely point
the retransmit_skb_hint to anywhere we want to because
there's no longer need to know what would be the count
changes involve, and since this is really used only as a
terminator, unnecessary work is one time walk at most,
and if some retransmissions are necessary after that
point later on, the walk is not full waste of time
anyway.

Since retransmit_high must be kept valid, all lost
markers must ensure that.

Now I also have learned how those "holes" in the
rexmittable skbs can appear, mtu probe does them. So
I removed the misleading comment as well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-20 21:20:20 -07:00
Adam Langley
4389dded77 tcp: Remove redundant checks when setting eff_sacks
Remove redundant checks when setting eff_sacks and make the number of SACKs a
compile time constant. Now that the options code knows how many SACK blocks can
fit in the header, we don't need to have the SACK code guessing at it.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19 00:07:02 -07:00
David S. Miller
4ae127d1b6 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/smc911x.c
2008-06-13 20:52:39 -07:00
David S. Miller
ec0a196626 tcp: Revert 'process defer accept as established' changes.
This reverts two changesets, ec3c0982a2
("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
the follow-on bug fix 9ae27e0adb
("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

This change causes several problems, first reported by Ingo Molnar
as a distcc-over-loopback regression where connections were getting
stuck.

Ilpo Järvinen first spotted the locking problems.  The new function
added by this code, tcp_defer_accept_check(), only has the
child socket locked, yet it is modifying state of the parent
listening socket.

Fixing that is non-trivial at best, because we can't simply just grab
the parent listening socket lock at this point, because it would
create an ABBA deadlock.  The normal ordering is parent listening
socket --> child socket, but this code path would require the
reverse lock ordering.

Next is a problem noticed by Vitaliy Gusev, he noted:

----------------------------------------
>--- a/net/ipv4/tcp_timer.c
>+++ b/net/ipv4/tcp_timer.c
>@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> 		goto death;
> 	}
>
>+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
>+		tcp_send_active_reset(sk, GFP_ATOMIC);
>+		goto death;

Here socket sk is not attached to listening socket's request queue. tcp_done()
will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
release this sk) as socket is not DEAD. Therefore socket sk will be lost for
freeing.
----------------------------------------

Finally, Alexey Kuznetsov argues that there might not even be any
real value or advantage to these new semantics even if we fix all
of the bugs:

----------------------------------------
Hiding from accept() sockets with only out-of-order data only
is the only thing which is impossible with old approach. Is this really
so valuable? My opinion: no, this is nothing but a new loophole
to consume memory without control.
----------------------------------------

So revert this thing for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-12 16:34:35 -07:00
Ilpo Järvinen
b79eeeb9e4 tcp: Reorganize tcp_sock to fill 64-bit holes & improve locality
I tried to group recovery related fields nearby (non-CA_Open related
variables, to be more accurate) so that one to three cachelines would
not be necessary in CA_Open. These are now contiguously deployed:

  struct sk_buff_head        out_of_order_queue;   /*  1968    80 */
  /* --- cacheline 32 boundary (2048 bytes) --- */
  struct tcp_sack_block      duplicate_sack[1];    /*  2048     8 */
  struct tcp_sack_block      selective_acks[4];    /*  2056    32 */
  struct tcp_sack_block      recv_sack_cache[4];   /*  2088    32 */
  /* --- cacheline 33 boundary (2112 bytes) was 8 bytes ago --- */
  struct sk_buff *           highest_sack;         /*  2120     8 */
  int                        lost_cnt_hint;        /*  2128     4 */
  int                        retransmit_cnt_hint;  /*  2132     4 */
  u32                        lost_retrans_low;     /*  2136     4 */
  u8                         reordering;           /*  2140     1 */
  u8                         keepalive_probes;     /*  2141     1 */

  /* XXX 2 bytes hole, try to pack */

  u32                        prior_ssthresh;       /*  2144     4 */
  u32                        high_seq;             /*  2148     4 */
  u32                        retrans_stamp;        /*  2152     4 */
  u32                        undo_marker;          /*  2156     4 */
  int                        undo_retrans;         /*  2160     4 */
  u32                        total_retrans;        /*  2164     4 */

...and they're then followed by URG slowpath & keepalive related
variables.

Head of the out_of_order_queue always needed for empty checks, if
that's empty (and TCP is in CA_Open), following ~200 bytes (in 64-bit)
shouldn't be necessary for anything. If only OFO queue exists but TCP
is in CA_Open, selective_acks (and possibly duplicate_sack) are
necessary besides the out_of_order_queue but the rest of the block
again shouldn't be (ie., the other direction had losses).

As the cacheline boundaries depend on many factors in the preceeding
stuff, trying to align considering them doesn't make too much sense.

Commented one ordering hazard.

There are number of low utilized u8/16s that could be combined get 2
bytes less in total so that the hole could be made to vanish (includes
at least ecn_flags, urg_data, urg_mode, frto_counter, nonagle).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-29 03:25:23 -07:00
Ilpo Järvinen
4b74944044 tcp: Make prior_ssthresh a u32
If previous window was above representable values of u16,
strange things will happen if undo with the truncated value
is called for. Alternatively, this could be fixed by some
max trickery but that would limit undoing high-speed undos.

Adds 16-bit hole but there isn't anything to fill it with.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-21 17:40:05 -07:00
Patrick McManus
ec3c0982a2 [TCP]: TCP_DEFER_ACCEPT updates - process as established
Change TCP_DEFER_ACCEPT implementation so that it transitions a
connection to ESTABLISHED after handshake is complete instead of
leaving it in SYN-RECV until some data arrvies. Place connection in
accept queue when first data packet arrives from slow path.

Benefits:
  - established connection is now reset if it never makes it
   to the accept queue

 - diagnostic state of established matches with the packet traces
   showing completed handshake

 - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
   enforced with reasonable accuracy instead of rounding up to next
   exponential back-off of syn-ack retry.

Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 16:33:01 -07:00
Ilpo Järvinen
68f8353b48 [TCP]: Rewrite SACK block processing & sack_recv_cache use
Key points of this patch are:

  - In case new SACK information is advance only type, no skb
    processing below previously discovered highest point is done
  - Optimize cases below highest point too since there's no need
    to always go up to highest point (which is very likely still
    present in that SACK), this is not entirely true though
    because I'm dropping the fastpath_skb_hint which could
    previously optimize those cases even better. Whether that's
    significant, I'm not too sure.

Currently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).

Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!

Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.

Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.

The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".

DSACKs are special case that must always be walked.

The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:07 -08:00
Ilpo Järvinen
fd6dad616d [TCP]: Earlier SACK block verification & simplify access to them
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:07 -08:00
Ilpo Järvinen
a47e5a988a [TCP]: Convert highest_sack to sk_buff to allow direct access
It is going to replace the sack fastpath hint quite soon... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:03 -08:00
Ilpo Järvinen
f78a1b3892 [TCP]: Make snd_cwnd_cnt 32-bit
Very little point of having 32-bit snd_cnwd if this is not
32-bit as well, as a number of snd_cwnd incrementation formulas
assume that snd_cwnd_cnt can be at least as large as snd_cwnd.

Whether 32-bit is useful was discussed when e0ef57cc56
was made:
  http://marc.info/?l=linux-netdev&m=117218144409825&w=2

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:59:43 -07:00
Ilpo Järvinen
b08d6cb22c [TCP]: Limit processing lost_retrans loop to work-to-do cases
This addition of lost_retrans_low to tcp_sock might be
unnecessary, it's not clear how often lost_retrans worker is
executed when there wasn't work to do.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:36:13 -07:00
Ilpo Järvinen
c79e335716 [TCP]: Comment fastpath_cnt_hint off-by-one trap
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:47 -07:00
Ilpo Järvinen
13dae42631 [TCP]: Update comment about highest_sack validity
This stale info came from the original idea, which proved to be
unnecessarily complex, sacked_out > 0 is easy to do and that when
it's going to be needed anyway (it _can_ be valid also when
sacked_out == 0 but there's not going to be a guarantee about it
for now).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:00 -07:00
Ilpo Järvinen
b5860bbac7 [TCP]: Tighten tcp_sock's belt, drop left_out
It is easily calculable when needed and user are not that many
after all.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:55 -07:00
Ilpo Järvinen
539d243fdd [TCP]: Access to highest_sack obsoletes forward_cnt_hint
In addition, added a reference about the purpose of the loop.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:53 -07:00
Ilpo Järvinen
d738cd8fca [TCP]: Add highest_sack seqno, points to globally highest SACK
It is guaranteed to be valid only when !tp->sacked_out. In most
cases this seqno is available in the last ACK but there is no
guarantee for that. The new fast recovery loss marking algorithm
needs this as entry point.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:50 -07:00
Arnaldo Carvalho de Melo
9c70220b73 [SK_BUFF]: Introduce skb_transport_header(skb)
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo
aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo
ab6a5bb6b2 [TCP]: Introduce tcp_hdrlen() and tcp_optlen()
The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
avoid the longer, open coded equivalent.

Ditched a no-op in bnx2 in the process.

I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:24 -07:00
David S. Miller
e0ef57cc56 [TCP]: Make snd_cwnd_clamp a u32.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:21 -07:00
Eric Dumazet
54287cc178 [TCP]: Keep copied_seq, rcv_wup and rcv_next together.
I noticed in oprofile study a cache miss in tcp_rcv_established() to read
copied_seq.

ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293  
2.0400 */

 55493  0.0281 :ffffffff80400bc9:   mov    0x4c8(%r12),%eax copied_seq
543103  0.2746 :ffffffff80400bd1:   cmp    0x3e0(%r12),%eax   rcv_nxt    

if (tp->copied_seq == tp->rcv_nxt &&
        len - tcp_header_len <= tp->ucopy.len) {

In this function, the cache line 0x4c0 -> 0x500 is used only for this
reading 'copied_seq' field.

rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of
active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...)

As you suggested, I changed tcp_create_openreq_child() so that these fields
are changed together, to avoid adding a new store buffer stall.

Patch is 64bit friendly (no new hole because of alignment constraints)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:21 -07:00
Baruch Even
6f74651ae6 [TCP]: Seperate DSACK from SACK fast path
Move DSACK code outside the SACK fast-path checking code. If the DSACK
determined that the information was too old we stayed with a partial cache
copied. Most likely this matters very little since the next packet will not be
DSACK and we will find it in the cache. but it's still not good form and there
is little reason to couple the two checks.

Since the SACK receive cache doesn't need the data to be in host order we also
remove the ntohl in the checking loop.

Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:49 -08:00