Commit Graph

1668 Commits

Author SHA1 Message Date
Al Viro f53f4137ba fix endianness bug in inet_lro
all uses of and almost all assignments to lro_desc->tcp_ack assume that it's
net-endian; one converts net-endian to host-endian and sticks it in
lro_desc->tcp_ack.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:52 -07:00
Al Viro 9df7c98a0f inet_lro: trivial endianness annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:52 -07:00
Ilpo Järvinen b08d6cb22c [TCP]: Limit processing lost_retrans loop to work-to-do cases
This addition of lost_retrans_low to tcp_sock might be
unnecessary, it's not clear how often lost_retrans worker is
executed when there wasn't work to do.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:36:13 -07:00
Ilpo Järvinen f785a8e28b [TCP]: Fix lost_retrans loop vs fastpath problems
Detection implemented with lost_retrans must work also when
fastpath is taken, yet most of the queue is skipped including
(very likely) those retransmitted skb's we're interested in.
This problem appeared when the hints got added, which removed
a need to always walk over the whole write queue head.
Therefore decicion for the lost_retrans worker loop entry must
be separated from the sacktag processing more than it was
necessary before.

It turns out to be problematic to optimize the worker loop
very heavily because ack_seqs of skb may have a number of
discontinuity points. Maybe similar approach as currently is
implemented could be attempted but that's becoming more and
more complex because the trend is towards less skb walking
in sacktag marker. Trying a simple work until all rexmitted
skbs heve been processed approach.

Maybe after(highest_sack_end_seq, tp->high_seq) checking is not
sufficiently accurate and causes entry too often in no-work-to-do
cases. Since that's not known, I've separated solution to that
from this patch.

Noticed because of report against a related problem from TAKANO
Ryousei <takano@axe-inc.co.jp>. He also provided a patch to
that part of the problem. This patch includes solution to it
(though this patch has to use somewhat different placement).
TAKANO's description and patch is available here:

  http://marc.info/?l=linux-netdev&m=119149311913288&w=2

...In short, TAKANO's problem is that end_seq the loop is using
not necessarily the largest SACK block's end_seq because the
current ACK may still have higher SACK blocks which are later
by the loop.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:35:41 -07:00
Ilpo Järvinen 4cd829995b [TCP]: No need to re-count fackets_out/sacked_out at RTO
Both sacked_out and fackets_out are directly known from how
parameter. Since fackets_out is accurate, there's no need for
recounting (sacked_out was previously unnecessarily counted
in the loop anyway).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:34:57 -07:00
Ilpo Järvinen d193594299 [TCP]: Extract tcp_match_queue_to_sack from sacktag code
This is necessary for upcoming DSACK bugfix. Reduces sacktag
length which is not very sad thing at all... :-)

Notice that there's a need to handle out-of-mem at caller's
place.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:34:25 -07:00
Ilpo Järvinen f6fb128d27 [TCP]: Kill almost unused variable pcount from sacktag
It's on the way for future cutting of that function.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:33:55 -07:00
Ilpo Järvinen 3eec0047d9 [TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L
This condition (plain R) can arise at least in recovery that
is triggered after tcp_undo_loss. There isn't any reason why
they should not be marked as lost, not marking makes in_flight
estimator to return too large values.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:33:11 -07:00
Ilpo Järvinen 16e906812f [TCP]: Add bytes_acked (ABC) clearing to FRTO too
I was reading tcp_enter_loss while looking for Cedric's bug and
noticed bytes_acked adjustment is missing from FRTO side.

Since bytes_acked will only be used in tcp_cong_avoid, I think
it's safe to assume RTO would be spurious. During FRTO cwnd
will be not controlled by tcp_cong_avoid and if FRTO calls for
conventional recovery, cwnd is adjusted and the result of wrong
assumption is cleared from bytes_acked. If RTO was in fact
spurious, we did normal ABC already and can continue without
any additional adjustments.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-11 17:32:31 -07:00
David S. Miller 28f7b0360f [NETLINK]: fib_frontend build fixes
1) fibnl needs to be declared outside of config ifdefs,
   and also should not be explicitly initialized to NULL
2) nl_fib_input() args are wrong for netlink_kernel_create()
   input method

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:32:39 -07:00
Denis V. Lunev cd40b7d398 [NET]: make netlink user -> kernel interface synchronious
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced 
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:15:29 -07:00
Stephen Hemminger 227b60f510 [INET]: local port range robustness
Expansion of original idea from Denis V. Lunev <den@openvz.org>

Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.

The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 17:30:46 -07:00
Herbert Xu 631a6698d0 [IPSEC]: Move IP protocol setting from transforms into xfrm4_input.c
This patch makes the IPv4 x->type->input functions return the next protocol
instead of setting it directly.  This is identical to how we do things in
IPv6 and will help us merge common code on the input path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:56 -07:00
Herbert Xu ceb1eec829 [IPSEC]: Move IP length/checksum setting out of transforms
This patch moves the setting of the IP length and checksum fields out of
the transforms and into the xfrmX_output functions.  This would help future
efforts in merging the transforms themselves.

It also adds an optimisation to ipcomp due to the fact that the transport
offset is guaranteed to be zero.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:56 -07:00
Herbert Xu 87bdc48d30 [IPSEC]: Get rid of ipv6_{auth,esp,comp}_hdr
This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since
they're identical to the IPv4 versions.  Duplicating them would only create
problems for ourselves later when we need to add things like extended
sequence numbers.

I've also added transport header type conversion headers for these types
which are now used by the transforms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:55 -07:00
Herbert Xu 37fedd3aab [IPSEC]: Use IPv6 calling convention as the convention for x->mode->output
The IPv6 calling convention for x->mode->output is more general and could
help an eventual protocol-generic x->type->output implementation.  This
patch adopts it for IPv4 as well and modifies the IPv4 type output functions
accordingly.

It also rewrites the IPv6 mac/transport header calculation to be based off
the network header where practical.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:54 -07:00
Herbert Xu 7b277b1a5f [IPSEC]: Set skb->data to payload in x->mode->output
This patch changes the calling convention so that on entry from
x->mode->output and before entry into x->type->output skb->data
will point to the payload instead of the IP header.

This is essentially a redistribution of skb_push/skb_pull calls
with the aim of minimising them on the common path of tunnel +
ESP.

It'll also let us use the same calling convention between IPv4
and IPv6 with the next patch.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:54 -07:00
Herbert Xu 8bd1707504 [IPSEC] esp: Remove NAT-T checksum invalidation for BEET
I pointed this out back when this patch was first proposed but it looks like
it got lost along the way.

The checksum only needs to be ignored for NAT-T in transport mode where
we lose the original inner addresses due to NAT.  With BEET the inner
addresses will be intact so the checksum remains valid.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:53 -07:00
Ilpo Järvinen 1c1e87edb9 [TCP]: Separate lost_retrans loop into own function
Follows own function for each task principle, this is really
somewhat separate task being done in sacktag. Also reduces
indentation.

In addition, added ack_seq local var to break some long
lines & fixed coding style things.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:51 -07:00
Pavel Emelyanov e2da591338 [NETFILTER]: Make netfilter code use the seq_open_private
Just switch to the consolidated calls.

ipt_recent() has to initialize the private, so use
the __seq_open_private() helper.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:34 -07:00
Pavel Emelyanov cf7732e4cc [NET]: Make core networking code use seq_open_private
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:33 -07:00
Herbert Xu b7c6538cd8 [IPSEC]: Move state lock into x->type->output
This patch releases the lock on the state before calling x->type->output.
It also adds the lock to the spots where they're currently needed.

Most of those places (all except mip6) are expected to disappear with
async crypto.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:03 -07:00
Herbert Xu 007f0211a8 [IPSEC]: Store IPv6 nh pointer in mac_header on output
Current the x->mode->output functions store the IPv6 nh pointer in the
skb network header.  This is inconvenient because the network header then
has to be fixed up before the packet can leave the IPsec stack.  The mac
header field is unused on output so we can use that to store this instead.

This patch does that and removes the network header fix-up in xfrm_output.

It also uses ipv6_hdr where appropriate in the x->type->output functions.

There is also a minor clean-up in esp4 to make it use the same code as
esp6 to help any subsequent effort to merge the two.

Lastly it kills two redundant skb_set_* statements in BEET that were
simply copied over from transport mode.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:00 -07:00
Herbert Xu 436a0a4022 [IPSEC]: Move output replay code into xfrm_output
The replay counter is one of only two remaining things in the output code
that requires a lock on the xfrm state (the other being the crypto).  This
patch moves it into the generic xfrm_output so we can remove the lock from
the transforms themselves.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:54 -07:00
Herbert Xu 406ef77c89 [IPSEC]: Move common output code to xfrm_output
Most of the code in xfrm4_output_one and xfrm6_output_one are identical so
this patch moves them into a common xfrm_output function which will live
in net/xfrm.

In fact this would seem to fix a bug as on IPv4 we never reset the network
header after a transform which may upset netfilter later on.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:53 -07:00