Commit Graph

2514 Commits

Author SHA1 Message Date
Ilpo Järvinen
23aeeec365 [TCP] FRTO: Plug potential LOST-bit leak
It might be possible that, in some extreme scenario that
I just cannot now construct in my mind, end_seq <=
frto_highmark check does not match causing the lost_out
and LOST bits become out-of-sync due to clearing and
recounting in the loop.

This may fix LOST-bit leak reported by Chazarain Guillaume
<guichaz@yahoo.fr>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:03:13 -08:00
Ilpo Järvinen
746aa32d28 [TCP] FRTO: Limit snd_cwnd if TCP was application limited
Otherwise TCP might violate packet ordering principles that FRTO
is based on. If conventional recovery path is chosen, this won't
be significant at all. In practice, any small enough value will
be sufficient to provide proper operation for FRTO, yet other
users of snd_cwnd might benefit from a "close enough" value.

FRTO's formula is now equal to what tcp_enter_cwr() uses.

FRTO used to check application limitedness a bit differently but
I changed that in commit 575ee7140d
and as a result checking for application limitedness became
completely non-existing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:01:23 -08:00
Li Zefan
e0bf9cf15f [NETFILTER]: nf_nat: fix memset error
The size passing to memset is the size of a pointer.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 02:57:16 -08:00
Pavel Emelyanov
d71209ded2 [INET]: Use list_head-s in inetpeer.c
The inetpeer.c tracks the LRU list of inet_perr-s, but makes
it by hands. Use the list_head-s for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:27:28 -08:00
Adrian Bunk
22649d1afb [IPVS]: Remove unused exports.
This patch removes the following unused EXPORT_SYMBOL's:
- ip_vs_try_bind_dest
- ip_vs_find_dest

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:25:24 -08:00
Denis V. Lunev
2994c63863 [INET]: Small possible memory leak in FIB rules
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
	ip rule flush

Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:12:03 -08:00
Pavel Emelyanov
358352b8b8 [INET]: Cleanup the xfrm4_tunnel_(un)register
Both check for the family to select an appropriate tunnel list.
Consolidate this check and make the for() loop more readable.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:48:54 -08:00
Pavel Emelyanov
99f933263a [INET]: Add missed tunnel64_err handler
The tunnel64_protocol uses the tunnel4_protocol's err_handler and
thus calls the tunnel4_protocol's handlers.

This is not very good, as in case of (icmp) error the wrong error
handlers will be called (e.g. ipip ones instead of sit) and this
won't be noticed at all, because the error is not reported.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:47:39 -08:00
Pavel Emelyanov
03f49f3457 [NET]: Make helper to get dst entry and "use" it
There are many places that get the dst entry, increase the
__use counter and set the "lastuse" time stamp.

Make a helper for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:28:34 -08:00
Pavel Emelyanov
b1667609cd [IPV4]: Remove bugus goto-s from ip_route_input_slow
Both places look like

        if (err == XXX) 
               goto yyy;
   done:

while both yyy targets look like

        err = XXX;
        goto done;

so this is ok to remove the above if-s.

yyy labels are used in other places and are not removed.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:26:41 -08:00
Ilpo Järvinen
fbd52eb2bd [TCP]: Split SACK FRTO flag clearing (fixes FRTO corner case bug)
In case we run out of mem when fragmenting, the clearing of
FLAG_ONLY_ORIG_SACKED might get missed which then feeds FRTO
with false information. Move clearing outside skb processing
loop so that it will get executed even if the skb loop
terminates prematurely due to out-of-mem.

Besides, now the core of the loop truly deals with a single
skb only, which also enables creation a more self-contained
of tcp_sacktag_one later on.

In addition, small reorganization of if branches was made.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:24:19 -08:00
Ilpo Järvinen
e49aa5d456 [TCP]: Add unlikely() to sacktag out-of-mem in fragment case
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:23:08 -08:00
Ilpo Järvinen
c7caf8d3ed [TCP]: Fix reord detection due to snd_una covered holes
Fixes subtle bug like the one with fastpath_cnt_hint happening
due to the way the GSO and hints interact. Because hints are not
reset when just a GSOed skb is partially ACKed, there's no
guarantee that the relevant part of the write queue is going to
be processed in sacktag at all (skbs below snd_una) because
fastpath hint can fast forward the entrypoint.

This was also on the way of future reductions in sacktag's skb
processing. Also future cleanups in sacktag can be made after
this (in 2.6.25).

This may make reordering update in tcp_try_undo_partial
redundant but I'm not too sure so I left it there.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:22:18 -08:00
Ilpo Järvinen
8dd71c5d28 [TCP]: Consider GSO while counting reord in sacktag
Reordering detection fails to take account that the reordered
skb may have pcount larger than 1. In such case the lowest of
them had the largest reordering, the old formula used the
highest of them which is pcount - 1 packets less reordered.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:20:59 -08:00
Eric Dumazet
230140cffa [INET]: Remove per bucket rwlock in tcp/dccp ehash table.
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.

On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.

Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.

This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:11 -08:00
Rumen G. Bogdanovski
efac52762b [IPVS]: Synchronize closing of Connections
This patch makes the master daemon to sync the connection when it is about
to close.  This makes the connections on the backup to close or timeout
according their state.  Before the sync was performed only if the
connection is in ESTABLISHED state which always made the connections to
timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch
([IPVS]: use proper timeout instead of fixed value) effectively did nothing
more than increasing this to 15 minutes (Established state timeout).  So
this patch makes use of proper timeout since it syncs the connections on
status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout).
However if the backup misses CLOSE hopefully it did not miss FIN_WAIT.
Otherwise we will just have to wait for the ESTABLISHED state timeout. As
it is without this patch.  This way the number of the hanging connections
on the backup is kept to minimum. And very few of them will be left to
timeout with a long timeout.

This is important if we want to make use of the fix for the real server
overcommit on master/backup fail-over.

Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:10 -08:00
Rumen G. Bogdanovski
1e356f9cdf [IPVS]: Bind connections on stanby if the destination exists
This patch fixes the problem with node overload on director fail-over.
Given the scenario: 2 nodes each accepting 3 connections at a time and 2
directors, director failover occurs when the nodes are fully loaded (6
connections to the cluster) in this case the new director will assign
another 6 connections to the cluster, If the same real servers exist
there.

The problem turned to be in not binding the inherited connections to
the real servers (destinations) on the backup director. Therefore:
"ipvsadm -l" reports 0 connections:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  test2.local:5999 wlc
  -> node473.local:5999           Route   1000   0          0
  -> node484.local:5999           Route   1000   0          0

while "ipvs -lnc" is right
root@test2:~# ipvsadm -lnc
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:56  ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999
192.168.0.51:5999
TCP 14:59  ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999
192.168.0.52:5999

So the patch I am sending fixes the problem by binding the received
connections to the appropriate service on the backup director, if it
exists, else the connection will be handled the old way. So if the
master and the backup directors are synchronized in terms of real
services there will be no problem with server over-committing since
new connections will not be created on the nonexistent real services
on the backup. However if the service is created later on the backup,
the binding will be performed when the next connection update is
received. With this patch the inherited connections will show as
inactive on the backup:

root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  test2.local:5999 wlc
  -> node473.local:5999           Route   1000   0          1
  -> node484.local:5999           Route   1000   0          1

rumen@test2:~$ cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  C0A800DE:176F wlc
  -> C0A80033:176F      Route   1000   0          1
  -> C0A80032:176F      Route   1000   0          1

Regards,
Rumen Bogdanovski

Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2007-11-07 04:15:09 -08:00
Herbert Xu
4999f3621f [IPSEC]: Fix crypto_alloc_comp error checking
The function crypto_alloc_comp returns an errno instead of NULL
to indicate error.  So it needs to be tested with IS_ERR.

This is based on a patch by Vicenç Beltran Querol.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:03 -08:00
Pavel Emelyanov
c3e9a353d8 [IPV4]: Compact some ifdefs in the fib code.
There are places that check for CONFIG_IP_MULTIPLE_TABLES
twice in the same file, but the internals of these #ifdefs
can be merged.

As a side effect - remove one ifdef from inside a function.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:11:41 -08:00
Eric Dumazet
47a31a6ffc [IPV4]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure
Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
"inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:58 -08:00
Eric Dumazet
286ab3d460 [NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.

If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.

In this patch, I tried to :

- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation

Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP

If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.

When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:57 -08:00
Pavel Emelyanov
6a9fb9479f [IPV4]: Clean the ip_sockglue.c from some ugly ifdefs
The #idfed CONFIG_IP_MROUTE is sometimes places inside the if-s,
which looks completely bad. Similar ifdefs inside the functions
looks a bit better, but they are also not recommended to be used.

Provide an ifdef-ed ip_mroute_opt() helper to cleanup the code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:55 -08:00
Pavel Emelyanov
429f08e950 [IPV4]: Consolidate the ip cork destruction in ip_output.c
The ip_push_pending_frames and ip_flush_pending_frames do the
same things to flush the sock's cork. Move this into a separate
function and save ~80 bytes from the .text

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:25 -08:00
Patrick McHardy
d1332e0ab8 [NETFILTER]: remove unneeded rcu_dereference() calls
As noticed by Paul McKenney, the rcu_dereference calls in the init path
of NAT modules are unneeded, remove them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:23 -08:00
Jan Engelhardt
0795c65d9f [NETFILTER]: Clean up Makefile
Sort matches and targets in the NF makefiles.

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:22 -08:00