As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.
To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kmemcheck complains that ->redirect_genid doesn't get initialized.
Presumably it should be set to zero.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF="-D__CHECK_ENDIAN__" M=net
And fix flowi4_init_output() prototype for sport
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.
MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)
Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.
For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.
Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)
This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter
Reported-by: Fernando Gont <fernando@gont.com.ar>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently can free inetpeer entries too early :
[ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
[ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
[ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
[ 782.636694] ^
[ 782.636696]
[ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
[ 782.636702] EIP: 0060:[<c13fefbb>] EFLAGS: 00010286 CPU: 0
[ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0
[ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
[ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
[ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
[ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 782.636717] DR6: ffff4ff0 DR7: 00000400
[ 782.636718] [<c13fbf76>] rt_set_nexthop.clone.45+0x56/0x220
[ 782.636722] [<c13fc449>] __ip_route_output_key+0x309/0x860
[ 782.636724] [<c141dc54>] tcp_v4_connect+0x124/0x450
[ 782.636728] [<c142ce43>] inet_stream_connect+0xa3/0x270
[ 782.636731] [<c13a8da1>] sys_connect+0xa1/0xb0
[ 782.636733] [<c13a99dd>] sys_socketcall+0x25d/0x2a0
[ 782.636736] [<c149deb8>] sysenter_do_call+0x12/0x28
[ 782.636738] [<ffffffff>] 0xffffffff
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.
It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.
Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.
This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.
There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several crashes in cleanup_once() were reported in recent kernels.
Commit d6cc1d642d (inetpeer: various changes) added a race in
unlink_from_unused().
One way to avoid taking unused_peers.lock before doing the list_empty()
test is to catch 0->1 refcnt transitions, using full barrier atomic
operations variants (atomic_cmpxchg() and atomic_inc_return()) instead
of previous atomic_inc() and atomic_add_unless() variants.
We then call unlink_from_unused() only for the owner of the 0->1
transition.
Add a new atomic_add_unless_return() static helper
With help from Arun Sharma.
Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772
Reported-by: Arun Sharma <asharma@fb.com>
Reported-by: Maximilian Engelhardt <maxi@daemonizer.de>
Reported-by: Yann Dupont <Yann.Dupont@univ-nantes.fr>
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 7b46ac4e77 (inetpeer: Don't disable BH for initial
fast RCU lookup.), we should use call_rcu() to wait proper RCU grace
period.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4
(Destination unreachable (Fragmentation needed)),
icmp_unreach
-> ip_rt_frag_needed
(peer->pmtu_expires is set here)
-> tcp_v4_err
-> do_pmtu_discovery
-> ip_rt_update_pmtu
(peer->pmtu_expires is already set,
so check_peer_pmtu is skipped.)
-> check_peer_pmtu
check_peer_pmtu is skipped and MTU is not updated.
To fix this, let check_peer_pmtu execute unconditionally.
And some minor fixes
1) Avoid potential peer->pmtu_expires set to be zero.
2) In check_peer_pmtu, argument of time_before is reversed.
3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero,
but not initialized.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If modifications on other cpus are ok, then modifications to
the tree during lookup done by the local cpu are ok too.
Signed-off-by: David S. Miller <davem@davemloft.net>
David noticed :
------------------
Eric, I was profiling the non-routing-cache case and something that
stuck out is the case of calling inet_getpeer() with create==0.
If an entry is not found, we have to redo the lookup under a spinlock
to make certain that a concurrent writer rebalancing the tree does
not "hide" an existing entry from us.
This makes the case of a create==0 lookup for a not-present entry
really expensive. It is on the order of 600 cpu cycles on my
Niagara2.
I added a hack to not do the relookup under the lock when create==0
and it now costs less than 300 cycles.
This is now a pretty common operation with the way we handle COW'd
metrics, so I think it's definitely worth optimizing.
-----------------
One solution is to use a seqlock instead of a spinlock to protect struct
inet_peer_base.
After a failed avl tree lookup, we can easily detect if a writer did
some changes during our lookup. Taking the lock and redo the lookup is
only necessary in this case.
Note: Add one private rcu_deref_locked() macro to place in one spot the
access to spinlock included in seqlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.
The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.
In this way PMTU expiration does not kill off the cached route as is
done currently.
Redirect information is permanent, or at least until another redirect
is received.
Signed-off-by: David S. Miller <davem@davemloft.net>
Future changes will add caching information, and some of
these new elements will be addresses.
Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.
Signed-off-by: David S. Miller <davem@davemloft.net>
Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.
If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the RTAX_LOCKED metric to INETPEER_METRICS_NEW (basically,
all ones) on fresh inetpeer entries.
This way code can determine if default metrics have been loaded
in from a routing table entry already.
Signed-off-by: David S. Miller <davem@davemloft.net>
Family was hard-coded to AF_INET but should be daddr->family.
This fixes crashes when unlinking ipv6 peer entries, since the
unlink code was looking up the base pointer properly.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>