inetpeer sequence numbers are no longer incremented, so no need to
check and flush the tree. The function that increments the sequence
number was already dead code and removed in in "ipv4: remove unused
function" (068a6e18). Remove the code that checks for a change, too.
Verifying that v4_seq and v6_seq are never incremented and thus that
flush_check compares bp->flush_seq to 0 is trivial.
The second part of the change removes flush_check completely even
though bp->flush_seq is exactly !0 once, at initialization. This
change is correct because the time this branch is true is when
bp->root == peer_avl_empty_rcu, in which the branch and
inetpeer_invalidate_tree are a NOOP.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed we were sending wrong IPv4 ID in TCP flows when MTU discovery
is disabled.
Note how GSO/TSO packets do not have monotonically incrementing ID.
06:37:41.575531 IP (id 14227, proto: TCP (6), length: 4396)
06:37:41.575534 IP (id 14272, proto: TCP (6), length: 65212)
06:37:41.575544 IP (id 14312, proto: TCP (6), length: 57972)
06:37:41.575678 IP (id 14317, proto: TCP (6), length: 7292)
06:37:41.575683 IP (id 14361, proto: TCP (6), length: 63764)
It appears I introduced this bug in linux-3.1.
inet_getid() must return the old value of peer->ip_id_count,
not the new one.
Lets revert this part, and remove the prevention of
a null identification field in IPv6 Fragment Extension Header,
which is dubious and not even done properly.
Fixes: 87c48fa3b4 ("ipv6: make fragment identifications less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.
Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.
Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.
Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.
Signed-off-by: David S. Miller <davem@davemloft.net>
We encode the pointer(s) into an unsigned long with one state bit.
The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.
Later the peer roots will be per-FIB table, and this change works to
facilitate that.
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.
Signed-off-by: David S. Miller <davem@davemloft.net>
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.
and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
now inetpeer doesn't support namespace,the information will
be leaking across namespace.
this patch move the global vars v4_peers and v6_peers to
netns_ipv4 and netns_ipv6 as a field peers.
add struct pernet_operations inetpeer_ops to initial pernet
inetpeer data.
and change family_to_base and inet_getpeer to support namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 5faa5df1fa (inetpeer: Invalidate the inetpeer tree along with
the routing cache) added a race :
Before freeing an inetpeer, we must respect a RCU grace period, and make
sure no user will attempt to increase refcnt.
inetpeer_invalidate_tree() waits for a RCU grace period before inserting
inetpeer tree into gc_list and waking the worker. At that time, no
concurrent lookup can find a inetpeer in this tree.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.
To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now inetpeer is the place where we cache redirect information for ipv4
destinations, we must be able to invalidate informations when a route is
added/removed on host.
As inetpeer is not yet namespace aware, this patch adds a shared
redirect_genid, and a per inetpeer redirect_genid. This might be changed
later if inetpeer becomes ns aware.
Cache information for one inerpeer is valid as long as its
redirect_genid has the same value than global redirect_genid.
Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
Tested-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.
Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)
This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter
Reported-by: Fernando Gont <fernando@gont.com.ar>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Profiles show false sharing in addr_compare() because refcnt/dtime
changes dirty the first inet_peer cache line, where are lying the keys
used at lookup time. If many cpus are calling inet_getpeer() and
inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
"perf top".
Before patch, my udpflood bench (16 threads) on my 2x4x2 machine :
5784.00 9.7% csum_partial_copy_generic [kernel]
3356.00 5.6% addr_compare [kernel]
2638.00 4.4% fib_table_lookup [kernel]
2625.00 4.4% ip_fragment [kernel]
1934.00 3.2% neigh_lookup [kernel]
1617.00 2.7% udp_sendmsg [kernel]
1608.00 2.7% __ip_route_output_key [kernel]
1480.00 2.5% __ip_append_data [kernel]
1396.00 2.3% kfree [kernel]
1195.00 2.0% kmem_cache_free [kernel]
1157.00 1.9% inet_getpeer [kernel]
1121.00 1.9% neigh_resolve_output [kernel]
1012.00 1.7% dev_queue_xmit [kernel]
# time ./udpflood.sh
real 0m44.511s
user 0m20.020s
sys 11m22.780s
# time ./udpflood.sh
real 0m44.099s
user 0m20.140s
sys 11m15.870s
After patch, no more addr_compare() in profiles :
4171.00 10.7% csum_partial_copy_generic [kernel]
1787.00 4.6% fib_table_lookup [kernel]
1756.00 4.5% ip_fragment [kernel]
1234.00 3.2% udp_sendmsg [kernel]
1191.00 3.0% neigh_lookup [kernel]
1118.00 2.9% __ip_append_data [kernel]
1022.00 2.6% kfree [kernel]
993.00 2.5% __ip_route_output_key [kernel]
841.00 2.2% neigh_resolve_output [kernel]
816.00 2.1% kmem_cache_free [kernel]
658.00 1.7% ia32_sysenter_target [kernel]
632.00 1.6% kmem_cache_alloc_node [kernel]
# time ./udpflood.sh
real 0m41.587s
user 0m19.190s
sys 10m36.370s
# time ./udpflood.sh
real 0m41.486s
user 0m19.290s
sys 10m33.650s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>