Commit Graph

110 Commits

Author SHA1 Message Date
Eric Dumazet 7ab4551f3b tcp: fix TFO regression
Fengguang Wu reported various panics and bisected to commit
8336886f78 (tcp: TCP Fast Open Server - support TFO listeners)

Fix this by making sure socket is a TCP socket before accessing TFO data
structures.

[  233.046014] kfree_debugcheck: out of range ptr ea6000000bb8h.
[  233.047399] ------------[ cut here ]------------
[  233.048393] kernel BUG at /c/kernel-tests/src/stable/mm/slab.c:3074!
[  233.048393] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  233.048393] Modules linked in:
[  233.048393] CPU 0
[  233.048393] Pid: 3929, comm: trinity-watchdo Not tainted 3.6.0-rc3+
#4192 Bochs Bochs
[  233.048393] RIP: 0010:[<ffffffff81169653>]  [<ffffffff81169653>]
kfree_debugcheck+0x27/0x2d
[  233.048393] RSP: 0018:ffff88000facbca8  EFLAGS: 00010092
[  233.048393] RAX: 0000000000000031 RBX: 0000ea6000000bb8 RCX:
00000000a189a188
[  233.048393] RDX: 000000000000a189 RSI: ffffffff8108ad32 RDI:
ffffffff810d30f9
[  233.048393] RBP: ffff88000facbcb8 R08: 0000000000000002 R09:
ffffffff843846f0
[  233.048393] R10: ffffffff810ae37c R11: 0000000000000908 R12:
0000000000000202
[  233.048393] R13: ffffffff823dbd5a R14: ffff88000ec5bea8 R15:
ffffffff8363c780
[  233.048393] FS:  00007faa6899c700(0000) GS:ffff88001f200000(0000)
knlGS:0000000000000000
[  233.048393] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  233.048393] CR2: 00007faa6841019c CR3: 0000000012c82000 CR4:
00000000000006f0
[  233.048393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  233.048393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  233.048393] Process trinity-watchdo (pid: 3929, threadinfo
ffff88000faca000, task ffff88000faec600)
[  233.048393] Stack:
[  233.048393]  0000000000000000 0000ea6000000bb8 ffff88000facbce8
ffffffff8116ad81
[  233.048393]  ffff88000ff588a0 ffff88000ff58850 ffff88000ff588a0
0000000000000000
[  233.048393]  ffff88000facbd08 ffffffff823dbd5a ffffffff823dbcb0
ffff88000ff58850
[  233.048393] Call Trace:
[  233.048393]  [<ffffffff8116ad81>] kfree+0x5f/0xca
[  233.048393]  [<ffffffff823dbd5a>] inet_sock_destruct+0xaa/0x13c
[  233.048393]  [<ffffffff823dbcb0>] ? inet_sk_rebuild_header
+0x319/0x319
[  233.048393]  [<ffffffff8231c307>] __sk_free+0x21/0x14b
[  233.048393]  [<ffffffff8231c4bd>] sk_free+0x26/0x2a
[  233.048393]  [<ffffffff825372db>] sctp_close+0x215/0x224
[  233.048393]  [<ffffffff810d6835>] ? lock_release+0x16f/0x1b9
[  233.048393]  [<ffffffff823daf12>] inet_release+0x7e/0x85
[  233.048393]  [<ffffffff82317d15>] sock_release+0x1f/0x77
[  233.048393]  [<ffffffff82317d94>] sock_close+0x27/0x2b
[  233.048393]  [<ffffffff81173bbe>] __fput+0x101/0x20a
[  233.048393]  [<ffffffff81173cd5>] ____fput+0xe/0x10
[  233.048393]  [<ffffffff810a3794>] task_work_run+0x5d/0x75
[  233.048393]  [<ffffffff8108da70>] do_exit+0x290/0x7f5
[  233.048393]  [<ffffffff82707415>] ? retint_swapgs+0x13/0x1b
[  233.048393]  [<ffffffff8108e23f>] do_group_exit+0x7b/0xba
[  233.048393]  [<ffffffff8108e295>] sys_exit_group+0x17/0x17
[  233.048393]  [<ffffffff8270de10>] tracesys+0xdd/0xe2
[  233.048393] Code: 59 01 5d c3 55 48 89 e5 53 41 50 0f 1f 44 00 00 48
89 fb e8 d4 b0 f0 ff 84 c0 75 11 48 89 de 48 c7 c7 fc fa f7 82 e8 0d 0f
57 01 <0f> 0b 5f 5b 5d c3 55 48 89 e5 0f 1f 44 00 00 48 63 87 d8 00 00
[  233.048393] RIP  [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
[  233.048393]  RSP <ffff88000facbca8>

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Tested-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "H.K. Jerry Chu" <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-06 14:21:10 -04:00
Jerry Chu 8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Christoph Paasch 1a7b27c97c ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()
Since 0e73441992 ("ipv4: Use inet_csk_route_child_sock() in DCCP and
TCP."), inet_csk_route_child_sock() is called instead of
inet_csk_route_req().

However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
Thus, inside inet_csk_route_child_sock() opt is always NULL and the
SRR-options are not respected anymore.
Packets sent by the server won't have the correct destination-IP.

This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
inside inet_csk_route_child_sock().

Reported-by: Luca Boccassi <luca.boccassi@gmail.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21 14:49:11 -07:00
David S. Miller ba3f7f04ef ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:54 -07:00
David S. Miller f8126f1d51 ipv4: Adjust semantics of rt->rt_gateway.
In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
2012-07-20 13:31:20 -07:00
Eric Dumazet 5abf7f7e0f ipv4: fix rcu splat
free_nh_exceptions() should use rcu_dereference_protected(..., 1)
since its called after one RCU grace period.

Also add some const-ification in recent code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 13:47:33 -07:00
David S. Miller 6700c2709c net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()
This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more.  In the
future the routes will be without destination address, source address,
etc. keying.  One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information.  This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here.  Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 03:29:28 -07:00
David S. Miller 80d0a69fc5 ipv4: Add helper inet_csk_update_pmtu().
This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16 03:28:06 -07:00
David S. Miller 3e12939a2a inet: Kill FLOWI_FLAG_PRECOW_METRICS.
No longer needed.  TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics.  Therefore there is no
longer any reason to pre-cow metrics in this way.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:12 -07:00
Eric Dumazet 7586eceb0a ipv4: tcp: dont cache output dst for syncookies
Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 21:47:33 -07:00
Eric Dumazet 7433819a1e tcp: do not create inetpeer on SYNACK message
Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
larger and larger, using lots of memory and cpu time.

tcp_v4_send_synack()
->inet_csk_route_req()
 ->ip_route_output_flow()
  ->rt_set_nexthop()
   ->rt_init_metrics()
    ->inet_getpeer( create = true)

This is a side effect of commit a4daad6b09 (net: Pre-COW metrics for
TCP) added in 2.6.39

Possible solution :

Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS

Before patch :

# grep peer /proc/slabinfo
inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0

Samples: 41K of event 'cycles', Event count (approx.): 30716565122
+  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
+   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
+   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
+   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
+   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
+   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
+   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
+   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
+   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
+   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
+   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
+   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
+   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
+   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
+   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
+   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
+   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
+   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
+   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
+   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01 14:22:11 -04:00
Pavel Emelyanov 4a17fd5229 sock: Introduce named constants for sk_reuse
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Eric Dumazet 95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Alex Copot aacd9289af tcp: bind() use stronger condition for bind_conflict
We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.

We achieve this by adding a relaxation parameter to
inet_csk_bind_conflict. When 'relax' parameter is off
we return a conflict whenever the current searched
pair (addr, port) is not unique.

This tries to address the problems reported in patch:
	8d238b25b1
	Revert "tcp: bind() fix when many ports are bound"

Tests where ran for creating and binding(0) many sockets
on 100 IPs. The results are, on average:

	* 60000 sockets, 600 ports / IP:
		* 0.210 s, 620 (IP, port) duplicates without patch
		* 0.219 s, no duplicates with patch
	* 100000 sockets, 1000 ports / IP:
		* 0.371 s, 1720 duplicates without patch
		* 0.373 s, no duplicates with patch
	* 200000 sockets, 2000 ports / IP:
		* 0.766 s, 6900 duplicates without patch
		* 0.768 s, no duplicates with patch
	* 500000 sockets, 5000 ports / IP:
		* 2.227 s, 41500 duplicates without patch
		* 2.284 s, no duplicates with patch

Signed-off-by: Alex Copot <alex.mihai.c@gmail.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:28:55 -04:00
Eric Dumazet c72e118334 inet: makes syn_ack_timeout mandatory
There are two struct request_sock_ops providers, tcp and dccp.

inet_csk_reqsk_queue_prune() can avoid testing syn_ack_timeout being
NULL if we make it non NULL like syn_ack_timeout

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:24:26 -04:00
Eric Dumazet fd4f2cead6 tcp: RFC6298 supersedes RFC2988bis
Updates some comments to track RFC6298

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-14 15:24:26 -04:00
Flavio Leitner fddb7b5761 tcp: bind() optimize port allocation
Port autoselection finds a port and then drop the lock,
then right after that, gets the hash bucket again and lock it.

Fix it to go direct.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-25 21:50:43 -05:00
Flavio Leitner 2b05ad33e1 tcp: bind() fix autoselection to share ports
The current code checks for conflicts when the application
requests a specific port.  If there is no conflict, then
the request is granted.

On the other hand, the port autoselection done by the kernel
fails when all ports are bound even when there is a port
with no conflict available.

The fix changes port autoselection to check if there is a
conflict and use it if not.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-25 21:50:43 -05:00
Eric Dumazet dfd56b8b38 net: use IS_ENABLED(CONFIG_IPV6)
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-11 18:25:16 -05:00
Eric Dumazet e56c57d0d3 net: rename sk_clone to sk_clone_lock
Make clear that sk_clone() and inet_csk_clone() return a locked socket.

Add _lock() prefix and kerneldoc.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08 17:07:07 -05:00
Eric Dumazet c4dbe54ed7 seqlock: Get rid of SEQLOCK_UNLOCKED
All static seqlock should be initialized with the lockdep friendly
__SEQLOCK_UNLOCKED() macro.

Remove legacy SEQLOCK_UNLOCKED() macro.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/%3C1306238888.3026.31.camel%40edumazet-laptop%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-24 15:22:17 +02:00
David S. Miller 6bd023f3dd ipv4: Make caller provide flowi4 key to inet_csk_route_req().
This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-18 18:32:03 -04:00
David S. Miller 77357a9552 ipv4: Create inet_csk_route_child_sock().
This is just like inet_csk_route_req() except that it operates after
we've created the new child socket.

In this way we can use the new socket's cork flow for proper route
key storage.

This will be used by DCCP and TCP child socket creation handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 14:34:22 -07:00
David S. Miller 072d8c9414 ipv4: Get route daddr from flow key in inet_csk_route_req().
Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 23:50:09 -07:00
Eric Dumazet f6d8bd051c inet: add RCU protection to inet->opt
We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 13:16:35 -07:00