Commit Graph

67 Commits

Author SHA1 Message Date
Eric Dumazet
3a5d1c0e7c inet: reqsk_alloc() needs to take care of dead listeners
We'll soon no longer take a refcount on listeners,
so reqsk_alloc() can not assume a listener refcount is not
zero. We need to use atomic_inc_not_zero()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 22:11:19 -04:00
Florian Westphal
b1f0a0e99c net: add inet_sk_transparent() helper
Avoids cluttering tcp_v4_send_reset when followup patch extends
it to deal with timewait sockets.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:03:05 -05:00
Eric Dumazet
ebb516af60 tcp/dccp: fix race at listener dismantle phase
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()

We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()

Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:19 -07:00
Eric Dumazet
ed53d0ab76 net: shrink struct sock and request_sock by 8 bytes
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:28:22 -07:00
Eric Dumazet
8e5eb54d30 net: align sk_refcnt on 128 bytes boundary
sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.

By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.

These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.

Tested:
 SYN flood hitting a 16 RX queues NIC.
 TCP listener using 16 sockets and SO_REUSEPORT
 and SO_INCOMING_CPU for proper siloing.

 Could process 6.0 Mpps SYN instead of 4.2 Mpps

 Kernel profile looked like :
    11.68%  [kernel]  [k] sha_transform
     6.51%  [kernel]  [k] __inet_lookup_listener
     5.07%  [kernel]  [k] __inet_lookup_established
     4.15%  [kernel]  [k] memcpy_erms
     3.46%  [kernel]  [k] ipt_do_table
     2.74%  [kernel]  [k] fib_table_lookup
     2.54%  [kernel]  [k] tcp_make_synack
     2.34%  [kernel]  [k] tcp_conn_request
     2.05%  [kernel]  [k] __netif_receive_skb_core
     2.03%  [kernel]  [k] kmem_cache_alloc

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:28:22 -07:00
Eric Dumazet
a1a5344ddb tcp: avoid two atomic ops for syncookies
inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:27 -07:00
Eric Dumazet
004a5d0140 net: use sk_fullsock() in __netdev_pick_tx()
SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
sk_dst_cache pointer.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:25 -07:00
Eric Dumazet
e96f78ab27 tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets
Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :

- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.

Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.

req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags

Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 13:25:20 -07:00
Eric Dumazet
ef547f2ac1 tcp: remove max_qlen_log
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.

Also rounding to powers of two was not very friendly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:44 -07:00
Eric Dumazet
10cbc8f179 tcp/dccp: remove struct listen_sock
It is enough to check listener sk_state, no need for an extra
condition.

max_qlen_log can be moved into struct request_sock_queue

We can remove syn_wait_lock and the alignment it enforced.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
81b496b31a tcp/dccp: shrink struct listen_sock
We no longer use hash_rnd, nr_table_entries and syn_table[]

For a listener with a backlog of 10 millions sockets, this
saves 80 MBytes of vmalloced memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:42 -07:00
Eric Dumazet
079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet
b267cdd107 tcp/dccp: init sk_prot and call sk_node_init() in reqsk_alloc()
We plan to use generic functions to insert request sockets
into ehash table.

sk_prot needs to be set (to retrieve sk_prot->h.hashinfo)
sk_node needs to be cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:38 -07:00
Eric Dumazet
8d2675f1e4 tcp: move synflood_warned into struct request_sock_queue
long term plan is to remove struct listen_sock when its hash
table is no longer there.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:37 -07:00
Eric Dumazet
aac065c50a tcp: move qlen/young out of struct listen_sock
qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
Eric Dumazet
fff1f3001c tcp: add a spinlock to protect struct request_sock_queue
struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)

We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:36 -07:00
Eric Dumazet
0536fcc039 tcp: prepare fastopen code for upcoming listener changes
While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.

Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.

Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
  for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.

If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:10 -07:00
Eric Dumazet
2985aaac01 tcp: constify tcp_syn_flood_action() socket argument
tcp_syn_flood_action() will soon be called with unlocked socket.
In order to avoid SYN flood warning being emitted multiple times,
use xchg().
Extend max_qlen_log and synflood_warned fields in struct listen_sock
to u32

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:10 -07:00
Eric Dumazet
a00e74442b tcp/dccp: constify send_synack and send_reset socket argument
None of these functions need to change the socket, make it
const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet
1b70e977ce inet: constify inet_rtx_syn_ack() sock argument
SYNACK packets are sent on behalf on unlocked listeners
or fastopen sockets. Mark socket as const to catch future changes
that might break the assumption.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
ea3bea3a1d tcp/dccp: constify rtx_synack() and friends
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
cd8ae85299 tcp: provide SYN headers for passive connections
This patch allows a server application to get the TCP SYN headers for
its passive connections.  This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.

Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.

The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.

The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.

TCP_SAVED_SYN is read once, it frees the saved SYN headers.

The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.

Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).

We have used such patch for about 3 years at Google.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 16:02:34 -04:00
Eric Dumazet
b357a364c5 inet: fix possible panic in reqsk_queue_unlink()
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24 11:39:15 -04:00
Eric Dumazet
b282705336 net: convert syn_wait_lock to a spinlock
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00