Commit Graph

82 Commits

Author SHA1 Message Date
Al Viro dd0fc66fb3 [PATCH] gfp flags annotations - part 1
- added typedef unsigned int __nocast gfp_t;

 - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
   the same warnings as far as sparse is concerned, doesn't change
   generated code (from gcc point of view we replaced unsigned int with
   typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 15:00:57 -07:00
Eric Dumazet 81c3d5470e [INET]: speedup inet (tcp/dccp) lookups
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)

(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

1) First some performance data :
--------------------------------

tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

The most time critical code is :

sk_for_each(sk, node, &head->chain) {
     if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
         goto hit; /* You sunk my battleship! */
}

The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.

As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.

This can be problematic if some chains are very long.

2) The goal
-----------

The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.

3) Description of the patch
---------------------------

Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.

struct sock_common {
	unsigned short		skc_family;
	volatile unsigned char	skc_state;
	unsigned char		skc_reuse;
	int			skc_bound_dev_if;
	struct hlist_node	skc_node;
	struct hlist_node	skc_bind_node;
	atomic_t		skc_refcnt;
+	unsigned int		skc_hash;
	struct proto		*skc_prot;
};

Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.

Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)

File include/net/inet_hashtables.h

64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))
     ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie))   &&  \
     ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))                 &&  \
     (inet_sk(__sk)->daddr          == (__saddr))   &&  \
     (inet_sk(__sk)->rcv_saddr      == (__daddr))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))


- Adds a prefetch(head->chain.first) in 
__inet_lookup_established()/__tcp_v4_check_established() and 
__inet6_lookup_established()/__tcp_v6_check_established() and 
__dccp_v4_check_established() to bring into cache the first element of the 
list, before the {read|write}_lock(&head->lock);

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 14:13:38 -07:00
Arnaldo Carvalho de Melo 88f964db6e [DCCP]: Introduce CCID getsockopt for the CCIDs
Allocation for the optnames is similar to the DCCP options, with a
range for rx and tx half connection CCIDs.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18 00:19:32 -07:00
Arnaldo Carvalho de Melo 561713cf47 [DCCP]: Don't use necessarily the same CCID for tx and rx
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18 00:18:52 -07:00
Arnaldo Carvalho de Melo 65299d6c3c [CCID3]: Introduce include/linux/tfrc.h
Moving the TFRC sender and receiver variables to separate structs, so
that we can copy these structs to userspace thru getsockopt,
dccp_diag, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18 00:18:32 -07:00
Arnaldo Carvalho de Melo ae31c3399d [DCCP]: Move the ack vector code to net/dccp/ackvec.[ch]
Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18 00:17:51 -07:00
Arnaldo Carvalho de Melo 67e6b62921 [DCCP]: Introduce DCCP_SOCKOPT_SERVICE
As discussed in the dccp@vger mailing list:

Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().

An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.

With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16 16:58:40 -07:00
Arnaldo Carvalho de Melo 0c10c5d968 [DCCP]: More precisely set reset_code when sending RESET packets
Moving the setting of DCCP_SKB_CB(skb)->dccpd_reset_code to the places
where events happen that trigger sending a RESET packet.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16 16:58:33 -07:00
Arnaldo Carvalho de Melo 2b80230a7f [DCCP]: Handle SYNC packets in dccp_rcv_state_process
Eliciting a SYNCACK in response, we were handling SYNC packets
only in the DCCP_OPEN state, in dccp_rcv_established.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-13 19:05:08 -03:00
Arnaldo Carvalho de Melo 811265b8e8 [DCCP]: Check if already in the CLOSING state in dccp_rcv_closereq
It is possible to receive more than one CLOSEREQ packet if the
CLOSE packet sent in response is somehow lost, change the state
to DCCP_CLOSING only on the first CLOSEREQ packet received.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-13 19:03:15 -03:00
Arnaldo Carvalho de Melo 59c2353dd0 [CCID3]: Listen socks doesn't have a private CCID block
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-12 14:16:58 -07:00
Arnaldo Carvalho de Melo 59d203f9e9 [CCID3] Cleanup ccid3 debug calls
Also use some BUG_ON where appropriate and use LIMIT_NETDEBUG for the unlikely
cases where we, at this stage, want to know about, that in my tests hasn't
appeared in the radar.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 20:01:25 -03:00
Arnaldo Carvalho de Melo dc19336c76 [DCCP] Only call the HC _exit() routines in dccp_v4_destroy_sock
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 19:59:26 -03:00
Arnaldo Carvalho de Melo d7e0fb985c [CCID3] Initialize ccid3hctx_t_ipi to 250ms
To match more closely what is described in RFC 3448.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz>
2005-09-09 19:58:18 -03:00
Arnaldo Carvalho de Melo 59725dc2a2 [CCID3] Introduce ccid3_hc_[rt]x_sk() for overal consistency
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:40:58 -03:00
Arnaldo Carvalho de Melo b0e567806d [DCCP] Introduce dccp_timestamp
To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this
probably will be reworked to use the to be introduced struct timeval_offset
infrastructure out of skb_get_timestamp, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:38:35 -03:00
Arnaldo Carvalho de Melo 954ee31f36 [CCID3] Initialize more fields in ccid3_hc_rx_init
The initialization of ccid3hcrx_rtt to 5ms is just a bandaid, I'll continue
auditing the CCID3 HC rx codebase to fix this properly, probably I'll add a
feedback timer as suggested in the CCID3 draft.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:37:05 -03:00
Arnaldo Carvalho de Melo b3a3077d96 [CCID3] Make the ccid3hcrx_rtt calc look more like the ccid3hctx_rtt one
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:34:10 -03:00
Arnaldo Carvalho de Melo 1a28599a2c [CCID3] Use ELAPSED_TIME in the HC TX RTT estimation
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:32:56 -03:00
Arnaldo Carvalho de Melo 1c14ac0ae8 [DCCP] Give precedence to the biggest ELAPSED_TIME
We can get this value in an TIMESTAMP_ECHO and/or in an ELAPSED_TIME option, if
receiving both give precendence to the biggest one.

In my tests they are very close if not equal at all times, so we may well think
about removing the code in CCID3 that inserts this option and leaving this to
the core, and perhaps even use just TIMESTAMP_ECHO including the elapsed time.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:32:01 -03:00
Arnaldo Carvalho de Melo 27ae543e6f [CCID3] Calculate ccid3hcrx_x_recv using usecs_div
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:31:07 -03:00
Arnaldo Carvalho de Melo 507d37cf26 [CCID] Only call the HC insert_options methods when requested
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:30:07 -03:00
Arnaldo Carvalho de Melo 0ba7a3ba66 [CCID3] Avoid unsigned integer overflows in usecs_div
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09 02:28:47 -03:00
Arnaldo Carvalho de Melo c530cfb1ce [CCID3]: Call sk->sk_write_space(sk) when receiving a feedback packet
This makes the send rate calculations behave way more closely to what
is specified, with the jitter previously seen on x and x_recv
disappearing completely on non lossy setups.

This resembles the tcp_data_snd_check code, that possibly we'll end up
using in DCCP as well, perhaps moving this code to
inet_connection_sock.

For now I'm doing the simplest implementation tho.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:13:46 -07:00
Arnaldo Carvalho de Melo a84ffe4303 [DCCP]: Introduce DCCP_SOCKOPT_PACKET_SIZE
So that applications can set dccp_sock->dccps_pkt_size, that in turn
is used in the CCID3 half connection init routines to set
ccid3hc[tr]x_s and use it in its rate calculations.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:13:37 -07:00