Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.
I could make at least one BUILD_BUG_ON conversion.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change struct proto destroy function pointer to return void. Noticed
by Al Viro.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
None of these files use any of the functionality promised by
asm/semaphore.h. It's possible that they rely on it dragging in some
unrelated header file, but I can't build all these files, so we'll have
fix any build failures as they come up.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
dev_queue_xmit() and the other IP output functions expect to get a skb
with clear or properly initialized skb->cb. Unlike TCP and UDP, the
dccp_skb_cb doesn't contain a struct inet_skb_parm at the beginning,
so the DCCP-specific data is interpreted by the IP output functions.
This can cause false negatives for the conditional POST_ROUTING hook
invocation, making the packet bypass the hook.
Add a inet_skb_parm/inet6_skb_parm union to the beginning of
dccp_skb_cb to avoid clashes. Also add a BUILD_BUG_ON to make
sure it fits in the cb.
[ Combined with patch from Gerrit Renker to remove two now unnecessary
memsets of IPCB(skb)->opt ]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces 4 individual assignments for `len' with a single
one, placed where the control flow of those 4 leads to.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a socket option and signalling support for the case where the server
holds timewait state on closing the connection, as described in RFC 4340, 8.3.
Since holding timewait state at the server is the non-usual case, it is enabled
via a socket option. Documentation for this socket option has been added.
The setsockopt statement has been made resilient against different possible cases
of expressing boolean `true' values using a suggestion by Ian McDonald.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing active close, RFC 4340, 8.3. requires to retransmit the
Close/CloseReq with a backoff-retransmit timer starting at intially 2 RTTs.
This patch shifts the existing code for active-close retransmit timer
into output.c, so that the retransmit timer is started when the first
Close/CloseReq is sent. Previously, the timer was started when, after
releasing the socket in dccp_close(), the actively-closing side had not yet
reached the CLOSED/TIMEWAIT state.
The patch further reduces the initial timeout from 3 seconds to the required
2 RTTs, where - in absence of a known RTT - the fallback value specified in
RFC 4340, 3.4 is used.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the necessary state transitions for the two forms of passive-close
* PASSIVE_CLOSE - which is entered when a host receives a Close;
* PASSIVE_CLOSEREQ - which is entered when a client receives a CloseReq.
Here is a detailed account of what the patch does in each state.
1) Receiving CloseReq
The pseudo-code in 8.5 says:
Step 13: Process CloseReq
If P.type == CloseReq and S.state < CLOSEREQ,
Generate Close
S.state := CLOSING
Set CLOSING timer.
This means we need to address what to do in CLOSED, LISTEN, REQUEST, RESPOND, PARTOPEN, and OPEN.
* CLOSED: silently ignore - it may be a late or duplicate CloseReq;
* LISTEN/RESPOND: will not appear, since Step 7 is performed first (we know we are the client);
* REQUEST: perform Step 13 directly (no need to enqueue packet);
* OPEN/PARTOPEN: enter PASSIVE_CLOSEREQ so that the application has a chance to process unread data.
When already in PASSIVE_CLOSEREQ, no second CloseReq is enqueued. In any other state, the CloseReq is ignored.
I think that this offers some robustness against rare and pathological cases: e.g. a simultaneous close where
the client sends a Close and the server a CloseReq. The client will then be retransmitting its Close until it
gets the Reset, so ignoring the CloseReq while in state CLOSING is sane.
2) Receiving Close
The code below from 8.5 is unconditional.
Step 14: Process Close
If P.type == Close,
Generate Reset(Closed)
Tear down connection
Drop packet and return
Thus we need to consider all states:
* CLOSED: silently ignore, since this can happen when a retransmitted or late Close arrives;
* LISTEN: dccp_rcv_state_process() will generate a Reset ("No Connection");
* REQUEST: perform Step 14 directly (no need to enqueue packet);
* RESPOND: dccp_check_req() will generate a Reset ("Packet Error") -- left it at that;
* OPEN/PARTOPEN: enter PASSIVE_CLOSE so that application has a chance to process unread data;
* CLOSEREQ: server performed active-close -- perform Step 14;
* CLOSING: simultaneous-close: use a tie-breaker to avoid message ping-pong (see comment);
* PASSIVE_CLOSEREQ: ignore - the peer has a bug (sending first a CloseReq and now a Close);
* TIMEWAIT: packet is ignored.
Note that the condition of receiving a packet in state CLOSED here is different from the condition "there
is no socket for such a connection": the socket still exists, but its state indicates it is unusable.
Last, dccp_finish_passive_close sets either DCCP_CLOSED or DCCP_CLOSING = TCP_CLOSING, so that
sk_stream_wait_close() will wait for the final Reset (which will trigger CLOSING => CLOSED).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds two auxiliary states to deal with passive closes:
* PASSIVE_CLOSE (reached from OPEN via reception of Close) and
* PASSIVE_CLOSEREQ (reached from OPEN via reception of CloseReq)
as internal intermediate states.
These states are used to allow a receiver to process unread data before
acknowledging the received connection-termination-request (the Close/CloseReq).
Without such support, it will happen that passively-closed sockets enter CLOSED
state while there is still unprocessed data in the queue; leading to unexpected
and erratic API behaviour.
PASSIVE_CLOSE has been mapped into TCPF_CLOSE_WAIT, so that the code will
seamlessly work with inet_accept() (which tests for this state).
The state names are thanks to Arnaldo, who suggested this naming scheme
following an earlier revision of this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This continues from the previous patch and adds support for actively aborting
a DCCP connection, using a Reset Code 2, "Aborted" to inform the peer of an
abortive release.
I have tried this in various client/server settings and it works as expected.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes one FIXME with regard to close when there is still unread data.
The mechanism is implemented similar to TCP: with regard to DCCP-specifics,
a Reset with Code 2, "Aborted" is sent to the peer.
This corresponds in part to RFC 4340, 8.1.1 and 8.1.5.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is because in the next patch CCID2 will assume that dccps_mss_cache is
non-zero.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends the DCCP socket API by honouring any shutdown(2) option set by the user.
The behaviour is, as much as possible, made consistent with the API for TCP's shutdown.
This patch exploits the information provided by the user via the socket API to reduce
processing costs:
* if the read end is closed (SHUT_RD), it is not necessary to deliver to input CCID;
* if the write end is closed (SHUT_WR), the same idea applies, but with a difference -
as long as the TX queue has not been drained, we need to receive feedback to keep
congestion-control rates up to date. Hence SHUT_WR is honoured only after the last
packet (under congestion control) has been sent;
* although SHUT_RDWR seems nonsensical, it is nevertheless supported in the same manner
as for TCP (and agrees with test for SHUTDOWN_MASK in dccp_poll() in net/dccp/proto.c).
Furthermore, most of the code already honours the sk_shutdown flags (dccp_recvmsg() for
instance sets the read length to 0 if SHUT_RD had been called); CCID handling is now added
to this by the present patch.
There will also no longer be any delivery when the socket is in the final stages, i.e. when
one of dccp_close(), dccp_fin(), or dccp_done() has been called - which is fine since at
that stage the connection is its final stages.
Motivation and background are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/shutdown
A FIXME has been added to notify the other end if SHUT_RD has been set (RFC 4340, 11.7).
Note: There is a comment in inet_shutdown() in net/ipv4/af_inet.c which asks to "make
sure the socket is a TCP socket". This should probably be extended to mean
`TCP or DCCP socket' (the code is also used by UDP and raw sockets).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.
On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.
Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.
This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables applications to query the current value of the Maximum
Packet Size via a socket option, suggested as a SHOULD in (RFC 4340,
p. 102).
This socket option is useful to avoid the annoying bail-out via
`-EMSGSIZE'. In particular, as fragmentation is not currently
supported (and its use is partly discouraged in RFC 4340).
With this option, it is possible to size buffers accordingly, e.g.
int buflen = dccp_get_cur_mps(sockfd);
/* or */
if (msgsize > dccp_get_cur_mps(sockfd))
die("message is too large for this path");
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since DCCP requires to close both ends of a connection simultaneously,
permission to write in state DCCP_CLOSING is removed in dccp_sendmsg():
* if the sending end closed, it would encounter a write error anyhow;
* if the other end has closed the connection, it accepts no more data.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This implements a SHOULD from RFC 4340, 7.5.4:
"To protect against denial-of-service attacks, DCCP implementations SHOULD
impose a rate limit on DCCP-Syncs sent in response to sequence-invalid packets,
such as not more than eight DCCP-Syncs per second."
The rate-limit is maintained on a per-socket basis. This is a more stringent
policy than enforcing the rate-limit on a per-source-address basis and
protects against attacks with forged source addresses.
Moreover, the mechanism is deliberately kept simple. In contrast to
xrlim_allow(), bursts of Sync packets in reply to sequence-invalid packets
are not supported. This foils such attacks where the receipt of a Sync
triggers further sequence-invalid packets. (I have tested this mechanism against
xrlim_allow algorithm for Syncs, permitting bursts just increases the problems.)
In order to keep flexibility, the timeout parameter can be set via sysctl; and
the whole mechanism can even be disabled (which is however not recommended).
The algorithm in this patch has been improved with regard to wrapping issues
thanks to a suggestion by Arnaldo.
Commiter note: Rate limited the step 6 DCCP_WARN too, as it says we're
sending a sync.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
This provides a timesource, conveniently used for DCCP timestamps, which
returns the elapsed time in 10s of microseconds since initialisation.
This makes for a wrap-around time of about 11.9 hours, which should be
sufficient for most applications.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>