Commit Graph

703 Commits

Author SHA1 Message Date
Gerrit Renker 49aebc66d6 dccp: Deprecate old setsockopt framework
The previous setsockopt interface, which passed socket options via struct
dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
ugly code since the old approach did not distinguish between NN and SP values.

This patch removes the old setsockopt interface and replaces it with two new
functions to register NN/SP values for feature negotiation. 
These are essentially wrappers around the internal __feat_register functions,
with checking added to avoid

 * wrong usage (type);
 * changing values while the connection is in progress.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 22:51:23 -08:00
Gerrit Renker 0c1168398e dccp: Mechanism to resolve CCID dependencies
This adds a hook to resolve features whose value depends on the choice of
CCID. It is done at the server since it can only be done after the CCID
values have been negotiated; i.e. the client will add its CCID preference
list on the Change options sent in the Request, which will be reconciled
with the local preference list of the server.

The concept is documented on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
				implementation_notes.html#ccid_dependencies

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 22:49:52 -08:00
Eric Dumazet 3ab5aee7fe net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:40:17 -08:00
Gerrit Renker 9eca0a47de dccp: Resolve dependencies of features on choice of CCID
This provides a missing link in the code chain, as several features implicitly
depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector
feature, but also Ack Ratio and Send Loss Event Rate (also taken care of).

For Send Ack Vector, the situation is as follows:
 * since CCID2 mandates the use of Ack Vectors, there is no point in allowing 
   endpoints which use CCID2 to disable Ack Vector features such a connection;

 * a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer
   with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4);

 * for all other CCIDs, the use of (Send) Ack Vector is optional and thus
   negotiable. However, this implies that the code negotiating the use of Ack
   Vectors also supports it (i.e. is able to supply and to either parse or
   ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack
   Vector support), the use of Ack Vectors is here disabled, with a comment
   in the source code.

An analogous consideration arises for the Send Loss Event Rate feature,
since the CCID-3 implementation does not support the loss interval options
of RFC 4342. To make such use explicit, corresponding feature-negotiation
options are inserted which signal the use of the loss event rate option,
as it is used by the CCID3 code.

Lastly, the values of the Ack Ratio feature are matched to the choice of CCID.

The patch implements this as a function which is called after the user has
made all other registrations for changing default values of features.

The table is variable-length, the reserved (and hence for feature-negotiation
invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0'
is used to mark the end of the table.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 00:48:44 -08:00
Gerrit Renker d90ebcbfa7 dccp: Query supported CCIDs
This provides a data structure to record which CCIDs are locally supported
and three accessor functions:
 - a test function for internal use which is used to validate CCID requests
   made by the user;
 - a copy function so that the list can be used for feature-negotiation;   
 - documented getsockopt() support so that the user can query capabilities.

The data structure is a table which is filled in at compile-time with the
list of available CCIDs (which in turn depends on the Kconfig choices).

Using the copy function for cloning the list of supported CCIDs is useful for
feature negotiation, since the negotiation is now with the full list of available
CCIDs (e.g. {2, 3}) instead of the default value {2}. This means negotiation 
will not fail if the peer requests to use CCID3 instead of CCID2. 

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 00:47:26 -08:00
Gerrit Renker e8ef967a54 dccp: Registration routines for changing feature values
Two registration routines, for SP and NN features, are provided by this patch,
replacing a previous routine which was used for both feature types.

These are internal-only routines and therefore start with `__feat_register'.

It further exports the known limits of Sequence Window and Ack Ratio as symbolic
constants.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 00:43:40 -08:00
Gerrit Renker f74e91b6cc dccp: Limit feature negotiation to connection setup phase
This patch limits feature (capability) negotation to the connection setup phase:

 1. Although it is theoretically possible to perform feature negotiation at any
    time (and RFC 4340 supports this), in practice this is prohibitively complex,
    as it requires to put traffic on hold for each new negotiation.
 2. As a byproduct of restricting feature negotiation to connection setup, the
    feature-negotiation retransmit timer is no longer required. This part is now
    mapped onto the protocol-level retransmission.
    Details indicating why timers are no longer needed can be found on
    http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
	                                      implementation_notes.html

This patch disables anytime negotiation, subsequent patches work out full
feature negotiation support for connection setup.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 00:42:58 -08:00
Gerrit Renker d99a7bd210 dccp: Cleanup routines for feature negotiation
This inserts the required de-allocation routines for memory allocated
by feature negotiation in the socket destructors, replacing
dccp_feat_clean() in one instance.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:56:30 -08:00
Gerrit Renker ac75773c27 dccp: Per-socket initialisation of feature negotiation
This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.

It also resolves a FIXME regarding the congestion control
initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:55:49 -08:00
Gerrit Renker 61e6473efb dccp: List management for new feature negotiation
This adds list initial fields and list management functions for the
new feature negotiation implementation.

Thanks to Arnaldo for suggestions and improvements.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:54:04 -08:00
Gerrit Renker 7d43d1a0f2 dccp: Implement lookup table for feature-negotiation information
A lookup table for feature-negotiation information, extracted from RFC
4340/42, is provided by this patch. All currently known features can
be found in this table, along with their feature location, their
default value, and type.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:43:47 -08:00
Gerrit Renker bd012f2e7b dccp: Basic data structure for feature negotiation
This patch prepares for the new and extended feature-negotiation
routines.

The following feature-negotiation data structures are provided:
	* a container for the various (SP or NN) values,
	* symbolic state names to track feature states,
	* an entry struct which holds all current information together,
	* elementary functions to fill in and process these structures.

Entry structs are arranged as FIFO for the following reason: RFC 4340
specifies that if multiple options of the same type are present, they
are processed in the order of their appearance in the packet; which
means that this order needs to be preserved in the local data
structure (the later insertion code also respects this order).

The struct list_head has been chosen for the following reasons: the most
frequent operations are

 * add new entry at tail (when receiving Change or setting socket
   options);
 * delete entry (when Confirm has been received);
 * deep copy of entire list (cloning from listening socket onto
   request socket).

The NN value has been set to 64 bit, which is a currently sufficient
upper limit (Sequence Window feature has 48 bit).

Thanks to Arnaldo, who contributed the streamlined layout of the entry
struct.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:38:20 -08:00
Harvey Harrison 21454aaad3 net: replace NIPQUAD() in net/*/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:54:56 -07:00
Gerrit Renker 944f750227 dccp: Port redirection support for DCCP
Commit a3116ac5c2 from 1st October ("tcp: Port
redirection support for TCP") broke DCCP skb lookup by changing inet_csk_clone,
which is used by DCCP to generate the child socket after the handshake.

This patch updates DCCP to use 'loc_port' instead of 'sport', which fixes the
problem, and thus inheriting port redirection support via the new interface.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-19 23:36:47 -07:00
Johannes Berg 95a5afca4a net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-16 15:24:51 -07:00
Denis V. Lunev e41b5368e0 ipv6: added net argument to ICMP6_INC_STATS_BH
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 11:14:13 -07:00
Arnaldo Carvalho de Melo 9a1f27c480 inet_hashtables: Add inet_lookup_skb helpers
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 11:41:57 -07:00
Gerrit Renker 410e27a49b This reverts "Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/dccp_exp"
as it accentally contained the wrong set of patches. These will be
submitted separately.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-09 13:27:22 +02:00
Gerrit Renker a3cbdde8e9 dccp ccid-3: Preventing Oscillations
This implements [RFC 3448, 4.5], which performs congestion avoidance behaviour
by reducing the transmit rate as the queueing delay (measured in terms of
long-term RTT) increases.

Oscillation can be turned on/off via a module option (do_osc_prev) and via sysfs
(using mode 0644), the default is off.

Overflow analysis:
------------------
 * oscillation prevention is done after update_x(), so that t_ipi <= 64000;
 * hence the multiplication "t_ipi * sqrt(R_sample)" needs 64 bits;
 * done using u64 for sqrt_sample and explicit typecast of t_ipi;
 * the divisor, R_sqmean, is non-zero because oscillation prevention is first
   called when receiving the second feedback packet, and tfrc_scaled_rtt() > 0.

A detailed discussion of the algorithm (with plots) is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid3/sender_notes/oscillation_prevention/

The algorithm has negative side effects:
  * when allowing to decrease t_ipi (leads to a large RTT) and
  * when using it during slow-start;
both uses are therefore disabled.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:43 +02:00
Gerrit Renker 53ac9570c8 dccp ccid-3: Simplify computing and range-checking of t_ipi
This patch simplifies the computation of t_ipi, avoiding expensive computations
to enforce the minimum sending rate.

Both RFC 3448 and rfc3448bis (revision #06), as well as RFC 4342 sec 5., require
at various stages that at least one packet must be sent per t_mbi = 64 seconds.
This requires frequent divisions of the type X_min = s/t_mbi, which are later
converted back into an inter-packet-interval t_ipi_max = s/X_min = t_mbi.

The patch removes the expensive indirection; in the unlikely case of having
a sending rate less than one packet per 64 seconds, it also re-adjusts X.

The following cases document conformance with RFC 3448  / rfc3448bis-06:
 1) Time until receiving the first feedback packet:
   * if the sender has no initial RTT sample then X = s/1 Bps > s/t_mbi;
   * if the sender has an initial RTT sample or when the first feedback
     packet is received, X = W_init/R > s/t_mbi.

 2) Slow-start (p == 0 and feedback packets come in):
   * RFC 3448  (current code) enforces a minimum of s/R > s/t_mbi;
   * rfc3448bis (future code) enforces an even higher minimum of W_init/R.

 3) Congestion avoidance with no absence of feedback (p > 0):
   * when X_calc or X_recv/2 are too low, the minimum of X_min = s/t_mbi
     is enforced in update_x() when calling update_send_interval();
   * update_send_interval() is, as before, only called when X changes
     (i.e. either when increasing or decreasing, not when in equilibrium).

 4) Reduction of X without prior feedback or during slow-start (p==0):
   * both RFC 3448 and rfc3448bis here halve X directly;
   * the associated constraint X >= s/t_mbi is nforced here by send_interval().

 5) Reduction of X when p > 0:
   * X is modified indirectly via X_recv (RFC 3448) or X_recv_set (rfc3448bis);
   * in both cases, control goes back to section 4.3 (in both documents);
   * since p > 0, both documents use X = max(min(...), s/t_mbi), which is
     enforced in this patch by calling send_interval() from update_x().

I think that this analysis is exhaustive. Should I have forgotten a case,
the worst-case consideration arises when X sinks below s/t_mbi, and is then
increased back up to this minimum value. Even under this assumption, the
behaviour is correct, since all lower limits of X in RFC 3448 / rfc3448bis
are either equal to or greater than s/t_mbi.

Note on the condition X >= s/t_mbi  <==> t_ipi = s/X <= t_mbi: since X is
scaled by 64, and all time units are in microseconds, the coded condition is:

    t_ipi = s * 64 * 10^6 usec / X <= 64 * 10^6 usec

This simplifies to s / X <= 1 second <==> X * 1 second >= s > 0.
(A zero `s' is not allowed by the CCID-3 code).	

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:43 +02:00
Gerrit Renker c8f41d50ad dccp ccid-3: Measuring the packet size s with regard to rfc3448bis-06
rfc3448bis allows three different ways of tracking the packet size `s': 

 1. using the MSS/MPS (at initialisation, 4.2, and in 4.1 (1));
 2. using the average of `s' (in 4.1);
 3. using the maximum of `s' (in 4.2).

Instead of hard-coding a single interpretation of rfc3448bis, this implements
a choice of all three alternatives and suggests the first as default, since it
is the option which is most consistent with other parts of the specification.

The patch further deprecates the update of t_ipi whenever `s' changes. The
gains of doing this are only small since a change of s takes effect at the
next instant X is updated:
 * when the next feedback comes in (within one RTT or less);
 * when the nofeedback timer expires (within at most 4 RTTs).
 
Further, there are complications caused by updating t_ipi whenever s changes:
 * if t_ipi had previously been updated to effect oscillation prevention (4.5),
   then it is impossible to make the same adjustment to t_ipi again, thus
   counter-acting the algorithm;
 * s may be updated any time and a modification of t_ipi depends on the current
   state (e.g. no oscillation prevention is done in the absence of feedback);
 * in rev-06 of rfc3448bis, there are more possible cases, depending on whether
   the sender is in slow-start (t_ipi <= R/W_init), or in congestion-avoidance,
   limited by X_recv or the throughput equation (t_ipi <= t_mbi).

Thus there are side effects of always updating t_ipi as s changes. These may not
be desirable. The only case I can think of where such an update makes sense is
to recompute X_calc when p > 0 and when s changes (not done by this patch).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:42 +02:00
Gerrit Renker 891e4d8a40 dccp ccid-3: Tidy up CCID-Kconfig dependencies
The per-CCID menu has several dependencies on EXPERIMENTAL. These are redundant,
since net/dccp/ccids/Kconfig is sourced by net/dccp/Kconfig and since the
latter menu in turn asserts a dependency on EXPERIMENTAL.

The patch removes the redundant dependencies as well as the repeated reference
within the sub-menu.

Further changes:
----------------
Two single dependencies on CCID-3 are replaced with a single enclosing `if'.
    
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:42 +02:00
Gerrit Renker 9d497a2c91 dccp ccid-3: Implement rfc3448bis change to initial-rate computation
The patch updates CCID-3 with regard to the latest rfc3448bis-06: 
 * in the first revisions of the draft, MSS was used for the RFC 3390 window; 
 * then (from revision #1 to revision #2), it used the packet size `s';
 * now, in this revision (and apparently final), the value is back to MSS.

This change has an implication for the case when no RTT sample is available,
at the time of sending the first packet:

 * with RTT sample, 2*MSS/RTT <= initial_rate <= 4*MSS/RTT;
 * without RTT sample, the initial rate is one packet (s bytes) per second
   (sec. 4.2), but using s instead of MSS here creates an imbalance, since
   this would further reduce the initial sending rate.

Hence the patch uses MSS (called MPS in RFC 4340) in all places.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:42 +02:00
Gerrit Renker 88e97a9334 dccp ccid-3: Update the RX history records in one place
This patch is a requirement for enabling ECN support later on. With that change
in mind, the following preparations are done:
 * renamed handle_loss() into congestion_event() since it returns true when a
   congestion event happens (it will eventually also take care of ECN packets);
 * lets tfrc_rx_congestion_event() always update the RX history records, since
   this routine needs to be called for each non-duplicate packet anyway;
 * made all involved boolean-type functions to have return type `bool';

Updating the RX history records is now only necessary for the packets received
up to sending the first feedback. The receiver code becomes again simpler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:42 +02:00
Gerrit Renker 68c89ee535 dccp ccid-3: Update the computation of X_recv
This updates the computation of X_recv with regard to Errata 610/611 for
RFC 4342 and draft rfc3448bis-06, ensuring that at least an interval of 1
RTT is used to compute X_recv.  The change is wrapped into a new function
ccid3_hc_rx_x_recv().

Further changes:
----------------
 * feedback is not sent when no data packets arrived (bytes_recv == 0), as per
   rfc3448bis-06, 6.2;
 * take the timestamp for the feedback /after/ dccp_send_ack() returns, to avoid
   taking the transmission time into account (in case layer-2 is busy);
 * clearer handling of failure in ccid3_first_li().

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2008-09-04 07:45:42 +02:00