Same as the ->hash one, this is easily consolidated.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having the raw_hashinfo it's easy to consolidate the
raw[46]_hash functions.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4/raw.c and ipv6/raw.c contain many common code (most
of which is proc interface) which can be consolidated.
Most of the places to consolidate deal with the raw sockets
hashtable, so introduce a struct raw_hashinfo which describes
the raw sockets hash.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The raw sockets functions are explicitly used from
inside the kernel in two places:
1. in ip_local_deliver_finish to intercept skb-s
2. in icmp_error
For this purposes many functions and even data structures,
that are naturally internal for raw protocol, are exported.
Compact the API to two functions and hide all the other
(including hash table and rwlock) inside the net/ipv4/raw.c
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With this patch the synced connections are created with their real state,
which can be changed on the next synchronizations if necessary. This way
on fail-over all the connections will be treated according to their actual
state, causing no scheduling problems (the active and the nonactive
connections have different weights in the schedulers).
The backwards compatibility is preserved and the existing tools will show
the true connection states even on the backup director.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch labels the sync-created connections with IP_VS_CONN_F_SYNC
flag and creates /proc/net/ip_vs_conn_sync to enable monitoring of the
origin of the connections, if they are local or created by the
synchronization.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously one of the in-block skip branches was missing it.
Also, drop it from tail-fully-processed case because the next
iteration will do exactly the same thing, i.e., process the
SACK block that contains the DSACK information.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_rt_acct needs 4096 bytes per cpu to perform some accounting.
It is actually allocated as a single huge array [4096*NR_CPUS]
(rounded up to a power of two)
Converting it to a per cpu variable is wanted to :
- Save space on machines were num_possible_cpus() < NR_CPUS
- Better NUMA placement (each cpu gets memory on its node)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Key points of this patch are:
- In case new SACK information is advance only type, no skb
processing below previously discovered highest point is done
- Optimize cases below highest point too since there's no need
to always go up to highest point (which is very likely still
present in that SACK), this is not entirely true though
because I'm dropping the fastpath_skb_hint which could
previously optimize those cases even better. Whether that's
significant, I'm not too sure.
Currently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).
Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!
Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.
Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.
The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".
DSACKs are special case that must always be walked.
The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Worker function that implements the main logic of
the inner-most loop of tcp_sacktag_write_queue().
Idea was originally presented by David S. Miller.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highest_sack_end_seq is no longer calculated in the loop,
thus it can be pushed to the worker function altogether
making that function independent of the sacktag.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many assumptions that are true when no reordering or other
strange events happen are not a part of the RFC3517. FACK
implementation is based on such assumptions. Previously (before
the rewrite) the non-FACK SACK was basically doing fast rexmit
and then it times out all skbs when first cumulative ACK arrives,
which cannot really be called SACK based recovery :-).
RFC3517 SACK disables these things:
- Per SKB timeouts & head timeout entry to recovery
- Marking at least one skb while in recovery (RFC3517 does this
only for the fast retransmission but not for the other skbs
when cumulative ACKs arrive in the recovery)
- Sacktag's loss detection flavors B and C (see comment before
tcp_sacktag_write_queue)
This does not implement the "last resort" rule 3 of NextSeg, which
allows retransmissions also when not enough SACK blocks have yet
arrived above a segment for IsLost to return true [RFC3517].
The implementation differs from RFC3517 in these points:
- Rate-halving is used instead of FlightSize / 2
- Instead of using dupACKs to trigger the recovery, the number
of SACK blocks is used as FACK does with SACK blocks+holes
(which provides more accurate number). It seems that the
difference can affect negatively only if the receiver does not
generate SACK blocks at all even though it claimed to be
SACK-capable.
- Dupthresh is not a constant one. Dynamical adjustments include
both holes and sacked segments (equal to what FACK has) due to
complexity involved in determining the number sacked blocks
between highest_sack and the reordered segment. Thus it's will
be an over-estimate.
Implementation note:
tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head
skb at that point cannot be SACKED_ACKED (nor would such
situation last for long enough to cause problems).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements more accurately what is stated in sacktag's
overall comment:
"Both of these heuristics are not used in Loss state, when
we cannot account for retransmits accurately."
When CA_Loss state is entered, the state changer ensures that
undo_marker is only set if no TCPCB_RETRANS skbs were found,
thus having non-zero undo_marker in CA_Loss basically tells
that the R-bits still accurately reflect the current state
of TCP.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
All intermediate conditions include it already, make them
simpler as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
After changeset:
[NETFILTER]: Introduce NF_INET_ hook values
It always evaluates to NF_INET_POST_ROUTING.
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for async resumptions on input. To do so, the
transform would return -EINPROGRESS and subsequently invoke the
function xfrm_input_resume to resume processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nhoff field isn't actually necessary in xfrm_input. For tunnel
mode transforms we now throw away the output IP header so it makes no
sense to fill in the nexthdr field. For transport mode we can now let
the function transport_finish do the setting and it knows where the
nexthdr field is.
The only other thing that needs the nexthdr field to be set is the
header extraction code. However, we can simply move the protocol
extraction out of the generic header extraction.
We want to minimise the amount of info we have to carry around between
transforms as this simplifies the resumption process for async crypto.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>