In 5e140dfc1f "net: reorder struct Qdisc
for better SMP performance" the definition of struct gnet_stats_basic
changed incompatibly, as copies of this struct are shipped to
userland via netlink.
Restoring old behavior is not welcome, for performance reason.
Fix is to use a private structure for kernel, and
teach gnet_stats_copy_basic() to convert from kernel to user land,
using legacy structure (struct gnet_stats_basic)
Based on a report and initial patch from Michael Spang.
Reported-by: Michael Spang <mspang@csclub.uwaterloo.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.
In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)
kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.
Fix is to call kmem_cache_alloc() and do the zeroing ourself.
As spotted by Patrick, we also need to make sure lookup keys are committed to
memory before setting refcount to 1, or a lockless reader could get a reference
on the old version of the object. Its key re-check could then pass the barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When NAT helpers change the TCP packet size, the highest seen sequence
number needs to be corrected. This is currently only done upwards, when
the packet size is reduced the sequence number is unchanged. This causes
TCP conntrack to falsely detect unacknowledged data and decrease the
timeout.
Fix by updating the highest seen sequence number in both directions after
packet mangling.
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
RCU barriers, rcu_barrier(), is inserted two places.
In nf_conntrack_expect.c nf_conntrack_expect_fini() before the
kmem_cache_destroy(). Firstly to make sure the callback to the
nf_ct_expect_free_rcu() code is still around. Secondly because I'm
unsure about the consequence of having in flight
nf_ct_expect_free_rcu/kmem_cache_free() calls while doing a
kmem_cache_destroy() slab destroy.
And in nf_conntrack_extend.c nf_ct_extend_unregister(), inorder to
wait for completion of callbacks to __nf_ct_ext_free_rcu(), which is
invoked by __nf_ct_ext_add(). It might be more efficient to call
rcu_barrier() in nf_conntrack_core.c nf_conntrack_cleanup_net(), but
thats make it more difficult to read the code (as the callback code
in located in nf_conntrack_extend.c).
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As noticed by Török Edwin <edwintorok@gmail.com>:
Compiling the kernel with clang has shown this warning:
net/netfilter/xt_rateest.c:69:16: warning: self-comparison always results in a
constant value
ret &= pps2 == pps2;
^
Looking at the code:
if (info->flags & XT_RATEEST_MATCH_BPS)
ret &= bps1 == bps2;
if (info->flags & XT_RATEEST_MATCH_PPS)
ret &= pps2 == pps2;
Judging from the MATCH_BPS case it seems to be a typo, with the intention of
comparing pps1 with pps2.
http://bugzilla.kernel.org/show_bug.cgi?id=13535
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit v2.6.29-rc5-872-gacc738f ("xtables: avoid pointer to self")
forgot to copy the initial quota value supplied by iptables into the
private structure, thus counting from whatever was in the memory
kmalloc returned.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
net/netfilter/xt_NFQUEUE.c:46:9: warning: incorrect type in assignment (different base types)
net/netfilter/xt_NFQUEUE.c:46:9: expected unsigned int [unsigned] [usertype] ipaddr
net/netfilter/xt_NFQUEUE.c:46:9: got restricted unsigned int
net/netfilter/xt_NFQUEUE.c:68:10: warning: incorrect type in assignment (different base types)
net/netfilter/xt_NFQUEUE.c:68:10: expected unsigned int [unsigned] <noident>
net/netfilter/xt_NFQUEUE.c:68:10: got restricted unsigned int
net/netfilter/xt_NFQUEUE.c:69:10: warning: incorrect type in assignment (different base types)
net/netfilter/xt_NFQUEUE.c:69:10: expected unsigned int [unsigned] <noident>
net/netfilter/xt_NFQUEUE.c:69:10: got restricted unsigned int
net/netfilter/xt_NFQUEUE.c:70:10: warning: incorrect type in assignment (different base types)
net/netfilter/xt_NFQUEUE.c:70:10: expected unsigned int [unsigned] <noident>
net/netfilter/xt_NFQUEUE.c:70:10: got restricted unsigned int
net/netfilter/xt_NFQUEUE.c:71:10: warning: incorrect type in assignment (different base types)
net/netfilter/xt_NFQUEUE.c:71:10: expected unsigned int [unsigned] <noident>
net/netfilter/xt_NFQUEUE.c:71:10: got restricted unsigned int
net/netfilter/xt_cluster.c:20:55: warning: incorrect type in return expression (different base types)
net/netfilter/xt_cluster.c:20:55: expected unsigned int
net/netfilter/xt_cluster.c:20:55: got restricted unsigned int const [usertype] ip
net/netfilter/xt_cluster.c:20:55: warning: incorrect type in return expression (different base types)
net/netfilter/xt_cluster.c:20:55: expected unsigned int
net/netfilter/xt_cluster.c:20:55: got restricted unsigned int const [usertype] ip
Signed-off-by: Patrick McHardy <kaber@trash.net>
The RCU protected conntrack hash lookup only checks whether the entry
has a refcount of zero to decide whether it is stale. This is not
sufficient, entries are explicitly removed while there is at least
one reference left, possibly more. Explicitly check whether the entry
has been marked as dying to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
New connection tracking entries are inserted into the hash before they
are fully set up, namely the CONFIRMED bit is not set and the timer not
started yet. This can theoretically lead to a race with timer, which
would set the timeout value to a relative value, most likely already in
the past.
Perform hash insertion as the final step to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
death_by_timeout() might delete a conntrack from hash list
and insert it in dying list.
nf_ct_delete_from_lists(ct);
nf_ct_insert_dying_list(ct);
I believe a (lockless) reader could *catch* ct while doing a lookup
and miss the end of its chain.
(nulls lookup algo must check the null value at the end of lookup and
should restart if the null value is not the expected one.
cf Documentation/RCU/rculist_nulls.txt for details)
We need to change nf_conntrack_init_net() and use a different "null" value,
guaranteed not being used in regular lists. Choose very large values, since
hash table uses [0..size-1] null values.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch improves ctnetlink event reliability if one broadcast
listener has set the NETLINK_BROADCAST_ERROR socket option.
The logic is the following: if an event delivery fails, we keep
the undelivered events in the missed event cache. Once the next
packet arrives, we add the new events (if any) to the missed
events in the cache and we try a new delivery, and so on. Thus,
if ctnetlink fails to deliver an event, we try to deliver them
once we see a new packet. Therefore, we may lose state
transitions but the userspace process gets in sync at some point.
At worst case, if no events were delivered to userspace, we make
sure that destroy events are successfully delivered. Basically,
if ctnetlink fails to deliver the destroy event, we remove the
conntrack entry from the hashes and we insert them in the dying
list, which contains inactive entries. Then, the conntrack timer
is added with an extra grace timeout of random32() % 15 seconds
to trigger the event again (this grace timeout is tunable via
/proc). The use of a limited random timeout value allows
distributing the "destroy" resends, thus, avoiding accumulating
lots "destroy" events at the same time. Event delivery may
re-order but we can identify them by means of the tuple plus
the conntrack ID.
The maximum number of conntrack entries (active or inactive) is
still handled by nf_conntrack_max. Thus, we may start dropping
packets at some point if we accumulate a lot of inactive conntrack
entries that did not successfully report the destroy event to
userspace.
During my stress tests consisting of setting a very small buffer
of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
flag, and generating lots of very small connections, I noticed
very few destroy entries on the fly waiting to be resend.
A simple way to test this patch consist of creating a lot of
entries, set a very small Netlink buffer in conntrackd (+ a patch
which is not in the git tree to set the BROADCAST_ERROR flag)
and invoke `conntrack -F'.
For expectations, no changes are introduced in this patch.
Currently, event delivery is only done for new expectations (no
events from expectation expiration, removal and confirmation).
In that case, they need a per-expectation event cache to implement
the same idea that is exposed in this patch.
This patch can be useful to provide reliable flow-accouting. We
still have to add a new conntrack extension to store the creation
and destroy time.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch moves the helper destruction to a function that lives
in nf_conntrack_helper.c. This new function is used in the patch
to add ctnetlink reliable event delivery.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch reworks the per-cpu event caching to use the conntrack
extension infrastructure.
The main drawback is that we consume more memory per conntrack
if event delivery is enabled. This patch is required by the
reliable event delivery that follows to this patch.
BTW, this patch allows you to enable/disable event delivery via
/proc/sys/net/netfilter/nf_conntrack_events in runtime, although
you can still disable event caching as compilation option.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use mod_timer_pending() instead of atomic sequence of del_timer()/
add_timer(). mod_timer_pending() does not rearm an inactive timer,
so we don't need the conntrack lock anymore to make sure we don't
accidentally rearm a timer of a conntrack which is in the process
of being destroyed.
With this change, we don't need to take the global lock anymore at all,
counter updates can be performed under the per-conntrack lock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
.ko is normally not included in Kconfig help, make it consistent.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Introduce per-conntrack locks and use them instead of the global protocol
locks to avoid contention. Especially tcp_lock shows up very high in
profiles on larger machines.
This will also allow to simplify the upcoming reliable event delivery patches.
Signed-off-by: Patrick McHardy <kaber@trash.net>