When zones were originally introduced, the expectation functions were
all extended to perform lookup using the zone. However, insertion was
not modified to check the zone. This means that two expectations which
are intended to apply for different connections that have the same tuple
but exist in different zones cannot both be tracked.
Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones")
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
IPVS Fixes for v4.2
please consider this fix for v4.2.
For reasons that are not clear to me it is a bumper crop.
It seems to me that they are all relevant to stable.
Please let me know if you need my help to get the fixes into stable.
* ipvs: fix ipv6 route unreach panic
This problem appears to be present since IPv6 support was added to
IPVS in v2.6.28.
* ipvs: skb_orphan in case of forwarding
This appears to resolve a problem resulting from a side effect of
41063e9dd1 ("ipv4: Early TCP socket demux.") which was included in v3.6.
* ipvs: do not use random local source address for tunnels
This appears to resolve a problem introduced by
026ace060d ("ipvs: optimize dst usage for real server") in v3.10.
* ipvs: fix crash if scheduler is changed
This appears to resolve a problem introduced by
ceec4c3816 ("ipvs: convert services to rcu") in v3.10.
Julian has provided backports of the fix:
* [PATCHv2 3.10.81] ipvs: fix crash if scheduler is changed
http://www.spinics.net/lists/lvs-devel/msg04008.html
* [PATCHv2 3.12.44,3.14.45,3.18.16,4.0.6] ipvs: fix crash if scheduler is changed
http://www.spinics.net/lists/lvs-devel/msg04007.html
Please let me know how you would like to handle guiding these
backports into stable.
* ipvs: fix crash with sync protocol v0 and FTP
This appears to resolve a problem introduced by
749c42b620 ("ipvs: reduce sync rate with time thresholds") in v3.5
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Quoting Daniel Borkmann:
"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.
Minimal example:
ip netns add foo
ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
ip netns del foo
What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.
Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net->ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.
This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."
Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.
Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.
Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Fix crash in 3.5+ if FTP is used after switching
sync_version to 0.
Fixes: 749c42b620 ("ipvs: reduce sync rate with time thresholds")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it. In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
Additionally, skb_set_owner_w is totally broken for non full-socks.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
I overlooked the svc->sched_data usage from schedulers
when the services were converted to RCU in 3.10. Now
the rare ipvsadm -E command can change the scheduler
but due to the reverse order of ip_vs_bind_scheduler
and ip_vs_unbind_scheduler we provide new sched_data
to the old scheduler resulting in a crash.
To fix it without changing the scheduler methods we
have to use synchronize_rcu() only for the editing case.
It means all svc->scheduler readers should expect a
NULL value. To avoid breakage for the service listing
and ipvsadm -R we can use the "none" name to indicate
that scheduler is not assigned, a state when we drop
new connections.
Reported-by: Alexander Vasiliev <a.vasylev@404-group.com>
Fixes: ceec4c3816 ("ipvs: convert services to rcu")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Michael Vallaly reports about wrong source address used
in rare cases for tunneled traffic. Looks like
__ip_vs_get_out_rt in 3.10+ is providing uninitialized
dest_dst->dst_saddr.ip because ip_vs_dest_dst_alloc uses
kmalloc. While we retry after seeing EINVAL from routing
for data that does not look like valid local address, it
still succeeded when this memory was previously used from
other dests and with different local addresses. As result,
we can use valid local address that is not suitable for
our real server.
Fix it by providing 0.0.0.0 every time our cache is refreshed.
By this way we will get preferred source address from routing.
Reported-by: Michael Vallaly <lvs@nolatency.com>
Fixes: 026ace060d ("ipvs: optimize dst usage for real server")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Previously there was a trivial panic
unshare -n /bin/bash <<EOF
ip addr add dev lo face::1/128
ipvsadm -A -t [face::1]:15213
ipvsadm -a -t [face::1]:15213 -r b00c::1
echo boom | nc face::1 15213
EOF
This patch allows us to replicate the net logic above and simply capture
the skb_dst(skb)->dev and use that for the purpose of the invocation.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
We have to put back the references to the master conntrack and the expectation
that we just created, otherwise we'll leak them.
Fixes: 0ef71ee1a5 ("netfilter: ctnetlink: refactor ctnetlink_create_expect")
Reported-by: Tim Wiess <Tim.Wiess@watchguard.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After a fresh boot with no modules in place at all and a large rulesets, the
existing nfnetlink_rcv_batch() funcion can take long time to commit the ruleset
due to the many abort path. This is specifically a problem for the existing
client of this code, ie. nf_tables, since it results in several
synchronize_rcu() call in a row.
This patch changes the policy to keep full batch processing on missing modules
errors so we abort only once.
Reported-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If someone sends packets from one of the netdevice ingress hooks to
the a userspace queue, and then userspace later accepts the packet,
the netfilter code can enter an infinite loop as the list head will
never be found.
Pass in the saved list_head to avoid this.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currenlty nf_tables chains added in one network namespace are being
run in all network namespace. The issues are myriad with the simplest
being an unprivileged user can cause any network packets to be dropped.
Address this by simply not running nf_tables chains in the wrong
network namespace.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to pull the full definitions in that file, a simple forward
declaration is enough.
Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
compilation error due to missing declarations, ie.
net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat,
^
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This appears to have been a dead macro in both nfnetlink_log.c and
nfnetlink_queue_core.c since these pieces of code were added in 2005.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xt_socket is useful for matching sockets with IP_TRANSPARENT and
taking some action on the matching packets. However, it lacks the
ability to match only a small subset of transparent sockets.
Suppose there are 2 applications, each with its own set of transparent
sockets. The first application wants all matching packets dropped,
while the second application wants them forwarded somewhere else.
Add the ability to retore the skb->mark from the sk_mark. The mark
is only restored if a matching socket is found and the transparent /
nowildcard conditions are satisfied.
Now the 2 hypothetical applications can differentiate their sockets
based on a mark value set with SO_MARK.
iptables -t mangle -I PREROUTING -m socket --transparent \
--restore-skmark -j action
iptables -t mangle -A action -m mark --mark 10 -j action2
iptables -t mangle -A action -m mark --mark 11 -j action3
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds an additional attribute when sending
packet information via netlink in netfilter_queue module.
It will send additional security context data, so that
userspace applications can verify this context against
their own security databases.
Signed-off-by: Roman Kubiak <r.kubiak@samsung.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.
This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.
This patch reworks ebddf1a8d7 ("netfilter: nf_tables: allow to bind table to
net_device").
Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
Only one copy of table is kept, instead of one copy per cpu.
We also can avoid a dereference if we put table data right after
xt_table_info. It reduces register pressure and helps compiler.
Then, we attempt a kmalloc() if total size is under order-3 allocation,
to reduce TLB pressure, as in many cases, rules fit in 32 KB.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jozsef Kadlecsik says:
====================
ipset patches for nf-next
Please consider to apply the next bunch of patches for ipset. First
comes the small changes, then the bugfixes and at the end the RCU
related patches.
* Use MSEC_PER_SEC consistently instead of the number.
* Use SET_WITH_*() helpers to test set extensions from Sergey Popovich.
* Check extensions attributes before getting extensions from Sergey Popovich.
* Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich.
* Make sure we always return line number on batch in the case of error
from Sergey Popovich.
* Check CIDR value only when attribute is given from Sergey Popovich.
* Fix cidr handling for hash:*net* types, reported by Jonathan Johnson.
* Fix parallel resizing and listing of the same set so that the original
set is kept for the whole dumping.
* Make sure listing doesn't grab a set which is just being destroyed.
* Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU.
* Replace rwlock_t with spinlock_t in "struct ip_set", change the locking
in the core and simplifications in the timeout routines.
* Introduce RCU locking in bitmap:* types with a slight modification in the
logic on how an element is added.
* Introduce RCU locking in hash:* types. This is the most complex part of
the changes.
* Introduce RCU locking in list type where standard rculist is used.
* Fix coding styles reported by checkpatch.pl.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>