We may lost race if we flush the rule-set (which happens asynchronously
via call_rcu) and we try to remove the table (that userspace assumes
to be empty).
Fix this by recovering synchronous rule and chain deletion. This was
introduced time ago before we had no batch support, and synchronous
rule deletion performance was not good. Now that we have the batch
support, we can just postpone the purge of old rule in a second step
in the commit phase. All object deletions are synchronous after this
patch.
As a side effect, we save memory as we don't need rcu_head per rule
anymore.
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.
Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().
A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.
CPU1 CPU2
____nf_conntrack_find()
nf_ct_put()
destroy_conntrack()
...
init_conntrack
__nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
if (!l4proto->new(ct, skb, dataoff, timeouts))
nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct); (use = 0)
destroy_conntrack()
/* continue to work with CT */
After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():
<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.
Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For L3-proto independant rules we need to get at the L4 protocol value
directly. Add it to the nft_pktinfo struct and use the meta expression
to retrieve it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Multi-family tables need the AF from the hook ops. Add a pointer to the
hook ops and replace usage of the hooknum member in struct nft_pktinfo.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says: <pablo@netfilter.org>
====================
nftables updates for net-next
The following patchset contains nftables updates for your net-next tree,
they are:
* Add set operation to the meta expression by means of the select_ops()
infrastructure, this allows us to set the packet mark among other things.
From Arturo Borrero Gonzalez.
* Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel
Borkmann.
* Add new queue expression to nf_tables. These comes with two previous patches
to prepare this new feature, one to add mask in nf_tables_core to
evaluate the queue verdict appropriately and another to refactor common
code with xt_NFQUEUE, from Eric Leblond.
* Do not hide nftables from Kconfig if nfnetlink is not enabled, also from
Eric Leblond.
* Add the reject expression to nf_tables, this adds the missing TCP RST
support. It comes with an initial patch to refactor common code with
xt_NFQUEUE, again from Eric Leblond.
* Remove an unused variable assignment in nf_tables_dump_set(), from Michal
Nazarewicz.
* Remove the nft_meta_target code, now that Arturo added the set operation
to the meta expression, from me.
* Add help information for nf_tables to Kconfig, also from me.
* Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is
available to other nf_tables objects, requested by Arturo, from me.
* Expose the table usage counter, so we can know how many chains are using
this table without dumping the list of chains, from Tomasz Bursztyka.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the addition of TCP reset support in
the nft_reject module by moving reusable code into a header
file.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the addition of nft_queue module by moving
reusable code into a header file.
This patch also converts NFQUEUE to use prandom_u32 to initialize
the random jhash seed as suggested by Florian Westphal.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pushing original fragments through causes several problems. For example
for matching, frags may not be matched correctly. Take following
example:
<example>
On HOSTA do:
ip6tables -I INPUT -p icmpv6 -j DROP
ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
and on HOSTB you do:
ping6 HOSTA -s2000 (MTU is 1500)
Incoming echo requests will be filtered out on HOSTA. This issue does
not occur with smaller packets than MTU (where fragmentation does not happen)
</example>
As was discussed previously, the only correct solution seems to be to use
reassembled skb instead of separete frags. Doing this has positive side
effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
dances in ipvs and conntrack can be removed.
Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
entirely and use code in net/ipv6/reassembly.c instead.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This is another batch containing Netfilter/IPVS updates for your net-next
tree, they are:
* Six patches to make the ipt_CLUSTERIP target support netnamespace,
from Gao feng.
* Two cleanups for the nf_conntrack_acct infrastructure, introducing
a new structure to encapsulate conntrack counters, from Holger
Eitzenberger.
* Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann.
* Skip checksum recalculation in SCTP support for IPVS, also from
Daniel Borkmann.
* Fix behavioural change in xt_socket after IP early demux, from
Florian Westphal.
* Fix bogus large memory allocation in the bitmap port set type in ipset,
from Jozsef Kadlecsik.
* Fix possible compilation issues in the hash netnet set type in ipset,
also from Jozsef Kadlecsik.
* Define constants to identify netlink callback data in ipset dumps,
again from Jozsef Kadlecsik.
* Use sock_gen_put() in xt_socket to replace xt_socket_put_sk,
from Eric Dumazet.
* Improvements for the SH scheduler in IPVS, from Alexander Frolkin.
* Remove extra delay due to unneeded rcu barrier in IPVS net namespace
cleanup path, from Julian Anastasov.
* Save some cycles in ip6t_REJECT by skipping checksum validation in
packets leaving from our stack, from Stanislav Fomichev.
* Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from
Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Encapsulate counters for both directions into nf_conn_acct. During
that process also consistently name pointers to the extend 'acct',
not 'counters'. This patch is a cleanup.
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for tracing the packet travel through
the ruleset, in a similar fashion to x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a batch support to nfnetlink. Basically, it adds
two new control messages:
* NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
the nfgenmsg->res_id indicates the nfnetlink subsystem ID.
* NFNL_MSG_BATCH_END, that results in the invocation of the
ss->commit callback function. If not specified or an error
ocurred in the batch, the ss->abort function is invoked
instead.
The end message represents the commit operation in nftables, the
lack of end message results in an abort. This patch also adds the
.call_batch function that is only called from the batch receival
path.
This patch adds atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.
The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:
00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
^
gencursor
Once you invoke the transition to the next generation, the global
gencursor is updated:
00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor
If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.
This new operation can be used from the user-space utility
that controls the firewall, eg.
nft -f restore
The rule updates contained in `file' will be applied atomically.
cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop #2
-EOF-
Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.
There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the commit list that
contain rules that require updates is finished.
Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.
This patch squashed the following patches from Pablo:
* nf_tables: atomic rule updates and dumps
* nf_tables: get rid of per rule list_head for commits
* nf_tables: use per netns commit list
* nfnetlink: add batch support and use it from nf_tables
* nf_tables: all rule updates are transactional
* nf_tables: attach replacement rule after stale one
* nf_tables: do not allow deletion/replacement of stale rules
* nf_tables: remove unused NFTA_RULE_FLAGS
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.
This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.
In order to get this compatibility layer working, I've done the
following things:
* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.
* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.
* add support for default policy to base chains, required to emulate
x_tables.
* add NFTA_CHAIN_USE attribute to obtain the number of references to
chains, required by x_tables emulation.
* add chain packet/byte counters using per-cpu.
* support 32-64 bits compat.
For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.
From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled
From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes
From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT
From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>