The netlink portid is an unsigned integer, use this type
also in netfilter.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
They are:
* nf_tables set timeout infrastructure from Patrick Mchardy.
1) Add support for set timeout support.
2) Add support for set element timeouts using the new set extension
infrastructure.
4) Add garbage collection helper functions to get rid of stale elements.
Elements are accumulated in a batch that are asynchronously released
via RCU when the batch is full.
5) Add garbage collection synchronization helpers. This introduces a new
element busy bit to address concurrent access from the netlink API and the
garbage collector.
5) Add timeout support for the nft_hash set implementation. The garbage
collector peridically checks for stale elements from the workqueue.
* iptables/nftables cgroup fixes:
6) Ignore non full-socket objects from the input path, otherwise cgroup
match may crash, from Daniel Borkmann.
7) Fix cgroup in nf_tables.
8) Save some cycles from xt_socket by skipping packet header parsing when
skb->sk is already set because of early demux. Also from Daniel.
* br_netfilter updates from Florian Westphal.
9) Save frag_max_size and restore it from the forward path too.
10) Use a per-cpu area to restore the original source MAC address when traffic
is DNAT'ed.
11) Add helper functions to access physical devices.
12) Use these new physdev helper function from xt_physdev.
13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
state information.
14) Annotate original layer 2 protocol number in nf_bridge info, instead of
using kludgy flags.
15) Also annotate the pkttype mangling when the packet travels back and forth
from the IP to the bridge layer, instead of using a flag.
* More nf_tables set enhancement from Patrick:
16) Fix possible usage of set variant that doesn't support timeouts.
17) Avoid spurious "set is full" errors from Netlink API when there are pending
stale elements scheduled to be released.
18) Restrict loop checks to set maps.
19) Add support for dynamic set updates from the packet path.
20) Add support to store optional user data (eg. comments) per set element.
BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
More recent GCC warns about two kinds of switch statement uses:
1) Switching on an enumeration, but not having an explicit case
statement for all members of the enumeration. To show the
compiler this is intentional, we simply add a default case
with nothing more than a break statement.
2) Switching on a boolean value. I think this warning is dumb
but nevertheless you get it wholesale with -Wswitch.
This patch cures all such warnings in netfilter.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add an userdata set extension and allow the user to attach arbitrary
data to set elements. This is intended to hold TLV encoded data like
comments or DNS annotations that have no meaning to the kernel.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new "dynset" expression for dynamic set updates.
A new set op ->update() is added which, for non existant elements,
invokes an initialization callback and inserts the new element.
For both new or existing elements the extenstion pointer is returned
to the caller to optionally perform timer updates or other actions.
Element removal is not supported so far, however that seems to be a
rather exotic need and can be added later on.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently a set binding is assumed to be related to a lookup and, in
case of maps, a data load.
In order to use bindings for set updates, the loop detection checks
must be restricted to map operations only. Add a flags member to the
binding struct to hold the set "action" flags such as NFT_SET_MAP,
and perform loop detection based on these.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use atomic operations for the element count to avoid races with async
updates.
To properly handle the transactional semantics during netlink updates,
deleted but not yet committed elements are accounted for seperately and
are treated as being already removed. This means for the duration of
a netlink transaction, the limit might be exceeded by the amount of
elements deleted. Set implementations must be prepared to handle this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NFT_SET_TIMEOUT flag is ignore in nft_select_set_ops, which may
lead to selection of a set implementation that doesn't actually
support timeouts.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
right now we store this in the nf_bridge_info struct, accessible
via skb->nf_bridge. This patch prepares removal of this pointer from skb:
Instead of using skb->nf_bridge->x, we use helpers to obtain the in/out
device (or ifindexes).
Followup patches to netfilter will then allow nf_bridge_info to be
obtained by a call into the br_netfilter core, rather than keeping a
pointer to it in sk_buff.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently in xt_socket, we take advantage of early demuxed sockets
since commit 00028aa370 ("netfilter: xt_socket: use IP early demux")
in order to avoid a second socket lookup in the fast path, but we
only make partial use of this:
We still unnecessarily parse headers, extract proto, {s,d}addr and
{s,d}ports from the skb data, accessing possible conntrack information,
etc even though we were not even calling into the socket lookup via
xt_socket_get_sock_{v4,v6}() due to skb->sk hit, meaning those cycles
can be spared.
After this patch, we only proceed the slower, manual lookup path
when we have a skb->sk miss, thus time to match verdict for early
demuxed sockets will improve further, which might be i.e. interesting
for use cases such as mentioned in 681f130f39 ("netfilter: xt_socket:
add XT_SOCKET_NOWILDCARD flag").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is currently always set to NULL, but nf_queue is adjusted to be
prepared for it being set to a real socket by taking and releasing a
reference to that socket when necessary.
Signed-off-by: David S. Miller <davem@davemloft.net>
That way we don't have to reinstantiate another nf_hook_state
on the stack of the nf_reinject() path.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of passing a large number of arguments down into the nf_hook()
entry points, create a structure which carries this state down through
the hook processing layers.
This makes is so that if we want to change the types or signatures of
any of these pieces of state, there are less places that need to be
changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to stop iterating on the rule expressions if the cgroup
mismatches. Moreover, make sure a non-full socket from the input path
leads us to a crash.
Fixes: ce67417 ("netfilter: nft_meta: add cgroup support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While originally only being intended for outgoing traffic, commit
a00e76349f ("netfilter: x_tables: allow to use cgroup match for
LOCAL_IN nf hooks") enabled xt_cgroups for the NF_INET_LOCAL_IN hook
as well, in order to allow for nfacct accounting.
Besides being currently limited to early demuxes only, commit
a00e76349f forgot to add a check if we deal with full sockets,
i.e. in this case not with time wait sockets. TCP time wait sockets
do not have the same memory layout as full sockets, a lower memory
footprint and consequently also don't have a sk_classid member;
probing for sk_classid member there could potentially lead to a
crash.
Fixes: a00e76349f ("netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks")
Cc: Alexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for element timeouts to nft_hash. The lookup and walking
functions are changed to ignore timed out elements, a periodic garbage
collection task cleans out expired entries.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
GC is expected to happen asynchrously to the netlink interface. In the
netlink path, both insertion and removal of elements consist of two
steps, insertion followed by activation or deactivation followed by
removal, during which the element must not be freed by GC.
The synchronization helpers use an unused bit in the genmask field to
atomically mark an element as "busy", meaning it is either currently
being handled through the netlink API or by GC.
Elements being processed by GC will never survive, netlink will simply
ignore them. Elements being currently processed through netlink will be
skipped by GC and reprocessed during the next run.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helpers for GC batch destruction: since element destruction needs
a RCU grace period for all set implementations, add some helper functions
for asynchronous batch destruction. Elements are collected in a batch
structure, which is asynchronously released using RCU once its full.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add API support for set element timeouts. Elements can have a individual
timeout value specified, overriding the sets' default.
Two new extension types are used for timeouts - the timeout value and
the expiration time. The timeout value only exists if it differs from
the default value.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add set timeout support to the netlink API. Sets with timeout support
enabled can have a default timeout value and garbage collection interval
specified.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:
1) Move netns to basechain and use recently added possible_net_t, from
Patrick McHardy.
2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.
3) Restore nf_log_trace that was accidentally removed during conflict
resolution.
4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
all patches from Patrick McHardy.
5) Use raw_smp_processor_id() in nft_meta.
Then, several patches to prepare ground for the new set extension
infrastructure:
6) Pass object length to the hash callback in rhashtable as needed by
the new set extension infrastructure.
7) Cleanup patch to restore struct nft_hash as wrapper for struct
rhashtable
8) Another small source code readability cleanup for nft_hash.
9) Convert nft_hash to rhashtable callbacks.
And finally...
10) Add the new set extension infrastructure.
11) Convert the nft_hash and nft_rbtree sets to use it.
12) Batch set element release to avoid several RCU grace period in a row
and add new function nft_set_elem_destroy() to consolidate set element
release.
13) Return the set extension data area from nft_lookup.
14) Refactor existing transaction code to add some helper functions
and document it.
15) Complete the set transaction support, using similar approach to what we
already use, to activate/deactivate elements in an atomic fashion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>