This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:
- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets
- addition of set elements
- removal of set elements
- dumping of all set elements
Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.
Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.
Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.
Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.
The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.
Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.
This patch includes the new 'lookup' expression to look up for element
matching in the set.
This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.
From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection
From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.
This is an adapted version of the original patch from Patrick McHardy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1) We need to take a timestamp only for skb that should be cloned.
Other skbs are not in write queue and no rtt estimation is done on them.
2) the unlikely() hint is wrong for receivers (they send pure ACK)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: MF Nowlan <fitz@cs.yale.edu>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Included changes:
- update emails for A. Quartulli and M. Lindner in MAINTAINERS
- switch to the next on-the-wire protocol version
- introduce the T(ype) V(ersion) L(ength) V(alue) framework
- adjust the existing components to make them use the new TVLV code
- make the TT component use CRC32 instead of CRC16
- totally remove the VIS functionality (has been moved to userspace)
- reorder packet types and flags
- add static checks on packet format
- remove __packed from batadv_ogm_packet
In commit 634fb979e8 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :
skc_dport is __be16 (network order), but skc_num is __u16 (host order)
So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16
Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 5 :
We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.
This patch includes the needed struct sock_common in front
of struct request_sock
This means there is no more inet6_request_sock IPv6 specific
structure.
Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.
loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.
This still builds very efficient skbs, and allows reaching 45 MSS per
skb.
(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)
High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)
Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.
We could make the 65536 bytes threshold a tunable to mitigate this.
(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a enhancement.
for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.
Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vis flag is not needed anymore, and since we do a compat bump we
can start with the first bit again
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
As we decreased the struct size from 26 to 24 byte, we can remove
__packed as the compiler will not add any more padding.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Reordering the packet type numbers allows us to handle unicast
packets in a general way - even if we don't know the specific packet
type, we can still forward it. There was already code handling
this for a couple of unicast packets, and this is the more
generalized version to do that.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Since we removed the __packed from most of the packets, we should
make sure that the offset generated by the compiler are correct for
sent/received data.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Client flags from bit 0 to 7 are sent over the wire.
BATADV_TT_CLIENT_TEMP is a local flag and is not supposed
to be sent to the network. Therefore it has occupy a
higher bit.
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
CRC32C has to be preferred to CRC16 because of its possible
HW native support and because of the reduced collision
probability. With this change the Translation Table
component now uses CRC32C to compute the local and global
table checksum.
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Instead of generating roaming specific packets the TVLV unicast API is
used to send roaming information.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Instead of generating TT specific packets the TVLV unicast API is used
to send translation table data.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The translation table meta data (version number, crc checksum, etc)
as well as the translation table diff propgated within OGMs now uses
the newly introduced tvlv infrastructure.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Prior to this patch batman-adv read the advertised uplink bandwidth
from userspace and compressed this information into a single byte
called "gateway class".
Now the download & upload bandwidth information is sent as-is. No
userspace change is necessary since the sysfs API always allowed
to specify a bandwidth.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Spyros Gasteratos <morfeas3000@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The goal is to provide the infrastructure for sending, receiving and
parsing information 'containers' while preserving backward
compatibility. TVLV (based on the commonly known Type Length Value
technique) was chosen as the format for those containers. Even if a
node does not know the tvlv type of a certain container it can simply
skip the current container and proceed with the next. Past experience
has shown features evolve over time, so a 'version' field was added
right from the start to allow differentiating between feature
variants - hence the name: T(ype) V(ersion) L(ength) V(alue).
This patch introduces the basic TVLV infrastructure:
* register / unregister tvlv containers to be sent with each OGM
(on primary interfaces only)
* register / unregister callback handlers to be called upon
finding the corresponding tvlv type in a tvlv buffer
* unicast tvlv send / receive API calls
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Spyros Gasteratos <morfeas3000@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>