Add a operations structure that allows a network interface to export
the fact that it supports package forwarding in hardware between
physical interfaces and other mac layer devices assigned to it (such
as macvlans). This operaions structure can be used by virtual mac
devices to bypass software switching so that forwarding can be done
in hardware more efficiently.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:
0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one
As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:
echo 1 > /proc/sys/net/core/bpf_jit_enable
tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:
1) People familiar with tcpdump-like filters:
tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:
bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0
It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a socket is freed/reallocated, we need to clear time_next_packet
or else we can inherit a prior value and delay first packets of the
new flow.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The transition from markov state "3 => lost packets within a burst
period" to "1 => successfully transmitted packets within a gap period"
has no *additional* loss event. The loss already happen for transition
from 1 -> 3, this additional loss will make things go wild.
E.g. transition probabilities:
p13: 10%
p31: 100%
Expected:
Ploss = p13 / (p13 + p31)
Ploss = ~9.09%
... but it isn't. Even worse: we get a double loss - each time.
So simple don't return true to indicate loss, rather break and return
false.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Stefano Salsano <stefano.salsano@uniroma2.it>
Cc: Fabio Ludovici <fabio.ludovici@yahoo.it>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy found that if network namespace not enabled then
nd_net does not exist and would cause compilation failure.
This is handled correctly by using the dev_net() macro.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netem can leak memory because packets get stored in red-black
tree and it is not cleared on reset.
Reported by: Сергеев Сергей <adron@yapic.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Steinar reported FQ pacing was not working for UDP flows.
It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)
Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.
Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_FQ_INITIAL_QUANTUM should set q->initial_quantum
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_tree_decrease_qlen() is called when some packets are dropped
on a qdisc, and we want to notify parents of qlen changes.
We also can increment parents qdisc qstats drop counters.
This permits more accurate drop counters up to root qdisc.
For example a graft operation typically resets a qdisc
(drops all packets) and call qdisc_tree_decrease_qlen()
Note that callers are responsible for their drop counters.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.
This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.
v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter updates for your net-next tree,
mostly ipset improvements and enhancements features, they are:
* Don't call ip_nest_end needlessly in the error path from me, suggested
by Pablo Neira Ayuso, from Jozsef Kadlecsik.
* Fixed sparse warnings about shadowed variable and missing rcu annotation
and fix of "may be used uninitialized" warnings, also from Jozsef.
* Renamed simple macro names to avoid namespace issues, reported by David
Laight, again from Jozsef.
* Use fix sized type for timeout in the extension part, and cosmetic
ordering of matches and targets separatedly in xt_set.c, from Jozsef.
* Support package fragments for IPv4 protos without ports from Anders K.
Pedersen. For example this allows a hash:ip,port ipset containing the
entry 192.168.0.1,gre:0 to match all package fragments for PPTP VPN
tunnels to/from the host. Without this patch only the first package
fragment (with fragment offset 0) was matched.
* Introduced a new operation to get both setname and family, from Jozsef.
ip[6]tables set match and SET target need to know the family of the set
in order to reject adding rules which refer to a set with a non-mathcing
family. Currently such rules are silently accepted and then ignored
instead of generating an error message to the user.
* Reworked extensions support in ipset types from Jozsef. The approach of
defining structures with all variations is not manageable as the
number of extensions grows. Therefore a blob for the extensions is
introduced, somewhat similar to conntrack. The support of extensions
which need a per data destroy function is added as well.
* When an element timed out in a list:set type of set, the garbage
collector skipped the checking of the next element. So the purging
was delayed to the next run of the gc, fixed by Jozsef.
* A small Kconfig fix: NETFILTER_NETLINK cannot be selected and
ipset requires it.
* hash:net,net type from Oliver Smith. The type provides the ability to
store pairs of subnets in a set.
* Comment for ipset entries from Oliver Smith. This makes possible to
annotate entries in a set with comments, for example:
ipset n foo hash:net,net comment
ipset a foo 10.0.0.0/21,192.168.1.0/24 comment "office nets A and B"
* Fix of hash types resizing with comment extension from Jozsef.
* Fix of new extensions for list:set type when an element is added
into a slot from where another element was pushed away from Jozsef.
* Introduction of a common function for the listing of the element
extensions from Jozsef.
* Net namespace support for ipset from Vitaly Lavrov.
* hash:net,port,net type from Oliver Smith, which makes possible
to store the triples of two subnets and a protocol, port pair in
a set.
* Get xt_TCPMSS working with net namespace, by Gao feng.
* Use the proper net netnamespace to allocate skbs, also by Gao feng.
* A couple of cleanups for the conntrack SIP helper, by Holger
Eitzenberger.
* Extend cttimeout to allow setting default conntrack timeouts via
nfnetlink, so we can get rid of all our sysctl/proc interfaces in
the future for timeout tuning, from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/usb/qmi_wwan.c
drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
include/net/netfilter/nf_conntrack_synproxy.h
include/net/secure_seq.h
The conflicts are of two varieties:
1) Conflicts with Joe Perches's 'extern' removal from header file
function declarations. Usually it's an argument signature change
or a function being added/removed. The resolutions are trivial.
2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
a new value, another changes an existing value. That sort of
thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
fq_reset() should drops all packets in queue, including
throttled flows.
This patch moves code from fq_destroy() to fq_reset()
to do the cleaning.
fq_change() must stop calling fq_dequeue() if all remaining
packets are from throttled flows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than returning earlier value (EINVAL), return ENOMEM if
kzalloc fails. Found while reviewing to find another EINVAL condition.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds netns support for ipset.
Major changes were made in ip_set_core.c and ip_set.h.
Global variables are moved to per net namespace.
Added initialization code and the destruction of the network namespace ipset subsystem.
In the prototypes of public functions ip_set_* added parameter "struct net*".
The remaining corrections related to the change prototypes of public functions ip_set_*.
The patch for git://git.netfilter.org/ipset.git commit 6a4ec96c0b8caac5c35474e40e319704d92ca347
Signed-off-by: Vitaly Lavrov <lve@guap.ru>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
HTB already can deal with 64bit rates, we only have to add two new
attributes so that tc can use them to break the current 32bit ABI
barrier.
TCA_HTB_RATE64 : class rate (in bytes per second)
TCA_HTB_CEIL64 : class ceil (in bytes per second)
This allows us to setup HTB on 40Gbps links, as 32bit limit is
actually ~34Gbps
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an extra u64 rate parameter to psched_ratecfg_precompute()
so that some qdisc can opt-in for 64bit rates in the future,
to overcome the ~34 Gbits limit.
psched_ratecfg_getrate() reports a legacy structure to
tc utility, so if actual rate is above the 32bit rate field,
cap it to the 34Gbit limit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>