Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Vijay reported that a loop as simple as ...
while true; do
tc qdisc add dev foo root handle 1: prio
tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
tc qdisc del dev foo root
rmmod cls_u32
done
... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab0 ("netlink: Re-add
locking to netlink_lookup() and seq walker").
The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3e ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):
When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.
After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.
Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.
Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.
By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.
One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.
synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.
Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.
Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.
Fixes: 25d8c0d55f ("net: rcu-ify tcf_proto")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix compile error in net/sched/cls_flower.c
net/sched/cls_flower.c: In function ‘fl_set_key’:
net/sched/cls_flower.c:240:3: error: implicit declaration of
function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);
Introduced in 77b9900ef5
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.
Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.
This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.
skb_act_clone is now identical to skb_clone, so just use that.
Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
protocol ip u32 match u32 0 0 police rate 10Kbit burst \
64000 mtu 1500 action reclassify
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
parameters configured, then packets for the default VQ are not subjected
to RED and are only dropped if the queue is larger than the net_device's
tx_queue_len. This behavior is useful for WRED mode, since these packets
will still influence the calculated average queue length and (therefore)
the drop probability for all of the other VQs. However, for some drivers
tx_queue_len is zero. In other cases the user may wish to make the limit
the same for all VQs (including the default VQ with no drop parameters).
This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
in bytes, during qdisc setup. (This limit is in bytes to be consistent
with the drop parameters.) The default limit is the same as for a bfifo
queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
configured with a smaller limit than the GRED queue limit, that VQ will
still observe the smaller limit instead.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only left enqueue_root() user is netem, and it looks not necessary :
qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601a ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.
This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.
A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.
In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.
This makes the host an active participant in the health of the whole
network.
This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.
On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.
Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.
lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
count 0 lastcount 0 ldelay 1.0ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
count 0 lastcount 0 ldelay 694us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
count 0 lastcount 0 ldelay 889us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcf_destroy() returns true, tp could be already destroyed,
we should not use tp->next after that.
For long term, we probably should move tp list to list_head.
Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call make_flow_keys_digest to get a digest from flow keys and
use that to pass skbuff cb and for comparing flows.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
backlog 13626b 3p requeues 0
count 0 lastcount 0 ldelay 96.9ms drop_next 0us
maxpacket 9084 ecn_mark 0 drop_overlimit 0
This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.
codel intent was to use MTU of the device.
After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
backlog 6056b 3p requeues 0
count 1 lastcount 1 ldelay 36.3ms drop_next 0us
maxpacket 10598 ecn_mark 0 drop_overlimit 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used.
pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.
And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action is meant to be passive, i.e. we should not alter
skb->nfct: If nfct is present just leave it alone.
Compile tested only.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>