Commit Graph

128 Commits

Author SHA1 Message Date
Florian Westphal
e578d9c025 net: sched: use counter to break reclassify loops
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.

skb_act_clone is now identical to skb_clone, so just use that.

Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
 protocol ip u32 match u32 0 0 police rate 10Kbit burst \
 64000 mtu 1500 action reclassify

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:08:14 -04:00
Eric Dumazet
b396cca6fa net: sched: deprecate enqueue_root()
Only left enqueue_root() user is netem, and it looks not necessary :

qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 14:17:32 -04:00
Florian Westphal
4749c3ef85 net: sched: remove TC_MUNGED bits
Not used.

pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.

And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-02 22:25:17 -04:00
Cong Wang
1e052be69d net_sched: destroy proto tp when all filters are gone
Kernel automatically creates a tp for each
(kind, protocol, priority) tuple, which has handle 0,
when we add a new filter, but it still is left there
after we remove our own, unless we don't specify the
handle (literally means all the filters under
the tuple). For example this one is left:

  # tc filter show dev eth0
  filter parent 8001: protocol arp pref 49152 basic

The user-space is hard to clean up these for kernel
because filters like u32 are organized in a complex way.
So kernel is responsible to remove it after all filters
are gone.  Each type of filter has its own way to
store the filters, so each type has to provide its
way to check if all filters are gone.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09 15:35:55 -04:00
Eric Dumazet
0d32ef8cef net: sched: fix panic in rate estimators
Doing the following commands on a non idle network device
panics the box instantly, because cpu_bstats gets overwritten
by stats.

tc qdisc add dev eth0 root <your_favorite_qdisc>
... some traffic (one packet is enough) ...
tc qdisc replace dev eth0 root est 1sec 4sec <your_favorite_qdisc>

[  325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
[  325.362609] IP: [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
[  325.369158] PGD 1fa7067 PUD 0
[  325.372254] Oops: 0000 [#1] SMP
[  325.375514] Modules linked in: ...
[  325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
[  325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
[  325.419518] RIP: 0010:[<ffffffff81541c9e>]  [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
[  325.428506] RSP: 0018:ffff881ff2fa7928  EFLAGS: 00010286
[  325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
[  325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
[  325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
[  325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
[  325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
[  325.469536] FS:  00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
[  325.477630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
[  325.490510] Stack:
[  325.492524]  ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
[  325.499990]  ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
[  325.507460]  00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
[  325.514956] Call Trace:
[  325.517412]  [<ffffffff815424ee>] gen_new_estimator+0x8e/0x230
[  325.523250]  [<ffffffff815427aa>] gen_replace_estimator+0x4a/0x60
[  325.529349]  [<ffffffff815718ab>] tc_modify_qdisc+0x52b/0x590
[  325.535117]  [<ffffffff8155edd0>] rtnetlink_rcv_msg+0xa0/0x240
[  325.540963]  [<ffffffff8155ed30>] ? __rtnl_unlock+0x20/0x20
[  325.546532]  [<ffffffff8157f811>] netlink_rcv_skb+0xb1/0xc0
[  325.552145]  [<ffffffff8155b355>] rtnetlink_rcv+0x25/0x40
[  325.557558]  [<ffffffff8157f0d8>] netlink_unicast+0x168/0x220
[  325.563317]  [<ffffffff8157f47c>] netlink_sendmsg+0x2ec/0x3e0

Lets play safe and not use an union : percpu 'pointers' are mostly read
anyway, and we have typically few qdiscs per host.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 22e0f8b932 ("net: sched: make bstats per cpu and estimator RCU safe")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-31 17:49:37 -08:00
Jiri Pirko
57d743a3de net: sched: cls: remove unused op put from tcf_proto_ops
It is never called and implementations are void. So just remove it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 14:49:02 -05:00
Jesper Dangaard Brouer
5772e9a346 qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 12:37:06 -07:00
John Fastabend
b0ab6f9275 net: sched: enable per cpu qstats
After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
John Fastabend
25331d6ce4 net: sched: implement qstat helper routines
This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
John Fastabend
22e0f8b932 net: sched: make bstats per cpu and estimator RCU safe
In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
David S. Miller
1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
Eric Dumazet
2571178626 net: sched: shrink struct qdisc_skb_cb to 28 bytes
We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
or increasing skb->cb[] size.

Commit e0f31d8498 ("flow_keys: Record IP layer protocol in
skb_flow_dissect()") broke IPoIB.

Only current offender is sch_choke, and this one do not need an
absolutely precise flow key.

If we store 17 bytes of flow key, its more than enough. (Its the actual
size of flow_keys if it was a packed structure, but we might add new
fields at the end of it later)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: e0f31d8498 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 14:21:47 -04:00
John Fastabend
25d8c0d55f net: rcu-ify tcf_proto
rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.

This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 12:30:25 -04:00
John Fastabend
46e5da40ae net: qdisc: use rcu prefix and silence sparse warnings
Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 12:30:25 -04:00
Govindarajulu Varadarajan
e0f31d8498 flow_keys: Record IP layer protocol in skb_flow_dissect()
skb_flow_dissect() dissects only transport header type in ip_proto. It dose not
give any information about IPv4 or IPv6.

This patch adds new member, n_proto, to struct flow_keys. Which records the
IP layer type. i.e IPv4 or IPv6.

This can be used in netdev->ndo_rx_flow_steer driver function to dissect flow.

Adding new member to flow_keys increases the struct size by around 4 bytes.
This causes BUILD_BUG_ON(sizeof(qcb->data) < sz); to fail in
qdisc_cb_private_validate()

So increase data size by 4

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-23 14:32:19 -07:00
Cong Wang
2f7ef2f879 sched, cls: check if we could overwrite actions when changing a filter
When actions are attached to a filter, they are a part of the filter
itself, so when changing a filter we should allow to overwrite the actions
inside as well.

In my specific case, when I tried to _append_ a new action to an existing
filter which already has an action, I got EEXIST since kernel refused
to overwrite the existing one in kernel.

This patch checks if we are changing the filter checking NLM_F_CREATE flag
(Sigh, filters don't use NLM_F_REPLACE...) and then passes the boolean down
to actions. This fixes the problem above.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-27 23:42:39 -04:00
WANG Cong
832d1d5bfa net_sched: add struct net pointer to tcf_proto_ops->dump
It will be needed by the next patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:50:14 -08:00
WANG Cong
3627287463 net_sched: convert tcf_proto_ops to use struct list_head
We don't need to maintain our own singly linked list code.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:08 -05:00
Eric Dumazet
3e1e3aae1f net_sched: add u64 rate to psched_ratecfg_precompute()
Add an extra u64 rate parameter to psched_ratecfg_precompute()
so that some qdisc can opt-in for 64bit rates in the future,
to overcome the ~34 Gbits limit.

psched_ratecfg_getrate() reports a legacy structure to
tc utility, so if actual rate is above the 32bit rate field,
cap it to the 34Gbit limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-20 14:41:02 -04:00
stephen hemminger
d2a7f269f9 qdisc: make args to qdisc_create_default const
Fixes warnings introduced by the qdisc default patch.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 18:09:45 -04:00
stephen hemminger
6da7c8fcbc qdisc: allow setting default queuing discipline
By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.

This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.

This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast

Also removes leftover random # in sysctl net core.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31 00:32:32 -04:00
David S. Miller
2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Jesper Dangaard Brouer
8a8e3d84b1 net_sched: restore "linklayer atm" handling
commit 56b765b79 ("htb: improved accuracy at high rates")
broke the "linklayer atm" handling.

 tc class add ... htb rate X ceil Y linklayer atm

The linklayer setting is implemented by modifying the rate table
which is send to the kernel.  No direct parameter were
transferred to the kernel indicating the linklayer setting.

The commit 56b765b79 ("htb: improved accuracy at high rates")
removed the use of the rate table system.

To keep compatible with older iproute2 utils, this patch detects
the linklayer by parsing the rate table.  It also supports future
versions of iproute2 to send this linklayer parameter to the
kernel directly. This is done by using the __reserved field in
struct tc_ratespec, to convey the choosen linklayer option, but
only using the lower 4 bits of this field.

Linklayer detection is limited to speeds below 100Mbit/s, because
at high rates the rtab is gets too inaccurate, so bad that
several fields contain the same values, this resembling the ATM
detect.  Fields even start to contain "0" time to send, e.g. at
1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
been more broken than we first realized.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:43:08 -07:00
Joe Perches
5c15257f93 net: Remove extern from include/net/ scheduling prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:24:22 -07:00
Eric Dumazet
130d3d68b5 net_sched: psched_ratecfg_precompute() improvements
Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 22:39:47 -07:00