With netem reordering, a gap of N is supposed to reorder every Nth packet with
given reorder probability. However, the code currently skips N packets and
reorders every (N+1)th packet.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an optional Random Early Detection on each SFQ flow queue.
Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.
1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.
2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]
Example of use :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
limit 3000 headdrop flows 512 divisor 16384 \
redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
ewma 6 min 8000b max 60000b probability 0.2 ecn
prob_mark 0 prob_mark_head 4876 prob_drop 6131
forced_mark 0 forced_mark_head 0 forced_drop 0
Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)
If same test is run, without RED, we can check backlog is much bigger.
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Tested-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the red_parms structure into two components.
One holding the RED 'constant' parameters, and one containing the
variables.
This permits a size reduction of GRED qdisc, and is a preliminary step
to add an optional RED unit to SFQ.
SFQRED will have a single red_parms structure shared by all flows, and a
private red_vars per flow.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ as implemented in Linux is very limited, with at most 127 flows
and limit of 127 packets. [ So if 127 flows are active, we have one
packet per flow ]
This patch brings to SFQ following features to cope with modern needs.
- Ability to specify a smaller per flow limit of inflight packets.
(default value being at 127 packets)
- Ability to have up to 65408 active flows (instead of 127)
- Ability to have head drops instead of tail drops
(to drop old packets from a flow)
Example of use : No more than 20 packets per flow, max 8000 flows, max
20000 packets in SFQ qdisc, hash table of 65536 slots.
tc qdisc add ... sfq \
flows 8000 \
depth 20 \
headdrop \
limit 20000 \
divisor 65536
Ram usage :
2 bytes per hash table entry (instead of previous 1 byte/entry)
32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
better cache hit ratio.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not now, but it looks you are correct. q->qdisc is NULL until another
additional qdisc is attached (beside tfifo). See 50612537e9.
The following patch should work.
From: Hagen Paul Pfeifer <hagen@jauu.net>
netem: catch NULL pointer by updating the real qdisc statistic
Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash.
We currently randomize this 32bit value only if a perturbation timer is
setup.
Its much better to always initialize it to defeat attackers, or else
they can predict very well what kind of packets they have to forge to
hit a particular flow.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 817fb15dfd (net_sched: sfq: allow divisor to be a
parameter), we can leave perturbation timer armed if a memory allocation
error aborts sfq_init().
Memory containing active struct timer_list is freed and kernel can
crash.
Call sfq_destroy() from sfq_init() to properly dismantle qdisc.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trying to allocate ~32768 qdiscs using autohandle mechanism, we can
fill the space managed by kernel (handles in [8000-FFFF]:0000 range)
But O(N^2) qdisc_alloc_handle() loops 0x10000 times instead of 0x8000
time tc add qdisc add dev eth0 parent 10:7fff pfifo limit 10
RTNETLINK answers: Cannot allocate memory
real 1m54.826s
user 0m0.000s
sys 0m0.004s
INFO: rcu_sched_state detected stall on CPU 0 (t=60000 jiffies)
Half number of loops, and add a cond_resched() call.
We hold rtnl at this point.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can underestimate q->wsum in case of "tc class replace ... qfq"
and/or qdisc_create_dflt() error.
wsum is not really used in fast path, only at qfq qdisc/class setup,
to catch user error.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
grp->slot_shift is between 22 and 41, so using 32bit wide variables is
probably a typo.
This could explain QFQ hangs Dave reported to me, after 2^23 packets ?
(23 = 64 - 41)
Reported-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the
circular list. In fact this is probably an old SFQ implementation bug.
100 Mbits = ~8333 full frames per second, or ~8 frames per ms.
With 50 flows, it means your "new flow" will have to wait 50 packets
being sent before its own packet. Thats the ~6ms.
We certainly can change SFQ to give a priority advantage to new flows,
so that next dequeued packet is taken from a new flow, not an old one.
Reported-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 10f6dfcfde (Revert "sch_netem: Remove classful functionality")
reintroduced classful functionality to netem, but broke basic netem
behavior :
netem uses an t(ime)fifo queue, and store timestamps in skb->cb[]
If qdisc is changed, time constraints are not respected and other qdisc
can destroy skb->cb[] and block netem at dequeue time.
Fix this by always using internal tfifo, and optionally attach a child
qdisc to netem (or a tree of qdiscs)
Example of use :
DEV=eth3
tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms
tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \
burst 20480 limit 20480 mtu 1514 rate 32000bps
qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms 10.0ms
Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0)
rate 18416bit 3pps backlog 0b 0p requeues 0
qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0)
backlog 0b 5p requeues 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6373a9a286 (netem: use vmalloc for distribution table) added a
regression, since vfree() is called while holding a spinlock and BH
being disabled.
Fix this by doing the pointers swap in critical section, and freeing
after spinlock release.
Also add __GFP_NOWARN to the kmalloc() try, since we fallback to
vmalloc().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
The new netem loss model is configured with nested netlink messages.
This code is being overly strict about sizes, and is easily confused
by padding (or possible future expansion). Also message
for gemodel is incorrect.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add backlog (byte count) information in hfsc classes and qdisc, so that
"tc -s" can report it to user, instead of 0 values :
qdisc hfsc 1: root refcnt 6 default 20
Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
rate 1492Kbit 126pps backlog 103226b 74p requeues 0
...
class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
backlog 81822b 56p requeues 0
period 23 work 49451576 bytes rtwork 13277552 bytes level 0
...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: John A. Sullivan III <jsullivan@opensourcedevel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace may not provide TCA_OPTIONS, in fact tc currently does
so not do so if no arguments are specified on the command line.
Return EINVAL instead of panicing.
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A known Out Of Order (OOO) problem hurts SFQ when timer changes
perturbation value, since all new packets delivered to SFQ enqueue might
end on different slots than previous in-flight packets.
With round robin delivery, we can thus deliver packets in a different
order.
Since SFQ is limited to small amount of in-flight packets, we can rehash
packets so that this OOO problem is fixed.
This rehashing is performed only if internal flow classifier is in use.
We now store in skb->cb[] the "struct flow_keys" so that we dont call
skb_flow_dissect() again while rehashing.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In control path, its better to use GFP_KERNEL allocations where
possible.
Before taking qdisc spinlock, we preallocate memory just in case we'll
need it in gred_change_vq()
This is a followup to commit 3f1e6d3fd3 (sch_gred: should not use
GFP_KERNEL while holding a spinlock)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its better to use a predefined size for this small automatic variable.
Removes a sparse error as well :
net/sched/cls_flow.c:288:13: error: bad constant expression
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.
packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.
cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead
(e.g. 5 byte header per fragment).
Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
cell overhead 5 byte):
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>