The comment for the skb_tx_timestamp() function suggests calling it just
after a buffer is released to the hardware for transmission. However,
for drivers that free the buffer in an ISR, this produces a race between
the time stamp code and the ISR. This commit changes the comment to advise
placing the call just before handing the buffer over to the hardware.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hash:net,iface type makes possible to store network address and
interface name pairs in a set. It's mostly suitable for egress
and ingress filtering. Examples:
# ipset create test hash:net,iface
# ipset add test 192.168.0.0/16,eth0
# ipset add test 192.168.0.0/24,eth1
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
With the change the sets can use any parameter available for the match
and target extensions, like input/output interface. It's required for
the hash:net,iface set type.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The patch "Fix adding ranges to hash types" had got a mistypeing
in the timeout variant of the hash types, which actually made
the patch ineffective. Fixed!
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The range internally is converted to the network(s) equal to the range.
Example:
# ipset new test hash:net
# ipset add test 10.2.0.0-10.2.1.12
# ipset list test
Name: test
Type: hash:net
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 16888
References: 0
Members:
10.2.1.12
10.2.1.0/29
10.2.0.0/24
10.2.1.8/30
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
A set type may have multiple revisions, for example when syntax is
extended. Support continuous revision ranges in set types.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When ranges are added to hash types, the elements may trigger rehashing
the set. However, the last successfully added element was not kept track
so the adding started again with the first element after the rehashing.
Bug reported by Mr Dash Four.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Current listing makes possible to list sets with full content only.
The patch adds support partial listings, i.e. listing just
the existing setnames or listing set headers, without set members.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The support makes possible to specify the timeout value for
the SET target and a flag to reset the timeout for already existing
entries.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When an element to a set with timeout added, one can change the timeout
by "readding" the element with the "-exist" flag. That means the timeout
value is reset to the specified one (or to the default from the set
specification if the "timeout n" option is not used). Example
ipset add foo 1.2.3.4 timeout 10
ipset add foo 1.2.3.4 timeout 600 -exist
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
After restructuring, there is some unused or empty functions
left to be removed.
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Remove variable ctl_key from struct netns_ipvs,
it's a leftover from early netns work.
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
SNMP mibs use two percpu arrays, one used in BH context, another in USER
context. With increasing number of cpus in machines, and fact that ipv6
uses per network device ipstats_mib, this is consuming a lot of memory
if many network devices are registered.
commit be281e554e (ipv6: reduce per device ICMP mib sizes) shrinked
percpu needs for ipv6, but we can reduce memory use a bit more.
With recent percpu infrastructure (irqsafe_cpu_inc() ...), we no longer
need this BH/USER separation since we can update counters in a single
x86 instruction, regardless of the BH/USER context.
Other arches than x86 might need to disable irq in their
irqsafe_cpu_inc() implementation : If this happens to be a problem, we
can make SNMP_ARRAY_SZ arch dependent, but a previous poll
( https://lkml.org/lkml/2011/3/17/174 ) to arch maintainers did not
raise strong opposition.
Only on 32bit arches, we need to disable BH for 64bit counters updates
done from USER context (currently used for IP MIB)
This also reduces vmlinux size :
1) x86_64 build
$ size vmlinux.before vmlinux.after
text data bss dec hex filename
7853650 1293772 1896448 11043870 a8841e vmlinux.before
7850578 1293772 1896448 11040798 a8781e vmlinux.after
2) i386 build
$ size vmlinux.before vmlinux.afterpatch
text data bss dec hex filename
6039335 635076 3670016 10344427 9dd7eb vmlinux.before
6037342 635076 3670016 10342434 9dd022 vmlinux.afterpatch
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Tejun Heo <tj@kernel.org>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org
CC: linux-arch@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no need for the guest to validate the checksum if it have been
validated by host nics. So this patch introduces a new flag -
VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
examing in guest. The backend (tap/macvtap) may set this flag when
met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.
No feature negotiation is needed as old driver just ignore this flag.
Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
when gro is on no difference as it produces skb with partial
checksum. But when gro is disabled, 20% or even higher improvement
could be measured by netperf.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Profiles show false sharing in addr_compare() because refcnt/dtime
changes dirty the first inet_peer cache line, where are lying the keys
used at lookup time. If many cpus are calling inet_getpeer() and
inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
"perf top".
Before patch, my udpflood bench (16 threads) on my 2x4x2 machine :
5784.00 9.7% csum_partial_copy_generic [kernel]
3356.00 5.6% addr_compare [kernel]
2638.00 4.4% fib_table_lookup [kernel]
2625.00 4.4% ip_fragment [kernel]
1934.00 3.2% neigh_lookup [kernel]
1617.00 2.7% udp_sendmsg [kernel]
1608.00 2.7% __ip_route_output_key [kernel]
1480.00 2.5% __ip_append_data [kernel]
1396.00 2.3% kfree [kernel]
1195.00 2.0% kmem_cache_free [kernel]
1157.00 1.9% inet_getpeer [kernel]
1121.00 1.9% neigh_resolve_output [kernel]
1012.00 1.7% dev_queue_xmit [kernel]
# time ./udpflood.sh
real 0m44.511s
user 0m20.020s
sys 11m22.780s
# time ./udpflood.sh
real 0m44.099s
user 0m20.140s
sys 11m15.870s
After patch, no more addr_compare() in profiles :
4171.00 10.7% csum_partial_copy_generic [kernel]
1787.00 4.6% fib_table_lookup [kernel]
1756.00 4.5% ip_fragment [kernel]
1234.00 3.2% udp_sendmsg [kernel]
1191.00 3.0% neigh_lookup [kernel]
1118.00 2.9% __ip_append_data [kernel]
1022.00 2.6% kfree [kernel]
993.00 2.5% __ip_route_output_key [kernel]
841.00 2.2% neigh_resolve_output [kernel]
816.00 2.1% kmem_cache_free [kernel]
658.00 1.7% ia32_sysenter_target [kernel]
632.00 1.6% kmem_cache_alloc_node [kernel]
# time ./udpflood.sh
real 0m41.587s
user 0m19.190s
sys 10m36.370s
# time ./udpflood.sh
real 0m41.486s
user 0m19.290s
sys 10m33.650s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andi Kleen and Tim Chen reported huge contention on inetpeer
unused_peers.lock, on memcached workload on a 40 core machine, with
disabled route cache.
It appears we constantly flip peers refcnt between 0 and 1 values, and
we must insert/remove peers from unused_peers.list, holding a contended
spinlock.
Remove this list completely and perform a garbage collection on-the-fly,
at lookup time, using the expired nodes we met during the tree
traversal.
This removes a lot of code, makes locking more standard, and obsoletes
two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
removes two pointers in inet_peer structure.
There is still a false sharing effect because refcnt is in first cache
line of object [were the links and keys used by lookups are located], we
might move it at the end of inet_peer structure to let this first cache
line mostly read by cpus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.
It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.
The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to remove all support for displaying an ntuple as
strings via ETHTOOL_GRXNTUPLE. The reason for this change is due to the
fact that multiple issues have been found including:
- Multiple buffer overruns for strings being displayed.
- Incorrect filters displayed, cleared filters with ring of -2 are displayed
- Setting get_rx_ntuple displays no rules if defined.
- Endianess wrong on displayed values.
- Hard limit of 1024 filters makes display functionality extremely limited
The only driver that had supported this interface was ixgbe. Since it no
longer uses the interface and due to the issues mentioned above I am
submitting this patch to remove it.
v2:
Updated based on comments from Ben Hutchings
- Left ETH_SS_NTUPLE_FILTERS in code but commented on it being deprecated
- Removed ethtool_rx_ntuple_list and ethtool_rx_ntuple_flow_spec_container
- Left ETHTOOL_GRXNTUPLE but commented it as deprecated
Also cleaned up set_rx_ntuple since there is no flow spec container to
maintain we can drop all the code for the alloc and free of it and just
return ops->set_rx_ntuple().
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>