Commit Graph

6950 Commits

Author SHA1 Message Date
WANG Cong 89819dc01f net_sched: convert tcf_hashinfo to hlist and use spinlock
So that we don't need to play with singly linked list,
and since the code is not on hot path, we can use spinlock
instead of rwlock.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong 369ba56787 net_sched: init struct tcf_hashinfo at register time
It looks weird to store the lock out of the struct but
still points to a static variable. Just move them into the struct.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong 5da57f422d net_sched: cls: refactor out struct tcf_ext_map
These information can be saved in tcf_exts, and this will
simplify the code.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong 33be627159 net_sched: act: use standard struct list_head
Currently actions are chained by a singly linked list,
therefore it is a bit hard to add and remove a specific
entry. Convert it to struct list_head so that in the
latter patch we can remove an action without finding
its head.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
WANG Cong d84231d3a2 net_sched: remove get_stats from tc_action_ops
It is not used.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 12:52:07 -05:00
Tom Herbert 7539fadcb8 net: Add utility functions to clear rxhash
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb.  This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash.  This patch adds an explict function to clear
all the rxhash related information in the skb properly.

skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.

Fixed up places where 'skb->rxhash = 0' was being called.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Eric Dumazet d4589926d7 tcp: refine TSO splits
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We can avoid this split by including the Nagle tests at the right place.

Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.

tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>

Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized
         8,449,968 context-switches          #    0.041 M/sec
         1,935,997 CPU-migrations            #    0.009 M/sec
           160,541 page-faults               #    0.780 K/sec
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized
         5,635,458 context-switches          #    0.052 M/sec
         1,374,707 CPU-migrations            #    0.013 M/sec
           160,920 page-faults               #    0.001 M/sec
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.

Many thanks to Neal for review and ideas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:15:25 -05:00
David S. Miller 6ea09d8a09 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
Please pull this batch of updates for the 3.14 stream...

For the Bluetooth bits, Gustavo says:

"This is the first batch of patches intended for 3.14. There is
nothing big here.  Most of the code are refactors, clean up, small
fixes, plus some new device id support."

And...

"More patches to 3.14. Here we have the support for Low Energy
Connection Oriented Channels (LE CoC). Basically, as the name says,
this adds supports for connection oriented channels in the same way
we already have them for BR/EDR connections so profiles/protocols
that work on top of BR/EDR can now work on LE plus a plenty of new
possibilities for LE."

For the ath10k bits, Kalle says:

"Janusz and Marek implemented DFS support to ath10k, but the code is
not enabled yet due to missing cfg80211/mac80211 patches (it will be
enabled in the next pull request). Michal did some device reset fixes
and made it possible for ath10k to share an interrupt with another
device. And lots of smaller fixes from different people."

For the iwlwifi bits, Emmanuel says:

"I have here a big rework of the rate control by Eyal. This is obviously
the biggest part of this batch.
I also have enhancement of protection flags by Avri and a few bits for
WoWLAN by Eliad and Luca. Johannes cleans up the debugfs plus a few
fixes. I provided a few things for Bluetooth coexistence.
Besides this we have an implementation for low priority scan."

Along with all that, there are big batches of updates to mwifiex and
ath9k, Jeff Kirsher's FSF address fix patches, and a handful of other
bits here and there.

Please let me know if there are problems!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:08:17 -05:00
wangweidong be78cfcb25 sctp: Reorder 'struc association' members to reduce its size
Members of 'struct association' are not in appropriate order to
reuse compiler added padding on 64bit architectures. In this patch
we reorder those struct members and help reduce the size of the
structure from 2776 bytes to 2720 bytes on 64 bit architectures.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 14:32:43 -05:00
Janusz Dziedzic 204e35a91c nl80211: add VHT support for set_bitrate_mask
Add VHT MCS/NSS set support for nl80211_set_tx_bitrate_mask().
This should be used mainly for test purpose, to check
different MCS/NSS VHT combinations.

Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-16 16:05:17 +01:00
Fan Du 776e9dd90c xfrm: export verify_userspi_info for pkfey and netlink interface
In order to check against valid IPcomp spi range, export verify_userspi_info
for both pfkey and netlink interface.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-12-16 12:54:02 +01:00
Felix Fietkau 70dabeb74e mac80211: let the driver reserve extra tailroom in beacons
Can be used to add extra IEs (such as P2P NoA) without having to
reallocate the buffer.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-16 12:14:07 +01:00
Johannes Berg 6a9d1b91f3 mac80211: add pre-RCU-sync sta removal driver operation
Currently, mac80211 allows drivers to keep RCU-protected station
references that are cleared when the station is removed from the
driver and consequently needs to synchronize twice, once before
removing the station from the driver (so it can guarantee that
the station is no longer used in TX towards the driver) and once
after the station is removed from the driver.

Add a new pre-RCU-synchronisation station removal operation to
the API to allow drivers to clear/invalidate their RCU-protected
station pointers before the RCU synchronisation.

This will allow removing the second synchronisation by changing
the driver API so that the driver may no longer assume a valid
RCU-protected pointer after sta_remove/sta_state returns.

The alternative to this would be to synchronize_rcu() in all the
drivers that currently rely on this behaviour (only iwlmvm) but
that would defeat the purpose.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-16 11:29:44 +01:00
Johannes Berg c4de673b77 Merge remote-tracking branch 'wireless-next/master' into mac80211-next 2013-12-16 11:23:45 +01:00
John W. Linville f647a52e15 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2013-12-13 13:14:28 -05:00
Jesper Dangaard Brouer 8cf4d6a224 net: reorder struct netns_ct for better cache-line usage
Reorder struct netns_ct so that atomic_t "count" changes don't
slowdown users of read mostly fields.

This is based on Eric Dumazet's proposed patch:
 "netfilter: conntrack: remove the central spinlock"
 http://thread.gmane.org/gmane.linux.network/268758/focus=47306

The tricky part of cache-aligning this structure, that it is getting
inlined in struct net (include/net/net_namespace.h), thus changes to
other netns_xxx structures affects our alignment.

Eric's original patch contained an ambiguity on 32-bit regarding
alignment in struct net.  This patch also takes 32-bit into account,
and in case of changed (struct net) alignment sysctl_xxx entries have
been ordered according to how often they are accessed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-13 12:55:55 +01:00
Jiri Benc 7e98056964 ipv6: router reachability probing
RFC 4191 states in 3.5:

   When a host avoids using any non-reachable router X and instead sends
   a data packet to another router Y, and the host would have used
   router X if router X were reachable, then the host SHOULD probe each
   such router X's reachability by sending a single Neighbor
   Solicitation to that router's address.  A host MUST NOT probe a
   router's reachability in the absence of useful traffic that the host
   would have sent to the router if it were reachable.  In any case,
   these probes MUST be rate-limited to no more than one per minute per
   router.

Currently, when the neighbour corresponding to a router falls into
NUD_FAILED, it's never considered again. Introduce a new rt6_nud_state
value, RT6_NUD_FAIL_PROBE, which suggests the route should not be used but
should be probed with a single NS. The probe is ratelimited by the existing
code. To better distinguish meanings of the failure values, rename
RT6_NUD_FAIL_SOFT to RT6_NUD_FAIL_DO_RR.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 16:02:58 -05:00
Jukka Rissanen 18722c2470 Bluetooth: Enable 6LoWPAN support for BT LE devices
This is initial version of
http://tools.ietf.org/html/draft-ietf-6lo-btle-00

By default the 6LoWPAN support is not activated and user
needs to tweak /sys/kernel/debug/bluetooth/hci0/6lowpan
file.

The kernel needs IPv6 support before 6LoWPAN is usable.

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2013-12-11 12:57:55 -08:00
Neil Horman 9f70f46bd4 sctp: properly latch and use autoclose value from sock to association
Currently, sctp associations latch a sockets autoclose value to an association
at association init time, subject to capping constraints from the max_autoclose
sysctl value.  This leads to an odd situation where an application may set a
socket level autoclose timeout, but sliently sctp will limit the autoclose
timeout to something less than that.

Fix this by modifying the autoclose setsockopt function to check the limit, cap
it and warn the user via syslog that the timeout is capped.  This will allow
getsockopt to return valid autoclose timeout values that reflect what subsequent
associations actually use.

While were at it, also elimintate the assoc->autoclose variable, it duplicates
whats in the timeout array, which leads to multiple sources for the same
information, that may differ (as the former isn't subject to any capping).  This
gives us the timeout information in a canonical place and saves some space in
the association structure as well.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
CC: Wang Weidong <wangweidong1@huawei.com>
CC: David Miller <davem@davemloft.net>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:41:26 -05:00
Jiri Pirko 9a32b86043 dn_dev: add support for IFA_FLAGS nl attribute
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 21:50:00 -05:00
Paul Moore 10ae76faa9 cipso: cleanup cipso_v4_translate() when !CONFIG_NETLABEL
Don't needlessly recompute 'opt[opt_iter + 1]' as we already have it
stored in 'tag_len'.

Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 17:56:54 -05:00
Florent Fourcot 3308de2b84 ipv6: add ip6_flowlabel helper
And use it if possible.

Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 21:03:49 -05:00
Florent Fourcot 37cfee909c ipv6: move IPV6_TCLASS_MASK definition in ipv6.h
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 21:03:49 -05:00
Jiri Pirko bba24896f0 neigh: ipv6: respect default values set before an address is assigned to device
Make the behaviour similar to ipv4. This will allow user to set sysctl
default neigh param values and these values will be respected even by
devices registered before (that ones what do not have address set yet).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
Jiri Pirko 1d4c8c2984 neigh: restore old behaviour of default parms values
Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.

Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).

This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.

Introduced by:
commit 8030f54499
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 22 01:53:47 2007 +0900

    [IPV4] devinet: Register inetdev earlier.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00