Commit Graph

506251 Commits

Author SHA1 Message Date
Fan Du dcd8fb8533 ipv4: Raise tcp PMTU probe mss base size
Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:41 -05:00
Eric W. Biederman aaa4e70404 DECnet: Only use neigh_ops for adding the link layer header
Other users users of the neighbour table use neigh->output as the method
to decided when and which link-layer header to place on a packet.
DECnet has been using neigh->output to decide which DECnet headers to
place on a packet depending which neighbour the packet is destined for.

The DECnet usage isn't totally wrong but it can run into problems if the
neighbour output function is run for a second time as the teql driver
and the bridge netfilter code can do.

Therefore to avoid pathologic problems later down the line and make the
neighbour code easier to understand by refactoring the decnet output
code to only use a neighbour method to add a link layer header to a
packet.

This is done by moving the neigbhour operations lookup from
dn_to_neigh_output to dn_neigh_output_packet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:54:22 -05:00
Mahesh Bandewar 616f45416c bonding: implement bond_poll_controller()
This patches implements the poll_controller support for all
bonding driver. If the slaves have poll_controller net_op defined,
this implementation calls them. This is mode agnostic implementation
and iterates through all slaves (based on mode) and calls respective
handler.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:40:42 -05:00
Scott Feldman 8ea696384a rocker: fix some sparse warnings
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 12:43:54 -05:00
Scott Feldman e1315db17d switchdev: fix CONFIG_IP_MULTIPLE_TABLES compile issue
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 12:43:54 -05:00
David S. Miller 23375a0fd5 ipv4: Fix unused variable warnings in fib_table_flush_external.
net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’:
net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable]
  int found = 0;
      ^
net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable]
  unsigned char slen;
                ^

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:38:35 -05:00
David S. Miller fabe7bed11 Merge branch 'l3_hw_offload'
Scott Feldman says:

====================
switchdev: add IPv4 routing offload

v4:

  - Add NETIF_F_NETNS_LOCAL to rocker port feature list to keep rocker
    ports in the default netns.  Rocker hardware can't be partitioned
    to support multiple namespaces, currently.  It would be interesting
    to add netns support to rocker device by basically adding another
    match field to each table to match on some unique netns ID, with
    a port knowing it's netns ID.  Future work TDB.
  - Up-level the RTNH_F_EXTERNAL marking of routes installed to offload
    device from driver to switchdev common code.  Now driver can't skip
    routes.  Either it can install the route or it cannot.  Yes or No.
    If no on any route, all offloading is aborted by removing routes
    from offload device and setting ipv4.fib_offload_disabled so no more
    routes can be offloaded.  This is harsh, but it's our starting point.
    We can refine the policies in follow-up work.
  - Add new net.ipv4.fib_offload_disabled bool that is set if anything
    goes wrong with route offloading.  We can refine this later to make
    the setting per-device or per-device-port-netdev, but let's start
    here simple and refine in follow-up work.
  - Rebase against Alex's latest FIB changes.  I think I did everything
    correctly, and didn't run into any issues with testing, but I'd like
    Alex to look over the changes and maybe follow-up with any cleanups.

v3:

Changes based on v2 review comments:

  - Move check for custom rules up earlier in patch set, to keep git bisect
    safe.
  - Simplify the route add/modify failure handling to simple try until
    failure, and then on failure, undo everything.  The switchdev driver
    will return err when route can normally be installed to device, but
    the install fails for one reason or another (no space left on device,
    etc).  If a failure happens, uninstall all routes from the device,
    punting forwarding for all routes back to the kernel.
  - Scan route's full nexthop list, ensuring all nexthop devs belong
    to the same switchdev device, otherwise don't try to install route
    to device.

v2:

Changes based on v1 review comments and discussions at netconf:

  - Allow route modification, but use same ndo op used for adding route.
    Driver/device is expected to modify route in-place, if it can, to avoid
    interruption of service.
  - Add new RTNH_F_EXTERNAL flag to mark FIB entries offloaded externally.
  - Don't offload routes if using custom IP rules.  If routes are already
    offloaded, and custom IP rules are turned on, flush routes from offload
    device.  (Offloaded routes are marked with RTNH_F_EXTERNAL).
  - Use kernel's neigh resolution code to resolve route's nexthops' neigh
    MAC addrs.  (Thanks davem, works great!).
  - Use fib->fib_priority in rocker driver to give priorities to routes in
    OF-DPA unicast route table.

v1:

This patch set adds L3 routing offload support for IPv4 routes.  The idea is to
mirror routes installed in the kernel's FIB down to a hardware switch device to
offload the data forwarding path for L3.  Only the data forwarding path is
intercepted.  Control and management of the kernel's FIB remains with the
kernel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:26:16 -05:00
Scott Feldman c1beeef7a3 rocker: implement IPv4 fib offloading
The driver implements ndo_switch_fib_ipv4_add/del ops to add/del/mod IPv4
routes to/from switchdev device.  Once a route is added to the device, and the
route's nexthops are resolved to neighbor MAC address, the device will forward
matching pkts rather than the kernel.  This offloads the L3 forwarding path
from the kernel to the device.  Note that control and management planes are
still mananged by Linux; only the data plane is offloaded.  Standard routing
control protocols such as OSPF and BGP run on Linux and manage the kernel's FIB
via standard rtm netlink msgs...nothing changes here.

A new hash table is added to rocker to track neighbors.  The driver listens for
neighbor updates events using netevent notifier NETEVENT_NEIGH_UPDATE.  Any ARP
table updates for ports on this device are recorded in this table.  Routes
installed to the device with nexthops that reference neighbors in this table
are "qualified".  In the case of a route with nexthops not resolved in the
table, the kernel is asked to resolve the nexthop.

The driver uses fib_info->fib_priority for the priority field in rocker's
unicast routing table.

The device can only forward to pkts matching route dst to resolved nexthops.
Currently, the device only supports single-path routes (i.e. routes with one
nexthop).  Equal Cost Multipath (ECMP) route support will be added in followup
patches.

This patch is driver support for unicast IPv4 routing only.  Followup patches
will add driver and infrastructure for IPv6 routing and multicast routing.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 8e05fd7166 fib: hook IPv4 fib for hardware offload
Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB.  The switchdev driver may or
may not install the route to the offload device.  In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, route forwarding falls back to the kernel, and no more routes are
offloading.

We can refine this logic later.  For now, use the simplist model of offloading
routes up to the point of failure, and then on failure, undo everything and
mark IPv4 offloading disabled.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 448b128a14 ipv4: add net bool fib_offload_disabled
If something goes wrong with IPv4 FIB offload, mark entire net offload
disabled.  This is brute force policy to basically shut down IPv4 FIB offload
permanently if there is a problem offloading any route to an external device.
We can refine the policy in the future, to handle failures on a per-device or
per-route basis, but for now, this policy is per-net.

What we're trying to avoid is an inconsistent split between the kernel's FIB
and the offload device's FIB.  We don't want the device to fwd a pkt
inconsitent with what the kernel would do.  An example of a split is if device
has 10.0.0.0/16 and kernel has 10.0.0.0/16 and 10.0.0.0/24, the device wouldn't
see the longest prefix 10.0.0.0/24 and potentially forward pkts incorrectly.

Limited capacity or limited capability are two ways a route may fail to install
to the offload device.  We'll not differentiate between failures at this time,
and treat any failure as fatal and mark the net as fib_offload_disabled.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman b5d6fbdeed switchdev: implement IPv4 fib ndo wrappers
Flesh out ndo wrappers to call into device driver.  To call into device driver,
the wrapper must interate over route's nexthops to ensure all nexthop devs
belong to the same switch device.  Currently, there is no support for route's
nexthops spanning offloaded and non-offloaded devices, or spanning ports of
multiple offload devices.

Since switch device ports may be stacked under virtual interfaces (bonds and/or
bridges), and the route's nexthop may be on the virtual interface, the wrapper
will traverse the nexthop dev down to the base dev.  It's the base dev that's
passed to the switchdev driver's ndo ops.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 104616e74e switchdev: don't support custom ip rules, for now
Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 5e8d90497d switchdev: add IPv4 fib ndo ops wrappers
Add IPv4 fib ndo wrapper funcs and stub them out for now.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 4586f1bb91 netdevice: add IPv4 fib add/del ops
Add two new ndo ops for IPv4 fib offload support, add and del.  Add uses
modifiy semantics if fib entry already offloaded.  Drivers implementing the new
ndo ops will return err<0 if programming device fails, for example if device's
tables are full.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 37ed949369 rtnetlink: add RTNH_F_EXTERNAL flag for fib offload
Add new RTNH_F_EXTERNAL flag to mark fib entries offloaded externally, for
example to a switchdev switch device.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:57 -05:00
Eric Dumazet 24d2e4a507 tg3: use napi_complete_done()
Using napi_complete_done() instead of napi_complete() allows
us to use /sys/class/net/ethX/gro_flush_timeout

GRO layer can aggregate more packets if the flush is delayed a bit,
without having to set too big coalescing parameters that impact
latencies.

Tested:

lpx:~# echo 0 >/sys/class/net/eth1/gro_flush_timeout

lpx:~# sar -n DEV 1 10 | grep eth1
10:36:25 AM      eth1  81290.00  40617.00 120479.67   2777.01      0.00      0.00      0.00
10:36:26 AM      eth1  81283.00  40608.00 120481.81   2778.13      0.00      0.00      1.00
10:36:27 AM      eth1  81304.00  40639.00 120518.42   2778.28      0.00      0.00      0.00
10:36:28 AM      eth1  81255.00  40605.00 120437.34   2775.95      0.00      0.00      1.00
10:36:29 AM      eth1  81306.00  40630.00 120521.44   2777.70      0.00      0.00      0.00
10:36:30 AM      eth1  81286.00  40564.00 120480.20   2773.31      0.00      0.00      0.00
10:36:31 AM      eth1  81256.00  40599.00 120438.81   2776.27      0.00      0.00      0.00
10:36:32 AM      eth1  81287.00  40594.00 120480.69   2776.69      0.00      0.00      0.00
10:36:33 AM      eth1  81279.00  40601.00 120478.53   2775.84      0.00      0.00      0.00
10:36:34 AM      eth1  81277.00  40610.00 120476.94   2776.25      0.00      0.00      0.00
Average:         eth1  81282.30  40606.70 120479.39   2776.54      0.00      0.00      0.20

lpx:~# echo 13000 >/sys/class/net/eth1/gro_flush_timeout

lpx:~# sar -n DEV 1 10 | grep eth1
10:36:43 AM      eth1  81257.00   7747.00 120437.44    530.00      0.00      0.00      0.00
10:36:44 AM      eth1  81278.00   7748.00 120480.00    529.85      0.00      0.00      0.00
10:36:45 AM      eth1  81282.00   7752.00 120479.09    531.09      0.00      0.00      0.00
10:36:46 AM      eth1  81282.00   7751.00 120478.80    530.90      0.00      0.00      0.00
10:36:47 AM      eth1  81276.00   7745.00 120478.31    529.64      0.00      0.00      0.00
10:36:48 AM      eth1  81278.00   7747.00 120478.50    529.81      0.00      0.00      0.00
10:36:49 AM      eth1  81282.00   7749.00 120478.88    530.01      0.00      0.00      0.00
10:36:50 AM      eth1  81284.00   7751.00 120481.52    530.20      0.00      0.00      0.00
10:36:51 AM      eth1  81299.00   7769.00 120481.74    533.81      0.00      0.00      0.00
10:36:52 AM      eth1  81281.00   7748.00 120478.62    529.96      0.00      0.00      0.00
Average:         eth1  81279.90   7750.70 120475.29    530.53      0.00      0.00      0.00

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:20:09 -05:00
David S. Miller 050f7dcd27 Merge branch 'dsa-next'
Florian Fainelli says:

====================
net: dsa: code re-organization

This pull request contains the first part of the patches required to implement
the grand plan detailed here:

http://www.spinics.net/lists/netdev/msg295942.html

These are mostly code re-organization and function bodies re-arrangement to
allow different callers of lower-level initialization functions for 'struct
dsa_switch' and 'struct dsa_switch_tree' to be later introduced.

There is no functional code change at this point.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:29 -05:00
Florian Fainelli c86e59b9e6 net: dsa: extract dsa switch tree setup and removal
Extract the core logic that setups a 'struct dsa_switch_tree' and
removes it, update dsa_probe() and dsa_remove() to use the two helper
functions. This will be useful to allow for other callers to setup
this structure differently.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:20 -05:00
Florian Fainelli 5929903103 net: dsa: let switches specify their tagging protocol
In order to support the new DSA device driver model, a dsa_switch should
be able to advertise the type of tagging protocol supported by the
underlying switch device. This also removes constraints on how tagging
can be stacked to each other.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:20 -05:00
Florian Fainelli df197195a5 net: dsa: split dsa_switch_setup into two functions
Split the part of dsa_switch_setup() which is responsible for allocating
and initializing a 'struct dsa_switch' and the part which is doing a
given switch device setup and slave network device creation.

This is a preliminary change to allow a separate caller of
dsa_switch_setup_one() which may have externally initialized the
dsa_switch structure, outside of dsa_switch_setup().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:20 -05:00
Florian Fainelli b324c07ac4 net: dsa: allow deferred probing
In preparation for allowing a different model to register DSA switches,
update dsa_of_probe() and dsa_probe() to return -EPROBE_DEFER where
appropriate.

Failure to find a phandle or Device Tree property is still fatal, but
looking up the internal device structure associated with a Device Tree
node is something that might need to be delayed based on driver probe
ordering.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:20 -05:00
Florian Fainelli f1a26a062f net: dsa: update dsa_of_{probe, remove} to use a device pointer
In preparation for allowing a different mechanism to register DSA switch
devices and driver, update dsa_of_probe and dsa_of_remove to take a
struct device pointer since neither of these two functions uses the
struct platform_device pointer.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:18:20 -05:00
Eric Dumazet 496127290f inet_diag: remove duplicate code from inet_twsk_diag_dump()
timewait sockets now share a common base with established sockets.

inet_twsk_diag_dump() can use inet_diag_bc_sk() instead of duplicating
code, granted that inet_diag_bc_sk() does proper userlocks
initialization.

twsk_build_assert() will catch any future changes that could break
the assumptions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 22:55:44 -05:00
Jeff Kirsher e815665e1a i40e: Fix mismatching type for ioremap_len
As pointed out by Ben Hutchings, ioremap uses unsigned long as
its parameter type, so we should be using that instead of u32
or int.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 22:11:38 -05:00
Erik Hugne d0f91938be tipc: add ip/udp media type
The ip/udp bearer can be configured in a point-to-point
mode by specifying both local and remote ip/hostname,
or it can be enabled in multicast mode, where links are
established to all tipc nodes that have joined the same
multicast group. The multicast IP address is generated
based on the TIPC network ID, but can be overridden by
using another multicast address as remote ip.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 22:08:42 -05:00