Commit Graph

30102 Commits

Author SHA1 Message Date
Eric Dumazet
46d3ceabd8 tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 18:12:59 -07:00
David S. Miller
48ee3569f3 ipv6: Move ipv6 twsk accessors outside of CONFIG_IPV6 ifdefs.
Fixes build when ipv6 is disabled.

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 02:39:24 -07:00
David S. Miller
87a50699cb rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo().
Nobody provides non-zero values any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:13 -07:00
David S. Miller
b6242b9b45 tcp: Remove tw->tw_peer
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:09 -07:00
Johannes Berg
ad7eee98be etherdevice: introduce eth_broadcast_addr
A lot of code has either the memset or an inefficient copy
from a static array that contains the all-ones broadcast
address. Introduce eth_broadcast_addr() to fill an address
with all ones, making the code clearer and allowing us to
get rid of some constant arrays.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 18:06:44 -07:00
Christian Hohnstaedt
d5bf9071e7 phylib: Support registering a bunch of drivers
If registering of one of them fails, all already registered drivers
of this module will be unregistered.

Use the new register/unregister functions in all drivers
registering more than one driver.

amd.c, realtek.c: Simplify: directly return registration result.

Tested with broadcom.c
All others compile-tested.

Signed-off-by: Christian Hohnstaedt <chohnstaedt@innominate.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-09 00:10:56 -07:00
David S. Miller
8f961faef7 Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next 2012-07-07 16:29:29 -07:00
Hadar Hen Zion
592e49dda8 net/mlx4: Implement promiscuous mode with device managed flow-steering
The device managed flow steering API has three promiscuous modes:

1. Uplink - captures all the packets that arrive to the port.
2. Allmulti - captures all multicast packets arriving to the port.
3. Function port - for future use, this mode is not implemented yet.

Use these modes with the flow_attach and flow_detach firmware commands
according to the promiscuous state of the netdevice.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-07 16:23:06 -07:00
Hadar Hen Zion
0ff1fb654b {NET, IB}/mlx4: Add device managed flow steering firmware API
The driver is modified to support three operation modes.

If supported by firmware use the device managed flow steering
API, that which we call device managed steering mode. Else, if
the firmware supports the B0 steering mode use it, and finally,
if none of the above, use the A0 steering mode.

When the steering mode is device managed, the code is modified
such that L2 based rules set by the mlx4_en driver for Ethernet
unicast and multicast, and the IB stack multicast attach calls
done through the mlx4_ib driver are all routed to use the device
managed API.

When attaching rule using device managed flow steering API,
the firmware returns a 64 bit registration id, which is to be
provided during detach.

Currently the firmware is always programmed during HCA initialization
to use standard L2 hashing. Future work should be done to allow
configuring the flow-steering hash function with common, non
proprietary means.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-07 16:23:05 -07:00
Hadar Hen Zion
8fcfb4db74 net/mlx4_core: Add firmware commands to support device managed flow steering
Add support for firmware commands to attach/detach a new device managed
steering mode. Such network steering rules allow the user to provide an
L2/L3/L4 flow specification to the firmware and have the device to steer
traffic that matches that specification to the provided QP.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-07 16:23:05 -07:00
Hadar Hen Zion
c96d97f4d1 net/mlx4: Set steering mode according to device capabilities
Instead of checking the firmware supported steering mode in various
places in the code, add a dedicated field in the mlx4 device capabilities
structure which is written once during the initialization flow and read
across the code.

This also set the grounds for add new steering modes. Currently two modes
are supported, and are named after the ConnectX HW versions A0 and B0.

A0 steering uses mac_index, vlan_index and priority to steer traffic
into pre-defined range of QPs.

B0 steering uses Ethernet L2 hashing rules and is enabled only
if the firmware supports both unicast and multicast B0 steering,

The current steering modes are relevant for Ethernet traffic only,
such that Infiniband steering remains untouched.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-07 16:23:05 -07:00
David S. Miller
d3a5ea6e21 Merge branch 'master' of git://1984.lsi.us.es/nf-next 2012-07-07 16:18:50 -07:00
David S. Miller
c90a9bb907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-05 03:44:25 -07:00
Yuval Mintz
16917b87a2 net-next: Add netif_get_num_default_rss_queues
Most multi-queue networking driver consider the number of online cpus when
configuring RSS queues.
This patch adds a wrapper to the number of cpus, setting an upper limit on the
number of cpus a driver should consider (by default) when allocating resources
for his queues.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 03:06:44 -07:00
Krishna Kumar
46ba5a25f5 netfilter: nfnetlink_queue: do not allow to set unsupported flag bits
Allow setting of only supported flag bits in queue->flags.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-07-04 19:51:50 +02:00
Rostislav Lisovy
f057bbb6f9 net: em_canid: Ematch rule to match CAN frames according to their identifiers
This ematch makes it possible to classify CAN frames (AF_CAN) according
to their identifiers. This functionality can not be easily achieved with
existing classifiers, such as u32, because CAN identifier is always stored
in native endianness, whereas u32 expects Network byte order.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2012-07-04 13:07:05 +02:00
Linus Torvalds
a3da2c6913 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block bits from Jens Axboe:
 "As vacation is coming up, thought I'd better get rid of my pending
  changes in my for-linus branch for this iteration.  It contains:

   - Two patches for mtip32xx.  Killing a non-compliant sysfs interface
     and moving it to debugfs, where it belongs.

   - A few patches from Asias.  Two legit bug fixes, and one killing an
     interface that is no longer in use.

   - A patch from Jan, making the annoying partition ioctl warning a bit
     less annoying, by restricting it to !CAP_SYS_RAWIO only.

   - Three bug fixes for drbd from Lars Ellenberg.

   - A fix for an old regression for umem, it hasn't really worked since
     the plugging scheme was changed in 3.0.

   - A few fixes from Tejun.

   - A splice fix from Eric Dumazet, fixing an issue with pipe
     resizing."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scsi: Silence unnecessary warnings about ioctl to partition
  block: Drop dead function blk_abort_queue()
  block: Mitigate lock unbalance caused by lock switching
  block: Avoid missed wakeup in request waitqueue
  umem: fix up unplugging
  splice: fix racy pipe->buffers uses
  drbd: fix null pointer dereference with on-congestion policy when diskless
  drbd: fix list corruption by failing but already aborted reads
  drbd: fix access of unallocated pages and kernel panic
  xen/blkfront: Add WARN to deal with misbehaving backends.
  blkcg: drop local variable @q from blkg_destroy()
  mtip32xx: Create debugfs entries for troubleshooting
  mtip32xx: Remove 'registers' and 'flags' from sysfs
  blkcg: fix blkg_alloc() failure path
  block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
  block: fix return value on cfq_init() failure
  mtip32xx: Remove version.h header file inclusion
  xen/blkback: Copy id field when doing BLKIF_DISCARD.
2012-07-03 15:45:10 -07:00
Giuseppe CAVALLARO
a59a4d1921 phy: add the EEE support and the way to access to the MMD registers.
This patch adds the support for the Energy-Efficient Ethernet (EEE)
to the Physical Abstraction Layer.
To support the EEE we have to access to the MMD registers 3.20 and
7.60/61. So two new functions have been added to read/write the MMD
registers (clause 45).

An Ethernet driver (I tested the stmmac) can invoke the phy_init_eee to properly
check if the EEE is supported by the PHYs and it can also set the clock
stop enable bit in the 3.0 register.
The phy_get_eee_err can be used for reporting the number of time where
the PHY failed to complete its normal wake sequence.

In the end, this patch also adds the EEE ethtool support implementing:
 o phy_ethtool_set_eee
 o phy_ethtool_get_eee

v1: initial patch
v2: fixed some errors especially on naming convention
v3: renamed again the mmd read/write functions thank to Ben's feedback
v4: moved file to phy.c and added the ethtool support.
v5: fixed phy_adv_to_eee, phy_eee_to_supported, phy_eee_to_adv return
    values according to ethtool API (thanks to Ben's feedback).
    Renamed some macros to avoid too long names.
v6: fixed kernel-doc comments to be properly parsed.
    Fixed the phy_init_eee function: we need to check which link mode
    was autonegotiated and then the corresponding bits in 7.60 and 7.61
    registers.
v7: reviewed the way to get the negotiated settings.
v8: fixed a problem in the phy_init_eee return value erroneously added
    when included the phy_read_status call.
v9: do not remove the MDIO_AN_EEE_ADV_100TX and MDIO_AN_EEE_ADV_1000T
    and fixed the eee_{cap,lp,adv} declaration as "int" instead of u16.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-01 03:34:50 -07:00
Randy Dunlap
87fac28808 linux/irq.h: fix kernel-doc warning
Fix kernel-doc warning.  This struct member was removed in commit
875682648b ("irq: Remove irq_chip->release()") so remove its
associated kernel-doc entry also.

  Warning(include/linux/irq.h:338): Excess struct/union/enum/typedef member 'release' description in 'irq_chip'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-30 15:56:40 -07:00
Jiri Pirko
bb35f67195 net: introduce new priv_flag indicating iface capable of change mac when running
Introduce IFF_LIVE_ADDR_CHANGE priv_flag and use it to disable
netif_running() check in eth_mac_addr()

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-30 01:08:00 -07:00
Pablo Neira Ayuso
03292745b0 netlink: add nlk->netlink_bind hook for module auto-loading
This patch adds a hook in the binding path of netlink.

This is used by ctnetlink to allow module autoloading for the case
in which one user executes:

 conntrack -E

So far, this resulted in nfnetlink loaded, but not
nf_conntrack_netlink.

I have received in the past many complains on this behaviour.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-29 16:46:06 -07:00
Pablo Neira Ayuso
a31f2d17b3 netlink: add netlink_kernel_cfg parameter to netlink_kernel_create
This patch adds the following structure:

struct netlink_kernel_cfg {
        unsigned int    groups;
        void            (*input)(struct sk_buff *skb);
        struct mutex    *cb_mutex;
};

That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.

I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.

That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.

This patch also adapts all callers to use this new interface.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-29 16:46:02 -07:00
John W. Linville
8732baafc3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
	drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
2012-06-29 12:42:14 -04:00
David S. Miller
b26d344c6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/caif/caif_hsi.c
	drivers/net/usb/qmi_wwan.c

The qmi_wwan merge was trivial.

The caif_hsi.c, on the other hand, was not.  It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.

I did my best with that one and will ask Sjur to check it out.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 17:37:00 -07:00
Linus Torvalds
f3747e2f2c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking update from David Miller:

 1) Pairing and deadlock fixes in bluetooth from Johan Hedberg.

 2) Add device IDs for AR3011 and AR3012 bluetooth chips.  From
    Giancarlo Formicuccia and Marek Vasut.

 3) Fix wireless regulatory deadlock, from Eliad Peller.

 4) Fix full TX ring panic in bnx2x driver, from Eric Dumazet.

 5) Revert the two commits that added skb_orphan_try(), it causes
    erratic bonding behavior with UDP clients and the gains it used to
    give are mostly no longer happening due to how BQL works.  From Eric
    Dumazet.

 6) It took two tries, but Thomas Graf fixed a problem wherein we
    registered ipv6 routing procfs files before their backend data were
    initialized properly.

 7) Fix max GSO size setting in be2net, from Sarveshwar Bandi.

 8) PHY device id mask is wrong for KSZ9021 and KS8001 chips, fix from
    Jason Wang.

 9) Fix use of stale SKB data pointer after skb_linearize() call in
    batman-adv, from Antonio Quartulli.

10) Fix memory leak in IXGBE due to missing __GFP_COMP, from Alexander
    Duyck.

11) Fix probing of Gobi devices in qmi_wwan usbnet driver, from Bjørn
    Mork.

12) Fix suspend/resume and open failure handling in usbnet from Ming
    Lei.

13) Attempt to fix device r8169 hangs for certain chips, from Francois
    Romieu.

14) Fix advancement of RX dirty pointer in some situations in sh_eth
    driver, from Yoshihiro Shimoda.

15) Attempt to fix restart of IPV6 routing table dumps when there is an
    intervening table update.  From Eric Dumazet.

16) Respect security_inet_conn_request() return value in ipv6 TCP.  From
    Neal Cardwell.

17) Add another iPAD device ID to ipheth driver, from Davide Gerhard.

18) Fix access to freed SKB in l2tp_eth_dev_xmit(), and fix l2tp lockdep
    splats, from Eric Dumazet.

19) Make sure all bridge devices, regardless of whether they were
    created via netlink or ioctls, have their rtnetlink ops hooked up.
    From Thomas Graf and Stephen Hemminger.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
  9p: fix min_t() casting in p9pdu_vwritef()
  can: flexcan: use be32_to_cpup to handle the value of dt entry
  xen/netfront: teardown the device before unregistering it.
  bridge: Assign rtnl_link_ops to bridge devices created via ioctl (v2)
  vhost: use USER_DS in vhost_worker thread
  ixgbe: Do not pad FCoE frames as this can cause issues with FCoE DDP
  net: l2tp_eth: use LLTX to avoid LOCKDEP splats
  mac802154: add missed braces
  net: l2tp_eth: fix l2tp_eth_dev_xmit race
  net/mlx4_en: Release QP range in free_resources
  net/mlx4: Use single completion vector after NOP failure
  net/mlx4_en: Set correct port parameters during device initialization
  ipheth: add support for iPad
  caif-hsi: Add missing return in error path
  caif-hsi: Bugfix - Piggyback'ed embedded CAIF frame lost
  caif: Clear shutdown mask to zero at reconnect.
  tcp: heed result of security_inet_conn_request() in tcp_v6_conn_request()
  ipv6: fib: fix fib dump restart
  batman-adv: fix race condition in TT full-table replacement
  batman-adv: only drop packets of known wifi clients
  ...
2012-06-28 11:20:31 -07:00