We saw a kernel oops in our regression testing when a multicast "join
finish" occurred just after the interface was -- this is
<https://bugs.openfabrics.org/show_bug.cgi?id=1040>. The test
randomly causes the HCA physical port to go down then up.
The cause of this is that ipoib_mcast_join_finish() processing happen
just after ipoib_mcast_dev_flush() was invoked (in which case the
broadcast pointer is NULL). This patch tests for and handles the case
where priv->broadcast is NULL.
Cc: <stable@kernel.org>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f56bcd80 ("IPoIB: Use separate CQ for UD send completions")
introduced a bug where the transmit queue could get stopped and never
woken up. The problem is that send completions are only polled at the
end of the xmit function, so if the send queue fills up and the xmit
path stops the queue, then there is no way for send completions to
ever get polled, and so the transmit queue stays stopped forever.
Fix this by arming the send CQ just before posting the last send
request that fills the send queue. Then, when the completion event
handler is called, drain the send CQ. Since it is possible that not
enough send completions are in the CQ, verify that the the net queue
has been woken up after draining the send CQ, and if not arm a timer
and drain again at the timer function.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a child interface, copy the MTU information from the
parent. Otherwise when the child's multicast join completes, the MTU
will not be updated since the code does
dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
and priv->admin_mtu will be set to 0.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a dedicated CQ for UD send completions. Also, do not arm the UD
send CQ, which reduces the number of interrupts generated. This patch
farther reduces overhead by not calling poll CQ for every posted send
WR -- it does polls only when there 16 or more outstanding work requests.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch enables IPoIB to use 4K UD messages (when the underlying
device and fabrics support a 4K MTU) by using two scatter buffers when
PAGE_SIZE is less than or equal to thhe HCA IB MTU size. The first
buffer is for IPoIB header + GRH header, and the second buffer is the
IPoIB payload, which is 4K-4.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a P_Key is deleted and then re-added at the same index, then IPoIB
gets confused because __ipoib_ib_dev_flush() only checks whether the
index is the same without checking whether the P_Key was present, so
the interface is stopped when the P_Key is deleted, but the event when
the P_Key is re-added gets ignored and the interface never gets
restarted.
Also, switch to using ib_find_pkey() instead of ib_find_cached_pkey()
everywhere in IPoIB, since none of the places that look for P_Keys are
in a fast path or in non-sleeping context, and in general we want to
kill off the whole caching infrastructure eventually. This also fixes
consistency problems caused because some IPoIB queries were cached and
some were uncached during the window where the cache was not updated.
Thanks to Venkata Subramonyam <vsubramo@cisco.com> for debugging this
problem and testing this fix.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This can be used to tune at run time the parameters controlling the
event (interrupt) generation rate and thus reduce the overhead
incurred by handling interrupts resulting in better throughput. Since
IPoIB uses a single CQ for both RX and TX, RX is chosen to dictate
configuration for both RX and TX.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Just add the infrastructure so we can add functionality later.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For HCAs that support TCP segmentation offload (IB_DEVICE_UD_TSO), set
NETIF_F_TSO and use HW LSO to offload TCP segmentation.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert list_splice() + INIT_LIST_HEAD() to the equivalent list_splice_init()
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For HCAs that support checksum offload (ie that set IB_DEVICE_UD_IP_CSUM
in the device capabilities flags), have IPoIB set NETIF_F_IP_CSUM and
use the HCA to generate and verify IP checksums.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 7143740d ("IPoIB: Add send gather support") made struct
ipoib_tx_buf significantly larger, since the mapping member changed
from a single u64 to an array with MAX_SKB_FRAGS + 1 entries. This
means that allocating tx_rings with kzalloc() may fail because there
is not enough contiguous memory for the new, much bigger size. Fix
this regression by allocating the rings with vmalloc() instead.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 7143740d ("IPoIB: Add send gather support") made it possible
for tx_wr.num_sge to be != 1 -- this happens if send gather support is
enabled. However, the code in the connected mode post_send() function
assumes the old invariant, namely that tx_wr.num_sge is always 1. Fix
this by explicitly setting tx_wr.num_sge to 1 in the CM post_send().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When set_multicast_list() is called the multicast task is restarted
and the IPOIB_MCAST_STARTED bit is cleared. As a result for some
window of time, multicast packets are not transmitted nor queued but
rather dropped by ipoib_mcast_send(). These dropped packets are
painful in two cases:
- bonding fail-over which both calls set_multicast_list() on the new
active slave and sends Gratuitous ARP through that slave.
- IP_DROP_MEMBERSHIP code which both calls set_multicast_list() on the
device and issues IGMP leave.
In both these cases, depending on the scheduling of the IPoIB
multicast task, the packets would be dropped. As a result, in the
bonding case, the failover would not be detected by the peers until
their neighbour is renewed the neighbour (which takes a few tens of
seconds). In the IGMP case, the IP router doesn't get an IGMP leave
and would only learn on that from further probes on the group (also a
delay of at least a few tens of seconds).
Fix this by allowing transmission (or queuing) depending on the
IPOIB_FLAG_OPER_UP flag instead of the IPOIB_MCAST_STARTED flag.
Signed-off-by: Olga Shern <olgas@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit efcd9971 ("IPoIB/cm: Factor out ipoib_cm_free_rx_reap_list()")
introduced a bug in ipoib_cm_dev_stop() when the receive drain times
out. In that case, the function moves all the pending rx stuff into a
private list but then calls ipoib_cm_free_rx_reap_list(), which
handles a different list.
Fix this by moving everything to the rx_reap_list that will actually
get freed up.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=906>.
Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ipoib_cm_tx.ibwc is unused since commit 1b524963 ("IPoIB/cm:
Use common CQ for CM send completions"), so remove it.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
In P_Key event handling, if the old P_Key is no longer available, the
driver must call ipoib_ib_dev_stop() -- just as it does when the P_Key
is still available (see procedure __ipoib_ib_dev_flush()).
When a P_Key becomes available, the driver will perform ipoib_open(),
which assumes that the QP is in RESET, the cm_id has been
destroyed/deleted, etc. If ipoib_ib_dev_stop() is not called as
described above, then these assumptions will be false, and the attempt
to bring the interface up will fail.
Found by Mellanox QA.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch acts as a preparation for using checksum offload for IB
devices capable of inserting/verifying checksum in IP packets. The
patch does not actaully turn on NETIF_F_SG - we defer that to the
patches adding checksum offload capabilities.
We only add support for send gathers for datagram mode, since existing
HW does not support checksum offload on connected QPs.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
All current InfiniBand devices can handle all DMA addresses, and it's
hard to imagine anyone would be silly enough to build a new device
that couldn't. Therefore, enable the NETIF_F_HIGHDMA feature for IPoIB.
This has no effect for no, but is needed when we enable gather/scatter
support and checksum stateless offloads.
Signed-off-by: Eli Cohen <eli@mellnaox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 732a2170 ("IB/ipoib: Bound the net device to the ipoib_neigh
structue") left a misleading debug print (n->dev would be a bond
device only if boding is used). Clean it up.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move up the code that checks for a situation where the remote GID
stored in the ipoib_neigh is different than the one present in the
neighbour (handle gratuitous ARP) or that a bonding fail over has
happened but the neighbour still has a pointer to an ipoib_neigh
created by a different device than the current slave. This will cause
the driver to apply the check also for connected mode neighbours.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
qdisc_run() now tests for queue_stopped() before calling
__qdisc_run(), and the same check is done in every iteration of
__qdisc_run(), so another check is not required in the driver xmit.
This means that ipoib_start_xmit() no longer needs to test
netif_queue_stopped(); the test was added to fix earlier kernels,
where the networking stack did not guarantee that the xmit method of
an LLTX driver would not be called after the queue was stopped, but
current kernels do provide this guarantee.
To validate, I put a debug in the TX_BUSY path which never hit with 64
threads running overnight exercising this code a few 100 million
times.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some HCAs (such as ehca2) support SRQ, but only support fewer than 16 SG
entries for SRQs. Currently IPoIB/CM implicitly assumes all HCAs will
support 16 SG entries for SRQs (to handle a 64K MTU with 4K pages). This
patch removes that restriction by limiting the maximum MTU in connected
mode to what the maximum number of SRQ SG entries allows.
This patch addresses <https://bugs.openfabrics.org/show_bug.cgi?id=728>
Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs. The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised. This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.
The next step will be to add a way to configure the scope for an IPoIB
interface.
Signed-off-by: Rolf Manderscheid <rvm@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>