Commit Graph

93 Commits

Author SHA1 Message Date
Moni Shoua 2efdd6a038 RDMA/cma: Don't allow IPoIB port space for IBoE
This patch fixes a kernel crash in cma_set_qkey().

When the link layer is Ethernet, it is wrong to use IPoIB port space
since no IPoIB interface is available.  Specifically, setting the
Q_Key when port space is RDMA_PS_IPOIB requires MGID calculation and
an SA query, which doesn't make sense over Ethernet.

Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18 16:50:25 -07:00
Jack Morgenstein 0c9361fcde RDMA/cma: Avoid assigning an IS_ERR value to cm_id pointer in CMA id object
Avoid assigning an IS_ERR value to the cm_id pointer.  This fixes a
few anomalies in the error flow due to confusion about checking for
NULL vs IS_ERR, and eliminates the need to test for the IS_ERR value
every time we wish to determine if the cma_id object has a cm device
associated with it.

Also, eliminate the now-unnecessary procedure cma_has_cm_dev (we can
check directly for the existence of the device pointer -- for a
non-NULL check, makes no difference if it is the iwarp or the ib
pointer).

Finally, make a few code changes here to improve coding consistency.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18 09:44:55 -07:00
Nir Muchtar 83e9502d8d RDMA/cma: Save PID of ID's owner
Save the PID associated with an RDMA CM ID for reporting via netlink.

Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Nir Muchtar 753f618ae0 RDMA/cma: Add support for netlink statistics export
Add callbacks and data types for statistics export of all current
devices/ids.  The schema for RDMA CM is a series of netlink messages.
Each one contains an rdma_cm_stat struct.  Additionally, two netlink
attributes are created for the addresses for each message (if
applicable).

Their types used are:
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR (The source address for this ID)
RDMA_NL_RDMA_CM_ATTR_DST_ADDR (The destination address for this ID)
sockaddr_* structs are encapsulated within these attributes.

In other words, every transaction contains a series of messages like:

-------message 1-------
struct rdma_cm_id_stats {
       __u32 qp_num;
       __u32 bound_dev_if;
       __u32 port_space;
       __s32 pid;
       __u8 cm_state;
       __u8 node_type;
       __u8 port_num;
       __u8 reserved;
}
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute - contains the source address
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute - contains the destination address
-------end 1-------
-------message 2-------
struct rdma_cm_id_stats
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute
-------end 2-------

Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Sean Hefty b26f9b9949 RDMA/cma: Pass QP type into rdma_create_id()
The RDMA CM currently infers the QP type from the port space selected
by the user.  In the future (eg with RDMA_PS_IB or XRC), there may not
be a 1-1 correspondence between port space and QP type.  For netlink
export of RDMA CM state, we want to export the QP type to userspace,
so it is cleaner to explicitly associate a QP type to an ID.

Modify rdma_create_id() to allow the user to specify the QP type, and
use it to make our selections of datagram versus connected mode.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:23 -07:00
Nir Muchtar 550e5ca77e RDMA/cma: Export enum cma_state in <rdma/rdma_cm.h>
Move cma.c's internal definition of enum cma_state to enum rdma_cm_state
in an exported header so that it can be exported via RDMA netlink.

Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-25 13:46:22 -07:00
Hefty, Sean a9bb79128a RDMA/cma: Add an ID_REUSEADDR option
Lustre requires that clients bind to a privileged port number before
connecting to a remote server.  On larger clusters (typically more
than about 1000 nodes), the number of privileged ports is exhausted,
resulting in lustre being unusable.

To handle this, we add support for reusable addresses to the rdma_cm.
This mimics the behavior of the socket option SO_REUSEADDR.  A user
may set an rdma_cm_id to reuse an address before calling
rdma_bind_addr() (explicitly or implicitly).  If set, other
rdma_cm_id's may be bound to the same address, provided that they all
have reuse enabled, and there are no active listens.

If rdma_listen() is called on an rdma_cm_id that has reuse enabled, it
will only succeed if there are no other id's bound to that same
address.  The reuse option is exported to user space.  The behavior of
the kernel reuse implementation was verified against that given by
sockets.

This patch is derived from a path by Ira Weiny <weiny2@llnl.gov>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:10 -07:00
Hefty, Sean 43b752daae RDMA/cma: Fix handling of IPv6 addressing in cma_use_port
cma_use_port() assumes that the sockaddr is an IPv4 address.  Since
IPv6 addressing is supported (and also to support other address
families) make the code more generic in its address handling.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:10 -07:00
Sean Hefty a396d43a35 RDMA/cma: Replace global lock in rdma_destroy_id() with id-specific one
rdma_destroy_id currently uses the global rdma cm 'lock' to test if an
rdma_cm_id has been bound to a device.  This prevents an active
address resolution callback handler from assigning a device to the
rdma_cm_id after rdma_destroy_id checks for one.

Instead, we can replace the use of the global lock around the check to
the rdma_cm_id device pointer by setting the id state to destroying,
then flushing all active callbacks.  The latter is accomplished by
acquiring and releasing the handler_mutex.  Any active handler will
complete first, and any newly scheduled handlers will find the
rdma_cm_id in an invalid state.

In addition to optimizing the current locking scheme, the use of the
rdma_cm_id mutex is a more intuitive synchronization mechanism than
that of the global lock.  These changes are based on feedback from
Doug Ledford <dledford@redhat.com> while he was trying to debug a
crash in the rdma cm destroy path.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-03-15 10:57:34 -07:00
Sean Hefty 25ae21a101 RDMA/cma: Fix crash in request handlers
Doug Ledford and Red Hat reported a crash when running the rdma_cm on
a real-time OS.  The crash has the following call trace:

    cm_process_work
       cma_req_handler
          cma_disable_callback
          rdma_create_id
             kzalloc
             init_completion
          cma_get_net_info
          cma_save_net_info
          cma_any_addr
             cma_zero_addr
          rdma_translate_ip
             rdma_copy_addr
          cma_acquire_dev
             rdma_addr_get_sgid
             ib_find_cached_gid
             cma_attach_to_dev
          ucma_event_handler
             kzalloc
             ib_copy_ah_attr_to_user
          cma_comp

[ preempted ]

    cma_write
        copy_from_user
        ucma_destroy_id
           copy_from_user
           _ucma_find_context
           ucma_put_ctx
           ucma_free_ctx
              rdma_destroy_id
                 cma_exch
                 cma_cancel_operation
                 rdma_node_get_transport

        rt_mutex_slowunlock
        bad_area_nosemaphore
        oops_enter

They were able to reproduce the crash multiple times with the
following details:

    Crash seems to always happen on the:
            mutex_unlock(&conn_id->handler_mutex);
    as conn_id looks to have been freed during this code path.

An examination of the code shows that a race exists in the request
handlers.  When a new connection request is received, the rdma_cm
allocates a new connection identifier.  This identifier has a single
reference count on it.  If a user calls rdma_destroy_id() from another
thread after receiving a callback, rdma_destroy_id will proceed to
destroy the id and free the associated memory.  However, the request
handlers may still be in the process of running.  When control returns
to the request handlers, they can attempt to access the newly created
identifiers.

Fix this by holding a reference on the newly created rdma_cm_id until
the request handler is through accessing it.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-03-15 10:00:28 -07:00
Eli Cohen af7bd46376 IB/core: Add VLAN support for IBoE
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:

    GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.

The 3 bits user priority field of the packets are identical to the 3
bits of the SL.

In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-25 10:20:39 -07:00
Eli Cohen 3c86aa70bf RDMA/cm: Add RDMA CM support for IBoE devices
Add support for IBoE device binding and IP --> GID resolution.  Path
resolving and multicast joining are implemented within cma.c by
filling in the responses and running callbacks in the CMA work queue.

IP --> GID resolution always yields IPv6 link local addresses; remote
GIDs are derived from the destination MAC address of the remote port.
Multicast GIDs are always mapped to multicast MACs as is done in IPv6.
(IPv4 multicast is enabled by translating IPv4 multicast addresses to
IPv6 multicast as described in
<http://www.mail-archive.com/ipng@sunroof.eng.sun.com/msg02134.html>.)

Some helper functions are added to ib_addr.h.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-10-13 15:46:43 -07:00
Roland Dreier ffebedb7ab Merge branches 'amso1100', 'bkl', 'cma', 'cxgb3', 'cxgb4', 'ipoib', 'iser', 'masked-atomics', 'misc', 'mthca' and 'nes' into for-next 2010-05-15 20:06:01 -07:00
Julia Lawall 9893e742a0 IB/core: Use kmemdup() instead of kmalloc()+memcpy()
Use kmemdup when some other buffer is immediately copied into the
allocated region.

A simplified version of the semantic patch that makes this change is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
statement S;
@@

-  to = \(kmalloc\|kzalloc\)(size,flag);
+  to = kmemdup(from,size,flag);
   if (to==NULL || ...) S
-  memcpy(to, from, size);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-15 20:05:07 -07:00
Tetsuo Handa 5d7220e8dc RDMA/cma: Randomize local port allocation
Randomize local port allocation in the way sctp_get_port_local() does.
Update rover at the end of loop since we're likely to pick a valid port
on the first try.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-04-21 16:18:40 -07:00
Linus Torvalds 0eddb519b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  IB/mlx4: Check correct variable for allocation failure
  RDMA/nes: Correct cap.max_inline_data assignment in nes_query_qp()
  RDMA/cm: Set num_paths when manually assigning path records
  IB/cm: Fix device_create() return value check
2010-04-09 11:53:06 -07:00
Sean Hefty ae2d9293d7 RDMA/cm: Set num_paths when manually assigning path records
When manually assigning the path records to use for a connection, save
the number of paths that were set.  Otherwise, checks against num_path
will show 0, even though path record data is available.

This was discovered by manually setting the path records from user
space, then querying the kernel to see if the correct path records
were assigned, only to discover that the kernel returned 0 path
records to the query.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-04-07 14:13:22 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Sean Hefty 8523c04809 RDMA/cm: Revert association of an RDMA device when binding to loopback
Revert the following change from commit 6f8372b6 ("RDMA/cm: fix
loopback address support")

   The defined behavior of rdma_bind_addr is to associate an RDMA
   device with an rdma_cm_id, as long as the user specified a non-
   zero address.  (ie they weren't just trying to reserve a port)
   Currently, if the loopback address is passed to rdma_bind_addr,
   no device is associated with the rdma_cm_id.  Fix this.

It turns out that important apps such as Open MPI depend on
rdma_bind_addr() NOT associating any RDMA device when binding to a
loopback address.  Open MPI is being updated to deal with this, but at
least until a new Open MPI release is available, maintain the previous
behavior: allow rdma_bind_addr() to succeed, but do not bind to a
device.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-02-10 12:00:48 -08:00
Robert P. J. Day fd4582a399 IB/addr: Correct CONFIG_IPv6 to CONFIG_IPV6
Correct misspelled "CONFIG_IPv6" that was introduced in commit
d14714df ("IB/addr: Fix IPv6 routing lookup").  The config variable
should be all uppercase.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>

[ This was my fault when I munged the original patch.  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-01-06 13:16:30 -08:00
Sean Hefty d14714df61 IB/addr: Fix IPv6 routing lookup
Include link scope as part of address resolution.  Combine local
and remote address resolution into a single, simpler code path.
Fix error checking in the IPv6 routing lookups.

Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthorpe@obsidianresearch.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ Fix up cma_check_linklocal() for !IPV6 case.  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 16:46:25 -08:00
Sean Hefty 6f8372b69c RDMA/cm: fix loopback address support
The RDMA CM is intended to support the use of a loopback address
when establishing a connection; however, the behavior of the CM
when loopback addresses are used is confusing and does not always
work, depending on whether loopback was specified by the server,
the client, or both.

The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address.  (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdam_bind_addr,
no device is associated with the rdma_cm_id.  Fix this.

If a loopback address is specified by the client as the destination
address for a connection, it will fail to establish a connection.
This is true even if the server is listing across all addresses or
on the loopback address itself.  The issue is that the server tries
to translate the IP address carried in the REQ message to a local
net_device address, which fails.  The translation is not needed in
this case, since the REQ carries the actual HW address that should
be used.

Finally, cleanup loopback support to be more transport neutral.
Replace separate calls to get/set the sgid and dgid from the
device address to a single call that behaves correctly depending
on the format of the device address.  And support both IPv4 and
IPv6 address formats.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ Fixed RDS build by s/ib_addr_get/rdma_addr_get/  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 13:26:06 -08:00
Sean Hefty c4315d85f9 IB/addr: Store net_device type instead of translating to RDMA transport
The struct rdma_dev_addr stores net_device address information:
the source device address, destination hardware address, and
broadcast address.  For consistency, store the net_device type
rather than converting it to the rdma_node_type.

The type indicates the format of the various hardware addresses,
which is what we're concerned with, and not the RDMA node type
that the address may map to.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 12:57:18 -08:00
Sean Hefty 6266ed6e41 RDMA/cma: Replace net_device pointer with index
Provide the device interface when resolving route information to
ensure that the correct outbound device is used.  This will also
simplify processing of sin6_scope_id for IPv6 support.

Based on work from:
David Wilder <dwilder@us.ibm.com>
Jason Gunthorpe <jgunthrope@obsidianresearch.com>

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 12:55:22 -08:00
Jason Gunthorpe e2e626972e RDMA/cma: Fix AF_INET6 support in multicast joining
If joining to an AF_INET6 address, we need to map the address to a MGID
in the same way as the IP stack.  The old code would just fall through to
the IPv4 case and generate garbage.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 12:55:22 -08:00