Commit Graph

314 Commits

Author SHA1 Message Date
Keith Busch d48756228e nvme/pci: Don't free queues on error
The nvme_remove function tears down all allocated resources in the correct
order, so no need to free queues on error during initialization. This
fixes possible use-after-free errors when queues are still associated
with a blk-mq hctx.

Reported-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 12:39:57 -07:00
Sagi Grimberg 14c862dbb0 nvmet-rdma: drain the queue-pair just before freeing it
draining the qp right after disconnect might not suffice because
the nvmet sq is not fully drained (in nvmet_sq_destroy) and we might
see completions after the drain. Instead, drain right before the
qp destroy which comes after the sq destruction and we can be sure
that no posts come after the drain.

Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:53 +02:00
Steve Wise c8dbc37cd8 nvme-rdma: stop and free io queues on connect failure
While testing nvme-rdma with the spdk nvmf target over iw_cxgb4, I
configured the target (mistakenly) to generate an error creating the
NVMF IO queues.  This resulted a "Invalid SQE Parameter" error sent back
to the host on the first IO queue connect:

[ 9610.928182] nvme nvme1: queue_size 128 > ctrl maxcmd 120, clamping down
[ 9610.938745] nvme nvme1: creating 32 I/O queues.

So nvmf_connect_io_queue() returns an error to
nvmf_connect_io_queue() / nvmf_connect_io_queues(), and that
is returned to nvme_rdma_create_io_queues().  In the error path,
nvmf_rdma_create_io_queues() frees the queue tagset memory _before_
stopping and freeing the IB queues, which causes yet another
touch-after-free crash due to SQ CQEs being flushed after the ib_cqe
structs pointed-to by the flushed WRs have been freed (since they are
part of the nvme_rdma_request struct).

The fix is to stop and free the queues in nvmf_connect_io_queues()
if there is an error connecting any of the queues.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:53 +02:00
Sagi Grimberg 766dbb179d nvmet-rdma: don't forget to delete a queue from the list of connection failed
In case we accepted a queue connection and it failed, we might not
remove the queue from the list until we unload and clean it up.
We should delete it from the queue list on the relevant handler.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:52 +02:00
Sagi Grimberg 8242ddac1b nvmet: Don't queue fatal error work if csts.cfs is set
In the transport, in case of an interal queue error like
error completion in rdma we trigger a fatal error. However,
multiple queues in the same controller can serr error completions
and we don't want to trigger fatal error work more than once.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:51 +02:00
Christoph Hellwig 553cd9ef82 nvme-rdma: reject non-connect commands before the queue is live
If we reconncect we might have command queue up that get resent as soon
as the queue is restarted.  But until the connect command succeeded we
can't send other command.  Add a new flag that marks a queue as live when
connect finishes, and delay any non-connect command until the queue is
live based on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
[sagig: fixes admin queue LIVE setting]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:51 +02:00
Bart Van Assche fa14a0acea nvmet-rdma: Fix possible NULL deref when handling rdma cm events
When we initiate queue teardown sequence we call rdma_destroy_qp
which clears cm_id->qp, afterwards we call rdma_destroy_id, but
we might see a rdma_cm event in between with a cleared cm_id->qp
so watch out for that and silently ignore the event because this
means that the queue teardown sequence is in progress.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-11-14 02:08:50 +02:00
Matias Bjørling 409ae5a76e lightnvm: invalid offset calculation for lba_shift
The ns->lba_shift assumes its value to be the logarithmic of the
LA size. A previous patch duplicated the lba_shift calculation into
lightnvm. It prematurely also subtracted a 512byte shift, which commonly
is applied per-command. The 512byte shift being subtracted twice led to
data loss when restoring the logical to physical mapping table from
device and when issuing I/O commands using rrpc.

Fix offset by removing the 512byte shift subtraction when calculating
lba_shift.

Fixes: b0b4e09c1a "lightnvm: control life of nvm_dev in driver"
Reported-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 18:27:32 -07:00
Linus Torvalds ecd06f2883 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes that missed the merge window, mostly due to me being
  away around that time.

  Nothing major here, a mix of nvme cleanups and fixes, and one fix for
  the badblocks handling"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: use symbolic constants for CNS values
  nvme: use symbolic constants for CNS values
  nvme.h: add an enum for cns values
  nvme.h: don't use uuid_be
  nvme.h: resync with nvme-cli
  nvme: Add tertiary number to NVME_VS
  nvme : Add sysfs entry for NVMe CMBs when appropriate
  nvme: don't schedule multiple resets
  nvme: Delete created IO queues on reset
  nvme: Stop probing a removed device
  badblocks: fix overlapping check for clearing
2016-10-21 10:54:01 -07:00
Christoph Hellwig e9c9346e20 nvmet: use symbolic constants for CNS values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Christoph Hellwig fa60682677 nvme: use symbolic constants for CNS values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Christoph Hellwig a446c0840e nvme.h: resync with nvme-cli
Import a few updates to nvme.h from nvme-cli.  This mostly includes a few
new fields and error codes, but also a few renames that so far are only
used in user space.  Also one field is moved from an array of two le64
values to one of 16 u8 values so that we can more easily access it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Gabriel Krisman Bertazi 8ef2074d28 nvme: Add tertiary number to NVME_VS
NVMe 1.2.1 specification adds a tertiary element to the version number.
This updates the macro and its callers to include the final number and
fixup a single place in nvmet where the version was generated manually.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Stephen Bates 202021c1a6 nvme : Add sysfs entry for NVMe CMBs when appropriate
Add a sysfs attribute that contains salient information about the NVMe
Controller Memory Buffer when one is present. For now, just display the
information about the CMB available from the control registers. We attach
the CMB attribute file to the existing nvme_ctrl sysfs group so it can
handle the sysfs teardown.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 11:10:28 -06:00
Keith Busch c5f6ce97c1 nvme: don't schedule multiple resets
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:

	WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.

This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg<sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:24:39 -06:00
Keith Busch 7065906096 nvme: Delete created IO queues on reset
The driver was decrementing the online_queues prior to attempting to
delete those IO queues, so the driver ended up not requesting the
controller delete any. This patch saves the online_queues prior to
suspending them, and adds that parameter for deleting io queues.

Fixes: c21377f8 ("nvme: Suspend all queues before deletion")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:22:16 -06:00
Keith Busch 0df1e4f5e0 nvme: Stop probing a removed device
There is no reason the nvme controller can ever return all 1's from
reading the CSTS register. This patch returns an error if we observe
that status. Without this, we may incorrectly proceed with controller
initialization and unnecessarilly rely on error handling to clean this.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 08:40:13 -06:00
Mauricio Faria de Oliveira 2b6b535d91 nvme: use the DMA_ATTR_NO_WARN attribute
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).

Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Linus Torvalds 12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Linus Torvalds b9044ac829 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
Linus Torvalds 513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Andy Lutomirski 1a6fe74dfd nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-24 10:56:26 -06:00
Andy Lutomirski 26501db8dc nvme/scsi: Remove power management support
As far as I can tell, there is basically nothing correct about this
code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse.  It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.

Since this code appears to be useless, just delete it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-24 10:56:25 -06:00
Alexander Solganik 2e5d0baa04 nvmet: Make dsm number of ranges zero based
This caused the nvmet request data length to be
incorrect.

Signed-off-by: Alexander Solganik <sashas@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-09-23 15:37:38 -07:00
Sagi Grimberg 9b349b080c nvmet: Use direct IO for writes
We're designed to work with high-end devices where
direct IO makes perfect sense. We noticed that we
context switch by scheduling kblockd instead of going
directly to the device without REQ_SYNC for writes.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2016-09-23 15:37:38 -07:00