This was originally set to 4 times the IO timeout, but that was when
the IO timeout was 5 seconds instead of 30. 20 seconds for total time
to failure seemed more reasonable than 2 minutes for most, but other
users have requested to make this a module parameter instead.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[renamed the module parameter to retry_time]
[made retry_time static]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod'
can fail. So check the return value.
Signed-off-by: Santosh Y <santosh.sy@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Quiesce and shutdown the device prior to reset, then restart the device and
resume IO after.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Since _nvme_check_size() wasn't being called from anywhere, the compiler
was optimising it away ... along with all the link-time build failures
that would result if any of the structures were the wrong size. Call it
from nvme_exit() for no particular reason.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.
The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Adding tracepoints for bio_complete and block_split into nvme to help
with gathering IO info using blktrace and blkparse.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
If a misbehaving device posts a CQE with a command id < depth but for
one that was never allocated, the command info will have a callback
function set to NULL and we don't want to try invoking that.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Help people diagnose what is going wrong at initialisation time by
printing out which command has gone wrong and what the device returned.
Also fix the error message printed while waiting for reset.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Pull NVMe driver updates from Matthew Wilcox:
"Various updates to the NVMe driver. The most user-visible change is
that drive hotplugging now works and CPU hotplug while an NVMe drive
is installed should also work better"
* git://git.infradead.org/users/willy/linux-nvme:
NVMe: Retry failed commands with non-fatal errors
NVMe: Add getgeo to block ops
NVMe: Start-stop nvme_thread during device add-remove.
NVMe: Make I/O timeout a module parameter
NVMe: CPU hot plug notification
NVMe: per-cpu io queues
NVMe: Replace DEFINE_PCI_DEVICE_TABLE
NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds
NVMe: IOCTL path RCU protect queue access
NVMe: RCU protected access to io queues
NVMe: Initialize device reference count earlier
NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions
For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.
Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.
The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Increase the default timeout to 30 seconds to match SCSI.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[use byte instead of ushort]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Pull block driver update from Jens Axboe:
"On top of the core pull request, here's the pull request for the
driver related changes for 3.15. It contains:
- Improvements for msi-x registration for block drivers (mtip32xx,
skd, cciss, nvme) from Alexander Gordeev.
- A round of cleanups and improvements for drbd from Andreas
Gruenbacher and Rashika Kheria.
- A round of clanups and improvements for bcache from Kent.
- Removal of sleep_on() and friends in DAC960, ataflop, swim3 from
Arnd Bergmann.
- Bug fix for a bug in the mtip32xx async completion code from Sam
Bradshaw.
- Bug fix for accidentally bouncing IO on 32-bit platforms with
mtip32xx from Felipe Franciosi"
* 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits)
bcache: remove nested function usage
bcache: Kill bucket->gc_gen
bcache: Kill unused freelist
bcache: Rework btree cache reserve handling
bcache: Kill btree_io_wq
bcache: btree locking rework
bcache: Fix a race when freeing btree nodes
bcache: Add a real GC_MARK_RECLAIMABLE
bcache: Add bch_keylist_init_single()
bcache: Improve priority_stats
bcache: Better alloc tracepoints
bcache: Kill dead cgroup code
bcache: stop moving_gc marking buckets that can't be moved.
bcache: Fix moving_pred()
bcache: Fix moving_gc deadlocking with a foreground write
bcache: Fix discard granularity
bcache: Fix another bug recovering from unclean shutdown
bcache: Fix a bug recovering from unclean shutdown
bcache: Fix a journalling reclaim after recovery bug
bcache: Fix a null ptr deref in journal replay
...
Checkpatch has started warning against using DEFINE_PCI_DEVICE_TABLE,
so replace it. Also update the copyright date and bump the module
version number to 0.9.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.
The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.
There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
If an NVMe device becomes ready but fails to create IO queues, the driver
creates a character device handle so the device can be managed. The
device reference count needs to be initialized before creating the
character device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.
drivers/block/nvme-core.c:2541:12: warning: 'nvme_suspend' defined but not used [-Wunused-function]
drivers/block/nvme-core.c:2550:12: warning: 'nvme_resume' defined but not used [-Wunused-function]
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
PREPARE_[DELAYED_]WORK() are being phased out. They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.
nvme_dev->reset_work is multiplexed with multiple work functions.
Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
always use it as the work function and update the users to set the
->reset_workfn field instead of overriding the work function using
PREPARE_WORK().
It would probably be best to route this with other related updates
through the workqueue tree.
Compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-nvme@lists.infradead.org