Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Race conditions are theoretically possible between the NVMe PCI device
removal and the generic PCI bus rescan and device removal that can be
triggered via sysfs.
To avoid those race conditions make the NVMe code use
pci_stop_and_remove_bus_device_locked().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is a minor refactor for handling devices that are incapable of IO.
The driver previously used special error codes to know that IO queues
are unavailable, but we have an online queue count now.
This also fixes an issue where the driver successfully sets the queue
count, but either is unable to allocate an IO queue or the device can't
create one for some reason.
If the driver can successfully enable the device and get responses to
admin commands, the driver will bring up a character device for managment
but not create block devices.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Change the behavior of nvme_enable_ctrl to set EN.
Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl.
Remove reading of the CC register and manage the state in
dev->ctrl_config.
Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
[removed an unwanted write to CC]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the zeroing function instead of dma_alloc_coherent & memset(,0,)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.
Historically, all block devices have automatically made entropy
contributions. But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.
- Calling add_disk_randomness() has a lot of overhead.
There are more reliable sources for randomness than non-rotational block
devices. From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull NVMe update from Matthew Wilcox:
"Mostly bugfixes again for the NVMe driver. I'd like to call out the
exported tracepoint in the block layer; I believe Keith has cleared
this with Jens.
We've had a few reports from people who're really pounding on NVMe
devices at scale, hence the timeout changes (and new module
parameters), hotplug cpu deadlock, tracepoints, and minor performance
tweaks"
[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
end up going away when mq conversion happens ]
* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
NVMe: Use Log Page constants in SCSI emulation
NVMe: Define Log Page constants
NVMe: Fix hot cpu notification dead lock
NVMe: Rename io_timeout to nvme_io_timeout
NVMe: Use last bytes of f/w rev SCSI Inquiry
NVMe: Adhere to request queue block accounting enable/disable
NVMe: Fix nvme get/put queue semantics
NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
NVMe: Make admin timeout a module parameter
NVMe: Make iod bio timeout a parameter
NVMe: Prevent possible NULL pointer dereference
NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
NVMe: Update data structures for NVMe 1.2
NVMe: Enable BUILD_BUG_ON checks
NVMe: Update namespace and controller identify structures to the 1.1a spec
NVMe: Flush with data support
NVMe: Configure support for block flush
NVMe: Add tracepoints
NVMe: Protect against badly formatted CQEs
...
There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.
The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Recently, a new sysfs control "iostats" was added to selectively
enable or disable io statistics collection for request queues. This
patch hooks that control.
IO statistics collection is rather expensive on large, multi-node
machines with drives pushing millions of iops. Having the ability to
disable collection if not needed can improve throughput significantly.
As a data point, on a quad E5-4640, I see more than 50% throughput
improvement when io statistics accounting is disabled during heavily
multi-threaded small block random read benchmarks where device
performance is in the million iops+ range.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
The routines to get and lock nvme queues required the caller to "put"
or "unlock" them even if getting one returned NULL. This patch fixes that.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This was originally set to 4 times the IO timeout, but that was when
the IO timeout was 5 seconds instead of 30. 20 seconds for total time
to failure seemed more reasonable than 2 minutes for most, but other
users have requested to make this a module parameter instead.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[renamed the module parameter to retry_time]
[made retry_time static]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod'
can fail. So check the return value.
Signed-off-by: Santosh Y <santosh.sy@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Quiesce and shutdown the device prior to reset, then restart the device and
resume IO after.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Since _nvme_check_size() wasn't being called from anywhere, the compiler
was optimising it away ... along with all the link-time build failures
that would result if any of the structures were the wrong size. Call it
from nvme_exit() for no particular reason.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.
The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Adding tracepoints for bio_complete and block_split into nvme to help
with gathering IO info using blktrace and blkparse.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
If a misbehaving device posts a CQE with a command id < depth but for
one that was never allocated, the command info will have a callback
function set to NULL and we don't want to try invoking that.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Help people diagnose what is going wrong at initialisation time by
printing out which command has gone wrong and what the device returned.
Also fix the error message printed while waiting for reset.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>