The way NVMe uses this field is entirely different from the older
SCSI/BLOCK_PC usage, so move it into struct nvme_request.
Also reduce the size of the file to a unsigned char so that we leave
space for additional smaller fields that will appear soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't pass the status explicitly but derive it from the requeust,
and unwind the complex condition to be more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command. We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
This driver is for pre-IDE hardisk that are only found in PC from the
stoneage of personal computing, and which we don't support elsewhere
in the kernel these days.
It's also been marked broken forever.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:
8,0 1 0 8.347751764 0 m N cfq workload slice:40000000
8,0 1 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0
8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
8,0 1 3814 8.347763916 5839 UT N [kworker/u9:2] 1
8,0 0 0 8.347777605 0 m N cfq293A / Not idling. st->count:1
8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
8,0 3 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0]
8,0 3 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0
8,0 3 0 8.354386476 0 m N cfq schedule dispatch
8,0 3 0 8.354399397 0 m N cfq293A / Not idling. st->count:1
8,0 3 0 8.354404705 0 m N cfq293A / dispatch_insert
8,0 3 0 8.354409454 0 m N cfq293A / dispatched a request
8,0 3 0 8.354412527 0 m N cfq293A / activate rq, drv=1
8,0 3 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0]
8,0 3 0 8.354484184 0 m N cfq293A / Not idling. st->count:1
8,0 3 0 8.354487536 0 m N cfq293A / slice expired t=0
8,0 3 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264
8,0 3 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
8,0 3 0 8.354505695 0 m N cfq293A / del_from_rr
...
8,0 0 1810 8.354728768 0 C W 145961400 + 24 (314076) [0]
8,0 0 0 8.354746927 0 m N cfq293A / complete rqnoidle 0
...
8,0 1 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2]
8,0 1 3830 8.389888127 5839 P N [kworker/u9:2]
8,0 1 3831 8.389908102 5839 A W 145978336 + 24 <- (8,4) 44000
8,0 1 3832 8.389910477 5839 Q W 145978336 + 24 [kworker/u9:2]
8,0 1 3833 8.389914248 5839 I W 145962968 + 24 (28146) [kworker/u9:2]
8,0 1 0 8.389919137 0 m N cfq293A / insert_request
8,0 1 0 8.389924305 0 m N cfq293A / add_to_rr
8,0 1 3834 8.389933175 5839 UT N [kworker/u9:2] 1
...
8,0 0 0 9.455290997 0 m N cfq workload slice:40000000
8,0 0 0 9.455294769 0 m N cfq293A / set_active wl_class:0 wl_type:0
8,0 0 0 9.455303499 0 m N cfq293A / fifo=ffff880003166090
8,0 0 0 9.455306851 0 m N cfq293A / dispatch_insert
8,0 0 0 9.455311251 0 m N cfq293A / dispatched a request
8,0 0 0 9.455314324 0 m N cfq293A / activate rq, drv=1
8,0 0 2043 9.455316210 6204 D W 145962968 + 24 (1065401962) [pgioperf]
8,0 0 0 9.455392407 0 m N cfq293A / Not idling. st->count:1
8,0 0 0 9.455395969 0 m N cfq293A / slice expired t=0
8,0 0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888
8,0 0 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
8,0 0 0 9.455416851 0 m N cfq293A / del_from_rr
...
8,0 0 2045 9.455648515 0 C W 145962968 + 24 (332305) [0]
8,0 0 0 9.455668350 0 m N cfq293A / complete rqnoidle 0
...
8,0 1 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2]
8,0 1 4372 9.455712350 5839 P N [kworker/u9:2]
8,0 1 4373 9.455730159 5839 A W 145986616 + 24 <- (8,4) 52280
8,0 1 4374 9.455732674 5839 Q W 145986616 + 24 [kworker/u9:2]
8,0 1 4375 9.455737563 5839 I W 145978336 + 24 (27448) [kworker/u9:2]
8,0 1 0 9.455742871 0 m N cfq293A / insert_request
8,0 1 0 9.455747550 0 m N cfq293A / add_to_rr
8,0 1 4376 9.455756629 5839 UT N [kworker/u9:2] 1
So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.
Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.
Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.
I have tested this patch with the null block device driver with the following
parameters.
null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1
I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.
I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This avoids duplicating the logic four times, and it also allows to keep
some helpers static in core.c or just opencode them.
Note that this loses printing the aborted status on completions in the
PCI driver as that uses a data structure not available any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
A requeue means we go through nvme_fc_start_fcp_op again and get
another controller reference. To make sure the refcount doesn't
leak we also need to drop it for every completion that came from
the LLDD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way our max retry limit holds as well.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way our max retry limit holds as well.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way our max retry limit holds as well.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
As Dan Carpenter pointed out: mixing 16-bit nvme status with 32-bit
error status from driver. Corrected comment on fcp request struct
status field, and converted done routine to explicitly set nvme status
codes for nvme status.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Update FC-NVME definitions to match FC-NVME r1.14 (16-020vB) plus
change voted in by 2/22 FC-NVME Adhoc (see HOSTID below).
Includes the following:
- Addition of "status_code" field to ERSP IU
- Addition of FC-NVME LS RJT reason_codes and reason_explanations
- CreateAssociation payload, HostID field shortened to 16 bytes
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Before scheduling a reconnect attempt, check
nr_reconnects against max_reconnects, if not
exhausted (or max_reconnects is not -1), schedule
a reconnect attempts, otherwise schedule ctrl
removal.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a host sense that its controller session is damaged,
it tries to re-establish it periodically (reconnect every
reconnect_delay). It may very well be that the controller
is gone and never coming back, in this case the host will
try to reconnect forever.
Add a ctrl_loss_tmo to bound the number of reconnect attempts
to a specific controller (default to a reasonable 10 minutes).
The timeout configuration is actually translated into number of
reconnect attempts and not a schedule on its own but rather
divided with reconnect_delay. This is useful to prevent
racing flows of remove and reconnect, and it doesn't really
matter if we remove slightly sooner than what the user requested.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
we already have it in opts.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
If nvmf_register_transport happened to fail
(it can't, but theoretically) we leak memory.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>