Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
Pull device mapper updates from Mike Snitzer:
- Significant refactoring and fixing of how DM core does bio-based IO
accounting with focus on fixing wildly inaccurate IO stats for
dm-crypt (and other DM targets that defer bio submission in their own
workqueues). End result is proper IO accounting, made possible by
targets being updated to use the new dm_submit_bio_remap() interface.
- Add hipri bio polling support (REQ_POLLED) to bio-based DM.
- Reduce dm_io and dm_target_io structs so that a single dm_io (which
contains dm_target_io and first clone bio) weighs in at 256 bytes.
For reference the bio struct is 128 bytes.
- Various other small cleanups, fixes or improvements in DM core and
targets.
- Update MAINTAINERS with my kernel.org email address to allow
distinction between my "upstream" and "Red" Hats.
* tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits)
dm: consolidate spinlocks in dm_io struct
dm: reduce size of dm_io and dm_target_io structs
dm: switch dm_target_io booleans over to proper flags
dm: switch dm_io booleans over to proper flags
dm: update email address in MAINTAINERS
dm: return void from __send_empty_flush
dm: factor out dm_io_complete
dm cache: use dm_submit_bio_remap
dm: simplify dm_sumbit_bio_remap interface
dm thin: use dm_submit_bio_remap
dm: add WARN_ON_ONCE to dm_submit_bio_remap
dm: support bio polling
block: add ->poll_bio to block_device_operations
dm mpath: use DMINFO instead of printk with KERN_INFO
dm: stop using bdevname
dm-zoned: remove the ->name field in struct dmz_dev
dm: remove unnecessary local variables in __bind
dm: requeue IO if mapping table not yet available
dm io: remove stale comment block for dm_io()
dm thin metadata: remove unused dm_thin_remove_block and __remove
...
Prepare for supporting IO polling for bio-based driver.
Add ->poll_bio callback so that bio-based driver can provide their own
logic for polling bio.
Also fix ->submit_bio_bio typo in comment block above
__submit_bio_noacct.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
First code becomes more clean by switching to xarray from plain array.
Second use-after-free on q->queue_hw_ctx can be fixed because
queue_for_each_hw_ctx() may be run when updating nr_hw_queues is
in-progress. With this patch, q->hctx_table is defined as xarray, and
this structure will share same lifetime with request queue, so
queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com
[axboe: fix blk_mq_hw_ctx forward declaration]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.
Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a method to notify the driver that the gendisk is about to be freed.
This allows drivers to tie the lifetime of their private data to that of
the gendisk and thus deal with device removal races without expensive
synchronization and boilerplate code.
A new flag is added so that ->free_disk is only called after a successful
call to add_disk, which significantly simplifies the error handling path
during probing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well. Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.
Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.
So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.
Kill the queue flag for this while at it, since we can just check the
assigned pointer now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block inode sync updates from Jens Axboe:
"This contains improvements to how bdev inode syncing is handled,
unifying the API"
* tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block:
block: simplify the block device syncing code
ntfs3: use sync_blockdev_nowait
fat: use sync_blockdev_nowait
btrfs: use sync_blockdev
xen-blkback: use sync_blockdev
block: remove __sync_blockdev
fs: remove __sync_filesystem
Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe:
"This contains a series leading to the removal of the
QUEUE_FLAG_SCSI_PASSTHROUGH queue flag"
* tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block:
block: remove blk_{get,put}_request
block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
block: remove the initialize_rq_fn blk_mq_ops method
scsi: add a scsi_alloc_request helper
bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
sd: implement ->get_unique_id
block: add a ->get_unique_id method