Pull block fixes from Jens Axboe:
- MD pull request via Song:
- Fix for clustered raid (Guoqing Jiang)
- req_op fix (Bart Van Assche)
- Fix race condition in raid recreate (David Sloan)
- loop configuration overflow fix (Siddh)
- Fix missing commit_rqs call for certain conditions (Yu)
* tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block:
md: call __md_stop_writes in md_stop
Revert "md-raid: destroy the bitmap after destroying the thread"
md: Flush workqueue md_rdev_misc_wq in md_alloc()
md/raid10: Fix the data type of an r10_sync_page_io() argument
loop: Check for overflow while configuring loop
blk-mq: fix io hung due to missing commit_rqs
A race condition still exists when removing and re-creating md devices
in test cases. However, it is only seen on some setups.
The race condition was tracked down to a reference still being held
to the kobject by the rdev in the md_rdev_misc_wq which will be released
in rdev_delayed_delete().
md_alloc() waits for previous deletions by waiting on the md_misc_wq,
but the md_rdev_misc_wq may still be holding a reference to a recently
removed device.
To fix this, also flush the md_rdev_misc_wq in md_alloc().
Signed-off-by: David Sloan <david.sloan@eideticom.com>
[logang@deltatee.com: rewrote commit message]
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Fix the following sparse warning:
drivers/md/raid10.c:2647:60: sparse: sparse: incorrect type in argument 5 (different base types) @@ expected restricted blk_opf_t [usertype] opf @@ got int rw @@
This patch does not change any functionality since REQ_OP_READ = READ = 0
and since REQ_OP_WRITE = WRITE = 1.
Cc: Rong A Chen <rong.a.chen@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Fixes: 4ce4c73f66 ("md/core: Combine two sync_page_io() arguments")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Song Liu <song@kernel.org>
Pull device mapper fixes from Mike Snitzer:
- A few fixes for the DM verity and bufio changes in this merge window
- A smatch warning fix for DM writecache locking in writecache_map
* tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix some cases where the code sleeps with spinlock held
dm writecache: fix smatch warning about invalid return from writecache_map
dm verity: fix verity_parse_opt_args parsing
dm verity: fix DM_VERITY_OPTS_MAX value yet again
dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking
Commit b32d45824a ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
added a "NO_SLEEP" mode, it replaces a mutex with a spinlock, and it
is only usable when the device is in read-only mode (because the write
path may be sleeping while holding the dm_bufio_client lock).
However, there are still two points where the code could sleep even in
read-only mode. One is in __get_unclaimed_buffer -> __make_buffer_clean.
The other is in __try_evict_buffer -> __make_buffer_clean. These functions
will call __make_buffer_clean which sleeps if the buffer is being read.
Fix these cases so that if c->no_sleep is set __make_buffer_clean
will not be called and the buffer will be skipped instead.
Fixes: b32d45824a ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There's a smatch warning "inconsistent returns '&wc->lock'" in
dm-writecache. The reason for the warning is that writecache_map()
doesn't drop the lock on the impossible path.
Fix this warning by adding wc_unlock() after the BUG statement (so
that it will be compiled-away anyway).
Fixes: df699cc16e ("dm writecache: report invalid return from writecache_map helpers")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit df326e7a06 ("dm verity: allow optional args to alter primary
args handling") introduced a bug where verity_parse_opt_args() wouldn't
properly shift past an optional argument's additional params (by
ignoring them).
Fix this by avoiding returning with error if an unknown argument is
encountered when @only_modifier_opts=true is passed to
verity_parse_opt_args().
In practice this regressed the cryptsetup testsuite's FEC testing
because unknown optional arguments were encountered, wherey
short-circuiting ever testing FEC mode. With this fix all of the
cryptsetup testsuite's verity FEC tests pass.
Fixes: df326e7a06 ("dm verity: allow optional args to alter primary args handling")
Reported-by: Milan Broz <gmazyland@gmail.com>>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Must account for the possibility that "try_verify_in_tasklet" is used.
This is the same issue that was fixed with commit 160f99db94 -- it
is far too easy to miss that additional a new argument(s) require
bumping DM_VERITY_OPTS_MAX accordingly.
Fixes: 5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Historically none of the bufio code runs in interrupt context but with
the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit
5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
That said, the new tasklet usecase still doesn't require interrupts be
disabled by bufio (let alone conditionally restore them).
Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care
must be taken to properly synchronize between softirq and process
context, otherwise ABBA deadlock may occur. While it is unnecessary to
disable bottom-half preemption within a tasklet, we must consistently do
so in process context to ensure locking is in the proper order.
Fix these issues by switching from spin_lock_irq{save,restore} to using
spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member
in dm_bufio_client struct (that can be used unsafely if bufio must
recurse on behalf of some caller, e.g. block layer's submit_bio).
Fixes: 5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull more device mapper updates from Mike Snitzer:
- Add flags argument to dm_bufio_client_create and introduce
DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather
than mutex for its locking.
- Add optional "try_verify_in_tasklet" feature to DM verity target.
This feature gives users the option to improve IO latency by using a
tasklet to verify, using hashes in bufio's cache, rather than wait to
schedule a work item via workqueue. But if there is a bufio cache
miss, or an error, then the tasklet will fallback to using workqueue.
- Incremental changes to both dm-bufio and the DM verity target to use
jump_label to minimize cost of branching associated with the niche
"try_verify_in_tasklet" feature. DM-bufio in particular is used by
quite a few other DM targets so it doesn't make sense to incur
additional bufio cost in those targets purely for the benefit of this
niche verity feature if the feature isn't ever used.
- Optimize verity_verify_io, which is used by both workqueue and
tasklet based verification, if FEC is not configured or tasklet based
verification isn't used.
- Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE
flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if
"try_verify_in_tasklet" is specified.
* tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet"
dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND
dm verity: only copy bvec_iter in verity_verify_io if in_tasklet
dm verity: optimize verity_verify_io if FEC not configured
dm verity: conditionally enable branching for "try_verify_in_tasklet"
dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP
dm verity: allow optional args to alter primary args handling
dm verity: Add optional "try_verify_in_tasklet" feature
dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag
dm bufio: Add flags argument to dm_bufio_client_create
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
Pull block driver updates from Jens Axboe:
- NVMe pull requests via Christoph:
- add support for In-Band authentication (Hannes Reinecke)
- handle the persistent internal error AER (Michael Kelley)
- use in-capsule data for TCP I/O queue connect (Caleb Sander)
- remove timeout for getting RDMA-CM established event (Israel
Rukshin)
- misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni,
Guixin Liu, Xiang wangx)
- use command_id instead of req->tag in trace_nvme_complete_rq()
(Bean Huo)
- various fixes for the new authentication code (Lukas Bulwahn,
Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes
Reinecke)
- small cleanups (Liu Song, Christoph Hellwig)
- restore compat_ioctl support (Nick Bowler)
- make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg)
- enable generic interface (/dev/ngXnY) for unknown command sets
(Joel Granados, Christoph Hellwig)
- don't always build constants.o (Christoph Hellwig)
- print the command name of aborted commands (Christoph Hellwig)
- MD pull requests via Song:
- Improve raid5 lock contention, by Logan Gunthorpe.
- Misc fixes to raid5, by Logan Gunthorpe.
- Fix race condition with md_reap_sync_thread(), by Guoqing Jiang.
- Fix potential deadlock with raid5_quiesce and
raid5_get_active_stripe, by Logan Gunthorpe.
- Refactoring md_alloc(), by Christoph"
- Fix md disk_name lifetime problems, by Christoph Hellwig
- Convert prepare_to_wait() to wait_woken() api, by Logan
Gunthorpe;
- Fix sectors_to_do bitmap issue, by Logan Gunthorpe.
- Work on unifying the null_blk module parameters and configfs API
(Vincent)
- drbd bitmap IO error fix (Lars)
- Set of rnbd fixes (Guoqing, Md Haris)
- Remove experimental marker on bcache async device registration (Coly)
- Series from cleaning up the bio splitting (Christoph)
- Removal of the sx8 block driver. This hardware never really
widespread, and it didn't receive a lot of attention after the
initial merge of it back in 2005 (Christoph)
- A few fixes for s390 dasd (Eric, Jiang)
- Followup set of fixes for ublk (Ming)
- Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang)
- Fixes for the dio dma alignment (Keith)
- Misc fixes and cleanups (Ming, Yu, Dan, Christophe
* tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits)
s390/dasd: Establish DMA alignment
s390/dasd: drop unexpected word 'for' in comments
ublk_drv: add support for UBLK_IO_NEED_GET_DATA
ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA
ublk_drv: cleanup ublksrv_ctrl_dev_info
ublk_drv: add SET_PARAMS/GET_PARAMS control command
ublk_drv: fix ublk device leak in case that add_disk fails
ublk_drv: cancel device even though disk isn't up
block: fix leaking page ref on truncated direct io
block: ensure bio_iov_add_page can't fail
block: ensure iov_iter advances for added pages
drivers:md:fix a potential use-after-free bug
md/raid5: Ensure batch_last is released before sleeping for quiesce
md/raid5: Move stripe_request_ctx up
md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()
md/raid5: Make is_inactive_blocked() helper
md/raid5: Refactor raid5_get_active_stripe()
block: pass struct queue_limits to the bio splitting helpers
block: move bio_allowed_max_sectors to blk-merge.c
block: move the call to get_max_io_size out of blk_bio_segment_split
...
Allow verify_wq to preempt softirq since verification in tasklet will
fall-back to using it for error handling (or if the bufio cache
doesn't have required hashes).
Suggested-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Avoid extra bvec_iter copy unless it is needed to allow retrying
verification, that failed from a tasklet, from a workqueue.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Only declare and copy bvec_iter if CONFIG_DM_VERITY_FEC is defined and
FEC enabled for the verity device.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Use jump_label to limit the need for branching unless the optional
"try_verify_in_tasklet" feature is used.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The previous commit ("dm verity: Add optional "try_verify_in_tasklet"
feature") imposed that CRYPTO_ALG_ASYNC mask be used even if the
optional "try_verify_in_tasklet" feature was not specified. This was
because verity_parse_opt_args() was called after handling the primary
args (due to it having data dependencies on having first parsed all
primary args).
Enhance verity_ctr() so that simple optional args, that don't have a
data dependency on primary args parsing, can alter how the primary
args are handled. In practice this means verity_parse_opt_args() gets
called twice. First with the new 'only_modifier_opts' arg set to true,
then again with it set to false _after_ parsing all primary args.
This allows the v->use_tasklet flag to be properly set and then used
when verity_ctr() parses the primary args and then calls
crypto_alloc_ahash() with CRYPTO_ALG_ASYNC conditionally set.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Using tasklets for disk verification can reduce IO latency. When there
are accelerated hash instructions it is often better to compute the
hash immediately using a tasklet rather than deferring verification to
a work-queue. This reduces time spent waiting to schedule work-queue
jobs, but requires spending slightly more time in interrupt context.
If the dm-bufio cache does not have the required hashes we fallback to
the work-queue implementation. FEC is only possible using work-queue
because code to support the FEC feature may sleep.
The following shows a speed comparison of random reads on a dm-verity
device. The dm-verity device uses a 1G ramdisk for data and a 1G
ramdisk for hashes. One test was run using tasklets and one test was
run using the existing work-queue solution. Both tests were run when
the dm-bufio cache was hot. The tasklet implementation performs
significantly better since there is no time spent waiting for
work-queue jobs to be scheduled.
READ: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s),
io=512MiB (537MB), run=2827-2827msec
READ: bw=23.6MiB/s (24.8MB/s), 23.6MiB/s-23.6MiB/s (24.8MB/s-24.8MB/s),
io=512MiB (537MB), run=21688-21688msec
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and
may cause sh to be released. However, sh is subsequently used in lines
2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an
use-after-free bug.
It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of
the function.
Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A race condition exists where if raid5_quiesce() is called in the
middle of a request that has set batch_last, it will deadlock.
batch_last will hold a reference to a stripe when raid5_quiesce() is
called. This will cause the next raid5_get_active_stripe() call to
sleep waiting for the quiesce to finish, but the raid5_quiesce() thread
will wait for active_stripes to go to zero which will never happen
because request thread is waiting for the quiesce to stop.
Fix this by creating a special __raid5_get_active_stripe() function
which takes the request context and clears the last_batch before
sleeping.
While we're at it, change the arguments of raid5_get_active_stripe()
to bools.
Fixes: 3312e6c887 ("md/raid5: Keep a reference to last stripe_head for batch")
Reported-by: David Sloan <David.Sloan@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move stripe_request_ctx up. No functional changes intended.
This will be necessary in the next patch to release the batch_last
in the context before sleeping.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>