We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have updated limits after calling wbt_set_min_lat(). No need to
update again.
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The default hard disk param sets latency targets at 50ms. As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.
This is excessively low. A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.
Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some IOs may span multiple periods. As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate. nr_lagging tracks these IOs to
avoid those situations. Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.
This has the following two problems.
* When latency target percentiles aren't set, vrate adjustment should
only be governed by queue depth depletion; however, the current code
keeps nr_lagging active which pulls in latency results and can keep
down vrate unexpectedly.
* When lagging condition is detected, it resets the entire negative
busy_level. This turned out to be way too aggressive on some
devices which sometimes experience extended latencies on a small
subset of commands. In addition, a lagging IO will be accounted as
latency target miss on completion anyway and resetting busy_level
amplifies its impact unnecessarily.
This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero. busy_level turning to zero can sometimes
be as interesting an event. This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cecf5d87ff ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.
Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.
The biggest issue is that commit cecf5d87ff assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.
Fixes the issue by not releasing queue's sysfs lock during switching
elevator.
Fixes: cecf5d87ff ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit c48dac137a ("block: don't hold q->sysfs_lock in elevator_init_mq")
removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
lockdep_assert_held() called in blk_mq_sched_free_requests() which is
run in failure path of elevator_init_mq().
blk_mq_sched_free_requests() is called in the following 3 functions:
elevator_init_mq()
elevator_exit()
blk_cleanup_queue()
In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
by 'mutex_lock(&q->sysfs_lock)'.
So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
into elevator_exit() for fixing the report by syzbot.
Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
Fixed: c48dac137a ("block: don't hold q->sysfs_lock in elevator_init_mq")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more block updates from Jens Axboe:
"Some later additions that weren't quite done for the first pull
request, and also a few fixes that have arrived since.
This contains:
- Kill silly pktcdvd warning on attempting to register a non-scsi
passthrough device (me)
- Use symbolic constants for the block t10 protection types, and
switch to handling it in core rather than in the drivers (Max)
- libahci platform missing node put fix (Nishka)
- Small series of fixes for BFQ (Paolo)
- Fix possible nbd crash (Xiubo)"
* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
block: drop device references in bsg_queue_rq()
block: t10-pi: fix -Wswitch warning
pktcdvd: remove warning on attempting to register non-passthrough dev
ata: libahci_platform: Add of_node_put() before loop exit
nbd: fix possible page fault for nbd disk
nbd: rename the runtime flags as NBD_RT_ prefixed
block, bfq: push up injection only after setting service time
block, bfq: increase update frequency of inject limit
block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
block, bfq: update inject limit only after injection occurred
block: centralize PI remapping logic to the block layer
block: use symbolic constants for t10_pi type
Make sure that bsg_queue_rq() calls put_device() if an error is
encountered after get_device() was successful.
Fixes: cd2f076f1d ("bsg: convert to use blk-mq")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Changing the switch() statement to symbolic constants made the compiler
(at least clang-9, did not check gcc) notice that there is one enum value
that is not handled here:
block/t10-pi.c:62:11: error: enumeration value 'T10_PI_TYPE0_PROTECTION'
not handled in switch [-Werror,-Wswitch]
Add a BUG_ON statement if we ever get to t10_pi_verify function with
TYPE0 and replace the switch() statement with if/else clause for the
valid types.
Fixes: 9b2061b1a262 ("block: use symbolic constants for t10_pi type")
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull dma-mapping updates from Christoph Hellwig:
- add dma-mapping and block layer helpers to take care of IOMMU merging
for mmc plus subsequent fixups (Yoshihiro Shimoda)
- rework handling of the pgprot bits for remapping (me)
- take care of the dma direct infrastructure for swiotlb-xen (me)
- improve the dma noncoherent remapping infrastructure (me)
- better defaults for ->mmap, ->get_sgtable and ->get_required_mask
(me)
- cleanup mmaping of coherent DMA allocations (me)
- various misc cleanups (Andy Shevchenko, me)
* tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits)
mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE
mmc: queue: Fix bigger segments usage
arm64: use asm-generic/dma-mapping.h
swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page
swiotlb-xen: simplify cache maintainance
swiotlb-xen: use the same foreign page check everywhere
swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable
xen: remove the exports for xen_{create,destroy}_contiguous_region
xen/arm: remove xen_dma_ops
xen/arm: simplify dma_cache_maint
xen/arm: use dev_is_dma_coherent
xen/arm: consolidate page-coherent.h
xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance
arm: remove wrappers for the generic dma remap helpers
dma-mapping: introduce a dma_common_find_pages helper
dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
vmalloc: lift the arm flag for coherent mappings to common code
dma-mapping: provide a better default ->get_required_mask
dma-mapping: remove the dma_declare_coherent_memory export
remoteproc: don't allow modular build
...
If equal to 0, the injection limit for a bfq_queue is pushed to 1
after a first sample of the total service time of the I/O requests of
the queue is computed (to allow injection to start). Yet, because of a
mistake in the branch that performs this action, the push may happen
also in some other case. This commit fixes this issue.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The update period of the injection limit has been tentatively set to
100 ms, to reduce fluctuations. This value however proved to cause,
occasionally, the limit to be decremented for some bfq_queue only
after the queue underwent excessive injection for a lot of time. This
commit reduces the period to 10 ms.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Upon an increment attempt of the injection limit, the latter is
constrained not to become higher than twice the maximum number
max_rq_in_driver of I/O requests that have happened to be in service
in the drive. This high bound allows the injection limit to grow
beyond max_rq_in_driver, which may then cause max_rq_in_driver itself
to grow.
However, since the limit is incremented by only one unit at a time,
there is no need for such a high bound, and just max_rq_in_driver+1 is
enough.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BFQ updates the injection limit of each bfq_queue as a function of how
much the limit inflates the service times experienced by the I/O
requests of the queue. So only service times affected by injection
must be taken into account. Unfortunately, in the current
implementation of this update scheme, the service time of an I/O
request rq not affected by injection may happen to be considered in
the following case: there is no I/O request in service when rq
arrives.
This commit fixes this issue by making sure that only service times
affected by injection are considered for updating the injection
limit. In particular, the service time of an I/O request rq is now
considered only if at least one of the following two conditions holds:
- the destination bfq_queue for rq underwent injection before rq
arrival, and there is still I/O in service in the drive on rq arrival
(the service of such unfinished I/O may delay the service of rq);
- injection occurs between the arrival and the completion time of rq.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace all hard-coded values with T10_PI_TYPES to make the code more
readable.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
In __blk_mq_end_request() if block stats needs update, we should
ensure now is valid instead of 0 even when iostat is disabled.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.
We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.
Fixes: 4bc6339a58 ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 795fe54c2a ("bfq: Add per-device weight"), bfq uses
blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it
causes linkage error if bfq compiled as a module.
Fixes: 795fe54c2a ("bfq: Add per-device weight")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cecf5d87ff ("block: split .sysfs_lock into two locks") starts to
release & actuire sysfs_lock again during switching elevator. So it
isn't enough to prevent switching elevator from happening by simply
clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
in-progress switch still can move on after re-acquiring the lock,
meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.
Fixes this issue by checking 'q->elevator' directly & locklessly after
q->kobj is removed in blk_unregister_queue(), this way is safe because
q->elevator can't be changed at that time.
Fixes: cecf5d87ff ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some devices may skip blk_pm_runtime_init() and have null pointer
in its request_queue->dev. For example, SCSI devices of UFS Well-Known
LUNs.
Currently the null pointer is checked by the user of
blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to
check it by blk_set_runtime_active() itself instead of by its users.
Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>