If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Fixes: 63581af3f3 ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes one issue reported by Kent, which can
be triggered in bcachefs over sata disk. Actually it
is a generic issue in block flush vs. blk-tag.
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When bandblocks_set acknowledges a range or badblocks_clear a range,
it's possible all badblocks are acknowledged. We should update
unacked_exist if this occurs.
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Tested-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A set of fixes that missed the merge window, mostly due to me being
away around that time.
Nothing major here, a mix of nvme cleanups and fixes, and one fix for
the badblocks handling"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: use symbolic constants for CNS values
nvme: use symbolic constants for CNS values
nvme.h: add an enum for cns values
nvme.h: don't use uuid_be
nvme.h: resync with nvme-cli
nvme: Add tertiary number to NVME_VS
nvme : Add sysfs entry for NVMe CMBs when appropriate
nvme: don't schedule multiple resets
nvme: Delete created IO queues on reset
nvme: Stop probing a removed device
badblocks: fix overlapping check for clearing
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
Pull cgroup updates from Tejun Heo:
- tracepoints for basic cgroup management operations added
- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()
- non-critical bug fixes
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()
Current bad block clear implementation assumes the range to clear
overlaps with at least one bad block already stored. If given range to
clear precedes first bad block in a list, the first entry is incorrectly
updated.
Check not only if stored block end is past clear block end but also if
stored block start is before clear block end.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Patch series "fallocate for block devices", v11.
This is a patchset to fix page cache coherency with BLKZEROOUT and
implement fallocate for block devices.
The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
the page cache if the zeroing command to the underlying device succeeds.
Without this patch we still have the pagecache coherence bug that's been
in the kernel forever.
The second patch changes the internal block device functions to reject
attempts to discard or zeroout that are not aligned to the logical block
size. Previously, we only checked that the start/len parameters were
512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
devices.
The third patch creates an fallocate handler for block devices, wires up
the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
fallocate interface between files and block devices. It also allows the
combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.
Test cases for the new block device fallocate are now in xfstests as
generic/349-351.
This patch (of 3):
Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
returning stale cache contents at a later time.
Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.
These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.
Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull blk-mq CPU hotplug update from Jens Axboe:
"This is the conversion of blk-mq to the new hotplug state machine"
* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
blk-mq: fixup "Convert to new hotplug state machine"
blk-mq: Convert to new hotplug state machine
blk-mq/cpu-notif: Convert to new hotplug state machine
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.
As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.
This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.
- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.
- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.
- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
avoids that smatch reports the following error:
block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'
Fixes: 06b285bd11 ("blkcg: fix blkcg_policy_data allocation bug")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.
The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):
ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]
Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.
Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:
8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240
8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1]
8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1]
8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1]
<==== 'I' means Inserted (into the IO scheduler) ===================================>
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]
<==== Only 15s later the CFQ scheduler dispatches the request ======================>
As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):
8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0
where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.
The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.
Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.
Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.
When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).
This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)
Care is taken during preempt to still allow rt requests to preempt us
regardless.
Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:
Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0
Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing
a copy. Since we might never had an online callback we could end up with
a ZERO mask which in turn leads to crash as test robot demonstarted.
Fixes: 65d5291eee ("blk-mq: Convert to new hotplug state machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Josef Bacik <jbacik@fb.com>
bt_get already does a non-blocking pass as well as running the queue
when scheduling internally, no need to duplicate it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Two cases:
1) blk_mq_alloc_request() needlessly re-runs the queue, after
calling into the tag allocation without NOWAIT set. We don't
need to do that.
2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
the async flag set to false.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>