Commit Graph

3327 Commits

Author SHA1 Message Date
Bart Van Assche b425e50492 block: Avoid that blk_exit_rl() triggers a use-after-free
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
essential that the memory allocated for struct request_queue
stays around until all blk_exit_rl() calls have finished. Hence
make blk_init_rl() take a reference on struct request_queue.

This patch fixes the following crash:

general protection fault: 0000 [#2] SMP
CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
task: ffff88013a108040 task.stack: ffffc9000071c000
RIP: 0010:free_request_size+0x1a/0x30
RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
Call Trace:
 mempool_destroy.part.10+0x21/0x40
 mempool_destroy+0xe/0x10
 blk_exit_rl+0x12/0x20
 blkg_free+0x4d/0xa0
 __blkg_release_rcu+0x59/0x170
 rcu_process_callbacks+0x260/0x4e0
 __do_softirq+0x116/0x250
 smpboot_thread_fn+0x123/0x1e0
 kthread+0x109/0x140
 ret_from_fork+0x31/0x40

Fixes: commit e9c787e65c ("scsi: allocate scsi_cmnd structures as part of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-01 13:07:55 -06:00
Hou Tao 5be6b75610 cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.

When cfq is under iops mode, commit 9a7f38c42c ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.

Despite having the same value, the delay of a cfq_queue in idle class
and the delay of cfq_group are different things, so I define two new
macros for the delay of a cfq_group under time-slice mode and iops mode.

Fixes: 9a7f38c42c ("cfq-iosched: Convert from jiffies to nanoseconds")
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-31 09:25:21 -06:00
Keith Busch e4dc2b32df blk-mq: Take tagset lock when updating hw queues
The tagset lock needs to be held when iterating the tag_list, so a
lockdep assert was added when updating number of hardware queues. The
drivers calling this API, however, were unaware of the new requirement,
so are failing the assertion.

This patch takes the lock within the blk-mq function so the drivers do
not have to be modified in order to be safe.

Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-30 12:32:37 -06:00
Jens Axboe 8aa6382907 Merge branch 'nvme-4.12' of git://git.infradead.org/nvme into for-linus
Christoph writes:

"A couple of fixes for the next rc on the nvme front. Various FC fixes
from James, controller removal fixes from Ming (including a block layer
patch), a APST related device quirk from Andy, a RDMA fix for small
queue depth device from Marta, as well as fixes for the lack of
metadata support in non-PCIe drivers and the printk logging format from
me."
2017-05-26 09:11:19 -06:00
Bart Van Assche a8ecdd7117 blk-mq: Only register debugfs attributes for blk-mq queues
The code in blk-mq-debugfs.c assumes that it is working on a blk-mq
queue and is not intended to work on a blk-sq queue. Hence only
register blk-mq debugfs attributes for blk-mq queues.

Fixes: commit 9c1051aacd ("blk-mq: untangle debugfs and sysfs")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-26 07:25:13 -06:00
Richard 223220356d partitions/msdos: FreeBSD UFS2 file systems are not recognized
The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
and NetBSD partitions and does a reasonable job picking out OpenBSD
and NetBSD UFS subpartitions.

But for FreeBSD the subpartitions are always "bad".

    Kernel: <bsd:bad subpartition - ignored

Though all 3 of these BSD systems use UFS as a file system, only
FreeBSD uses relative start addresses in the subpartition
declarations.

The following patch fixes this for FreeBSD partitions and leaves
the code for OpenBSD and NetBSD intact:

Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-23 09:16:07 -06:00
Dan Carpenter 7bd897cfce block: fix an error code in add_partition()
We don't set an error code on this path.  It means that we return NULL
instead of an error pointer and the caller does a NULL dereference.

Fixes: 6d1d8050b4 ("block, partition: add partition_meta_info to hd_struct")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-23 08:41:59 -06:00
Shaohua Li b4f428ef28 blk-throttle: force user to configure all settings for io.low
Default value of io.low limit is 0. If user doesn't configure the limit,
last patch makes cgroup be throttled to very tiny bps/iops, which could
stall the system. A cgroup with default settings of io.low limit really
means nothing, so we force user to configure all settings, otherwise
io.low limit doesn't take effect. With this stragety, default setting of
latency/idle isn't important, so just set them to very conservative and
safe value.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-22 14:47:12 -06:00
Shaohua Li 9bb67aeb96 blk-throttle: respect 0 bps/iops settings for io.low
If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit
is ignored and we throttle the cgroup with its max limit. In this way,
other cgroups with a low limit will not get protected. To fix this, we
don't do the exception any more. cgroup will be throttled to a limit 0
if it uese default setting. To avoid completed stall, we give such
cgroup tiny IO resources.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-22 14:47:12 -06:00
Shaohua Li 4cff729f62 blk-throttle: output some debug info in trace
These info are important to understand what's happening and help debug.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-22 14:47:12 -06:00
Shaohua Li 5b81fc3cc6 blk-throttle: add hierarchy support for latency target and idle time
For idle time, children's setting should not be bigger than parent's.
For latency target, children's setting should not be smaller than
parent's. The leaf nodes will adjust their settings according to the
hierarchy and compare their IO with the settings and do
upgrade/downgrade. parents nodes don't need to track their IO
latency/idle time.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-22 14:47:12 -06:00
Ming Lei 7254a50a5d blk-mq: remove blk_mq_abort_requeue_list()
No one uses it any more, so remove it.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-22 20:50:11 +02:00
Linus Torvalds 0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00
Christoph Hellwig ed6565e734 block: handle partial completions for special payload requests
SCSI devices can return short writes on Write Same just like for normal
writes, so we need to handle this case for our special payload requests
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-11 08:08:53 -06:00
Wen Xiong f36ea50ca0 blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.

Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error

We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.

1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

With the fix, It showed us both of the first op and the second op have
correct bi and pi address.

1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
	0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605
	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
	0x00003028

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 08:02:58 -06:00
Jens Axboe d373812398 blk-stat: don't use this_cpu_ptr() in a preemptable section
If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough
for us to use this_cpu_ptr() in that section. Use the safer
get/put_cpu_ptr() variants instead.

Reported-by: Mike Galbraith <efault@gmx.de>
Fixes: 34dbad5d26 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 07:40:18 -06:00
Jens Axboe 340ff32167 elevator: remove redundant warnings on IO scheduler switch
We warn twice for switching to a scheduler, if that switch fails.
As we also report the failure in the return value to the
sysfs write, remove the dmesg induced failures.

Keep the failure print for warning to switch to the kconfig
selected IO scheduler, as we can't report errors for that in
any other way.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 07:40:04 -06:00
Paolo Valente 43c1b3d6e5 block, bfq: stress that low_latency must be off to get max throughput
The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 07:39:43 -06:00
Paolo Valente a66c38a171 block, bfq: use pointer entity->sched_data only if set
In the function __bfq_deactivate_entity, the pointer
entity->sched_data could happen to be used before being properly
initialized. This led to a NULL pointer dereference. This commit fixes
this bug by just using this pointer only where it is safe to do so.

Reported-by: Tom Harrison <l12436.tw@gmail.com>
Tested-by: Tom Harrison <l12436.tw@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 07:39:43 -06:00
Dan Williams ef51042472 block, dax: move "select DAX" from BLOCK to FS_DAX
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08 10:55:27 -07:00
Colin Ian King ebd7685795 blk-mq: make __blk_mq_stop_hw_queues static
Making __blk_mq_stop_hw_queues static fixes sparse warning:

  block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
  declared. Should it be static?

Fixes: 2719aa217e ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-08 08:09:39 -06:00
Wanpeng Li 51d638b1f5 block/mq: fix potential deadlock during cpu hotplug
This can be triggered by hot-unplug one cpu.

======================================================
 [ INFO: possible circular locking dependency detected ]
 4.11.0+ #17 Not tainted
 -------------------------------------------------------
 step_after_susp/2640 is trying to acquire lock:
  (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110

 but task is already holding lock:
  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (cpu_hotplug.lock){+.+.+.}:
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        get_online_cpus+0x64/0x80
        blk_mq_init_allocated_queue+0x3a0/0x4e0
        blk_mq_init_queue+0x3a/0x60
        loop_add+0xe5/0x280
        loop_init+0x124/0x177
        do_one_initcall+0x53/0x1c0
        kernel_init_freeable+0x1e3/0x27f
        kernel_init+0xe/0x100
        ret_from_fork+0x31/0x40

 -> #0 (all_q_mutex){+.+...}:
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        do_syscall_64+0x8f/0x710
        return_from_SYSCALL_64+0x0/0x7a

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(cpu_hotplug.lock);
                                lock(all_q_mutex);
                                lock(cpu_hotplug.lock);
   lock(all_q_mutex);

  *** DEADLOCK ***

 8 locks held by step_after_susp/2640:
  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
  #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
  #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
  #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 stack backtrace:
 CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
 Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
 Call Trace:
  dump_stack+0x99/0xce
  print_circular_bug+0x1fa/0x270
  __lock_acquire+0x189a/0x18a0
  lock_acquire+0x11c/0x230
  ? lock_acquire+0x11c/0x230
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? blk_mq_queue_reinit_work+0x18/0x110
  __mutex_lock+0x92/0x990
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? kmem_cache_free+0x2cb/0x330
  ? anon_transport_class_unregister+0x20/0x20
  ? blk_mq_queue_reinit_work+0x110/0x110
  mutex_lock_nested+0x1b/0x20
  ? mutex_lock_nested+0x1b/0x20
  blk_mq_queue_reinit_work+0x18/0x110
  blk_mq_queue_reinit_dead+0x1c/0x20
  cpuhp_invoke_callback+0x1f2/0x810
  ? __flow_cache_shrink+0x160/0x160
  cpuhp_down_callbacks+0x42/0x80
  _cpu_down+0xb2/0xe0
  freeze_secondary_cpus+0xb6/0x390
  suspend_devices_and_enter+0x3b3/0xa40
  ? rcu_read_lock_sched_held+0x79/0x80
  pm_suspend+0x129/0x490
  state_store+0x82/0xf0
  kobj_attr_store+0xf/0x20
  sysfs_kf_write+0x45/0x60
  kernfs_fop_write+0x135/0x1c0
  __vfs_write+0x37/0x160
  ? rcu_read_lock_sched_held+0x79/0x80
  ? rcu_sync_lockdep_assert+0x2f/0x60
  ? __sb_start_write+0xd9/0x1c0
  ? vfs_write+0x1ad/0x1d0
  vfs_write+0xcd/0x1d0
  SyS_write+0x58/0xc0
  ? rcu_read_lock_sched_held+0x79/0x80
  do_syscall_64+0x8f/0x710
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL64_slow_path+0x25/0x25

The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe06595d
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-07 19:50:11 -06:00
Linus Torvalds 044f1daaaa Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes and updates from Jens Axboe:
 "Some fixes and followup features/changes that should go in, in this
  merge window. This contains:

   - Two fixes for lightnvm from Javier, fixing problems in the new code
     merge previously in this merge window.

   - A fix from Jan for the backing device changes, fixing an issue in
     NFS that causes a failure to mount on certain setups.

   - A change from Christoph, cleaning up the blk-mq init and exit
     request paths.

   - Remove elevator_change(), which is now unused. From Bart.

   - A fix for queue operation invocation on a dead queue, from Bart.

   - A series fixing up mtip32xx for blk-mq scheduling, removing a
     bandaid we previously had in place for this. From me.

   - A regression fix for this series, fixing a case where we wait on
     workqueue flushing from an invalid (non-blocking) context. From me.

   - A fix/optimization from Ming, ensuring that we don't both quiesce
     and freeze a queue at the same time.

   - A fix from Peter on lock ordering for CPU hotplug. Not a real
     problem right now, but will be once the CPU hotplug rework goes in.

   - A series from Omar, cleaning up out blk-mq debugfs support, and
     adding support for exporting info from schedulers in debugfs as
     well. This is really useful in debugging stalls or livelocks. From
     Omar"

* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
  mq-deadline: add debugfs attributes
  kyber: add debugfs attributes
  blk-mq-debugfs: allow schedulers to register debugfs attributes
  blk-mq: untangle debugfs and sysfs
  blk-mq: move debugfs declarations to a separate header file
  blk-mq: Do not invoke queue operations on a dead queue
  blk-mq-debugfs: get rid of a bunch of boilerplate
  blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
  blk-mq-debugfs: don't open code strstrip()
  blk-mq-debugfs: error on long write to queue "state" file
  blk-mq-debugfs: clean up flag definitions
  blk-mq-debugfs: separate flags with |
  nfs: Fix bdi handling for cloned superblocks
  block/mq: Cure cpu hotplug lock inversion
  lightnvm: fix bad back free on error path
  lightnvm: create cmd before allocating request
  blk-mq: don't use sync workqueue flushing from drivers
  mtip32xx: convert internal commands to regular block infrastructure
  mtip32xx: cleanup internal tag assumptions
  block: don't call blk_mq_quiesce_queue() after queue is frozen
  ...
2017-05-06 11:25:08 -07:00
Linus Torvalds 53ef7d0e20 Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
 "The bulk of this has been in multiple -next releases. There were a few
  late breaking fixes and small features that got added in the last
  couple days, but the whole set has received a build success
  notification from the kbuild robot.

  Change summary:

   - Region media error reporting: A libnvdimm region device is the
     parent to one or more namespaces. To date, media errors have been
     reported via the "badblocks" attribute attached to pmem block
     devices for namespaces in "raw" or "memory" mode. Given that
     namespaces can be in "device-dax" or "btt-sector" mode this new
     interface reports media errors generically, i.e. independent of
     namespace modes or state.

     This subsequently allows userspace tooling to craft "ACPI 6.1
     Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
     requests and submit them via the ioctl path for NVDIMM root bus
     devices.

   - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
     by a request from Linus and feedback from Christoph this allows for
     dax capable drivers to publish their own custom dax operations.
     This fixes the broken assumption that all dax operations are
     related to a persistent memory device, and makes it easier for
     other architectures and platforms to add customized persistent
     memory support.

   - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
     available for storage appliance applications to manually trigger
     memory controllers to drain write-pending buffers that would
     otherwise be flushed automatically by the platform ADR
     (asynchronous-DRAM-refresh) mechanism at a power loss event.
     Support for "locked" DIMMs is included to prevent namespaces from
     surfacing when the namespace label data area is locked. Finally,
     fixes for various reported deadlocks and crashes, also tagged for
     -stable.

   - ACPI / nfit driver updates: General updates of the nfit driver to
     add DSM command overrides, ACPI 6.1 health state flags support, DSM
     payload debug available by default, and various fixes.

  Acknowledgements that came after the branch was pushed:

   - commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
     Tested-by: Yi Zhang <yizhan@redhat.com>

   - commit 23f4984483 "libnvdimm: rework region badblocks clearing"
     Tested-by: Toshi Kani <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
  libnvdimm, pfn: fix 'npfns' vs section alignment
  libnvdimm: handle locked label storage areas
  libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
  brd: fix uninitialized use of brd->dax_dev
  block, dax: use correct format string in bdev_dax_supported
  device-dax: fix sysfs attribute deadlock
  libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
  libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
  libnvdimm: rework region badblocks clearing
  acpi, nfit: kill ACPI_NFIT_DEBUG
  libnvdimm: fix clear length of nvdimm_forget_poison()
  libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
  libnvdimm, region: sysfs trigger for nvdimm_flush()
  libnvdimm: fix phys_addr for nvdimm_clear_poison
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  block: remove block_device_operations ->direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  filesystem-dax: convert to dax_direct_access()
  Revert "block: use DAX for partition table reads"
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  ...
2017-05-05 18:49:20 -07:00
Omar Sandoval daaadb3e94 mq-deadline: add debugfs attributes
Expose the fifo lists, cached next requests, batching state, and
dispatch list. It'd also be possible to add the sorted lists, but there
aren't already seq_file helpers for rbtrees.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04 08:25:17 -06:00