Pull block core updates from Jens Axboe:
"It's a big(ish) round this time, lots of development effort has gone
into blk-mq in the last 3 months. Generally we're heading to where
3.16 will be a feature complete and performant blk-mq. scsi-mq is
progressing nicely and will hopefully be in 3.17. A nvme port is in
progress, and the Micron pci-e flash driver, mtip32xx, is converted
and will be sent in with the driver pull request for 3.16.
This pull request contains:
- Lots of prep and support patches for scsi-mq have been integrated.
All from Christoph.
- API and code cleanups for blk-mq from Christoph.
- Lots of good corner case and error handling cleanup fixes for
blk-mq from Ming Lei.
- A flew of blk-mq updates from me:
* Provide strict mappings so that the driver can rely on the CPU
to queue mapping. This enables optimizations in the driver.
* Provided a bitmap tagging instead of percpu_ida, which never
really worked well for blk-mq. percpu_ida relies on the fact
that we have a lot more tags available than we really need, it
fails miserably for cases where we exhaust (or are close to
exhausting) the tag space.
* Provide sane support for shared tag maps, as utilized by scsi-mq
* Various fixes for IO timeouts.
* API cleanups, and lots of perf tweaks and optimizations.
- Remove 'buffer' from struct request. This is ancient code, from
when requests were always virtually mapped. Kill it, to reclaim
some space in struct request. From me.
- Remove 'magic' from blk_plug. Since we store these on the stack
and since we've never caught any actual bugs with this, lets just
get rid of it. From me.
- Only call part_in_flight() once for IO completion, as includes two
atomic reads. Hopefully we'll get a better implementation soon, as
the part IO stats are now one of the more expensive parts of doing
IO on blk-mq. From me.
- File migration of block code from {mm,fs}/ to block/. This
includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me,
from a discussion on lkml.
That should describe the meat of the pull request. Also has various
little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong,
Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam
Bradshaw"
* 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits)
blk-mq: push IPI or local end_io decision to __blk_mq_complete_request()
blk-mq: remember to start timeout handler for direct queue
block: ensure that the timer is always added
blk-mq: blk_mq_unregister_hctx() can be static
blk-mq: make the sysfs mq/ layout reflect current mappings
blk-mq: blk_mq_tag_to_rq should handle flush request
block: remove dead code in scsi_ioctl:blk_verify_command
blk-mq: request initialization optimizations
block: add queue flag for disabling SG merging
block: remove 'magic' from struct blk_plug
blk-mq: remove alloc_hctx and free_hctx methods
blk-mq: add file comments and update copyright notices
blk-mq: remove blk_mq_alloc_request_pinned
blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
blk-mq: remove blk_mq_wait_for_tags
blk-mq: initialize request in __blk_mq_alloc_request
blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
blk-mq: add helper to insert requests from irq context
blk-mq: remove stale comment for blk_mq_complete_request()
blk-mq: allow non-softirq completions
...
Pull device-mapper fixes from Mike Snitzer:
"A dm-cache stable fix to split discards on cache block boundaries
because dm-cache cannot yet handle discards that span cache blocks.
Really fix a dm-mpath LOCKDEP warning that was introduced in -rc1.
Add a 'no_space_timeout' control to dm-thinp to restore the ability to
queue IO indefinitely when no data space is available. This fixes a
change in behavior that was introduced in -rc6 where the timeout
couldn't be disabled"
* tag 'dm-3.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm mpath: really fix lockdep warning
dm cache: always split discards on cache block boundaries
dm thin: add 'no_space_timeout' dm-thin-pool module param
lockdep complains about a circular locking. And indeed, we need to
release the lock before calling dm_table_run_md_queue_async().
As such, commit 4cdd2ad ("dm mpath: fix lock order inconsistency in
multipath_ioctl") must also be reverted in addition to fixing the
lock order in the other dm_table_run_md_queue_async() callers.
Reported-by: Bart van Assche <bvanassche@acm.org>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The DM cache target cannot cope with discards that span multiple cache
blocks, so each discard bio that spans more than one cache block must
get split by the DM core.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.9+
Pull device mapper fixes from Mike Snitzer:
"A dm-crypt fix for a cpu hotplug crash that switches from using
per-cpu data to a mempool allocation (which offers allocation with cpu
locality, and there is no inter-cpu communication on slab allocation).
A couple dm-thinp stable fixes to address "out-of-data-space" issues.
A dm-multipath fix for a LOCKDEP warning introduced in 3.15-rc1"
* tag 'dm-3.15-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm mpath: fix lock order inconsistency in multipath_ioctl
dm thin: add timeout to stop out-of-data-space mode holding IO forever
dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode
dm crypt: fix cpu hotplug crash by removing per-cpu structure
Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
holding IO forever") introduced a fixed 60 second timeout. Users may
want to either disable or modify this timeout.
Allow the out-of-data-space timeout to be configured using the
'no_space_timeout' dm-thin-pool module param. Setting it to 0 will
disable the timeout, resulting in IO being queued until more data space
is added to the thin-pool.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
Commit 3e9f1be1b4 ("dm mpath: remove process_queued_ios()") did not
consistently take the multipath device's spinlock (m->lock) before
calling dm_table_run_md_queue_async() -- which takes the q->queue_lock.
Found with code inspection using hint from reported lockdep warning.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pool runs out of data space, dm-thin can be configured to
either error IOs that would trigger provisioning, or hold those IOs
until the pool is resized. Unfortunately, holding IOs until the pool is
resized can result in a cascade of tasks hitting the hung_task_timeout,
which may render the system unavailable.
Add a fixed timeout so IOs can only be held for a maximum of 60 seconds.
If LVM is going to resize a thin-pool that is out of data space it needs
to be prompt about it.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced
a regression in the metadata commit() method by returning an error if
the pool is in PM_OUT_OF_DATA_SPACE mode. This oversight caused a thin
device to return errors even if the default queue_if_no_space ENOSPC
handling mode is used.
Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode.
Reported-by: qindehua@163.com
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
The DM crypt target used per-cpu structures to hold pointers to a
ablkcipher_request structure. The code assumed that the work item keeps
executing on a single CPU, so it didn't use synchronization when
accessing this structure.
If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
the work item could be moved to another CPU. This causes dm-crypt
crashes, like the following, because the code starts using an incorrect
ablkcipher_request:
smpboot: CPU 7 is now offline
BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
...
Call Trace:
[<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
[<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
[<ffffffff81052a28>] ? process_one_work+0x168/0x470
[<ffffffff8105366b>] ? worker_thread+0x10b/0x390
[<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
[<ffffffff81058d9f>] ? kthread+0xaf/0xc0
[<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
[<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
Fix this bug by removing the per-cpu definition. The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.
This change may undermine performance improvements intended by commit
c0297721 ("dm crypt: scale to multiple cpus") on select hardware. In
practice no performance difference was observed on recent hardware. But
regardless, correctness is more important than performance.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull md bugfixes from Neil Brown:
"Two bugfixes for md in 3.15
Both tagged for -stable"
* tag 'md/3.15-fixes' of git://neil.brown.name/md:
md: avoid possible spinning md thread at shutdown.
md/raid10: call wait_barrier() for each request submitted.
If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:
1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
and md_check_recover enters an infinite loop.
Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.
Cc: stable@vger.kernel.org (any kernel)
Signed-off-by: NeilBrown <neilb@suse.de>
wait_barrier() includes a counter, so we must call it precisely once
(unless balanced by allow_barrier()) for each request submitted.
Since
commit 20d0189b10
block: Introduce new bio_split()
in 3.14-rc1, we don't call it for the extra requests generated when
we need to split a bio.
When this happens the counter goes negative, any resync/recovery will
never start, and "mdadm --stop" will hang.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14+)
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull device mapper fixes from Mike Snitzer:
"A few dm-thinp fixes for changes merged in 3.15-rc1.
A dm-verity fix for an immutable biovec regression that affects 3.14+.
A dm-cache fix to properly quiesce when using writethrough mode (3.14+)"
* tag 'dm-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache: fix writethrough mode quiescing in cache_map
dm thin: use INIT_WORK_ONSTACK in noflush_work to avoid ODEBUG warning
dm verity: fix biovecs hash calculation regression
dm thin: fix rcu_read_lock being held in code that can sleep
dm thin: irqsave must always be used with the pool->lock spinlock
Commit 2ee57d5873 ("dm cache: add passthrough mode") inadvertently
removed the deferred set reference that was taken in cache_map()'s
writethrough mode support. Restore taking this reference.
This issue was found with code inspection.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
Use INIT_WORK_ONSTACK to silence "ODEBUG: object is on stack, but not
annotated".
Reported-by: Zdeněk Kabeláč <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Pull md bugfix from Neil Brown:
"One BUG fix for md for recent commit"
* tag '3.15-fixes' of git://neil.brown.name/md:
raid5: fix a race of stripe count check
I hit another BUG_ON with e240c1839d. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.
V2: keeps the BUG_ON suggested by Neil.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This was used in the olden days, back when onions were proper
yellow. Basically it mapped to the current buffer to be
transferred. With highmem being added more than a decade ago,
most drivers map pages out of a bio, and rq->buffer isn't
pointing at anything valid.
Convert old style drivers to just use bio_data().
For the discard payload use case, just reference the page
in the bio.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 003b5c5719 ("block: Convert drivers
to immutable biovecs") incorrectly converted biovec iteration in
dm-verity to always calculate the hash from a full biovec, but the
function only needs to calculate the hash from part of the biovec (up to
the calculated "todo" value).
Fix this issue by limiting hash input to only the requested data size.
This problem was identified using the cryptsetup regression test for
veritysetup (verity-compat-test).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
Pull md updates from Neil Brown:
"Just a few md patches for the 3.15 merge window.
Not much happening in md/raid at the moment. Just a few bug fixes
(one for -stable) and a couple of performance tweaks"
* tag 'md/3.15' of git://neil.brown.name/md:
raid5: get_active_stripe avoids device_lock
raid5: make_request does less prepare wait
md: avoid oops on unload if some process is in poll or select.
md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
md/bitmap: don't abuse i_writecount for bitmap files.
For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe. This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>