Commit Graph

2476 Commits

Author SHA1 Message Date
Linus Torvalds
23b7776290 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes are:

   - lockless wakeup support for futexes and IPC message queues
     (Davidlohr Bueso, Peter Zijlstra)

   - Replace spinlocks with atomics in thread_group_cputimer(), to
     improve scalability (Jason Low)

   - NUMA balancing improvements (Rik van Riel)

   - SCHED_DEADLINE improvements (Wanpeng Li)

   - clean up and reorganize preemption helpers (Frederic Weisbecker)

   - decouple page fault disabling machinery from the preemption
     counter, to improve debuggability and robustness (David
     Hildenbrand)

   - SCHED_DEADLINE documentation updates (Luca Abeni)

   - topology CPU masks cleanups (Bartosz Golaszewski)

   - /proc/sched_debug improvements (Srikar Dronamraju)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
  sched/deadline: Remove needless parameter in dl_runtime_exceeded()
  sched: Remove superfluous resetting of the p->dl_throttled flag
  sched/deadline: Drop duplicate init_sched_dl_class() declaration
  sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
  sched/deadline: Make init_sched_dl_class() __init
  sched/deadline: Optimize pull_dl_task()
  sched/preempt: Add static_key() to preempt_notifiers
  sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
  sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
  sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
  sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
  sched/debug: Properly format runnable tasks in /proc/sched_debug
  sched/numa: Only consider less busy nodes as numa balancing destinations
  Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
  sched/fair: Prevent throttling in early pick_next_task_fair()
  preempt: Reorganize the notrace definitions a bit
  preempt: Use preempt_schedule_context() as the official tracing preemption point
  sched: Make preempt_schedule_context() function-tracing safe
  x86: Remove cpu_sibling_mask() and cpu_core_mask()
  x86: Replace cpu_**_mask() with topology_**_cpumask()
  ...
2015-06-22 15:52:04 -07:00
Dan Williams
4d66e5e9b6 block: fix ext_dev_lock lockdep report
=================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
   [<ffffffff810c1947>] lock_acquire+0xb7/0x290
   [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
   [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
[..]
  [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
  [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
  [<ffffffff810c1947>] lock_acquire+0xb7/0x290
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
  [<ffffffff8143bfec>] part_release+0x1c/0x50
  [<ffffffff8158edf6>] device_release+0x36/0xb0
  [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
  [<ffffffff8145aad0>] kobject_put+0x30/0x70
  [<ffffffff8158f147>] put_device+0x17/0x20
  [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
  [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
  [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
  [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
  [<ffffffff81067e2e>] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dd "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092dd are
candidates for -stable.

Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: <stable@vger.kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-11 09:01:40 -06:00
Ming Lei
c3b4afca70 blk-mq: free hctx->ctxs in queue's release handler
Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.

So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.

[1], 6cd18e711d (block: destroy bdi before blockdev is
unregistered)

Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-09 15:32:38 -06:00
Ingo Molnar
f407a82586 Merge branch 'linus' into sched/core, to resolve conflict
Conflicts:
	arch/sparc/include/asm/topology_64.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-02 08:05:42 +02:00
NeilBrown
aad653a0bc block: discard bdi_unregister() in favour of bdi_destroy()
bdi_unregister() now contains very little functionality.

It contains a "WARN_ON" if bdi->dev is NULL.  This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
   blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
  Commit: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")

So this isn't wanted.

It also calls bdi_set_min_ratio().  This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.

Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.

So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
  writeback_bdi_unregister
event which is already not used.

Reported-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d31e ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e711d ("block: destroy bdi before blockdev is unregistered.")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-28 10:12:42 -06:00
Bartosz Golaszewski
06931e6224 sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask()
Rename topology_thread_cpumask() to topology_sibling_cpumask()
for more consistency with scheduler code.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 15:22:15 +02:00
Mike Snitzer
336b7e1f23 block: remove export for blk_queue_bio
With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export.  Doing so required a forward declaration in
blk-core.c.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-12 17:21:22 -04:00
Shaohua Li
9ba52e5812 blk-mq: don't lose requests if a stopped queue restarts
Normally if driver is busy to dispatch a request the logic is like below:
block layer:					driver:
	__blk_mq_run_hw_queue
a.						blk_mq_stop_hw_queue
b.	rq add to ctx->dispatch

later:
1.						blk_mq_start_hw_queue
2.	__blk_mq_run_hw_queue

But it's possible step 1-2 runs between a and b. And since rq isn't in
ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
there are no subsequent requests kick in.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-04 14:32:48 -06:00
NeilBrown
6cd18e711d block: destroy bdi before blockdev is unregistered.
Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
	blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call.  In particular, the 'bdi'.

Since:
commit c4db59d31e
Author: Christoph Hellwig <hch@lst.de>
    fs: don't reassign dirty inodes to default_backing_dev_info

moved the
   device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk().  As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d31e
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27 10:27:20 -06:00
Wang YanQing
393a339705 block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
Commit d2c5e30c9a
("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter")
convert statistic of nr_bounce to per zone and one global value in vm_stat,
but it call inc_|dec_zone_page_state on different pages, then different
zones, and cause us to get unexpected value of NR_BOUNCE.

Below is the result on my machine:
Mar  2 09:26:08 udknight kernel: [144766.778265] Mem-Info:
Mar  2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu:
Mar  2 09:26:08 udknight kernel: [144766.778268] CPU    0: hi:    0, btch:   1 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778269] CPU    1: hi:    0, btch:   1 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu:
Mar  2 09:26:08 udknight kernel: [144766.778271] CPU    0: hi:  186, btch:  31 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778273] CPU    1: hi:  186, btch:  31 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu:
Mar  2 09:26:08 udknight kernel: [144766.778275] CPU    0: hi:  186, btch:  31 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778276] CPU    1: hi:  186, btch:  31 usd:   0
Mar  2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0
Mar  2 09:26:08 udknight kernel: [144766.778279]  active_file:105085 inactive_file:139432 isolated_file:0
Mar  2 09:26:08 udknight kernel: [144766.778279]  unevictable:653 dirty:0 writeback:0 unstable:0
Mar  2 09:26:08 udknight kernel: [144766.778279]  free:178957 slab_reclaimable:6419 slab_unreclaimable:9966
Mar  2 09:26:08 udknight kernel: [144766.778279]  mapped:4426 shmem:305277 pagetables:784 bounce:0
Mar  2 09:26:08 udknight kernel: [144766.778279]  free_cma:0
Mar  2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar  2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754
Mar  2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes
Mar  2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451
Mar  2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Mar  2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0

You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global.

This patch fix it.

Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27 09:24:07 -06:00
Chao Yu
8406a4d56e elevator: fix double release of elevator module
Our issue is descripted in below call path:
->elevator_init
 ->elevator_init_fn
  ->{cfq,deadline,noop}_init_queue
   ->elevator_alloc
    ->kzalloc_node
   fail to call kzalloc_node and then put module in elevator_alloc;
fail to call elevator_init_fn and then put module again in elevator_init.

Remove elevator_put invoking in error path of elevator_alloc to avoid
double release issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23 10:47:44 -06:00
Ming Lei
2a34c0872a blk-mq: fix CPU hotplug handling
hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.

The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23 10:27:38 -06:00
Ming Lei
f054b56c95 blk-mq: fix race between timeout and CPU hotplug
Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.

Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been unmapped, otherwise
still use-after-free on tags can be triggered.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23 10:27:35 -06:00
Jens Axboe
569fd0ce96 blk-mq: fix iteration of busy bitmap
Commit 889fa31f00 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f00 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-17 08:31:12 -06:00
Linus Torvalds
d82312c808 Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-block
Pull block layer core bits from Jens Axboe:
 "This is the core pull request for 4.1.  Not a lot of stuff in here for
  this round, mostly little fixes or optimizations.  This pull request
  contains:

   - An optimization that speeds up queue runs on blk-mq, especially for
     the case where there's a large difference between nr_cpu_ids and
     the actual mapped software queues on a hardware queue.  From Chong
     Yuan.

   - Honor node local allocations for requests on legacy devices.  From
     David Rientjes.

   - Cleanup of blk_mq_rq_to_pdu() from me.

   - exit_aio() fixup from me, greatly speeding up exiting multiple IO
     contexts off exit_group().  For my particular test case, fio exit
     took ~6 seconds.  A typical case of both exposing RCU grace periods
     to user space, and serializing exit of them.

   - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
     wait if __GFP_WAIT is set.  From Keith Busch.

   - blk-mq exports and two added helpers from Mike Snitzer, which will
     be used by the dm-mq code.

   - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

* 'for-4.1/core' of git://git.kernel.dk/linux-block:
  blk-mq: reduce unnecessary software queue looping
  aio: fix serial draining in exit_aio()
  blk-mq: cleanup blk_mq_rq_to_pdu()
  blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
  block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
  block: allocate request memory local to request queue
  blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
  blk-mq: export blk_mq_run_hw_queues
  blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
2015-04-16 21:49:16 -04:00
Chong Yuan
889fa31f00 blk-mq: reduce unnecessary software queue looping
In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.

And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().

Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com>

Updated by me for formatting and commenting.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-15 11:39:29 -06:00
Linus Torvalds
ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
Al Viro
8f7e885a4c blk_rq_map_user(): use import_single_range()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:27:13 -04:00
Al Viro
e272b89ff8 sg_io(): use import_iovec()
... and don't skip access_ok() validation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:27:13 -04:00
Linus Torvalds
ac2111753c blk-mq: initialize 'struct request' and associated data to zero
Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.

The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.

This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation.  The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.

Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-11 13:42:16 -07:00
Mike Snitzer
e9637415a9 block: fix blk_stack_limits() regression due to lcm() change
Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
caused blk_stack_limits() to not properly stack queue_limits for stacked
devices (e.g. DM).

Fix this regression by establishing lcm_not_zero() and switching
blk_stack_limits() over to using it.

DM uses blk_set_stacking_limits() to establish the initial top-level
queue_limits that are then built up based on underlying devices' limits
using blk_stack_limits().  In the case of optimal_io_size (io_opt)
blk_set_stacking_limits() establishes a default value of 0.  With commit
69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
the underlying devices' io_opt.

Test:
$ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
$ cat /sys/block/sde/queue/optimal_io_size
786432
$ dmsetup create node --table "0 100 linear /dev/sde 0"

Before this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
0

After this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
786432

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-31 09:45:50 -06:00
Wei Fang
c76cbbcf40 blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
Don't assign ->rq_timeout twice.

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30 09:07:00 -06:00
Xiaoguang Wang
f9018ac930 block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
At the beginning of blk_mq_alloc_tag_set(), we have already checked whether
'set->nr_hw_queues' is zero, so here remove this redundant check.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30 09:04:27 -06:00
David Rientjes
271508dba2 block: allocate request memory local to request queue
blk_init_rl() allocates a mempool using mempool_create_node() with node
local memory.  This only allocates the mempool and element list locally
to the requeue queue node.

What we really want to do is allocate the request itself local to the
queue.  To do this, we need our own alloc and free functions that will
allocate from request_cachep and pass the request queue node in to prefer
node local memory.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-24 20:00:07 -06:00
Wenbo Wang
7ee8e4f398 Fix bug in blk_rq_merge_ok
Use the right array index to reference the last
element of rq->biotail->bi_io_vec[]

Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Reviewed-by: Chong Yuan <chong.yuan@memblaze.com>
Fixes: 66cb45aa41 ("block: add support for limiting gaps in SG lists")
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-20 08:50:41 -06:00