Commit Graph

2648 Commits

Author SHA1 Message Date
Randy Dunlap
ccc2600b8a block: fix blk-core.c kernel-doc warning
Fix kernel-doc warning in blk-core.c:

Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-11 09:36:57 -07:00
Jens Axboe
1fa8cc52f4 blk-mq: mark __blk_mq_complete_request() static
It's no longer used outside of blk-mq core.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-11 09:36:56 -07:00
Linus Torvalds
3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Jens Axboe
05229beedd block: add block polling support
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:47 -07:00
Jens Axboe
7b371636fb blk-mq: return tag/queue combo in the make_request_fn handlers
Return a cookie, blk_qc_t, from the blk-mq make request functions, that
allows a later caller to uniquely identify a specific IO. The cookie
doesn't mean anything to the caller, but the caller can use it to later
pass back to the block layer. The block layer can then identify the
hardware queue and request from that cookie.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:47 -07:00
Jens Axboe
dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Ben Segall
8639b46139 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
the current pid namespace.  This is in contrast to both the other modes of
setpriority and the example of kill(-1).  Fix this.  getpriority and
ioprio have the same failure mode, fix them too.

Eric said:

: After some more thinking about it this patch sounds justifiable.
:
: My goal with namespaces is not to build perfect isolation mechanisms
: as that can get into ill defined territory, but to build well defined
: mechanisms.  And to handle the corner cases so you can use only
: a single namespace with well defined results.
:
: In this case you have found the two interfaces I am aware of that
: identify processes by uid instead of by pid.  Which quite frankly is
: weird.  Unfortunately the weird unexpected cases are hard to handle
: in the usual way.
:
: I was hoping for a little more information.  Changes like this one we
: have to be careful of because someone might be depending on the current
: behavior.  I don't think they are and I do think this make sense as part
: of the pid namespace.

Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ambrose Feinstein <ambrose@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
71baba4b92 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep.  The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds
69234acee5 Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "The cgroup core saw several significant updates this cycle:

   - percpu_rwsem for threadgroup locking is reinstated.  This was
     temporarily dropped due to down_write latency issues.  Oleg's
     rework of percpu_rwsem which is scheduled to be merged in this
     merge window resolves the issue.

   - On the v2 hierarchy, when controllers are enabled and disabled, all
     operations are atomic and can fail and revert cleanly.  This allows
     ->can_attach() failure which is necessary for cpu RT slices.

   - Tasks now stay associated with the original cgroups after exit
     until released.  This allows tracking resources held by zombies
     (e.g.  pids) and makes it easy to find out where zombies came from
     on the v2 hierarchy.  The pids controller was broken before these
     changes as zombies escaped the limits; unfortunately, updating this
     behavior required too many invasive changes and I don't think it's
     a good idea to backport them, so the pids controller on 4.3, the
     first version which included the pids controller, will stay broken
     at least until I'm sure about the cgroup core changes.

   - Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
  cgroup: fix race condition around termination check in css_task_iter_next()
  blkcg: don't create "io.stat" on the root cgroup
  cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
  cgroup: replace error handling in cgroup_init() with WARN_ON()s
  cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  cgroup: keep zombies associated with their original cgroups
  cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  cgroup: don't hold css_set_rwsem across css task iteration
  cgroup: reorganize css_task_iter functions
  cgroup: factor out css_set_move_task()
  cgroup: keep css_set and task lists in chronological order
  cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  cgroup: make css_sets pin the associated cgroups
  cgroup: relocate cgroup_[try]get/put()
  cgroup: move check_for_release() invocation
  cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  cgroup: make cgroup->nr_populated count the number of populated css_sets
  cgroup: remove an unused parameter from cgroup_task_migrate()
  cgroup: fix too early usage of static_branch_disable()
  cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  ...
2015-11-05 14:51:32 -08:00
Linus Torvalds
ccf21b69a8 Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block
Pull block reservation support from Jens Axboe:
 "This adds support for persistent reservations, both at the core level,
  as well as for sd and NVMe"

[ Background from the docs: "Persistent Reservations allow restricting
  access to block devices to specific initiators in a shared storage
  setup.  All implementations are expected to ensure the reservations
  survive a power loss and cover all connections in a multi path
  environment" ]

* 'for-4.4/reservations' of git://git.kernel.dk/linux-block:
  NVMe: Precedence error in nvme_pr_clear()
  nvme: add missing endianess annotations in nvme_pr_command
  NVMe: Add persistent reservation ops
  sd: implement the Persistent Reservation API
  block: add an API for Persistent Reservations
  block: cleanup blkdev_ioctl
2015-11-04 21:01:27 -08:00
Linus Torvalds
527d1529e3 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
2015-11-04 20:51:48 -08:00
Linus Torvalds
d9734e0d1c Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This is the core block pull request for 4.4.  I've got a few more
  topic branches this time around, some of them will layer on top of the
  core+drivers changes and will come in a separate round.  So not a huge
  chunk of changes in this round.

  This pull request contains:

   - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

   - Unused prototype removal in blk-mq from Christoph.

   - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
     xchg()'s, from Davidlohr.

   - A plug flush fix from Jeff.

   - Also from Jeff, a fix that means we don't have to update shared tag
     sets at init time unless we do a state change.  This cuts down boot
     times on thousands of devices a lot with scsi/blk-mq.

   - blk-mq waitqueue barrier fix from Kosuke.

   - Various fixes from Ming:

        - Fixes for segment merging and splitting, and checks, for
          the old core and blk-mq.

        - Potential blk-mq speedup by marking ctx pending at the end
          of a plug insertion batch in blk-mq.

        - direct-io no page dirty on kernel direct reads.

   - A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
  blk-mq: avoid excessive boot delays with large lun counts
  blktrace: re-write setting q->blk_trace
  blk-mq: mark ctx as pending at batch in flush plug path
  blk-mq: fix for trace_block_plug()
  block: check bio_mergeable() early before merging
  blk-mq: check bio_mergeable() early before merging
  block: avoid to merge splitted bio
  block: setup bi_phys_segments after splitting
  block: fix plug list flushing for nomerge queues
  blk-mq: remove unused blk_mq_clone_flush_request prototype
  blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
  fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
  fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  block: kmemleak: Track the page allocations for struct request
2015-11-04 20:28:10 -08:00
Jeff Moyer
2404e607a9 blk-mq: avoid excessive boot delays with large lun counts
Hi,

Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled.  He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one.  We don't need to do this.  After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.

This problem was introduced by commit 0d2602ca30 (blk-mq: improve
support for shared tags maps).

Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-03 08:42:19 -07:00
Ming Lin
a22c4d7e34 block: re-add discard_granularity and alignment checks
In commit b49a087("block: remove split code in
blkdev_issue_{discard,write_same}"), discard_granularity and alignment
checks were removed. Ideally, with bio late splitting, the upper layers
shouldn't need to depend on device's limits.

Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe
device when mkfs.xfs. We have not found the root cause yet.

This patch re-adds discard_granularity and alignment checks by reverting
the related changes in commit b49a087. The good thing is now we can
remove the 2G discard size cap and just use UINT_MAX to avoid bi_size
overflow.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-28 09:12:58 +09:00
Tejun Heo
ca0752c5e3 blkcg: don't create "io.stat" on the root cgroup
The stat files on the root cgroup shows stats for the whole system and
usually don't contain any information which isn't available through
the usual system monitoring mechanisms.  Some controllers skip
collecting these duplicate stats to optimize cases where cgroup isn't
used and later try to emulate the result on demand.

This leads to complexities and subtle differences in the information
shown through different channels.  This is entirely unnecessary and
cgroup v2 is dropping stat files which are duplicate from all
controllers.  This patch removes "io.stat" from the root hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
2015-10-22 17:58:26 +09:00
Ming Lei
cfd0c552a8 blk-mq: mark ctx as pending at batch in flush plug path
Most of times, flush plug should be the hottest I/O path,
so mark ctx as pending after all requests in the list are
inserted.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:58 -06:00
Ming Lei
676d06077f blk-mq: fix for trace_block_plug()
The trace point is for tracing plug event of each request
queue instead of each task, so we should check the request
count in the plug list from current queue instead of
current task.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:56 -06:00
Ming Lei
7460d389c0 block: check bio_mergeable() early before merging
After bio splitting is introduced, one bio can be splitted
and it is marked as NOMERGE because it is too fat to be merged,
so check bio_mergeable() earlier to avoid to try to merge it
unnecessarily.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:54 -06:00
Ming Lei
e18378a60e blk-mq: check bio_mergeable() early before merging
It isn't necessary to try to merge the bio which is marked
as NOMERGE.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:53 -06:00
Ming Lei
6ac45aeb6b block: avoid to merge splitted bio
The splitted bio has been already too fat to merge, so mark it
as NOMERGE.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:51 -06:00
Ming Lei
bdced438ac block: setup bi_phys_segments after splitting
The number of bio->bi_phys_segments is always obtained
during bio splitting, so it is natural to setup it
just after bio splitting, then we can avoid to compute
nr_segment again during merge.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:50 -06:00
Jeff Moyer
0809e3ac62 block: fix plug list flushing for nomerge queues
Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count.  Fix this by
computing the number of queued requests even for nomerge queues.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:48 -06:00
Christoph Hellwig
bbd3e06436 block: add an API for Persistent Reservations
This commits adds a driver API and ioctls for controlling Persistent
Reservations s/genericly/generically/ at the block layer.  Persistent
Reservations are supported by SCSI and NVMe and allow controlling who gets
access to a device in a shared storage setup.

Note that we add a pr_ops structure to struct block_device_operations
instead of adding the members directly to avoid bloating all instances
of devices that will never support Persistent Reservations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:46:56 -06:00
Christoph Hellwig
d8e4bb8103 block: cleanup blkdev_ioctl
Split out helpers for all non-trivial ioctls to make this function simpler,
and also start passing around a pointer version of the argument, as that's
what most ioctl handlers actually need.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:46:55 -06:00