Commit Graph

3992 Commits

Author SHA1 Message Date
Linus Torvalds 5bed49adfe Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Jianchao Wang f5bbbbe4d6 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:56 -06:00
Jianchao Wang d48ece209f blk-mq: init hctx sched after update ctx and hctx mapping
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:55 -06:00
Chaitanya Kulkarni fcedba42d9 block: remove duplicate initialization
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-17 12:33:36 -06:00
Chengguang Xu 599d067dd3 block: change return type to bool
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:44:17 -06:00
Maciej S. Szmigiero fc8ebd01de block, bfq: return nbytes and not zero from struct cftype .write() method
The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:11:16 -06:00
Paolo Valente f812164869 block, bfq: improve code of bfq_bfqq_charge_time
bfq_bfqq_charge_time contains some lengthy and redundant code. This
commit trims and condenses that code.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:15 -06:00
Paolo Valente d5801088a7 block, bfq: reduce write overcharge
When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.

On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect.  The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.

The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards.  This commit
introduces the resulting, new value.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:13 -06:00
Paolo Valente e02a0aa26b block, bfq: always update the budget of an entity when needed
When the next child entity to serve changes for a given parent entity,
the budget of that parent entity must be updated accordingly.
Unfortunately, this update is not performed, by mistake, for the
entities that happen to switch from having no child entity to serve,
to having one child entity to serve.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:12 -06:00
Paolo Valente 8a511ba5fe block, bfq: readd missing reset of parent-entity service
The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.

Fixes: 9fae8dd59f ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:10 -06:00
Linus Torvalds 73ba2fb33c Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Ming Lei df60f6e835 blk-wbt: fix IO hang in wbt_wait()
On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.

But commit c1c80384c8 ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.

This patch fixes the report in the following link:

	https://marc.info/?l=linux-block&m=153221542021033&w=2

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik <jbacik@fb.com>
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 11:05:52 -06:00
Jens Axboe b089cfd95d block: don't warn for flush on read-only device
Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.

Reported-by: Stefan Agner <stefan@agner.ch>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 10:52:40 -06:00
Bart Van Assche b86d865cb1 blkcg: Make blkg_root_lookup() work for queues in bypass mode
For legacy queues the only call of blkg_root_lookup() happens after
bypass mode has been enabled. Since blkg_lookup() returns NULL for
queues in bypass mode, modify the blkg_root_lookup() such that it
no longer depends on bypass mode. Rename the function into
blk_queue_root_blkg() as suggested by Tejun.

Suggested-by: Tejun Heo <tj@kernel.org>
Fixes: 6bad9b210a ("blkcg: Introduce blkg_root_lookup()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:41:25 -06:00
Liu Bo 991f61fe7e Blk-throttle: reduce tail io latency when iops limit is enforced
When an application's iops has exceeded its cgroup's iops limit, surely it
is throttled and kernel will set a timer for dispatching, thus IO latency
includes the delay.

However, the dispatch delay which is calculated by the limit and the
elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
once the application's iops is (iops limit + 1), it doesn't need to wait
any longer than the remaining time of the current slice.

The difference can be proved by the following fio job and cgroup iops
setting,
-----
$ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
$ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
$ cat r2.job
[global]
name=fio-rand-read
filename=/dev/nullb1
rw=randread
bs=4k
direct=1
numjobs=1
time_based=1
runtime=60
group_reporting=1

[file1]
size=4G
ioengine=libaio
iodepth=1
rate_iops=50000
norandommap=1
thinktime=4ms
-----

wo patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
    slat (usec): min=10, max=336, avg=27.71, stdev=17.82
    clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
     lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
     | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
     | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
     | 99.99th=[28967]

w/ patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
    slat (usec): min=10, max=155, avg=23.24, stdev=16.79
    clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
     lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
     | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
     | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
     | 99.99th=[12387]

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 12:43:16 -06:00
Bart Van Assche 24ecc35853 block: Ensure that a request queue is dissociated from the cgroup controller
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0000000000000008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Fixes: a063057d7c ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:13:00 -06:00
Bart Van Assche 4cf6324b17 block: Introduce blk_exit_queue()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:12:59 -06:00
Jianchao Wang d263ed9926 blk-mq: count the hctx as active before allocating tag
Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.

To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:34:17 -06:00
Greg Edwards d6c02a9beb block: bvec_nr_vecs() returns value for wrong slab
In commit ed996a52c8 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value.  Decrement idx before performing the lookup.

Fixes: ed996a52c8 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:25:04 -06:00
Bart Van Assche f7ecb1b109 cfq: Suppress compiler warnings about comparisons
This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:13 -06:00
Bart Van Assche 9b4f43460d cfq: Annotate fall-through in a switch statement
This patch avoids that gcc complains about fall-through when building
with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:11 -06:00
Anchal Agarwal 2887e41b91 blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 14:40:49 -06:00
Jens Axboe 05b9ba4b55 Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Linus Torvalds a32e236eb9 Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.

The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.

This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=200439
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

The lvm2 fix is here

  https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check.  It now allows the
write to happen, but causes a warning, something like this:

  generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
  Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
  CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
  Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
  Workqueue: ksnaphd do_metadata
  RIP: 0010:generic_make_request_checks+0x4ac/0x600
  ...
  Call Trace:
   generic_make_request+0x64/0x400
   submit_bio+0x6c/0x140
   dispatch_io+0x287/0x430
   sync_io+0xc3/0x120
   dm_io+0x1f8/0x220
   do_metadata+0x1d/0x30
   process_one_work+0x1b9/0x3e0
   worker_thread+0x2b/0x3c0
   kthread+0x113/0x130
   ret_from_fork+0x35/0x40

Note that this is a "revert" in behavior only.  I'm leaving alone the
actual code cleanups in commit 721c7fc701, but letting the previously
uncaught request go through with a warning instead of stopping it.

Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-04 18:19:55 -07:00
Linus Torvalds 71abe04269 Merge tag 'for-linus-20180803' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
 "Just a single fix, from Ming, fixing a regression in this cycle where
  the busy tag iteration was changed to only calling the callback
  function for requests that are started. We really want all non-free
  requests.

  This fixes a boot regression on certain VM setups"

* tag 'for-linus-20180803' of git://git.kernel.dk/linux-block:
  blk-mq: fix blk_mq_tagset_busy_iter
2018-08-03 10:43:56 -07:00