Commit Graph

154 Commits

Author SHA1 Message Date
Linus Torvalds 33c8846c81 Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:

 - mq-deadline accounting improvements (Bart)

 - blk-wbt timer fix (Andrea)

 - Untangle the block layer includes (Christoph)

 - Rework the poll support to be bio based, which will enable adding
   support for polling for bio based drivers (Christoph)

 - Block layer core support for multi-actuator drives (Damien)

 - blk-crypto improvements (Eric)

 - Batched tag allocation support (me)

 - Request completion batching support (me)

 - Plugging improvements (me)

 - Shared tag set improvements (John)

 - Concurrent queue quiesce support (Ming)

 - Cache bdev in ->private_data for block devices (Pavel)

 - bdev dio improvements (Pavel)

 - Block device invalidation and block size improvements (Xie)

 - Various cleanups, fixes, and improvements (Christoph, Jackie,
   Masahira, Tejun, Yu, Pavel, Zheng, me)

* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
  blk-mq-debugfs: Show active requests per queue for shared tags
  block: improve readability of blk_mq_end_request_batch()
  virtio-blk: Use blk_validate_block_size() to validate block size
  loop: Use blk_validate_block_size() to validate block size
  nbd: Use blk_validate_block_size() to validate block size
  block: Add a helper to validate the block size
  block: re-flow blk_mq_rq_ctx_init()
  block: prefetch request to be initialized
  block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
  block: add rq_flags to struct blk_mq_alloc_data
  block: add async version of bio_set_polled
  block: kill DIO_MULTI_BIO
  block: kill unused polling bits in __blkdev_direct_IO()
  block: avoid extra iter advance with async iocb
  block: Add independent access ranges support
  blk-mq: don't issue request directly in case that current is to be blocked
  sbitmap: silence data race warning
  blk-cgroup: synchronize blkg creation against policy deactivation
  block: refactor bio_iov_bvec_set()
  block: add single bio async direct IO helper
  ...
2021-11-01 09:19:50 -07:00
Christoph Hellwig ccdf774189 mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>
Move inode_to_bdi out of line to avoid having to include blkdev.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Matthew Wilcox (Oracle) bd3488e7b4 mm/writeback: Rename __add_wb_stat() to wb_stat_mod()
Make this look like the newly renamed vmstat functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18 07:49:39 -04:00
Linus Torvalds 14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Jan Kara fee468fdf4 writeback: reliably update bandwidth estimation
Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback().  However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g.  from fsync(2).  Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Christoph Hellwig a11d7fc2d0 block: remove the bd_bdi in struct block_device
Just retrieve the bdi from the disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 11:53:26 -06:00
Christoph Hellwig f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 823423ef55 bdi: invert BDI_CAP_NO_ACCT_WB
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious.  Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig a8b456d01c bdi: remove BDI_CAP_SYNCHRONOUS_IO
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device.  Just check up for
the method instead.  The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig ed7b6b4f6e bdi: remove BDI_CAP_CGROUP_WRITEBACK
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 21cf866145 writeback: remove bdi->congested_fn
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures.  And
pktdvd is a deprecated driver that isn't useful in stack setup
either.  So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:20:46 -06:00
Christoph Hellwig 8c911f3d4c writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Christoph Hellwig aef33c2ff8 bdi: simplify bdi_alloc
Merge the _node vs normal version and drop the superflous gfp_t argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig 3c5d202b55 bdi: remove bdi_register_owner
Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig eb7ae5e06b bdi: move bdi_dev_name out of line
bdi_dev_name is not a fast path function, move it out of line.  This
prepares for using it from modular callers without having to export
an implementation detail like bdi_unknown_name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 08:45:47 -06:00
Theodore Ts'o 68f23b8906 memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures.  In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi.  There is a refcount which prevents
the bdi object from being released (and hence, unregistered).  So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly.  It does
this without informing the file system, or the memcg code, or anything
else.  This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown.  So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk().  As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:36 -08:00
Tejun Heo ed288dc0d4 writeback: Separate out wb_get_lookup() from wb_get_create()
Separate out wb_get_lookup() which doesn't try to create one if there
isn't already one from wb_get_create().  This will be used by later
patches.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 09:22:38 -06:00
Tejun Heo 34f8fe501f bdi: Add bdi->id
There currently is no way to universally identify and lookup a bdi
without holding a reference and pointer to it.  This patch adds an
non-recycling bdi->id and implements bdi_get_by_id() which looks up
bdis by their ids.  This will be used by memcg foreign inode flushing.

I left bdi_list alone for simplicity and because while rb_tree does
support rcu assignment it doesn't seem to guarantee lossless walk when
walk is racing aginst tree rebalance operations.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 09:22:38 -06:00
Tejun Heo 5b9cce4c7e writeback: Generalize and expose wb_completion
wb_completion is used to track writeback completions.  We want to use
it from memcg side for foreign inode flushes.  This patch updates it
to remember the target waitq instead of assuming bdi->wb_waitq and
expose it outside of fs-writeback.c.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 09:22:38 -06:00
Tejun Heo d3f77dfdc7 blkcg: implement REQ_CGROUP_PUNT
When a shared kthread needs to issue a bio for a cgroup, doing so
synchronously can lead to priority inversions as the kthread can be
trapped waiting for that cgroup.  This patch implements
REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
to a dedicated per-blkcg work item to avoid such priority inversions.

This will be used to fix priority inversions in btrfs compression and
should be generally useful as we grow filesystem support for
comprehensive IO control.

Cc: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-10 09:00:57 -06:00
Greg Thelen a9519defc7 writeback: fix inode cgroup switching comment
Commit 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.

Switch the comments to inode_switch_wbs_work_fn().

Link: http://lkml.kernel.org/r/20190305004617.142590-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Sebastian Andrzej Siewior e58dd0de5e bdi: use refcount_t for reference counting instead atomic_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter.  This permits avoiding
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Greg Thelen 2e898e4c0a writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00