Commit Graph

178 Commits

Author SHA1 Message Date
Linus Torvalds 10f36b1e80 Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:

 - Online capacity resizing (Balbir)

 - Number of hardware queue change fixes (Bart)

 - null_blk fault injection addition (Bart)

 - Cleanup of queue allocation, unifying the node/no-node API
   (Christoph)

 - Cleanup of genhd, moving code to where it makes sense (Christoph)

 - Cleanup of the partition handling code (Christoph)

 - disk stat fixes/improvements (Konstantin)

 - BFQ improvements (Paolo)

 - Various fixes and improvements

* tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits)
  block: return NULL in blk_alloc_queue() on error
  block: move bio_map_* to blk-map.c
  Revert "blkdev: check for valid request queue before issuing flush"
  block: simplify queue allocation
  bcache: pass the make_request methods to blk_queue_make_request
  null_blk: use blk_mq_init_queue_data
  block: add a blk_mq_init_queue_data helper
  block: move the ->devnode callback to struct block_device_operations
  block: move the part_stat* helpers from genhd.h to a new header
  block: move block layer internals out of include/linux/genhd.h
  block: move guard_bio_eod to bio.c
  block: unexport get_gendisk
  block: unexport disk_map_sector_rcu
  block: unexport disk_get_part
  block: mark part_in_flight and part_in_flight_rw static
  block: mark block_depr static
  block: factor out requeue handling from dispatch code
  block/diskstats: replace time_in_queue with sum of request times
  block/diskstats: accumulate all per-cpu counters in one pass
  block/diskstats: more accurate approximation of io_ticks for slow disks
  ...
2020-03-30 11:20:13 -07:00
Christoph Hellwig 348e114bbd block: move the ->devnode callback to struct block_device_operations
There really isn't any good reason to stash a method directly into
struct gendisk.  Move it together with the other block device
operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 09:50:05 -06:00
Christoph Hellwig c6a564ffad block: move the part_stat* helpers from genhd.h to a new header
These macros are just used by a few files.  Move them out of genhd.h,
which is included everywhere into a new standalone header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:09 -06:00
Christoph Hellwig 581e26004a block: move block layer internals out of include/linux/genhd.h
None of this needs to be exposed to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig 6005771c17 block: mark part_in_flight and part_in_flight_rw static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig 31eb618679 block: mark block_depr static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Konstantin Khlebnikov 8cd5b8fc00 block/diskstats: replace time_in_queue with sum of request times
Column "time_in_queue" in diskstats is supposed to show total waiting time
of all requests. I.e. value should be equal to the sum of times from other
columns. But this is not true, because column "time_in_queue" is counted
separately in jiffies rather than in nanoseconds as other times.

This patch removes redundant counter for "time_in_queue" and shows total
time of read, write, discard and flush requests.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 08:49:12 -06:00
Konstantin Khlebnikov ea18e0f0a6 block/diskstats: accumulate all per-cpu counters in one pass
Reading /proc/diskstats iterates over all cpus for summing each field.
It's faster to sum all fields in one pass.

Hammering /proc/diskstats with fio shows 2x performance improvement:

fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

	  JOBS=1	JOBS=10
Before:	  7k iops	64k iops
After:	 18k iops      120k iops

Also this way code is more compact:

add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
Function                                     old     new   delta
part_stat_read_all                             -     194    +194
diskstats_show                              1344     631    -713
part_stat_show                              1219     392    -827
Total: Before=14966947, After=14965601, chg -0.01%

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 08:49:10 -06:00
Konstantin Khlebnikov 2b8bd42361 block/diskstats: more accurate approximation of io_ticks for slow disks
Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.

If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.

Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.

Example: common HDD executes random read 4k requests around 12ms.

fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb

Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

Before:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43

After:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99

Now io_ticks does not loose time between start and end of requests, but
for queue-depth > 1 some I/O time between adjacent starts might be lost.

For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.

Fixes: 5b18b5a737 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 08:49:08 -06:00
Christoph Hellwig 3f4fc59c13 block: move the various x86 Unix label formats out of genhd.h
All these are just used in block/partitions/msdos.c, so move them out of the
genhd.h driver included by every driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:08 -06:00
Christoph Hellwig 0226e9ead4 block: move the *_PARTITION enum out of genhd.h
The enum containing the *_PARTITION symbolic names is only relevant
for the partition parser.  More specifically most values are MSDOS
partition table system indicators and thus should go straight into
msdos.c.  One value is only used by the sun partition parser, and the
sun and sgi partition parsers use the same value as the x86 Linux
RAID indicator to also indicate RAID autodetection.  Duplicate them
in sun.c and sgi.c given that the different partition types use
entirely different values otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:08 -06:00
Christoph Hellwig 1442f76d43 block: move struct partition out of genhd.h
struct partition is the on-disk format of a MSDOS partition table entry.
Move it out of genhd.h into a new msdos_partition.h header and give it
a msdos_ prefix to avoid confusion.
Also move the magic number from block/partitions/msdos.h to the new
header so that it can be used by the SCSI drivers looking at the DOS
partition tables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:08 -06:00
Christoph Hellwig f17c21c1ec block: remove alloc_part_info and free_part_info
There isn't any good reason not to simply open code the allocation and
freeing of the partition_meta_info structure.  Especially as one of
the branches in alloc_part_info is entirely dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00
Christoph Hellwig 3ad5cee5cd block: move sysfs methods shared by disks and partitions to genhd.c
Move the sysfs _show methods that are used both on the full disk and
partition nodes to genhd.c instead of hiding them in the partitioning
code.  Also move the declaration for these methods to block/blk.h so
that we don't expose them to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00
Balbir Singh e598a72fae block/genhd: Notify udev about capacity change
Allow block/genhd to notify user space (via udev) about disk size changes
using a new helper set_capacity_revalidate_and_notify(), which is a wrapper
on top of set_capacity(). set_capacity_revalidate_and_notify() will only
notify via udev if the current capacity or the target capacity is not zero
and iff the capacity changes.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Someswarudu Sangaraju <ssomesh@amazon.com>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-18 15:13:21 -06:00
Shin'ichiro Kawasaki b53df2e744 block: Fix partition support for host aware zoned block devices
Commit b72053072c ("block: allow partitions on host aware zone
devices") introduced the helper function disk_has_partitions() to check
if a given disk has valid partitions. However, since this function result
directly depends on the disk partition table length rather than the
actual existence of valid partitions in the table, it returns true even
after all partitions are removed from the disk. For host aware zoned
block devices, this results in zone management support to be kept
disabled even after removing all partitions.

Fix this by changing disk_has_partitions() to walk through the partition
table entries and return true if and only if a valid non-zero size
partition is found.

Fixes: b72053072c ("block: allow partitions on host aware zone devices")
Cc: stable@vger.kernel.org # 5.5
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-12 07:54:39 -06:00
Stephen Kitt 9243c6f3e0 block: Document genhd capability flags
The kernel documentation includes a brief section about genhd
capabilities, but it turns out that the only documented
capability (GENHD_FL_MEDIA_CHANGE_NOTIFY) isn't used any more.

This patch removes that flag, and documents the rest, based on my
understanding of the current uses of these flags in the kernel. The
documentation is kept in the header file, alongside the declarations,
in the hope that it will be kept up-to-date in future; the kernel
documentation is changed to include the documentation generated from
the header file.

Because the ultimate goal is to provide some end-user
documentation (or end-administrator documentation), the comments are
perhaps more user-oriented than might be expected. Since the values
are shown to users in hexadecimal, the documentation lists them in
hexadecimal, and the constant declarations are adjusted to match.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Stephen Kitt <steve@sk2.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-12 07:47:22 -06:00
Linus Torvalds c677124e63 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...
2020-01-28 10:07:09 -08:00
Christoph Hellwig b72053072c block: allow partitions on host aware zone devices
Host-aware SMR drives can be used with the commands to explicitly manage
zone state, but they can also be used as normal disks.  In the former
case it makes perfect sense to allow partitions on them, in the latter
it does not, just like for host managed devices.  Add a check to
add_partition to allow partitions on host aware devices, but give
up any zone management capabilities in that case, which also catches
the previously missed case of adding a partition vs just scanning it.

Because sd can rescan the attribute at runtime it needs to check if
a disk has partitions, for which a new helper is added to genhd.h.

Fixes: 5eac3eb30c ("block: Remove partition support for zoned block devices")
Reported-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-26 09:59:08 -07:00
Thomas Gleixner 2496396fcb sched/rt, fs: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the i_size() and part_nr_sects_…() code over to use
CONFIG_PREEMPTION. Update the comment for fsstack_copy_inode_size() also
to refer to CONFIG_PREEMPTION.

[bigeasy: +PREEMPT comments]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-24-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:36 +01:00
Christoph Hellwig 142fe8f4bb block: fix bdev_disk_changed for non-partitioned devices
We still have to set the capacity to 0 if invalidating or call
revalidate_disk if not even if the disk has no partitions.  Fix
that by merging rescan_partitions into bdev_disk_changed and just
stubbing out blk_add_partitions and blk_drop_partitions for
non-partitioned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:53 -07:00
Christoph Hellwig a1548b6744 block: move rescan_partitions to fs/block_dev.c
Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:21 -07:00
Christoph Hellwig 6917d06899 block: merge invalidate_partitions into rescan_partitions
A lot of the logic in invalidate_partitions and rescan_partitions is
shared.  Merge the two functions to simplify things.  There is a small
behavior change in that we now send the kevent change notice also if we
were not invalidating but no partitions were found, which seems like
the right thing to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:42:41 -07:00
Yufen Yu 6fcc44d1d7 block: fix use-after-free on gendisk
commit 2da78092dd "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 09:48:12 -06:00
Martin Wilck c92e2f04b3 block: disk_events: introduce event flags
Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.

Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.

The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.

In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.

This patch doesn't change behavior.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12 13:35:24 -06:00