Commit Graph

157 Commits

Author SHA1 Message Date
Linus Torvalds 3645e6d0dc Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "This update mainly fixes bugs:

   - Make raid5 ppl support several ppl from Pawel

   - Several raid5-cache bug fixes from Song

   - Bitmap fixes from Neil and Me

   - One raid1/10 regression fix since 4.12 from Me

   - Other small fixes and cleanup"

* tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/bitmap: disable bitmap_resize for file-backed bitmaps.
  raid5-ppl: Recovery support for multiple partial parity logs
  md: Runtime support for multiple ppls
  md/raid0: attach correct cgroup info in bio
  lib/raid6: align AVX512 constants to 512 bits, not bytes
  raid5: remove raid5_build_block
  md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
  md: replace seq_release_private with seq_release
  md: notify about new spare disk in the container
  md/raid1/10: reset bio allocated from mempool
  md/raid5: release/flush io in raid5_do_work()
  md/bitmap: copy correct data for bitmap super
2017-09-07 12:41:48 -07:00
Pawel Baldysiak ddc088238c md: Runtime support for multiple ppls
Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28 07:45:48 -07:00
Shaohua Li 8a8e6f84ad md/raid0: attach correct cgroup info in bio
The discard bio doesn't attach the original bio cgroup info. Normal bio
is cloned, so is fine.

Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25 10:21:48 -07:00
Christoph Hellwig 74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
NeilBrown cc27b0c78c md: fix deadlock between mddev_suspend() and md_write_start()
If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix <nix@esperi.org.uk>
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-13 10:18:01 -07:00
Shaohua Li 29efc390b9 md/md0: optimize raid0 discard handling
There are complaints that raid0 discard handling is slow. Currently we
divide discard request into chunks and dispatch to underlayer disks. The
block layer will do merge to form big requests. This causes a lot of
request split/merge and uses significant CPU time.

A simple idea is to calculate the range for each raid disk for an IO
request and send a discard request to raid disks, which will avoid the
split/merge completely. Previously Coly tried the approach, but the
implementation was too complex because of raid0 zones. This patch always
split bio in zone boundary and handle bio within one zone. It simplifies
the implementation a lot.

Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-08 21:18:03 -07:00
Shaohua Li e265eb3a30 Merge branch 'md-next' into md-linus 2017-05-01 14:09:21 -07:00
NeilBrown f00d7c85be md/raid0: fix up bio splitting.
raid0_make_request() should use a private bio_set rather than the
shared fs_bio_set, which is only meant for filesystems to use.

raid0_make_request() shouldn't loop around using the bio_set
multiple times as that can deadlock.

So use mddev->bio_set and pass the tail to generic_make_request()
instead of looping on it.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:18:09 -07:00
Christoph Hellwig 3deff1a70d md: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Artur Paszkiewicz ea0213e0c7 md: superblock changes for PPL
Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:53 -07:00
Linus Torvalds a682e00354 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull md updates from Shaohua Li:
 "Mainly fixes bugs and improves performance:

   - Improve scalability for raid1 from Coly

   - Improve raid5-cache read performance, disk efficiency and IO
     pattern from Song and me

   - Fix a race condition of disk hotplug for linear from Coly

   - A few cleanup patches from Ming and Byungchul

   - Fix a memory leak from Neil

   - Fix WRITE SAME IO failure from me

   - Add doc for raid5-cache from me"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
  md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
  md/raid1: handle flush request correctly
  md/linear: shutup lockdep warnning
  md/raid1: fix a use-after-free bug
  RAID1: avoid unnecessary spin locks in I/O barrier code
  RAID1: a new I/O barrier implementation to remove resync window
  md/raid5: Don't reinvent the wheel but use existing llist API
  md: fast clone bio in bio_clone_mddev()
  md: remove unnecessary check on mddev
  md/raid1: use bio_clone_bioset_partial() in case of write behind
  md: fail if mddev->bio_set can't be created
  block: introduce bio_clone_bioset_partial()
  md: disable WRITE SAME if it fails in underlayer disks
  md/raid5-cache: exclude reclaiming stripes in reclaim check
  md/raid5-cache: stripe reclaim only counts valid stripes
  MD: add doc for raid5-cache
  Documentation: move MD related doc into a separate dir
  md: ensure md devices are freed before module is unloaded.
  md/r5cache: improve journal device efficiency
  md/r5cache: enable chunk_aligned_read with write back cache
  ...
2017-02-24 14:42:19 -08:00
Shaohua Li 26483819f8 md: disable WRITE SAME if it fails in underlayer disks
This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Suggested-by: NeilBrown <neilb@suse.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 19:24:16 -08:00
Jan Kara dc3b17cc8b block: Use pointer to backing_dev_info from request_queue
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:20:48 -07:00
Shaohua Li 394ed8e474 md: cleanup mddev flag clear for takeover
Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:45:18 -08:00
Shaohua Li 6995f0b247 md: takeover should clear unrelated bits
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_HAS_JOURNAL bit.

Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:00:11 -08:00
NeilBrown 109e376530 md: add block tracing for bio_remapping
The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors.  This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 09:32:50 -08:00
NeilBrown 766038846e md/raid0: replace printk() with pr_*()
This makes md/raid0 much less verbose as the messages about
the array geometry are now pr_debug()

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 796a5cf083 md: use bio op accessors
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Michał Pecio b297874a2d md/raid0: remove empty line printk from dump_zones
Remove the final printk. All preceding output is already properly
newline-terminated and the printk isn't even KERN_CONT to begin with,
so it only adds one empty line to the log.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-25 08:43:58 -07:00
Dan Carpenter 7dedd15dd2 md/raid0: fix uninitialized variable bug
If this function fails the callers expect that *private_conf is set to
an ERR_PTR() but that isn't true for the first error path where we can't
allocate "conf".  It leads to some uninitialized variable bugs.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-14 09:57:59 -07:00
Masanari Iida e3d132d123 treewide: Fix typos in printk
This patch fix multiple spelling typos found in
various part of kernel.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-08 14:59:19 +01:00
NeilBrown 66eefe5de1 md/raid0: apply base queue limits *before* disk_stack_limits
Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed51 ("md/raid0: update queue parameter in a safer location.")
Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed51).
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown e89c6fdf9e Merge linux-block/for-4.3/core into md/for-linux
There were a few conflicts that are fairly easy to resolve.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-05 11:08:32 +02:00