linear_make_request() uses fs_bio_set, which is meant for filesystems
to use, and loops, possible allocating from the same bio set multiple
times.
These behaviors can theoretically cause deadlocks, though as
linear requests are hardly ever split, it is unlikely in practice.
Change to use mddev->bio_set - otherwise unused for linear, and submit
the tail of a split request to generic_make_request() for it to
handle.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull md updates from Shaohua Li:
"Mainly fixes bugs and improves performance:
- Improve scalability for raid1 from Coly
- Improve raid5-cache read performance, disk efficiency and IO
pattern from Song and me
- Fix a race condition of disk hotplug for linear from Coly
- A few cleanup patches from Ming and Byungchul
- Fix a memory leak from Neil
- Fix WRITE SAME IO failure from me
- Add doc for raid5-cache from me"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
md/raid1: handle flush request correctly
md/linear: shutup lockdep warnning
md/raid1: fix a use-after-free bug
RAID1: avoid unnecessary spin locks in I/O barrier code
RAID1: a new I/O barrier implementation to remove resync window
md/raid5: Don't reinvent the wheel but use existing llist API
md: fast clone bio in bio_clone_mddev()
md: remove unnecessary check on mddev
md/raid1: use bio_clone_bioset_partial() in case of write behind
md: fail if mddev->bio_set can't be created
block: introduce bio_clone_bioset_partial()
md: disable WRITE SAME if it fails in underlayer disks
md/raid5-cache: exclude reclaiming stripes in reclaim check
md/raid5-cache: stripe reclaim only counts valid stripes
MD: add doc for raid5-cache
Documentation: move MD related doc into a separate dir
md: ensure md devices are freed before module is unloaded.
md/r5cache: improve journal device efficiency
md/r5cache: enable chunk_aligned_read with write back cache
...
Commit 03a9e24(md linear: fix a race between linear_add() and
linear_congested()) introduces the warnning.
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.
We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.
Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581
Suggested-by: NeilBrown <neilb@suse.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.
There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend(). After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.
Here I explain how the race still exists in current code. For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.
Now I use a possible code execution time sequence to demo how the possible
race happens,
seq linear_add() linear_congested()
0 conf=mddev->private
1 oldconf=mddev->private
2 mddev->raid_disks++
3 for (i=0; i<mddev->raid_disks;i++)
4 bdev_get_queue(conf->disks[i].rdev->bdev)
5 mddev->private=newconf
In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.
To fix this race, there are 2 parts of modification in the patch,
1) Add 'int raid_disks' in struct linear_conf, as a copy of
mddev->raid_disks. It is initialized in linear_conf(), always being
consistent with pointers number of 'struct dev_info disks[]'. When
iterating conf->disks[] in linear_congested(), use conf->raid_disks to
replace mddev->raid_disks in the for-loop, then NULL pointer deference
will not happen again.
2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
free oldconf memory. Because oldconf may be referenced as mddev->private
in linear_congested(), kfree_rcu() makes sure that its memory will not
be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.
This patch can be applied for kernels since v4.0 after commit:
3be260cc18 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.
Changelog:
- V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
replace rcu_call() in linear_add().
- v2: add RCU stuffs by suggestion from Shaohua and Neil.
- v1: initial effort.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.
When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors. This patch uses
the second option.
When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that the ->stop function only frees the private data,
rename is accordingly.
Also pass in the private pointer as an arg rather than using
mddev->private. This flexibility will be useful in level_store().
Finally, don't clear ->private. It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before ->to_remove was
introduced).
Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NeilBrown <neilb@suse.de>
Each md personality has a 'stop' operation which does two
things:
1/ it finalizes some aspects of the array to ensure nothing
is accessing the ->private data
2/ it frees the ->private data.
All the steps in '1' can apply to all arrays and so can be
performed in common code.
This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.
So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NeilBrown <neilb@suse.de>
The use of 'rcu' to protect accesses to ->private_data so that
the ->private_data could be updated predates the introduction
of mddev_suspend/mddev_resume.
These are a cleaner mechanism for providing stability while
swapping in a new ->private data - it is used by level_store()
to support changing of raid levels.
So get rid of the RCU stuff and just use mddev_suspend, mddev_resume.
As these function call ->quiesce(), we add an empty function for
linear just like for raid0.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.
So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NeilBrown <neilb@suse.de>
There is currently no locking around calls to the 'congested'
bdi function. If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.
So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.
When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock. Fortunately this is the case.
Signed-off-by: NeilBrown <neilb@suse.de>
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.
Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.
(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>