There is a bug in 'check_reshape' for raid5.c To checks
that the new minimum number of devices is large enough (which is
good), but it does so also after the reshape has started (bad).
This is bad because
- the calculation is now wrong as mddev->raid_disks has changed
already, and
- it is pointless because it is now too late to stop.
So only perform that test when reshape has not been committed to.
Signed-off-by: NeilBrown <neilb@suse.de>
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
If we write to a known-bad-block it will be flags as having
a ReadError by analyse_stripe, but the write will proceed anyway
(as it should). Then the read-error handling will kick in an
write again, then re-read.
We don't need that 'write-again', so set R5_ReWrite so it looks like
it has already been done. Then we will just get the re-read, which we
want.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As the function call is the most expensive of these tests it should be
done later in the chain so that it can be avoided in some cases.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This reverts commit 3a366e614d.
Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.
Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).
The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."
Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull md fixes from NeilBrown:
"A few bugfixes for md
- recent regressions in raid5
- recent regressions in dmraid
- a few instances of CONFIG_MULTICORE_RAID456 linger
Several tagged for -stable"
* tag 'md-3.9-fixes' of git://neil.brown.name/md:
md: remove CONFIG_MULTICORE_RAID456 entirely
md/raid5: ensure sync and DISCARD don't happen at the same time.
MD: Prevent sysfs operations on uninitialized kobjects
MD RAID5: Avoid accessing gendisk or queue structs when not available
md/raid5: schedule_construction should abort if nothing to do.
Had to shuffle the code around a bit (where bi_rw and bi_end_io were
set), but shouldn't really be anything tricky here
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
A number of problems can occur due to races between
resync/recovery and discard.
- if sync_request calls handle_stripe() while a discard is
happening on the stripe, it might call handle_stripe_clean_event
before all of the individual discard requests have completed
(so some devices are still locked, but not all).
Since commit ca64cae960
md/raid5: Make sure we clear R5_Discard when discard is finished.
this will cause R5_Discard to be cleared for the parity device,
so handle_stripe_clean_event() will not be called when the other
devices do become unlocked, so their ->written will not be cleared.
This ultimately leads to a WARN_ON in init_stripe and a lock-up.
- If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
time for resync, it can lead to s->uptodate being less than disks
in handle_parity_checks5(), which triggers a BUG (because it is
one).
So:
- keep R5_Discard on the parity device until all other devices have
completed their discard request
- make sure we don't try to have a 'discard' and a 'sync' action at
the same time.
This involves a new stripe flag to we know when a 'discard' is
happening, and the use of R5_Overlap on the parity disk so when a
discard is wanted while a sync is active, so we know to wake up
the discard at the appropriate time.
Discard support for RAID5 was added in 3.7, so this is suitable for
any -stable kernel since 3.7.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID5: Fix kernel oops when RAID4/5/6 is used via device-mapper
Commit a9add5d (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
However, when device-mapper is used to create RAID4/5/6 arrays, the
mddev->gendisk and mddev->queue fields are not setup. Therefore, calling
things like trace_block_bio_remap will cause a kernel oops. This patch
conditionalizes those calls on whether the proper fields exist to make
the calls. (Device-mapper will call trace_block_bio_remap on its own.)
This patch is suitable for the 3.8.y stable kernel.
Cc: stable@vger.kernel.org (v3.8+)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit 1ed850f356
md/raid5: make sure to_read and to_write never go negative.
It has been possible for handle_stripe_dirtying to be called
when there isn't actually any work to do.
It then calls schedule_reconstruction() which will set R5_LOCKED
on the parity block(s) even when nothing else is happening.
This then causes problems in do_release_stripe().
So add checks to schedule_reconstruction() so that if it doesn't
find anything to do, it just aborts.
This bug was introduced in v3.7, so the patch is suitable
for -stable kernels since then.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md updates from NeilBrown:
"Mostly little bugfixes.
Only "feature" is a new RAID10 layout which slightly improves the
number of sets of devices that can concurrently fail, without data
loss."
* tag 'md-3.9' of git://neil.brown.name/md:
md: expedite metadata update when switching read-auto -> active
md: remove CONFIG_MULTICORE_RAID456
md/raid1,raid10: fix deadlock with freeze_array()
md/raid0: improve error message when converting RAID4-with-spares to RAID0
md: raid0: fix error return from create_stripe_zones.
md: fix two bugs when attempting to resize RAID0 array.
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
MD RAID10: Minor non-functional code changes
md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
md: protect against crash upon fsync on ro array
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This doesn't seem to actually help and we have an alternate
multi-threading approach waiting in the wings, so just get
rid of this config option and associated code.
As a bonus, we remove one use of CONFIG_EXPERIMENTAL
Cc: Dan Williams <djbw@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...