If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.
With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.
Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.
I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Track overwrite disk count, so we can know if a stripe is a full stripe write.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.
The default minimum might have made sense many years ago
but the drives have become faster. Changing the default
to match the times isn't really a long term solution.
This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.
Testing shows that:
- for some loads, the resync speed is unchanged. For those loads
increasing the minimum doesn't change the speed either.
So this is a good result. To increase resync speed under such
loads we would probably need to increase the resync window
size.
- for other loads, resync speed does increase to a reasonable
fraction (e.g. 20%) of maximum possible, and throughput of
the load only drops a little bit (e.g. 10%)
- for other loads, throughput of the non-sync load drops quite a bit
more. These seem to be latency-sensitive loads.
So it isn't a perfect solution, but it is mostly an improvement.
Signed-off-by: NeilBrown <neilb@suse.de>
This option is not well justified and testing suggests that
it hardly ever makes any difference.
The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.
So just remove it to simplify reasoning about speed limiting.
This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.
Signed-off-by: NeilBrown <neilb@suse.de>
There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.
So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.
Signed-off-by: NeilBrown <neilb@suse.de>
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:
1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
and copies them into the local bitmap. It does not clear the bitmap
from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.
This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.
This facilitates the removal of failed devices.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This is required by the clustering module (patches to follow) to
find the device to remove or re-add.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Since commit 20d0189b10
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.
This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.
So restore that first arg before re-using the variable.
Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.
Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull networking fixes from David Miller:
1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
previously SACK'd, from Neal Cardwell.
2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
from Cong WANG.
3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
from Cong WANG.
4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.
5) When we encapsulate for a tunnel device, skb->sk still points to the
user socket. So this leads to cases where we retraverse the
ipv4/ipv6 output path with skb->sk being of some other address
family (f.e. AF_PACKET). This can cause things to crash since the
ipv4 output path is dereferencing an AF_PACKET socket as if it were
an ipv4 one.
The short term fix for 'net' and -stable is to elide these socket
checks once we've entered an encapsulation sequence by testing
xmit_recursion.
Longer term we have a better solution wherein we pass the tunnel's
socket down through the output paths, but that is way too invasive
for 'net' and -stable.
From Hannes Frederic Sowa.
6) l2tp_init() failure path forgets to unregister per-net ops, from
Cong WANG.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
net: dsa: fix filling routing table from OF description
l2tp: unregister l2tp_net_ops on failure path
mvneta: dont call mvneta_adjust_link() manually
ipv6: protect skb->sk accesses from recursive dereference inside the stack
netns: don't allocate an id for dead netns
Revert "netns: don't clear nsid too early on removal"
ip6mr: call del_timer_sync() in ip6mr_free_table()
net: move fib_rules_unregister() under rtnl lock
ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
tcp: fix FRTO undo on cumulative ACK of SACKed range
xen-netfront: transmit fully GSO-sized packets
Commit 1daa4303b4 ("net/mlx4_core: Deprecate error message at
ConnectX-2 cards startup to debug") did the deprecation only for port 1
of the card. Need to deprecate for port 2 as well.
Fixes: 1daa4303b4 ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to description in 'include/net/dsa.h', in cascade switches
configurations where there are more than one interconnected devices,
'rtable' array in 'dsa_chip_data' structure is used to indicate which
port on this switch should be used to send packets to that are destined
for corresponding switch.
However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
of the _target_ switch, but not current one.
This commit removes redundant devicetree parsing and adds needed port
number as a function argument. So dsa_of_setup_routing_table() now just
looks for target switch number by parsing parent of 'link' device node.
To remove possible misunderstandings with the way of determining target
switch number, a corresponding comment was added to the source code and
to the DSA device tree bindings documentation file.
This was tested on a custom board with two Marvell 88E6095 switches with
following corresponding routing tables: { -1, 10 } and { 8, -1 }.
Signed-off-by: Pavel Nakonechny <pavel.nakonechny@skitlab.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull input fixes from Dmitry Torokhov:
"Updates for the input subsystem - two more tweaks for ALPS driver to
work out kinks after splitting the touchpad, trackstick, and potential
external PS/2 mouse into separate input devices.
Changes to support ALPS SS4 devices (protocol V8) will be coming in
4.1..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: alps - document stick behavior for protocol V2
Input: alps - report V2 Dualpoint Stick events via the right evdev node
Input: alps - report interleaved bare PS/2 packets via dev3
mvneta_adjust_link() is a callback for of_phy_connect() and should
not be called directly. The result of calling it directly is as below:
Signed-off-by: David S. Miller <davem@davemloft.net>