You've already forked linux-apfs
mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper changes from Alasdair G Kergon: "Add a device-mapper target called dm-switch to provide a multipath framework for storage arrays that dynamically reconfigure their preferred paths for different device regions. Fix a bug in the verity target that prevented its use with some specific sizes of devices. Improve some locking mechanisms in the device-mapper core and bufio. Add Mike Snitzer as a device-mapper maintainer. A few more clean-ups and fixes" * tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm: add switch target dm: update maintainers dm: optimize reorder structure dm: optimize use SRCU and RCU dm bufio: submit writes outside lock dm cache: fix arm link errors with inline dm verity: use __ffs and __fls dm flakey: correct ctr alloc failure mesg dm verity: remove pointless comparison dm: use __GFP_HIGHMEM in __vmalloc dm verity: fix inability to use a few specific devices sizes dm ioctl: set noio flag to avoid __vmalloc deadlock dm mpath: fix ioctl deadlock when no paths
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
dm-switch
|
||||
=========
|
||||
|
||||
The device-mapper switch target creates a device that supports an
|
||||
arbitrary mapping of fixed-size regions of I/O across a fixed set of
|
||||
paths. The path used for any specific region can be switched
|
||||
dynamically by sending the target a message.
|
||||
|
||||
It maps I/O to underlying block devices efficiently when there is a large
|
||||
number of fixed-sized address regions but there is no simple pattern
|
||||
that would allow for a compact representation of the mapping such as
|
||||
dm-stripe.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Dell EqualLogic and some other iSCSI storage arrays use a distributed
|
||||
frameless architecture. In this architecture, the storage group
|
||||
consists of a number of distinct storage arrays ("members") each having
|
||||
independent controllers, disk storage and network adapters. When a LUN
|
||||
is created it is spread across multiple members. The details of the
|
||||
spreading are hidden from initiators connected to this storage system.
|
||||
The storage group exposes a single target discovery portal, no matter
|
||||
how many members are being used. When iSCSI sessions are created, each
|
||||
session is connected to an eth port on a single member. Data to a LUN
|
||||
can be sent on any iSCSI session, and if the blocks being accessed are
|
||||
stored on another member the I/O will be forwarded as required. This
|
||||
forwarding is invisible to the initiator. The storage layout is also
|
||||
dynamic, and the blocks stored on disk may be moved from member to
|
||||
member as needed to balance the load.
|
||||
|
||||
This architecture simplifies the management and configuration of both
|
||||
the storage group and initiators. In a multipathing configuration, it
|
||||
is possible to set up multiple iSCSI sessions to use multiple network
|
||||
interfaces on both the host and target to take advantage of the
|
||||
increased network bandwidth. An initiator could use a simple round
|
||||
robin algorithm to send I/O across all paths and let the storage array
|
||||
members forward it as necessary, but there is a performance advantage to
|
||||
sending data directly to the correct member.
|
||||
|
||||
A device-mapper table already lets you map different regions of a
|
||||
device onto different targets. However in this architecture the LUN is
|
||||
spread with an address region size on the order of 10s of MBs, which
|
||||
means the resulting table could have more than a million entries and
|
||||
consume far too much memory.
|
||||
|
||||
Using this device-mapper switch target we can now build a two-layer
|
||||
device hierarchy:
|
||||
|
||||
Upper Tier – Determine which array member the I/O should be sent to.
|
||||
Lower Tier – Load balance amongst paths to a particular member.
|
||||
|
||||
The lower tier consists of a single dm multipath device for each member.
|
||||
Each of these multipath devices contains the set of paths directly to
|
||||
the array member in one priority group, and leverages existing path
|
||||
selectors to load balance amongst these paths. We also build a
|
||||
non-preferred priority group containing paths to other array members for
|
||||
failover reasons.
|
||||
|
||||
The upper tier consists of a single dm-switch device. This device uses
|
||||
a bitmap to look up the location of the I/O and choose the appropriate
|
||||
lower tier device to route the I/O. By using a bitmap we are able to
|
||||
use 4 bits for each address range in a 16 member group (which is very
|
||||
large for us). This is a much denser representation than the dm table
|
||||
b-tree can achieve.
|
||||
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
|
||||
[<dev_path> <offset>]+
|
||||
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
|
||||
|
||||
Modify the region table by specifying which regions are redirected to
|
||||
which paths.
|
||||
|
||||
<index>
|
||||
The region number (region size was specified in constructor parameters).
|
||||
If index is omitted, the next region (previous index + 1) is used.
|
||||
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||
|
||||
<path_nr>
|
||||
The path number in the range 0 ... (<num_paths> - 1).
|
||||
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||
|
||||
Status
|
||||
======
|
||||
|
||||
No status line is reported.
|
||||
|
||||
Example
|
||||
=======
|
||||
|
||||
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
|
||||
the same size.
|
||||
|
||||
Create a switch device with 64kB region size:
|
||||
dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0`
|
||||
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
|
||||
|
||||
Set mappings for the first 7 entries to point to devices switch0, switch1,
|
||||
switch2, switch0, switch1, switch2, switch1:
|
||||
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
|
||||
Reference in New Issue
Block a user