mirror of
https://github.com/Dasharo/linux.git
synced 2026-03-06 15:25:10 -08:00
dm zoned: drive-managed zoned block device target
The dm-zoned device mapper target provides transparent write access to zoned block devices (ZBC and ZAC compliant block devices). dm-zoned hides to the device user (a file system or an application doing raw block device accesses) any constraint imposed on write requests by the device, equivalent to a drive-managed zoned block device model. Write requests are processed using a combination of on-disk buffering using the device conventional zones and direct in-place processing for requests aligned to a zone sequential write pointer position. A background reclaim process implemented using dm_kcopyd_copy ensures that conventional zones are always available for executing unaligned write requests. The reclaim process overhead is minimized by managing buffer zones in a least-recently-written order and first targeting the oldest buffer zones. Doing so, blocks under regular write access (such as metadata blocks of a file system) remain stored in conventional zones, resulting in no apparent overhead. dm-zoned implementation focus on simplicity and on minimizing overhead (CPU, memory and storage overhead). For a 14TB host-managed disk with 256 MB zones, dm-zoned memory usage per disk instance is at most about 3 MB and as little as 5 zones will be used internally for storing metadata and performing buffer zone reclaim operations. This is achieved using zone level indirection rather than a full block indirection system for managing block movement between zones. dm-zoned primary target is host-managed zoned block devices but it can also be used with host-aware device models to mitigate potential device-side performance degradation due to excessive random writing. Zoned block devices can be formatted and checked for use with the dm-zoned target using the dmzadm utility available at: https://github.com/hgst/dm-zoned-tools Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [Mike Snitzer partly refactored Damien's original work to cleanup the code] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This commit is contained in:
committed by
Mike Snitzer
parent
b73c67c2cb
commit
3b1a94c88b
144
Documentation/device-mapper/dm-zoned.txt
Normal file
144
Documentation/device-mapper/dm-zoned.txt
Normal file
@@ -0,0 +1,144 @@
|
||||
dm-zoned
|
||||
========
|
||||
|
||||
The dm-zoned device mapper target exposes a zoned block device (ZBC and
|
||||
ZAC compliant devices) as a regular block device without any write
|
||||
pattern constraints. In effect, it implements a drive-managed zoned
|
||||
block device which hides from the user (a file system or an application
|
||||
doing raw block device accesses) the sequential write constraints of
|
||||
host-managed zoned block devices and can mitigate the potential
|
||||
device-side performance degradation due to excessive random writes on
|
||||
host-aware zoned block devices.
|
||||
|
||||
For a more detailed description of the zoned block device models and
|
||||
their constraints see (for SCSI devices):
|
||||
|
||||
http://www.t10.org/drafts.htm#ZBC_Family
|
||||
|
||||
and (for ATA devices):
|
||||
|
||||
http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
|
||||
|
||||
The dm-zoned implementation is simple and minimizes system overhead (CPU
|
||||
and memory usage as well as storage capacity loss). For a 10TB
|
||||
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
|
||||
instance is at most 4.5 MB and as little as 5 zones will be used
|
||||
internally for storing metadata and performaing reclaim operations.
|
||||
|
||||
dm-zoned target devices are formatted and checked using the dmzadm
|
||||
utility available at:
|
||||
|
||||
https://github.com/hgst/dm-zoned-tools
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-zoned implements an on-disk buffering scheme to handle non-sequential
|
||||
write accesses to the sequential zones of a zoned block device.
|
||||
Conventional zones are used for caching as well as for storing internal
|
||||
metadata.
|
||||
|
||||
The zones of the device are separated into 2 types:
|
||||
|
||||
1) Metadata zones: these are conventional zones used to store metadata.
|
||||
Metadata zones are not reported as useable capacity to the user.
|
||||
|
||||
2) Data zones: all remaining zones, the vast majority of which will be
|
||||
sequential zones used exclusively to store user data. The conventional
|
||||
zones of the device may be used also for buffering user random writes.
|
||||
Data in these zones may be directly mapped to the conventional zone, but
|
||||
later moved to a sequential zone so that the conventional zone can be
|
||||
reused for buffering incoming random writes.
|
||||
|
||||
dm-zoned exposes a logical device with a sector size of 4096 bytes,
|
||||
irrespective of the physical sector size of the backend zoned block
|
||||
device being used. This allows reducing the amount of metadata needed to
|
||||
manage valid blocks (blocks written).
|
||||
|
||||
The on-disk metadata format is as follows:
|
||||
|
||||
1) The first block of the first conventional zone found contains the
|
||||
super block which describes the on disk amount and position of metadata
|
||||
blocks.
|
||||
|
||||
2) Following the super block, a set of blocks is used to describe the
|
||||
mapping of the logical device blocks. The mapping is done per chunk of
|
||||
blocks, with the chunk size equal to the zoned block device size. The
|
||||
mapping table is indexed by chunk number and each mapping entry
|
||||
indicates the zone number of the device storing the chunk of data. Each
|
||||
mapping entry may also indicate if the zone number of a conventional
|
||||
zone used to buffer random modification to the data zone.
|
||||
|
||||
3) A set of blocks used to store bitmaps indicating the validity of
|
||||
blocks in the data zones follows the mapping table. A valid block is
|
||||
defined as a block that was written and not discarded. For a buffered
|
||||
data chunk, a block is always valid only in the data zone mapping the
|
||||
chunk or in the buffer zone of the chunk.
|
||||
|
||||
For a logical chunk mapped to a conventional zone, all write operations
|
||||
are processed by directly writing to the zone. If the mapping zone is a
|
||||
sequential zone, the write operation is processed directly only if the
|
||||
write offset within the logical chunk is equal to the write pointer
|
||||
offset within of the sequential data zone (i.e. the write operation is
|
||||
aligned on the zone write pointer). Otherwise, write operations are
|
||||
processed indirectly using a buffer zone. In that case, an unused
|
||||
conventional zone is allocated and assigned to the chunk being
|
||||
accessed. Writing a block to the buffer zone of a chunk will
|
||||
automatically invalidate the same block in the sequential zone mapping
|
||||
the chunk. If all blocks of the sequential zone become invalid, the zone
|
||||
is freed and the chunk buffer zone becomes the primary zone mapping the
|
||||
chunk, resulting in native random write performance similar to a regular
|
||||
block device.
|
||||
|
||||
Read operations are processed according to the block validity
|
||||
information provided by the bitmaps. Valid blocks are read either from
|
||||
the sequential zone mapping a chunk, or if the chunk is buffered, from
|
||||
the buffer zone assigned. If the accessed chunk has no mapping, or the
|
||||
accessed blocks are invalid, the read buffer is zeroed and the read
|
||||
operation terminated.
|
||||
|
||||
After some time, the limited number of convnetional zones available may
|
||||
be exhausted (all used to map chunks or buffer sequential zones) and
|
||||
unaligned writes to unbuffered chunks become impossible. To avoid this
|
||||
situation, a reclaim process regularly scans used conventional zones and
|
||||
tries to reclaim the least recently used zones by copying the valid
|
||||
blocks of the buffer zone to a free sequential zone. Once the copy
|
||||
completes, the chunk mapping is updated to point to the sequential zone
|
||||
and the buffer zone freed for reuse.
|
||||
|
||||
Metadata Protection
|
||||
===================
|
||||
|
||||
To protect metadata against corruption in case of sudden power loss or
|
||||
system crash, 2 sets of metadata zones are used. One set, the primary
|
||||
set, is used as the main metadata region, while the secondary set is
|
||||
used as a staging area. Modified metadata is first written to the
|
||||
secondary set and validated by updating the super block in the secondary
|
||||
set, a generation counter is used to indicate that this set contains the
|
||||
newest metadata. Once this operation completes, in place of metadata
|
||||
block updates can be done in the primary metadata set. This ensures that
|
||||
one of the set is always consistent (all modifications committed or none
|
||||
at all). Flush operations are used as a commit point. Upon reception of
|
||||
a flush request, metadata modification activity is temporarily blocked
|
||||
(for both incoming BIO processing and reclaim process) and all dirty
|
||||
metadata blocks are staged and updated. Normal operation is then
|
||||
resumed. Flushing metadata thus only temporarily delays write and
|
||||
discard requests. Read requests can be processed concurrently while
|
||||
metadata flush is being executed.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
A zoned block device must first be formatted using the dmzadm tool. This
|
||||
will analyze the device zone configuration, determine where to place the
|
||||
metadata sets on the device and initialize the metadata sets.
|
||||
|
||||
Ex:
|
||||
|
||||
dmzadm --format /dev/sdxx
|
||||
|
||||
For a formatted device, the target can be created normally with the
|
||||
dmsetup utility. The only parameter that dm-zoned requires is the
|
||||
underlying zoned block device name. Ex:
|
||||
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
|
||||
@@ -521,6 +521,23 @@ config DM_INTEGRITY
|
||||
To compile this code as a module, choose M here: the module will
|
||||
be called dm-integrity.
|
||||
|
||||
config DM_ZONED
|
||||
tristate "Drive-managed zoned block device target support"
|
||||
depends on BLK_DEV_DM
|
||||
depends on BLK_DEV_ZONED
|
||||
---help---
|
||||
This device-mapper target takes a host-managed or host-aware zoned
|
||||
block device and exposes most of its capacity as a regular block
|
||||
device (drive-managed zoned block device) without any write
|
||||
constraints. This is mainly intended for use with file systems that
|
||||
do not natively support zoned block devices but still want to
|
||||
benefit from the increased capacity offered by SMR disks. Other uses
|
||||
by applications using raw block devices (for example object stores)
|
||||
are also possible.
|
||||
|
||||
To compile this code as a module, choose M here: the module will
|
||||
be called dm-zoned.
|
||||
|
||||
If unsure, say N.
|
||||
|
||||
endif # MD
|
||||
|
||||
@@ -20,6 +20,7 @@ dm-era-y += dm-era-target.o
|
||||
dm-verity-y += dm-verity-target.o
|
||||
md-mod-y += md.o bitmap.o
|
||||
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
|
||||
dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
|
||||
|
||||
# Note: link order is important. All raid personalities
|
||||
# and must come before md.o, as they each initialise
|
||||
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
|
||||
obj-$(CONFIG_DM_ERA) += dm-era.o
|
||||
obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
|
||||
obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o
|
||||
obj-$(CONFIG_DM_ZONED) += dm-zoned.o
|
||||
|
||||
ifeq ($(CONFIG_DM_UEVENT),y)
|
||||
dm-mod-objs += dm-uevent.o
|
||||
|
||||
2509
drivers/md/dm-zoned-metadata.c
Normal file
2509
drivers/md/dm-zoned-metadata.c
Normal file
File diff suppressed because it is too large
Load Diff
570
drivers/md/dm-zoned-reclaim.c
Normal file
570
drivers/md/dm-zoned-reclaim.c
Normal file
File diff suppressed because it is too large
Load Diff
967
drivers/md/dm-zoned-target.c
Normal file
967
drivers/md/dm-zoned-target.c
Normal file
File diff suppressed because it is too large
Load Diff
228
drivers/md/dm-zoned.h
Normal file
228
drivers/md/dm-zoned.h
Normal file
@@ -0,0 +1,228 @@
|
||||
/*
|
||||
* Copyright (C) 2017 Western Digital Corporation or its affiliates.
|
||||
*
|
||||
* This file is released under the GPL.
|
||||
*/
|
||||
|
||||
#ifndef DM_ZONED_H
|
||||
#define DM_ZONED_H
|
||||
|
||||
#include <linux/types.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/device-mapper.h>
|
||||
#include <linux/dm-kcopyd.h>
|
||||
#include <linux/list.h>
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/workqueue.h>
|
||||
#include <linux/rwsem.h>
|
||||
#include <linux/rbtree.h>
|
||||
#include <linux/radix-tree.h>
|
||||
#include <linux/shrinker.h>
|
||||
|
||||
/*
|
||||
* dm-zoned creates block devices with 4KB blocks, always.
|
||||
*/
|
||||
#define DMZ_BLOCK_SHIFT 12
|
||||
#define DMZ_BLOCK_SIZE (1 << DMZ_BLOCK_SHIFT)
|
||||
#define DMZ_BLOCK_MASK (DMZ_BLOCK_SIZE - 1)
|
||||
|
||||
#define DMZ_BLOCK_SHIFT_BITS (DMZ_BLOCK_SHIFT + 3)
|
||||
#define DMZ_BLOCK_SIZE_BITS (1 << DMZ_BLOCK_SHIFT_BITS)
|
||||
#define DMZ_BLOCK_MASK_BITS (DMZ_BLOCK_SIZE_BITS - 1)
|
||||
|
||||
#define DMZ_BLOCK_SECTORS_SHIFT (DMZ_BLOCK_SHIFT - SECTOR_SHIFT)
|
||||
#define DMZ_BLOCK_SECTORS (DMZ_BLOCK_SIZE >> SECTOR_SHIFT)
|
||||
#define DMZ_BLOCK_SECTORS_MASK (DMZ_BLOCK_SECTORS - 1)
|
||||
|
||||
/*
|
||||
* 4KB block <-> 512B sector conversion.
|
||||
*/
|
||||
#define dmz_blk2sect(b) ((sector_t)(b) << DMZ_BLOCK_SECTORS_SHIFT)
|
||||
#define dmz_sect2blk(s) ((sector_t)(s) >> DMZ_BLOCK_SECTORS_SHIFT)
|
||||
|
||||
#define dmz_bio_block(bio) dmz_sect2blk((bio)->bi_iter.bi_sector)
|
||||
#define dmz_bio_blocks(bio) dmz_sect2blk(bio_sectors(bio))
|
||||
|
||||
/*
|
||||
* Zoned block device information.
|
||||
*/
|
||||
struct dmz_dev {
|
||||
struct block_device *bdev;
|
||||
|
||||
char name[BDEVNAME_SIZE];
|
||||
|
||||
sector_t capacity;
|
||||
|
||||
unsigned int nr_zones;
|
||||
|
||||
sector_t zone_nr_sectors;
|
||||
unsigned int zone_nr_sectors_shift;
|
||||
|
||||
sector_t zone_nr_blocks;
|
||||
sector_t zone_nr_blocks_shift;
|
||||
};
|
||||
|
||||
#define dmz_bio_chunk(dev, bio) ((bio)->bi_iter.bi_sector >> \
|
||||
(dev)->zone_nr_sectors_shift)
|
||||
#define dmz_chunk_block(dev, b) ((b) & ((dev)->zone_nr_blocks - 1))
|
||||
|
||||
/*
|
||||
* Zone descriptor.
|
||||
*/
|
||||
struct dm_zone {
|
||||
/* For listing the zone depending on its state */
|
||||
struct list_head link;
|
||||
|
||||
/* Zone type and state */
|
||||
unsigned long flags;
|
||||
|
||||
/* Zone activation reference count */
|
||||
atomic_t refcount;
|
||||
|
||||
/* Zone write pointer block (relative to the zone start block) */
|
||||
unsigned int wp_block;
|
||||
|
||||
/* Zone weight (number of valid blocks in the zone) */
|
||||
unsigned int weight;
|
||||
|
||||
/* The chunk that the zone maps */
|
||||
unsigned int chunk;
|
||||
|
||||
/*
|
||||
* For a sequential data zone, pointer to the random zone
|
||||
* used as a buffer for processing unaligned writes.
|
||||
* For a buffer zone, this points back to the data zone.
|
||||
*/
|
||||
struct dm_zone *bzone;
|
||||
};
|
||||
|
||||
/*
|
||||
* Zone flags.
|
||||
*/
|
||||
enum {
|
||||
/* Zone write type */
|
||||
DMZ_RND,
|
||||
DMZ_SEQ,
|
||||
|
||||
/* Zone critical condition */
|
||||
DMZ_OFFLINE,
|
||||
DMZ_READ_ONLY,
|
||||
|
||||
/* How the zone is being used */
|
||||
DMZ_META,
|
||||
DMZ_DATA,
|
||||
DMZ_BUF,
|
||||
|
||||
/* Zone internal state */
|
||||
DMZ_ACTIVE,
|
||||
DMZ_RECLAIM,
|
||||
DMZ_SEQ_WRITE_ERR,
|
||||
};
|
||||
|
||||
/*
|
||||
* Zone data accessors.
|
||||
*/
|
||||
#define dmz_is_rnd(z) test_bit(DMZ_RND, &(z)->flags)
|
||||
#define dmz_is_seq(z) test_bit(DMZ_SEQ, &(z)->flags)
|
||||
#define dmz_is_empty(z) ((z)->wp_block == 0)
|
||||
#define dmz_is_offline(z) test_bit(DMZ_OFFLINE, &(z)->flags)
|
||||
#define dmz_is_readonly(z) test_bit(DMZ_READ_ONLY, &(z)->flags)
|
||||
#define dmz_is_active(z) test_bit(DMZ_ACTIVE, &(z)->flags)
|
||||
#define dmz_in_reclaim(z) test_bit(DMZ_RECLAIM, &(z)->flags)
|
||||
#define dmz_seq_write_err(z) test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags)
|
||||
|
||||
#define dmz_is_meta(z) test_bit(DMZ_META, &(z)->flags)
|
||||
#define dmz_is_buf(z) test_bit(DMZ_BUF, &(z)->flags)
|
||||
#define dmz_is_data(z) test_bit(DMZ_DATA, &(z)->flags)
|
||||
|
||||
#define dmz_weight(z) ((z)->weight)
|
||||
|
||||
/*
|
||||
* Message functions.
|
||||
*/
|
||||
#define dmz_dev_info(dev, format, args...) \
|
||||
DMINFO("(%s): " format, (dev)->name, ## args)
|
||||
|
||||
#define dmz_dev_err(dev, format, args...) \
|
||||
DMERR("(%s): " format, (dev)->name, ## args)
|
||||
|
||||
#define dmz_dev_warn(dev, format, args...) \
|
||||
DMWARN("(%s): " format, (dev)->name, ## args)
|
||||
|
||||
#define dmz_dev_debug(dev, format, args...) \
|
||||
DMDEBUG("(%s): " format, (dev)->name, ## args)
|
||||
|
||||
struct dmz_metadata;
|
||||
struct dmz_reclaim;
|
||||
|
||||
/*
|
||||
* Functions defined in dm-zoned-metadata.c
|
||||
*/
|
||||
int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **zmd);
|
||||
void dmz_dtr_metadata(struct dmz_metadata *zmd);
|
||||
int dmz_resume_metadata(struct dmz_metadata *zmd);
|
||||
|
||||
void dmz_lock_map(struct dmz_metadata *zmd);
|
||||
void dmz_unlock_map(struct dmz_metadata *zmd);
|
||||
void dmz_lock_metadata(struct dmz_metadata *zmd);
|
||||
void dmz_unlock_metadata(struct dmz_metadata *zmd);
|
||||
void dmz_lock_flush(struct dmz_metadata *zmd);
|
||||
void dmz_unlock_flush(struct dmz_metadata *zmd);
|
||||
int dmz_flush_metadata(struct dmz_metadata *zmd);
|
||||
|
||||
unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
unsigned int dmz_nr_chunks(struct dmz_metadata *zmd);
|
||||
|
||||
#define DMZ_ALLOC_RND 0x01
|
||||
#define DMZ_ALLOC_RECLAIM 0x02
|
||||
|
||||
struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags);
|
||||
void dmz_free_zone(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
|
||||
void dmz_map_zone(struct dmz_metadata *zmd, struct dm_zone *zone,
|
||||
unsigned int chunk);
|
||||
void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd);
|
||||
unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd);
|
||||
|
||||
void dmz_activate_zone(struct dm_zone *zone);
|
||||
void dmz_deactivate_zone(struct dm_zone *zone);
|
||||
|
||||
int dmz_lock_zone_reclaim(struct dm_zone *zone);
|
||||
void dmz_unlock_zone_reclaim(struct dm_zone *zone);
|
||||
struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd);
|
||||
|
||||
struct dm_zone *dmz_get_chunk_mapping(struct dmz_metadata *zmd,
|
||||
unsigned int chunk, int op);
|
||||
void dmz_put_chunk_mapping(struct dmz_metadata *zmd, struct dm_zone *zone);
|
||||
struct dm_zone *dmz_get_chunk_buffer(struct dmz_metadata *zmd,
|
||||
struct dm_zone *dzone);
|
||||
|
||||
int dmz_validate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone,
|
||||
sector_t chunk_block, unsigned int nr_blocks);
|
||||
int dmz_invalidate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone,
|
||||
sector_t chunk_block, unsigned int nr_blocks);
|
||||
int dmz_block_valid(struct dmz_metadata *zmd, struct dm_zone *zone,
|
||||
sector_t chunk_block);
|
||||
int dmz_first_valid_block(struct dmz_metadata *zmd, struct dm_zone *zone,
|
||||
sector_t *chunk_block);
|
||||
int dmz_copy_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone,
|
||||
struct dm_zone *to_zone);
|
||||
int dmz_merge_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone,
|
||||
struct dm_zone *to_zone, sector_t chunk_block);
|
||||
|
||||
/*
|
||||
* Functions defined in dm-zoned-reclaim.c
|
||||
*/
|
||||
int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd,
|
||||
struct dmz_reclaim **zrc);
|
||||
void dmz_dtr_reclaim(struct dmz_reclaim *zrc);
|
||||
void dmz_suspend_reclaim(struct dmz_reclaim *zrc);
|
||||
void dmz_resume_reclaim(struct dmz_reclaim *zrc);
|
||||
void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc);
|
||||
void dmz_schedule_reclaim(struct dmz_reclaim *zrc);
|
||||
|
||||
#endif /* DM_ZONED_H */
|
||||
Reference in New Issue
Block a user