Merge tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer: - based on Jens' 'for-4.7/core' to have DM thinp's discard support use bio_inc_remaining() and the block core's new async __blkdev_issue_discard() interface - make DM multipath's fast code-paths lockless, using lockless_deference, to significantly improve large NUMA performance when using blk-mq. The m->lock spinlock contention was a serious bottleneck. - a few other small code cleanups and Documentation fixes * tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: unroll issue_discard() to create longer discard bio chains dm thin: use __blkdev_issue_discard for async discard support dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining() dm raid: make sure no feature flags are set in metadata dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call dm stats: fix spelling mistake in Documentation dm cache: update cache-policies.txt now that mq is an alias for smq dm mpath: eliminate use of spinlock in IO fast-paths dm mpath: move trigger_event member to the end of 'struct multipath' dm mpath: use atomic_t for counting members of 'struct multipath' dm mpath: switch to using bitops for state flags dm thin: Remove return statement from void function dm: remove unused mapped_device argument from free_tio()
2026-05-01 15:00:59 -07:00 · 2016-05-17 16:13:00 -07:00
parent 24b9f0cf00 202bae5293
commit b80fed9595
7 changed files with 298 additions and 273 deletions
@@ -11,7 +11,7 @@ Every bio that is mapped by the target is referred to the policy.
 The policy can return a simple HIT or MISS or issue a migration.

 Currently there's no way for the policy to issue background work,
-e.g. to start writing back dirty blocks that are going to be evicte
+e.g. to start writing back dirty blocks that are going to be evicted
 soon.

 Because we map bios, rather than requests it's easy for the policy
@@ -48,7 +48,7 @@ with the multiqueue (mq) policy.

 The smq policy (vs mq) offers the promise of less memory utilization,
 improved performance and increased adaptability in the face of changing
-workloads.  SMQ also does not have any cumbersome tuning knobs.
+workloads.  smq also does not have any cumbersome tuning knobs.

 Users may switch from "mq" to "smq" simply by appropriately reloading a
 DM table that is using the cache target.  Doing so will cause all of the
@@ -57,47 +57,45 @@ degrade slightly until smq recalculates the origin device's hotspots
 that should be cached.

 Memory usage:
-The mq policy uses a lot of memory; 88 bytes per cache block on a 64
+The mq policy used a lot of memory; 88 bytes per cache block on a 64
 bit machine.

-SMQ uses 28bit indexes to implement it's data structures rather than
+smq uses 28bit indexes to implement it's data structures rather than
 pointers.  It avoids storing an explicit hit count for each block.  It
-has a 'hotspot' queue rather than a pre cache which uses a quarter of
+has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
 the entries (each hotspot block covers a larger area than a single
 cache block).

-All these mean smq uses ~25bytes per cache block.  Still a lot of
+All this means smq uses ~25bytes per cache block.  Still a lot of
 memory, but a substantial improvement nontheless.

 Level balancing:
-MQ places entries in different levels of the multiqueue structures
-based on their hit count (~ln(hit count)).  This means the bottom
-levels generally have the most entries, and the top ones have very
-few.  Having unbalanced levels like this reduces the efficacy of the
+mq placed entries in different levels of the multiqueue structures
+based on their hit count (~ln(hit count)).  This meant the bottom
+levels generally had the most entries, and the top ones had very
+few.  Having unbalanced levels like this reduced the efficacy of the
 multiqueue.

-SMQ does not maintain a hit count, instead it swaps hit entries with
-the least recently used entry from the level above.  The over all
+smq does not maintain a hit count, instead it swaps hit entries with
+the least recently used entry from the level above.  The overall
 ordering being a side effect of this stochastic process.  With this
 scheme we can decide how many entries occupy each multiqueue level,
 resulting in better promotion/demotion decisions.

 Adaptability:
-The MQ policy maintains a hit count for each cache block.  For a
+The mq policy maintained a hit count for each cache block.  For a
 different block to get promoted to the cache it's hit count has to
-exceed the lowest currently in the cache.  This means it can take a
+exceed the lowest currently in the cache.  This meant it could take a
 long time for the cache to adapt between varying IO patterns.
-Periodically degrading the hit counts could help with this, but I
-haven't found a nice general solution.

-SMQ doesn't maintain hit counts, so a lot of this problem just goes
+smq doesn't maintain hit counts, so a lot of this problem just goes
 away.  In addition it tracks performance of the hotspot queue, which
 is used to decide which blocks to promote.  If the hotspot queue is
 performing badly then it starts moving entries more quickly between
 levels.  This lets it adapt to new IO patterns very quickly.

 Performance:
-Testing SMQ shows substantially better performance than MQ.
+Testing smq shows substantially better performance than mq.

 cleaner
 -------
@@ -205,7 +205,7 @@ statistics on them:

  dmsetup message vol 0 @stats_create - /100

-Set the auxillary data string to "foo bar baz" (the escape for each
+Set the auxiliary data string to "foo bar baz" (the escape for each
 space must also be escaped, otherwise the shell will consume them):

  dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
@@ -1723,7 +1723,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	if (!dmi) {
 		unsigned noio_flag;
 		noio_flag = memalloc_noio_save();
-		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
+		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
 		memalloc_noio_restore(noio_flag);
 		if (dmi)
 			*param_flags |= DM_PARAMS_VMALLOC;
@@ -1037,6 +1037,11 @@ static int super_validate(struct raid_set *rs, struct md_rdev *rdev)
 	if (!mddev->events && super_init_validation(mddev, rdev))
 		return -EINVAL;

+	if (le32_to_cpu(sb->features)) {
+		rs->ti->error = "Unable to assemble array: No feature flags supported yet";
+		return -EINVAL;
+	}
+
 	/* Enable bitmap creation for RAID levels != 0 */
 	mddev->bitmap_info.offset = (rs->raid_type->level) ? to_sector(4096) : 0;
 	rdev->mddev->bitmap_info.default_offset = mddev->bitmap_info.offset;
@@ -1718,7 +1723,7 @@ static void raid_resume(struct dm_target *ti)

 static struct target_type raid_target = {
 	.name = "raid",
-	.version = {1, 7, 0},
+	.version = {1, 8, 0},
 	.module = THIS_MODULE,
 	.ctr = raid_ctr,
 	.dtr = raid_dtr,
@@ -322,56 +322,6 @@ struct thin_c {

 /*----------------------------------------------------------------*/

-/**
- * __blkdev_issue_discard_async - queue a discard with async completion
- * @bdev:	blockdev to issue discard for
- * @sector:	start sector
- * @nr_sects:	number of sectors to discard
- * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @flags:	BLKDEV_IFL_* flags to control behaviour
- * @parent_bio: parent discard bio that all sub discards get chained to
- *
- * Description:
- *    Asynchronously issue a discard request for the sectors in question.
- */
-static int __blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
-					sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
-					struct bio *parent_bio)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-	int type = REQ_WRITE | REQ_DISCARD;
-	struct bio *bio;
-
-	if (!q || !nr_sects)
-		return -ENXIO;
-
-	if (!blk_queue_discard(q))
-		return -EOPNOTSUPP;
-
-	if (flags & BLKDEV_DISCARD_SECURE) {
-		if (!blk_queue_secdiscard(q))
-			return -EOPNOTSUPP;
-		type |= REQ_SECURE;
-	}
-
-	/*
-	 * Required bio_put occurs in bio_endio thanks to bio_chain below
-	 */
-	bio = bio_alloc(gfp_mask, 1);
-	if (!bio)
-		return -ENOMEM;
-
-	bio_chain(bio, parent_bio);
-
-	bio->bi_iter.bi_sector = sector;
-	bio->bi_bdev = bdev;
-	bio->bi_iter.bi_size = nr_sects << 9;
-
-	submit_bio(type, bio);
-
-	return 0;
-}
-
 static bool block_size_is_power_of_two(struct pool *pool)
 {
 	return pool->sectors_per_block_shift >= 0;
@@ -384,14 +334,55 @@ static sector_t block_to_sectors(struct pool *pool, dm_block_t b)
 		(b * pool->sectors_per_block);
 }

-static int issue_discard(struct thin_c *tc, dm_block_t data_b, dm_block_t data_e,
-			 struct bio *parent_bio)
+/*----------------------------------------------------------------*/
+
+struct discard_op {
+	struct thin_c *tc;
+	struct blk_plug plug;
+	struct bio *parent_bio;
+	struct bio *bio;
+};
+
+static void begin_discard(struct discard_op *op, struct thin_c *tc, struct bio *parent)
 {
+	BUG_ON(!parent);
+
+	op->tc = tc;
+	blk_start_plug(&op->plug);
+	op->parent_bio = parent;
+	op->bio = NULL;
+}
+
+static int issue_discard(struct discard_op *op, dm_block_t data_b, dm_block_t data_e)
+{
+	struct thin_c *tc = op->tc;
 	sector_t s = block_to_sectors(tc->pool, data_b);
 	sector_t len = block_to_sectors(tc->pool, data_e - data_b);

-	return __blkdev_issue_discard_async(tc->pool_dev->bdev, s, len,
-					    GFP_NOWAIT, 0, parent_bio);
+	return __blkdev_issue_discard(tc->pool_dev->bdev, s, len,
+				      GFP_NOWAIT, REQ_WRITE | REQ_DISCARD, &op->bio);
+}
+
+static void end_discard(struct discard_op *op, int r)
+{
+	if (op->bio) {
+		/*
+		 * Even if one of the calls to issue_discard failed, we
+		 * need to wait for the chain to complete.
+		 */
+		bio_chain(op->bio, op->parent_bio);
+		submit_bio(REQ_WRITE | REQ_DISCARD, op->bio);
+	}
+
+	blk_finish_plug(&op->plug);
+
+	/*
+	 * Even if r is set, there could be sub discards in flight that we
+	 * need to wait for.
+	 */
+	if (r && !op->parent_bio->bi_error)
+		op->parent_bio->bi_error = r;
+	bio_endio(op->parent_bio);
 }

 /*----------------------------------------------------------------*/
@@ -632,7 +623,7 @@ static void error_retry_list(struct pool *pool)
 {
 	int error = get_pool_io_error_code(pool);

-	return error_retry_list_with_code(pool, error);
+	error_retry_list_with_code(pool, error);
 }

 /*
@@ -1006,24 +997,28 @@ static void process_prepared_discard_no_passdown(struct dm_thin_new_mapping *m)
 	mempool_free(m, tc->pool->mapping_pool);
 }

-static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m)
+/*----------------------------------------------------------------*/
+
+static void passdown_double_checking_shared_status(struct dm_thin_new_mapping *m)
 {
 	/*
 	 * We've already unmapped this range of blocks, but before we
 	 * passdown we have to check that these blocks are now unused.
 	 */
-	int r;
+	int r = 0;
 	bool used = true;
 	struct thin_c *tc = m->tc;
 	struct pool *pool = tc->pool;
 	dm_block_t b = m->data_block, e, end = m->data_block + m->virt_end - m->virt_begin;
+	struct discard_op op;

+	begin_discard(&op, tc, m->bio);
 	while (b != end) {
 		/* find start of unmapped run */
 		for (; b < end; b++) {
 			r = dm_pool_block_is_used(pool->pmd, b, &used);
 			if (r)
-				return r;
+				goto out;

 			if (!used)
 				break;
@@ -1036,20 +1031,20 @@ static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m)
 		for (e = b + 1; e != end; e++) {
 			r = dm_pool_block_is_used(pool->pmd, e, &used);
 			if (r)
-				return r;
+				goto out;

 			if (used)
 				break;
 		}

-		r = issue_discard(tc, b, e, m->bio);
+		r = issue_discard(&op, b, e);
 		if (r)
-			return r;
+			goto out;

 		b = e;
 	}
-
-	return 0;
+out:
+	end_discard(&op, r);
 }

 static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
@@ -1059,20 +1054,21 @@ static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
 	struct pool *pool = tc->pool;

 	r = dm_thin_remove_range(tc->td, m->virt_begin, m->virt_end);
-	if (r)
+	if (r) {
 		metadata_operation_failed(pool, "dm_thin_remove_range", r);
+		bio_io_error(m->bio);

-	else if (m->maybe_shared)
-		r = passdown_double_checking_shared_status(m);
-	else
-		r = issue_discard(tc, m->data_block, m->data_block + (m->virt_end - m->virt_begin), m->bio);
+	} else if (m->maybe_shared) {
+		passdown_double_checking_shared_status(m);
+
+	} else {
+		struct discard_op op;
+		begin_discard(&op, tc, m->bio);
+		r = issue_discard(&op, m->data_block,
+				  m->data_block + (m->virt_end - m->virt_begin));
+		end_discard(&op, r);
+	}

-	/*
-	 * Even if r is set, there could be sub discards in flight that we
-	 * need to wait for.
-	 */
-	m->bio->bi_error = r;
-	bio_endio(m->bio);
 	cell_defer_no_holder(tc, m->cell);
 	mempool_free(m, pool->mapping_pool);
 }
@@ -1494,17 +1490,6 @@ static void process_discard_cell_no_passdown(struct thin_c *tc,
 		pool->process_prepared_discard(m);
 }

-/*
- * __bio_inc_remaining() is used to defer parent bios's end_io until
- * we _know_ all chained sub range discard bios have completed.
- */
-static inline void __bio_inc_remaining(struct bio *bio)
-{
-	bio->bi_flags |= (1 << BIO_CHAIN);
-	smp_mb__before_atomic();
-	atomic_inc(&bio->__bi_remaining);
-}
-
 static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t end,
 				 struct bio *bio)
 {
@@ -1554,13 +1539,13 @@ static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t

 		/*
 		 * The parent bio must not complete before sub discard bios are
-		 * chained to it (see __blkdev_issue_discard_async's bio_chain)!
+		 * chained to it (see end_discard's bio_chain)!
 		 *
 		 * This per-mapping bi_remaining increment is paired with
 		 * the implicit decrement that occurs via bio_endio() in
-		 * process_prepared_discard_{passdown,no_passdown}.
+		 * end_discard().
 		 */
-		__bio_inc_remaining(bio);
+		bio_inc_remaining(bio);
 		if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
 			pool->process_prepared_discard(m);

@@ -3899,7 +3884,7 @@ static struct target_type pool_target = {
 	.name = "thin-pool",
 	.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
 		    DM_TARGET_IMMUTABLE,
-	.version = {1, 18, 0},
+	.version = {1, 19, 0},
 	.module = THIS_MODULE,
 	.ctr = pool_ctr,
 	.dtr = pool_dtr,
@@ -4273,7 +4258,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)

 static struct target_type thin_target = {
 	.name = "thin",
-	.version = {1, 18, 0},
+	.version = {1, 19, 0},
 	.module	= THIS_MODULE,
 	.ctr = thin_ctr,
 	.dtr = thin_dtr,
@@ -674,7 +674,7 @@ static void free_io(struct mapped_device *md, struct dm_io *io)
 	mempool_free(io, md->io_pool);
 }

-static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
+static void free_tio(struct dm_target_io *tio)
 {
 	bio_put(&tio->clone);
 }
@@ -1055,7 +1055,7 @@ static void clone_endio(struct bio *bio)
 		     !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors))
 		disable_write_same(md);

-	free_tio(md, tio);
+	free_tio(tio);
 	dec_pending(io, error);
 }

@@ -1517,7 +1517,6 @@ static void __map_bio(struct dm_target_io *tio)
 {
 	int r;
 	sector_t sector;
-	struct mapped_device *md;
 	struct bio *clone = &tio->clone;
 	struct dm_target *ti = tio->ti;

@@ -1540,9 +1539,8 @@ static void __map_bio(struct dm_target_io *tio)
 		generic_make_request(clone);
 	} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
 		/* error the io and bail out, or requeue it if needed */
-		md = tio->io->md;
 		dec_pending(tio->io, r);
-		free_tio(md, tio);
+		free_tio(tio);
 	} else if (r != DM_MAPIO_SUBMITTED) {
 		DMWARN("unimplemented target map return value: %d", r);
 		BUG();
@@ -1663,7 +1661,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 		tio->len_ptr = len;
 		r = clone_bio(tio, bio, sector, *len);
 		if (r < 0) {
-			free_tio(ci->md, tio);
+			free_tio(tio);
 			break;
 		}
 		__map_bio(tio);