Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
boundaries
- calculate the cache's discard_nr_blocks relative to the determined
discard_block_size rather than using oblock_to_dblock()
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target. And the DM cache target passes in
the same old root to array_resize() anyway.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
rcu_dereference() should be used in sections protected by rcu_read_lock.
For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.
In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33423974bf ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize. The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.
This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device. One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611
So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.
We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize. In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios. But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices. Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.
Store the mapped_device for each thin device in struct thin_c to make
these calls possible.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.
Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.
Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared. Add lockdep annotation to dm_suspend() and dm_resume().
The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().
Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend". A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.
But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.
Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Otherwise IO could be issued to the pool while it is suspended.
Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag. The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
As long as struct thin_c is in the list, anyone can grab a reference of
it. Consequently, we must wait for the reference count to drop to zero
*after* we remove the structure from the list, not before.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Loading and saving millions of block mappings takes time. We may as
well explain what's going on, and encourage people to use a larger
cache block size.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support. This also improves
discard performance considerly because larger discards are issued to the
dm-cache device. The discard blocksize was always intended to be
greater than the cache blocksize. But until now it wasn't implemented
safely.
Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset. Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size. For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset! Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit d132cc6d9e because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 64ab346a36 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ranges will be placed in the same cell if they overlap.
Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.
Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block. Also add a fudge factor to discourage demoting dirty blocks.
With some tests this has a sizeable difference, because the old code
was too eager to demote blocks. For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When creating new devices dm_sync_table() calls
synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be
flushed. This causes a latency overhead that is especially noticeable
when creating lots of devices.
And all of this is pointless as there are no old maps to be
disconnected, and hence no stale pointers which would need to be
cleared up.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference()
while accessing it.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells(). This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>