Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder. There is a race here
though between releasing an empty cell, and other threads adding new
inmates. So this function makes the decision with its lock held.
This function can have two outcomes:
i) An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We only allow non critical writeback if the origin is idle. It is up
to the policy to decide what writeback work is critical.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.
FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.
This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.
Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg. Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace. Such
recovery takes a long time due to a full resync.
Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton. To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type
Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.
The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
to allow the raid0 personality to calculate the MD array
size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
(wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
so that userspace checks for 100% sync will succeed and allow
for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
that resulted from bio payloads that grew too large. This problem
did not occur with the other raid levels because it either did not
apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.
Also, backfill dm-raid target version 1.6.0 documentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten. As such it is only
relevant when zeroing is enabled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Introduce a single common method for cleaning up a DM device's
mapped_device. No functional change, just eliminates duplication of
delicate mapped_device cleanup code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized). Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".
Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn"). In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int. This code made
assumptions about the value of BIO_MAX_SECTORS.
A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At
this point the cast from sector_t to int resulted in a zero value. The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.
This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
dm-multipath accepts 0 path mapping.
# echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev
Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.
However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:
# echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
| dmsetup create mpath1
# echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
device-mapper: reload ioctl on mpath1 failed: Invalid argument
Command failed
With following kernel message:
device-mapper: ioctl: can't change device type after initial table load.
DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.
Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.
Fixes: 65803c2059 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull m,ore md bugfixes gfrom Neil Brown:
"Assorted fixes for new RAID5 stripe-batching functionality.
Unfortunately this functionality was merged a little prematurely. The
necessary testing and code review is now complete (or as complete as
it can be) and to code passes a variety of tests and looks quite
sensible.
Also a fix for some recent locking changes - a race was introduced
which causes a reshape request to sometimes fail. No data safety
issues"
* tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md:
md: fix race when unfreezing sync_action
md/raid5: break stripe-batches when the array has failed.
md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
md/raid5: be more selective about distributing flags across batch.
md/raid5: add handle_flags arg to break_stripe_batch_list.
md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
md/raid5: remove condition test from check_break_stripe_batch_list.
md/raid5: Ensure a batch member is not handled prematurely.
md/raid5: close race between STRIPE_BIT_DELAY and batching.
md/raid5: ensure whole batch is delayed for all required bitmap updates.
When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.
The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request"). But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.
Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).
Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A recent change removed the need for locking around writing
to "sync_action" (and various other places), but introduced a
subtle race.
When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
flag is cleared before 'reshape' is set, so the md thread can
get in and start trying recovery - which isn't wanted.
So instead of clearing MD_RECOVERY_FROZEN for any command
except 'frozen', only clear it when each specific command
is parsed. This allows the handling of 'reshape' to clear
the bit while a lock is held.
Also remove some places where we set MD_RECOVERY_NEEDED,
as it is always set on non-error exit of the function.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")