This patch fixes a bug which was triggering a case where the primary leg
could not be changed on failure even when the mirror was in-sync.
The case involves the failure of the primary device along with
the transient failure of the log device. The problem is that
bios can be put on the 'failures' list (due to log failure)
before 'fail_mirror' is called due to the primary device failure.
Normally, this is fine, but if the log device failure is transient,
a subsequent iteration of the work thread, 'do_mirror', will
reset 'log_failure'. The 'do_failures' function then resets
the 'in_sync' variable when processing bios on the failures list.
The 'in_sync' variable is what is used to determine if the
primary device can be switched in the event of a failure. Since
this has been reset, the primary device is incorrectly assumed
to be not switchable.
The case has been seen in the cluster mirror context, where one
machine realizes the log device is dead before the other machines.
As the responsibilities of the server migrate from one node to
another (because the mirror is being reconfigured due to the failure),
the new server may think for a moment that the log device is fine -
thus resetting the 'log_failure' variable.
In any case, it is inappropiate for us to reset the 'log_failure'
variable. The above bug simply illustrates that it can actually
hurt us.
Cc: stable@kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Incorrect device area lengths are being passed to device_area_is_valid().
The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.
With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes). An example of a
consequent incorrect error message is:
device-mapper: table: 254:0: sdb too small for target
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.
The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing). Otherwise the bios could sit on the list indefinitely.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add .iterate_devices to 'struct target_type' to allow a function to be
called for all devices in a DM target. Implemented it for all targets
except those in dm-snap.c (origin and snapshot).
(The raid1 version number jumps to 1.12 because we originally reserved
1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
instead.)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: martin.petersen@oracle.com
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The logging API needs an extra function to make cluster mirroring
possible. This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster. This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.
Cluster-aware log modules will implement this function. Single
machine log modules will not. So, there is no performance
penalty for single machine mirrors.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
is significantly increased (the vector list takes 3072 bytes on 32-bit machines
and 4096 bytes on 64-bit machines).
The structure dm_raid1_read_record used to be allocated with kmalloc,
but kmalloc aligns the size on the next power-of-two so an object
slightly greater than 4096 will allocate 8192 bytes of memory and half of
that memory will be wasted.
This patch turns kmalloc into a slab cache which doesn't have this
padding so it will reduce the memory consumed.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move log size validation from mirror target to log constructor.
Removed PAGE_SIZE restriction we no longer think necessary.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change dm_unregister_target to return void and use BUG() for error
reporting.
dm_unregister_target can only fail because of programming bug in the
target driver. It can't fail because of user's behavior or disk errors.
This patch changes unregister_target to return void and use BUG if
someone tries to unregister non-registered target or unregister target
that is in use.
This patch removes code duplication (testing of error codes in all dm
targets) and reports bugs in just one place, in dm_unregister_target. In
some target drivers, these return codes were ignored, which could lead
to a situation where bugs could be missed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Always increase the error count when I/O on a leg of a mirror fails.
The error count is used to decide whether to select an alternative
mirror leg. If the target doesn't use the "handle_errors" feature, the
error count is not updated and the bio can get requeued forever by the
read callback.
Fix it by increasing error_count before the handle_errors feature
checking.
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
We queue work on keventd queue --- so this queue must be flushed in the
destructor. Otherwise, keventd could access mirror_set after it was freed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
Separate the region hash code from raid1 so it can be shared by forthcoming
targets. Use BUG_ON() for failed async dm_io() calls.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.
Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail. (Even for stripes it seems rather
unlikely.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally
when assigning kcopyd work. kcopyd is responsible for copying an
assigned section of disk to one or more other disks. The
'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way:
When not set:
kcopyd will immediately stop the copy operation when an error is
encountered.
When set:
kcopyd will try to proceed regardless of errors and try to continue
copying any remaining amount.
Since dm-raid1 tracks regions of the address space that are (or
are not) in sync and it now has the ability to handle these
errors, we can safely enable this optimization. This optimization
is conditional on whether mirror error handling has been enabled.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.
It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).
The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.
This was happenning:
When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.
fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.
wait_on_buffer() starts waiting until the IO is finished.
kmirrord is scheduled, kmirrord takes bios and submits them to the devices.
The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)
So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.
This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.
Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.
This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.
It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.
The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-log interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the kcopyd interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-io interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the dirty region log code into a separate module so
other targets can share the code.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>