Philipp writes:
This are the updates we have in the drbd-8.3 tree. They are intended
for your "for-3.5/drivers" drivers branch.
These changes include one new feature:
* Allow detach from frozen backing devices with the new --force option;
configurable timeout for backing devices by the new disk-timeout option
And huge number of bug fixes:
* Fixed a write ordering problem on SyncTarget nodes for a write
to a block that gets resynced at the same time. The bug can
only be triggered with a device that has a firmware that
actually reorders writes to the same block
* Fixed a race between disconnect and receive_state, that could cause
a IO lockup
* Fixed resend/resubmit for requests with disk or network timeout
* Make sure that hard state changed do not disturb the connection
establishing process (I.e. detach due to an IO error). When the
bug was triggered it caused a retry in the connect process
* Postpone soft state changes to no disturb the connection
establishing process (I.e. becoming primary). When the bug
was triggered it could cause both nodes going into SyncSource state
* Fixed a refcount leak that could cause failures when trying to
unload a protocol family modules, that was used by DRBD
* Dedicated page pool for meta data IOs
* Deny normal detach (as opposed to --forced) if the user tries
to detach from the last UpToDate disk in the resource
* Fixed a possible protocol error that could be caused by
"unusual" BIOs.
* Enforce the disk-timeout option also on meta-data IO operations
* Implemented stable bitmap pages when we do a full write out of
the bitmap
* Fixed a rare compatibility issue with DRBD's older than 8.3.7
when negotiating the bio_size
* Fixed a rare race condition where an empty resync could stall with
if pause/unpause events happen in parallel
* Made the re-establishing of connections quicker, if it got a broken pipe
once. Previously there was a bug in the code caused it to waste the first
successful established connection after a broken pipe event.
PS: I am postponing the drbd-8.4 for mainline for one or two kernel
development cycles more (the ~400 patchets set).
Konrad writes:
Please git pull the following branch:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.5
in your for-3.5/drivers branch. The changes in it are rather simple - cleaning
up some code and adding proper mechanism to unload without leaking memory.
I have fought the current maintainer to the death in the Thunderdome
[1] https://lkml.org/lkml/2012/5/16/370
Umm, actually, there is noone taking care of the driver due to lack
of real hardware, and I still have some. The driver exposes bugs on
emulated/virtualized devices (mostly due to timing), but it's essential
to verify the fixes against a real hardware as well (which has been
holding back some of the fixes).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Block layer now handles O_EXCL in a generic way for block devices.
The semantics is however different for floppy and all other block devices,
as floppy driver contains its own O_EXCL handling.
The semantics for all-but-floppy bdevs is "there can be at most one O_EXCL
open of this file", while for floppy bdev the semantics is "if someone has
the bdev open with O_EXCL, noone else can open it".
There is actual userspace-observable change in behavior because of this
since commit e525fd89d3 ("block: make blkdev_get/put() handle exclusive
access") -- on kernels containing this commit, mount of /dev/fd0 causes
the fd0 block device be claimed with _EXCL, preventing subsequent
open(/dev/fd0).
Bring things back into shape, i.e. make it possible, analogically to
other block devices, to mount the floppy and open() it afterwards --
remove the floppy-specific handling and let the generic bdev code O_EXCL
handling take over.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There are several races in floppy driver between bottom half
(scheduled_work) and timers (fd_timeout, fd_timer). Due to slowness
of the actual floppy devices, those races are never (at least to my
knowledge) triggered on a bare floppy metal. However on virtualized
(emulated) floppy drives, which are of course magnitudes faster
than the real ones, these races trigger reliably. They usually exhibit
themselves as NULL pointer dereferences during DMA setup, such as
BUG: unable to handle kernel NULL pointer dereference at 0000000a
[ ... snip ... ]
EIP: 0060:[<c02053d5>] EFLAGS: 00010293 CPU: 0
EAX: ffffe000 EBX: 0000000a ECX: 00000000 EDX: 0000000a
ESI: c05d2718 EDI: 00000000 EBP: 00000000 ESP: f540fe44
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=f540e000 task=c082d5a0 task.ti=c0826000)
Stack:
ffffe000 00001ffc 00000000 00000000 00000000 c05d2718 c0708b40 f540fe80
c020470f c05d2718 c0708b40 00000000 f540fe80 0000000a f540fee4 00000000
c0708b40 f540fee4 00000000 00000000 c020526b 00000000 c05d2718 c0708b40
Call Trace:
[<c020470f>] dump_trace+0xaf/0x110
[<c020526b>] show_trace_log_lvl+0x4b/0x60
[<c0205298>] show_trace+0x18/0x20
[<c05c5811>] dump_stack+0x6d/0x72
[<c0248527>] warn_slowpath_common+0x77/0xb0
[<c02485f3>] warn_slowpath_fmt+0x33/0x40
[<f7ec593c>] setup_DMA+0x14c/0x210 [floppy]
[<f7ecaa95>] setup_rw_floppy+0x105/0x190 [floppy]
[<c0256d08>] run_timer_softirq+0x168/0x2a0
[<c024e762>] __do_softirq+0xc2/0x1c0
[<c02042ed>] do_softirq+0x7d/0xb0
[<f54d8a00>] 0xf54d89ff
but other instances can be easily seen as well. This can be observed at least under
VMWare, VirtualBox and KVM.
This patch converts all the timers and bottom halfs to be processed in a single
workqueue. This aproach has been already discussed back in 2010 if I remember
correctly, and Acked by Linus [1], but it then never made it to the tree.
This all is based on original idea and code of Stephen Hemminger. I have
ported original Stepen's code to the current state of the floppy driver, and
performed quite some testing (on real hardware), which didn't reveal any issues
(this includes not only writing and reading data, but also formatting
(unfortunately I didn't find any Double-Density disks any more)). Ability to
handle errors properly (supplying known bad floppies) has also been verified.
[1] http://kerneltrap.org/mailarchive/linux-kernel/2010/6/11/4582092
Based-on-patch-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The blkdev major must be released upon exit, or else the module can't
attach to devices using the same majors upon being loaded again. Also
avoid leaking the minor tracking bitmap.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- devices beyond xvdzz didn't get proper names assigned at all
- extended devices with minors not representable within the kernel's
major/minor bit split spilled into foreign majors
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
THIS_MODULE is NULL only when drbd is compiled as built-in,
so the #ifdef CONFIG_MODULES should be #ifdef MODULE instead.
This fixes the warning:
drivers/block/drbd/drbd_main.c: In function ‘drbd_buildtag’:
drivers/block/drbd/drbd_main.c:4187:24: warning: the comparison will always evaluate as ‘true’ for the address of ‘__this_module’ will never be NULL [-Waddress]
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Don't rely on availability of bios from the global fs_bio_set,
we should use our own bio_set for meta data IO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
...).
Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Symptom: messages similar to
"FIXME asender in bm_change_bits_to,
bitmap locked for 'write from resync_finished' by worker"
If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered. If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.
To fix this, introduce the drbd_bm_write_copy_pages() variant.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.
If *in that very moment* a new verify or resync is triggered,
this can race:
ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.
This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.
Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).
Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".
Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit
drbd: Implemented real timeout checking for request processing time
if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.
This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Changes to the role and disk state should be delayed or rejected
while we establish a connection.
This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.
The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
One invocation in the endio handler is good enough,
we don't need mention it for each of the different ways
it calls __req_mod().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Just because this request happened during a resync does
not mean it may pretend to have been barrier-acked.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
READ_RETRY_REMOTE_CANCELED needs to be grouped with the other _CANCELED
cases, not with CONNECTION_LOST_WHILE_PENDING, as that would complete
(fail) the bio even if the device became suspended.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
OOS_HANDED_TO_NETWORK should not be grouped with the various
*_CANCELED/*_FAILED cases.
Also, not only clear the RQ_NET_QUEUED flag, but also mark it RQ_NET_DONE,
so it can be distinguished from a local-only request even after that.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We used to have a barrier implementation where barrier_nr 0 was
reserved. That is long gone. Just use the full sequence space.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>