The index on the page must be set before it is inserted in the radix
tree. Otherwise there is a small race which can occur during lookup
where the page can be found with the incorrect index. This will trigger
the BUG_ON() in brd_lookup_page().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Chris Wedgwood <cw@f00f.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph fixes from Sage Weil:
"Yes, this is a much larger pull than I would like after -rc1. There
are a few things included:
- a few fixes for leaks and incorrect assertions
- a few patches fixing behavior when mapped images are resized
- handling for cloned/layered images that are flattened out from
underneath the client
The last bit was non-trivial, and there is some code movement and
associated cleanup mixed in. This was ready and was meant to go in
last week but I missed the boat on Friday. My only excuse is that I
was waiting for an all clear from the testing and there were many
other shiny things to distract me.
Strictly speaking, handling the flatten case isn't a regression and
could wait, so if you like we can try to pull the series apart, but
Alex and I would much prefer to have it all in as it is a case real
users will hit with 3.10."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits)
rbd: re-submit flattened write request (part 2)
rbd: re-submit write request for flattened clone
rbd: re-submit read request for flattened clone
rbd: detect when clone image is flattened
rbd: reference count parent requests
rbd: define parent image request routines
rbd: define rbd_dev_unparent()
rbd: don't release write request until necessary
rbd: get parent info on refresh
rbd: ignore zero-overlap parent
rbd: support reading parent page data for writes
rbd: fix parent request size assumption
libceph: init sent and completed when starting
rbd: kill rbd_img_request_get()
rbd: only set up watch for mapped images
rbd: set mapping read-only flag in rbd_add()
rbd: support reading parent page data
rbd: fix an incorrect assertion condition
rbd: define rbd_dev_v2_header_info()
rbd: get rid of trivial v1 header wrappers
...
Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.
Kill off some redundant assertions.
This completes the resolution for:
http://tracker.ceph.com/issues/3763
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case. (See the previous commit for more
reasoning about why this is appropriate.)
Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.
The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:
rbd client parent image/osd
---------- ----------------
original object ENOENT;
issue parent read
respond to parent read
child image flattened
original image header refresh
<--- original object written independently here
parent read response received
Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated. Once this is complete, the
clone's association with the parent is abolished.
Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly. We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.
There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away. Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.
We leverage the fact that a zero parent overlap renders an image
effectively unlayered. We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.
This and the next few patches resolve:
http://tracker.ceph.com/issues/3763
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Keep a reference count for uses of the parent information for an rbd
device.
An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap). Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.
The initial reference is dropped when the image gets torn down.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects. For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read. The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().
Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request. We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get parent info for format 2 images on every refresh (rather than
just during the initial probe). This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened). Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)
Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered. It will soon be possible for
a layered image to become unlayered while a request is in flight.
This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created. The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.
This and the next several patches are related to:
http://tracker.ceph.com/issues/3763
NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist. This is not really correct, and is a scenario that needs
to be addressed. For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all. Detect this case and treat
such an image as non-layered. Issue a warning to be sure the user
knows what's going on.
This resolves:
http://tracker.ceph.com/issues/5028
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data. But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.
This is handling the same kind of issue as was resolved by this
commit:
5b2ab72d rbd: support reading parent page data
This resolves:
http://tracker.ceph.com/issues/5027
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.
That is not necessarily the case. The parent overlap could reduce
the request size below that. To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request. Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.
This resolves:
http://tracker.ceph.com/issues/5038
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull NVMe driver update from Matthew Wilcox:
"Lots of exciting new features in the NVM Express driver this time,
including support for emulating SCSI commands, discard support and the
ability to submit per-sector metadata with I/Os.
It's still mostly bugfixes though!"
* git://git.infradead.org/users/willy/linux-nvme: (27 commits)
NVMe: Use user defined admin ioctl timeout
NVMe: Simplify Firmware Activate code slightly
NVMe: Only clear the enable bit when disabling controller
NVMe: Wait for device to acknowledge shutdown
NVMe: Schedule timeout for sync commands
NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
NVMe: Device specific stripe size handling
NVMe: Split non-mergeable bio requests
NVMe: Remove dead code in nvme_dev_add
NVMe: Check for NULL memory in nvme_dev_add
NVMe: Fix error clean-up on nvme_alloc_queue
NVMe: Free admin queue on request_irq error
NVMe: Add scsi unmap to SG_IO
NVMe: queue usage fixes in nvme-scsi
NVMe: Set TASK_INTERRUPTIBLE before processing queues
NVMe: Add a character device for each nvme device
NVMe: Fix endian-related problems in user I/O submission path
NVMe: Fix I/O cancellation status on big-endian machines
NVMe: Fix sparse warnings in scsi emulation
NVMe: Don't fail initialisation unnecessarily
...
Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.
We can't set up the watch request until we know the name of the
header object though. So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.
Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.
This resolves:
http://tracker.ceph.com/issues/4941
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.
So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.
This resolves:
http://tracker.ceph.com/issues/4940
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data. But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.
Fortunately, it's not hard to add support for page data.
This resolves:
http://tracker.ceph.com/issues/4939
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object. But
assertion was looking at the parent image order rather than the
original one, and these values can differ.
Fix that.
This resolves:
http://tracker.ceph.com/issues/4938
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info(). While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively. So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().
Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.
Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.
Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe(). Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.
Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.
This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.
Now rbd_dev_v1_probe() *is* a trivial wrapper function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>