Without a way to flush the osd client's notify workqueue, a watch
event that is unregistered could continue receiving callbacks
indefinitely.
Unregistering the event simply means no new notifies are added to the
queue, but there may still be events in the queue that will call the
watch callback for the event. If the queue is flushed after the event
is unregistered, the caller can be sure no more watch callbacks will
occur for the canceled watch.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
handle_reply() calls complete_request() only if the first OSD reply
has ONDISK flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This creates a new source file "net/ceph/snapshot.c" to contain
utility routines related to ceph snapshot contexts. The main
motivation was to define ceph_create_snap_context() as a common way
to create these structures, but I've moved the definitions of
ceph_get_snap_context() and ceph_put_snap_context() there too.
(The benefit of inlining those is very small, and I'd rather
keep this collection of functions together.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A ceph timespec contains 32-bit unsigned values for its seconds and
nanoseconds components. For a standard timespec, both fields are
signed, and the seconds field is almost surely 64 bits.
Add some explicit casts so the fact that this conversion is taking
place is obvious. Also trip a bug if we ever try to put out of
range (negative or too big) values into a ceph timespec.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Flesh out the limits defined in <linux/ceph/decode.h> to include the
maximum and minimum values for signed type S8, S16, S32, and S64.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add the ability to provide an array of pages as outbound request
data for object class method calls.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Allow osd request ops that aren't otherwise structured (not class,
extent, or watch ops) to specify "raw" data to be used to hold
incoming data for the op. Make use of this capability for the osd
STAT op.
Prefix the name of the private function osd_req_op_init() with "_",
and expose a new function by that (earlier) name whose purpose is to
initialize osd ops with (only) implied data.
For now we'll just support the use of a page array for an osd op
with incoming raw data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data. Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request currently has two callbacks. They inform the
initiator of the request when we've received confirmation for the
target osd that a request was received, and when the osd indicates
all changes described by the request are durable.
The only time the second callback is used is in the ceph file system
for a synchronous write. There's a race that makes some handling of
this case unsafe. This patch addresses this problem. The error
handling for this callback is also kind of gross, and this patch
changes that as well.
In ceph_sync_write(), if a safe callback is requested we want to add
the request on the ceph inode's unsafe items list. Because items on
this list must have their tid set (by ceph_osd_start_request()), the
request added *after* the call to that function returns. The
problem with this is that there's a race between starting the
request and adding it to the unsafe items list; the request may
already be complete before ceph_sync_write() even begins to put it
on the list.
To address this, we change the way the "safe" callback is used.
Rather than just calling it when the request is "safe", we use it to
notify the initiator the bounds (start and end) of the period during
which the request is *unsafe*. So the initiator gets notified just
before the request gets sent to the osd (when it is "unsafe"), and
again when it's known the results are durable (it's no longer
unsafe). The first call will get made in __send_request(), just
before the request message gets sent to the messenger for the first
time. That function is only called by __send_queued(), which is
always called with the osd client's request mutex held.
We then have this callback function insert the request on the ceph
inode's unsafe list when we're told the request is unsafe. This
will avoid the race because this call will be made under protection
of the osd client's request mutex. It also nicely groups the setup
and cleanup of the state associated with managing unsafe requests.
The name of the "safe" callback field is changed to "unsafe" to
better reflect its new purpose. It has a Boolean "unsafe" parameter
to indicate whether the request is becoming unsafe or is now safe.
Because the "msg" parameter wasn't used, we drop that.
This resolves the original problem reportedin:
http://tracker.ceph.com/issues/4706
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Right now the data for a method call is specified via a pointer and
length, and it's copied--along with the class and method name--into
a pagelist data item to be sent to the osd. Instead, encode the
data in a data item separate from the class and method names.
This will allow large amounts of data to be supplied to methods
without copying. Only rbd uses the class functionality right now,
and when it really needs this it will probably need to use a page
array rather than a page list. But this simple implementation
demonstrates the functionality on the osd client, and that's enough
for now.
This resolves:
http://tracker.ceph.com/issues/4104
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing. Currently only one data item is
ever added to a message, but that's about to change.
This resolves:
http://tracker.ceph.com/issues/2770
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch adds support to the messenger for more than one data item
in its data list.
A message data cursor has two more fields to support this:
- a count of the number of bytes left to be consumed across
all data items in the list, "total_resid"
- a pointer to the head of the list (for validation only)
The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.
When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided. The data pointer
is initialized to the first data item on the list. From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.
Advancing the cursor consumes bytes in total_resid. If the resid
field reaches zero, it means the current data item is fully
consumed. If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that. (A check is made at
this point to make sure we don't wrap around the front of the list.)
The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support. The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.
When total_resid reaches zero, we're done.
This resolves:
http://tracker.ceph.com/issues/3761
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In place of the message data pointer, use a list head which links
through message data items. For now we only support a single entry
on that list.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item. This will
allow a message cursor to be used for more than one data item.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.
Move the cursor embedded in the message data structure into the
message itself. To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.
Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.
This and the upcoming patches are part of:
http://tracker.ceph.com/issues/3761
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio is the only data item type that doesn't record its full
length. Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch:
15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have. Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Finally! Convert the osd op data pointers into real structures, and
make the switch over to using them instead of having all ops share
the in and/or out data structures in the osd request.
Set up a new function to traverse the set of ops and release any
data associated with them (pages).
This and the patches leading up to it resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Still using the osd request r_data_in and r_data_out pointer, but
we're basically only referring to it via the data pointers in the
osd ops. And we're transferring that information to the request
or reply message only when the op indicates it's needed, in
osd_req_encode_op().
To avoid a forward reference, ceph_osdc_msg_data_set() was moved up
in the file.
Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(),
because the ops array will already be zeroed anyway.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This ends up being a rather large patch but what it's doing is
somewhat straightforward.
Basically, this is replacing two calls with one. The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters. In place of those two will be a single function that
initializes the op directly.
That means we sort of fan out a set of the needed functions:
- extent ops with pages data
- extent ops with pagelist data
- extent ops with bio list data
and
- class ops with page data for receiving a response
We also have define another one, but it's only used internally:
- class ops with pagelist data for request parameters
Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields. All the osd ops refer to them for
their data. For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An object class method is formatted using a pagelist which contains
the class name, the method name, and the data concatenated into an
osd request's outbound data.
Currently when a class op is initialized in osd_req_op_cls_init(),
the lengths of and pointers to these three items are recorded.
Later, when the op is getting formatted into the request message, a
new pagelist is created and that is when these items get copied into
the pagelist.
This patch makes it so the pagelist to hold these items is created
when the op is initialized instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.
So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize. This better hides the details the
op structure (and faciltates moving the data pointers they use).
Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope. Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message. Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.
Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message. The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.
Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient. Add some assertions to verify
pointers are always set the way they're expected to be.
This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.
This is the first in a series of patches that resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>