Commit Graph

376 Commits

Author SHA1 Message Date
Ilya Dryomov
33e9c8dbfb libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
All four SERVER_LUMINOUS feature bits are implemented, switch it on!

NEW_OSDOP_ENCODING doesn't mean much for the client (it signifies
support for MOSDOp v6) but needs to be enabled in order to get the
latest (currently v25) pg_pool_t.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Sage Weil <sage@redhat.com>
2017-07-07 17:26:24 +02:00
Ilya Dryomov
0bb05da2ec libceph: osd_state is 32 bits wide in luminous
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:19 +02:00
Ilya Dryomov
6f428df47d libceph: pg_upmap[_items] infrastructure
pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap.  (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)

Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:18 +02:00
Ilya Dryomov
278b1d709c libceph: ceph_decode_skip_* helpers
Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.

Replace skip_name_map() with ceph_decode_skip_map as an example.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:18 +02:00
Ilya Dryomov
a02a946dfe libceph: respect RADOS_BACKOFF backoffs
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov
76f827a7b1 libceph: make DEFINE_RB_* helpers more general
Initially for ceph_pg_mapping, ceph_spg_mapping and ceph_hobject_id,
compared with ceph_pg_compare(), ceph_spg_compare() and hoid_compare()
respectively.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov
df28152d53 libceph: avoid unnecessary pi lookups in calc_target()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov
04c7d789e2 libceph: make sure need_resend targets reflect latest map
Otherwise we may miss events like PG splits, pool deletions, etc when
we get multiple incremental maps at once.  Because check_pool_dne() can
now be fed an unlinked request, finish_request() needed to be taught to
handle unlinked requests.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:16 +02:00
Ilya Dryomov
7de030d6b1 libceph: resend on PG splits if OSD has RESEND_ON_SPLIT
Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:16 +02:00
Ilya Dryomov
8cb441c054 libceph: MOSDOp v8 encoding (actual spgid + full hash)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov
98ad5ebd15 libceph: ceph_connection_operations::reencode_message() method
Give upper layers a chance to reencode the message after the connection
is negotiated and ->peer_features is set.  OSD client will use this to
support both luminous and pre-luminous OSDs (in a single cluster): the
former need MOSDOp v8; the latter will continue to be sent MOSDOp v4.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov
dc98ff7230 libceph: introduce ceph_spg, ceph_pg_to_primary_shard()
Store both raw pgid and actual spgid in ceph_osd_request_target.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov
dc93e0e283 libceph: fold [l]req->last_force_resend into ceph_osd_request_target
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov
220abf5aa7 libceph: support SERVER_JEWEL feature bits
Only MON_STATEFUL_SUB, really.  MON_ROUTE_OSDMAP and
OSDSUBOP_NO_SNAPCONTEXT are irrelevant.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov
2d7522e0bd libceph: advertise support for OSD_POOLRESEND
The code has been in place since commit 63244fa123 ("libceph:
introduce ceph_osd_request_target, calc_target()"), and, with the
ceph_{oloc,oid}_copy() issue fixed in the previous commit, is now
in working order.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov
f179d3ba8c libceph: new features macros
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov
dcbbd97ccb libceph: remove ceph_sanitize_features() workaround
Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov
6f4dbd149d libceph: use kbasename() and kill ceph_file_part()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2017-05-23 20:32:10 +02:00
Alexander Graf
f775ff7d89 ceph: fix file open flags on ppc64
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.

Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.

This patch converts the kernel view of flags to the ceph view of file
open flags.

Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Ilya Dryomov
14bb211d32 rbd: support updating the lock cookie without releasing the lock
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04 09:19:23 +02:00
Jeff Layton
58eb7932ae libceph: add an epoch_barrier field to struct ceph_osd_client
Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
aa26d662b9 libceph: remove req->r_replay_version
Nothing uses this anymore with the removal of the ack vs. commit code.
Remove the field and just encode zeroes into place in the request
encoding.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
79162547b7 ceph: make seeky readdir more efficient
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
76201b6354 ceph: allow connecting to mds whose rank >= mdsmap::m_max_mds
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00