Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.
At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.
Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When trimming the caps and just after the 'session->s_cap_lock' is
released in ceph_iterate_session_caps() the cap maybe removed by
another thread, and when using the stale cap memory in the callbacks
it will trigger use-after-free crash.
We need to check the existence of the cap just after the 'ci->i_ceph_lock'
being acquired. And do nothing if it's already removed.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/43272
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds latency and size metrics for remote object copies
operations ("copyfrom"). For now, these metrics will be available on the
client only, they won't be sent to the MDS.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch modifies struct ceph_client_metric so that each metric block
(read, write and metadata) becomes an element in a array. This allows to
also re-write the helper functions that handle these blocks, making them
simpler and, above all, reduce the amount of copy&paste every time a new
metric is added.
Thus, for each of these metrics there will be a new struct ceph_metric
entry that'll will contain all the sizes and latencies fields (and a lock).
Note however that the metadata metric doesn't really use the size_fields,
and thus this metric won't be shown in the debugfs '../metrics/size' file.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, all the metrics are grouped together in a single file, making
it difficult to process this file from scripts. Furthermore, as new
metrics are added, processing this file will become even more challenging.
This patch turns the 'metric' file into a directory that will contain
several files, one for each metric.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will collect IO's total size and then calculate the average
size, and also will collect the min/max IO sizes.
The debugfs will show the size metrics in bytes and will let the
userspace applications to switch to what they need.
URL: https://tracker.ceph.com/issues/49913
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In client for each inode, it may have many opened files and may
have been pinned in more than one MDS servers. And some inodes
are idle, which have no any opened files.
This patch will show these metrics in the debugfs, likes:
item total
-----------------------------------------
opened files / total inodes 14 / 5
pinned i_caps / total inodes 7 / 5
opened inodes / total inodes 3 / 5
Will send these metrics to ceph, which will be used by the `fs top`,
later.
[ jlayton: drop unrelated hunk, count hashed inodes instead of
allocated ones ]
URL: https://tracker.ceph.com/issues/47005
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In multi-mds, the 'caps' debugfs file will have duplicate ino,
add the 'mds' column to indicate which mds session the cap belongs to.
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).
I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).
What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.
Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.
The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.
While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.
NOTE: This is a user-visible change on 32-bit arches:
1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.
2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.
The workaround for both problems is to mount with "-o ino32".
[ idryomov: changelog tweak ]
URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The variable mds is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will help to reduce using the global mdsc->mutex lock in many
places.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.
Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.
For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.
The output looks like:
item total miss hit
-------------------------------------------------
d_lease 11 7 141
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates
DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.
Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.
Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add some visibility of tasks that are waiting for caps to the "caps"
debugfs file. Display the tgid of the waiting task, inode number, and
the caps the task needs and wants.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This fixes a build warning to that effect.
Fixes: 1a829ff2a6 ("ceph: no need to check return value of debugfs_create functions")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull ceph updates from Ilya Dryomov:
"Lots of exciting things this time!
- support for rbd object-map and fast-diff features (myself). This
will speed up reads, discards and things like snap diffs on sparse
images.
- ceph.snap.btime vxattr to expose snapshot creation time (David
Disseldorp). This will be used to integrate with "Restore Previous
Versions" feature added in Windows 7 for folks who reexport ceph
through SMB.
- security xattrs for ceph (Zheng Yan). Only selinux is supported for
now due to the limitations of ->dentry_init_security().
- support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
Layton). This is actually a single feature bit which was missing
because of the filesystem pieces. With this in, the kernel client
will finally be reported as "luminous" by "ceph features" -- it is
still being reported as "jewel" even though all required Luminous
features were implemented in 4.13.
- stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
with xattrs is to not terminate and this was causing
inconsistencies with ceph-fuse.
- change filesystem time granularity from 1 us to 1 ns, again fixing
an inconsistency with ceph-fuse (Luis Henriques).
On top of this there are some additional dentry name handling and cap
flushing fixes from Zheng. Finally, Jeff is formally taking over for
Zheng as the filesystem maintainer"
* tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
ceph: fix end offset in truncate_inode_pages_range call
ceph: use generic_delete_inode() for ->drop_inode
ceph: use ceph_evict_inode to cleanup inode's resource
ceph: initialize superblock s_time_gran to 1
MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
rbd: setallochint only if object doesn't exist
rbd: support for object-map and fast-diff
rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
libceph: export osd_req_op_data() macro
libceph: change ceph_osdc_call() to take page vector for response
libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
rbd: new exclusive lock wait/wake code
rbd: quiescing lock should wait for image requests
rbd: lock should be quiesced on reacquire
rbd: introduce copyup state machine
rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
rbd: move OSD request allocation into object request state machines
rbd: factor out __rbd_osd_setup_discard_ops()
rbd: factor out rbd_osd_setup_copyup()
rbd: introduce obj_req->osd_reqs list
...