.. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
between kernel and userspace"). This fixes a bunch of recently pulled
coding style issues and makes includes a bit cleaner.
A patch "crush:Make the function crush_ln static" from Nicholas Krause
<xerofoify@gmail.com> is folded in as crush_ln() has been made static
in userspace as well.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Verify that the 'take' argument is a valid device or bucket.
Otherwise ignore it (do not add the value to the working vector).
Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
GFP_NOFS memory allocation is required for page writeback path.
But there is no need to use GFP_NOFS in syscall path and readpage
path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
if flushing caps were revoked, we should re-send the cap flush in
client reconnect stage. This guarantees that MDS processes the cap
flush message before issuing the flushing caps to other client.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
According to this information, MDS can trim its completed caps flush
list (which is used to detect duplicated cap flush).
Signed-off-by: Yan, Zheng <zyan@redhat.com>
So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.
Tracking pending caps flushing globally also simplifies syncfs code.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.
This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
modinfo libceph prints the module name "Ceph filesystem for Linux",
which is same as the real fs module ceph. It's confusing.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
fsync() on directory should flush dirty caps and wait for any
uncommitted directory opertions to commit. But ceph_dir_fsync()
only waits for uncommitted directory opertions.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Current ceph_fsync() only flushes dirty caps and wait for them to be
flushed. It doesn't wait for caps that has already been flushing.
This patch makes ceph_fsync() wait for pending flushing caps too.
Besides, this patch also makes caps_are_flushed() peroperly handle
tid wrapping.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when copying files to cephfs, file data may stay in page cache after
corresponding file is closed. Cached data use Fc capability. If we
include Fc capability in cap_wanted, MDS will treat files with cached
data as open files, and journal them in an EOpen event when trimming
log segment.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
As part of unmap sequence, kernel client has to talk to the OSDs to
teardown watch on the header object. If none of the OSDs are available
it would hang forever, until interrupted by a signal - when that
happens we follow through with the rest of unmap procedure (i.e.
unregister the device and put all the data structures) and the unmap is
still considired successful (rbd cli tool exits with 0). The watch on
the userspace side should eventually timeout so that's fine.
This isn't very nice, because various userspace tools (pacemaker rbd
resource agent, for example) then have to worry about setting up their
own timeouts. Timeout it with mount_timeout (60 seconds by default).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@redhat.com>
No need to bifurcate wait now that we've got ceph_timeout_jiffies().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
- return -ETIMEDOUT instead of -EIO in case of timeout
- wait_event_interruptible_timeout() returns time left until timeout
and since it can be almost LONG_MAX we had better assign it to long
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.
There is also a bug in the way mount_timeout=0 is handled. It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.
Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Previously we pre-allocate cap release messages for each caps. This
wastes lots of memory when there are large amount of caps. This patch
make the code not pre-allocate the cap release messages. Instead,
we add the corresponding ceph_cap struct to a list when releasing a
cap. Later when flush cap releases is needed, we allocate the cap
release messages dynamically.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps()
accesses snap realm's cached_context. So we need take read lock
of snap_rwsem.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when a snap notification contains no new snapshot, we can avoid
sending FLUSHSNAP message to MDS. But we still need to create
cap_snap in some case because it's required by write path and
page writeback path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
In most cases that snap context is needed, we are holding
reference of CEPH_CAP_FILE_WR. So we can set ceph inode's
i_head_snapc when getting the CEPH_CAP_FILE_WR reference,
and make codes get snap context from i_head_snapc. This makes
the code simpler.
Another benefit of this change is that we can handle snap
notification more elegantly. Especially when snap context
is updated while someone else is doing write. The old queue
cap_snap code may set cap_snap's context to ether the old
context or the new snap context, depending on if i_head_snapc
is set. The new queue capp_snap code always set cap_snap's
context to the old snap context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>