Commit Graph

211536 Commits

Author SHA1 Message Date
Sage Weil cd045cb42a ceph: fix rdcache_gen usage and invalidate
We used to use rdcache_gen to indicate whether we "might" have cached
pages.  Now we just look at the mapping to determine that.  However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages.  That can happen
at any time (presumably when we carry FILE_CACHE).  We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate.  If they are
equal, we should invalidate.  On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen.  Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1.  (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil feb4cc9bb4 ceph: re-request max_size if cap auth changes
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests.  This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:23 -08:00
Sage Weil 912a9b0319 ceph: only let auth caps update max_size
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap.  Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:21 -08:00
Sage Weil 7421ab8041 ceph: fix open for write on clustered mds
Normally when we open a file we already have a cap, and simply update the
wanted set.  However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS.  Only
reuse existing caps if we are opening for read or the existing cap is auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:07:15 -08:00
Sage Weil d8b16b3d1c ceph: fix bad pointer dereference in ceph_fill_trace
We dereference *in a few lines down, but only set it on rename.  It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 08:40:43 -08:00
Sage Weil df9f86faf3 ceph: fix small seq message skipping
If the client gets out of sync with the server message sequence number, we
normally skip low seq messages (ones we already received).  The skip code
was also incrementing the expected seq, such that all subsequent messages
also appeared old and got skipped, and an eventual timeout on the osd
connection.  This resulted in some lagging requests and console messages
like

[233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
[233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
[233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
[233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
[233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
[233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
[233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
[233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
[233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
[233500.534614] ceph:  tid 6023602 timed out on osd22, will reset osd

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-01 15:49:23 -07:00
Sage Weil 2f56f56ad9 Revert "ceph: update issue_seq on cap grant"
This reverts commit d91f2438d8.

The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths.  By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.

The larger question is what workload/problem made me think it should be
updated here...

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-27 21:05:54 -07:00
Sage Weil efa4c1206e ceph: do not carry i_lock for readdir from dcache
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work.  Since we don't actually need it here anyway, remove
it.

We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:27 -07:00
Julia Lawall 61413c2f59 fs/ceph/xattr.c: Use kmemdup
Convert a sequence of kmalloc and memcpy to use kmemdup.

The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@

  a =
-  \(kmalloc\|kzalloc\)(len,flag)
+  kmemdup(arg,len,flag)
  <... when != a
  if (a == NULL || ...) S
  ...>
- memcpy(a,arg,len+1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:26 -07:00
Dan Carpenter 85b5aaa624 rbd: passing wrong variable to bvec_kunmap_irq()
We should be passing "buf" here insead of "bv".  This is tricky because
it's not the same as kmap() and kunmap().  GCC does warn about it if you
compile on i386 with CONFIG_HIGHMEM.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:25 -07:00
Dan Carpenter b8d0638a98 rbd: null vs ERR_PTR
ceph_alloc_page_vector() returns ERR_PTR(-ENOMEM) on errors.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:24 -07:00
Sage Weil 240634e9b3 ceph: fix num_pages_free accounting in pagelist
Decrement the free page counter when removing a page from the free_list.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:23 -07:00
Greg Farnum 571dba52a3 ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:23 -07:00
Yehuda Sadeh 010e3b48fc ceph: don't crash when passed bad mount options
This only happened when parse_extra_token was not passed
to ceph_parse_option() (hence, only happened in rbd).

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:22 -07:00
Randy Dunlap 6f453ed6c0 ceph: fix debugfs warnings
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:

fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:21 -07:00
Yehuda Sadeh f4cf3deef4 block: rbd: removing unnecessary test
rbd_get_segment() can't return a negative value, we don't need to check
the return output.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:20 -07:00
Vasiliy Kulikov 28f259b7cd block: rbd: fixed may leaks
rbd_client_create() doesn't free rbdc, this leads to many leaks.

seg_len in rbd_do_op() is unsigned, so (seg_len < 0) makes no sense.
Also if fixed check fails then seg_name is leaked.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:19 -07:00
Sage Weil 496e59553c ceph: switch from BKL to lock_flocks()
Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:18 -07:00
Greg Farnum fca4451acf ceph: preallocate flock state without locks held
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held.  Preallocate space without
the lock, and retry if the lock state changes out from underneath us.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:17 -07:00
Greg Farnum ac0b74d8a1 ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursor
These facilitate preallocation of pages so that we can encode into the pagelist
in an atomic context.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:16 -07:00
Sage Weil 18a38193ef ceph: use mapping->nrpages to determine if mapping is empty
This is simpler and faster.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil 93afd449aa ceph: only invalidate on check_caps if we actually have pages
The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil 4c32f5dda5 ceph: do not hide .snap in root directory
Snaps in the root directory are now supported by the MDS, and harmless on
older versions.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:14 -07:00
Yehuda Sadeh 602adf4002 rbd: introduce rados block device (rbd), based on libceph
The rados block device (rbd), based on osdblk, creates a block device
that is backed by objects stored in the Ceph distributed object storage
cluster.  Each device consists of a single metadata object and data
striped over many data objects.

The rbd driver supports read-only snapshots.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:13 -07:00
Yehuda Sadeh 3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00