Commit Graph

1160 Commits

Author SHA1 Message Date
Linus Torvalds ca4ba96e02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There are several patches from Ilya fixing RBD allocation lifecycle
  issues, a series adding a nocephx_sign_messages option (and associated
  bug fixes/cleanups), several patches from Zheng improving the
  (directory) fsync behavior, a big improvement in IO for direct-io
  requests when striping is enabled from Caifeng, and several other
  small fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: clear msg->con in ceph_msg_release() only
  libceph: add nocephx_sign_messages option
  libceph: stop duplicating client fields in messenger
  libceph: drop authorizer check from cephx msg signing routines
  libceph: msg signing callouts don't need con argument
  libceph: evaluate osd_req_op_data() arguments only once
  ceph: make fsync() wait unsafe requests that created/modified inode
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  libceph: introduce ceph_x_authorizer_cleanup()
  ceph: don't invalidate page cache when inode is no longer used
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  rbd: set device_type::release instead of device::release
  rbd: don't free rbd_dev outside of the release callback
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  libceph: use local variable cursor instead of &msg->cursor
  libceph: remove con argument in handle_reply()
  ceph: combine as many iovec as possile into one OSD request
  ceph: fix message length computation
  ceph: fix a comment typo
  rbd: drop null test before destroy functions
2015-11-13 09:24:40 -08:00
Michal Hocko c62d25556b mm, fs: introduce mapping_gfp_constraint()
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track.  This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Ilya Dryomov 79dbd1baa6 libceph: msg signing callouts don't need con argument
We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:45 +01:00
Yan, Zheng 68cd5b4b76 ceph: make fsync() wait unsafe requests that created/modified inode
If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Yan, Zheng 4c06ace81a ceph: add request to i_unsafe_dirops when getting unsafe reply
Previously we add request to i_unsafe_dirops when registering
request. So ceph_fsync() also waits for imcomplete requests.
This is unnecessary, ceph_fsync() only needs to wait unsafe
requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Yan, Zheng 5e804ac482 ceph: don't invalidate page cache when inode is no longer used
ceph_check_caps() invalidate page cache when inode is not used
by any open file. This behaviour is not friendly for workload
that repeatly read files.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Zhu, Caifeng b5b98989dc ceph: combine as many iovec as possile into one OSD request
Both ceph_sync_direct_write and ceph_sync_read iterate iovec elements
one by one, send one OSD request for each iovec. This is sub-optimal,
We can combine serveral iovec into one page vector, and send an OSD
request for the whole page vector.

Signed-off-by: Zhu, Caifeng <zhucaifeng@unissoft-nj.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:47 +01:00
Arnd Bergmann 777d738a5e ceph: fix message length computation
create_request_message() computes the maximum length of a message,
but uses the wrong type for the time stamp: sizeof(struct timespec)
may be 8 or 16 depending on the architecture, while sizeof(struct
ceph_timespec) is always 8, and that is what gets put into the
message.

Found while auditing the uses of timespec for y2038 problems.

Fixes: b8e69066d8 ("ceph: include time stamp in every MDS request")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:47 +01:00
Geliang Tang 1291fb950f ceph: fix a comment typo
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:47 +01:00
Benjamin Coddington 4f6563677a Move locks API users to locks_lock_inode_wait()
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait().  This
allows for some later cleanup.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22 14:57:36 -04:00
Linus Torvalds e013f74b60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few fixes for snapshot behavior with CephFS and support
  for the new keepalive protocol from Zheng, a libceph fix that affects
  both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
  and several small fixes and cleanups from Jianpeng and others"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: improve readahead for file holes
  ceph: get inode size for each append write
  libceph: check data_len in ->alloc_msg()
  libceph: use keepalive2 to verify the mon session is alive
  rbd: plug rbd_dev->header.object_prefix memory leak
  rbd: fix double free on rbd_dev->header_name
  libceph: set 'exists' flag for newly up osd
  ceph: cleanup use of ceph_msg_get
  ceph: no need to get parent inode in ceph_open
  ceph: remove the useless judgement
  ceph: remove redundant test of head->safe and silence static analysis warnings
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: rename con_work() to ceph_con_workfn()
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
  libceph: remove the unused macro AES_KEY_SIZE
  ceph: invalidate dirty pages after forced umount
  ceph: EIO all operations after forced umount
2015-09-11 12:33:03 -07:00
Kirill A. Shutemov 7cbea8dc01 mm: mark most vm_operations_struct const
With two exceptions (drm/qxl and drm/radeon) all vm_operations_struct
structs should be constant.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Yan, Zheng 438386853d ceph: improve readahead for file holes
When readahead encounters file holes, osd reply returns error -ENOENT,
finish_read() skips adding pages to the the page cache. So readahead
does not work for file holes. The fix is adding zero pages to the
page cache when -ENOENT is returned.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-09 09:52:29 +03:00
Yan, Zheng 55b0b31cbc ceph: get inode size for each append write
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-09 09:52:29 +03:00
Jianpeng Ma 5fdb1389e1 ceph: cleanup use of ceph_msg_get
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:29 +03:00
Jianpeng Ma e36d571d70 ceph: no need to get parent inode in ceph_open
parent inode is needed in creating new inode case.  For ceph_open,
the target inode already exists.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:29 +03:00
Jianpeng Ma a43137f7b0 ceph: remove the useless judgement
err != 0 is already handled. So skip this.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:29 +03:00
Brad Hubbard 1550d34e56 ceph: remove redundant test of head->safe and silence static analysis warnings
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:29 +03:00
Yan, Zheng 23078637e0 ceph: fix queuing inode to mdsdir's snaprealm
During MDS failovers, MClientSnap message may cause kclient to move
some inodes from root directory's snaprealm to mdsdir's snaprealm
and queue snapshots for these inodes. For a FS has never created any
snapshot, both root directory's snaprealm and mdsdir's snaprealm
share the same snapshot contexts (both are ceph_empty_snapc). This
confuses ceph_put_wrbuffer_cap_refs(), make it unable to distinguish
snapshot buffers from head buffers.

The fix is do not use ceph_empty_snapc as snaprealm's cached context.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:29 +03:00
Yan, Zheng a341d4df87 ceph: invalidate dirty pages after forced umount
After forced umount, ceph_writepages_start() skips flushing dirty
pages. To make sure inode's reference count get dropped to zero,
we need to invalidate dirty pages.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:28 +03:00
Yan, Zheng 48fec5d0a5 ceph: EIO all operations after forced umount
This patch makes try_get_cap_refs() and __do_request() check
if the file system was forced umount, and return -EIO if it was.
This patch also adds a helper function to drops dirty caps and
wakes up blocking operation.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:28 +03:00
Kees Cook a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Yan, Zheng fc927cd32f ceph: always re-send cap flushes when MDS recovers
commit e548e9b93d makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-07-31 11:38:53 +03:00
Yan, Zheng f6762cb2ca ceph: fix ceph_encode_locks_to_buffer()
posix locks should be in ctx->flc_posix list

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-07-31 11:38:47 +03:00
Linus Torvalds 1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00