The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.
This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.
Fixes CVE-2014-4014.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
in non-lazy walk we need to be careful about dentry switching from
negative to positive - both ->d_flags and ->d_inode are updated,
and in some places we might see only one store. The cases where
dentry has been obtained by dcache lookup with ->i_mutex held on
parent are safe - ->d_lock and ->i_mutex provide all the barriers
we need. However, there are several places where we run into
trouble:
* do_last() fetches ->d_inode, then checks ->d_flags and
assumes that inode won't be NULL unless d_is_negative() is true.
Race with e.g. creat() - we might have fetched the old value of
->d_inode (still NULL) and new value of ->d_flags (already not
DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting
oops.
* a bunch of places checks ->d_inode for being non-NULL,
then checks ->d_flags for "is it a symlink". Race with symlink(2)
in case if our CPU sees ->d_inode update first - we see non-NULL
there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow
link here?", with subsequent unpleasantness.
Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
Reported-and-tested-by: Lin Ming <minggr@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
Pull file locking updates from Jeff Layton:
"Highlights:
- maintainership change for fs/locks.c. Willy's not interested in
maintaining it these days, and is OK with Bruce and I taking it.
- fix for open vs setlease race that Al ID'ed
- cleanup and consolidation of file locking code
- eliminate unneeded BUG() call
- merge of file-private lock implementation"
* 'locks-3.15' of git://git.samba.org/jlayton/linux:
locks: make locks_mandatory_area check for file-private locks
locks: fix locks_mandatory_locked to respect file-private locks
locks: require that flock->l_pid be set to 0 for file-private locks
locks: add new fcntl cmd values for handling file private locks
locks: skip deadlock detection on FL_FILE_PVT locks
locks: pass the cmd value to fcntl_getlk/getlk64
locks: report l_pid as -1 for FL_FILE_PVT locks
locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
locks: rename locks_remove_flock to locks_remove_file
locks: consolidate checks for compatible filp->f_mode values in setlk handlers
locks: fix posix lock range overflow handling
locks: eliminate BUG() call when there's an unexpected lock on file close
locks: add __acquires and __releases annotations to locks_start and locks_stop
locks: remove "inline" qualifier from fl_link manipulation functions
locks: clean up comment typo
locks: close potential race between setlease and open
MAINTAINERS: update entry for fs/locks.c
If flags contain RENAME_EXCHANGE then exchange source and destination files.
There's no restriction on the type of the files; e.g. a directory can be
exchanged with a symlink.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Add flags to security_path_rename() and security_inode_rename() hooks.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
If this flag is specified and the target of the rename exists then the
rename syscall fails with EEXIST.
The VFS does the existence checking, so it is trivial to enable for most
local filesystems. This patch only enables it in ext4.
For network filesystems the VFS check is not enough as there may be a race
between a remote create and the rename, so these filesystems need to handle
this flag in their ->rename() implementations to ensure atomicity.
Andy writes about why this is useful:
"The trivial answer: to eliminate the race condition from 'mv -i'.
Another answer: there's a common pattern to atomically create a file
with contents: open a temporary file, write to it, optionally fsync
it, close it, then link(2) it to the final name, then unlink the
temporary file.
The reason to use link(2) is because it won't silently clobber the destination.
This is annoying:
- It requires an extra system call that shouldn't be necessary.
- It doesn't work on (IMO sensible) filesystems that don't support
hard links (e.g. vfat).
- It's not atomic -- there's an intermediate state where both files exist.
- It's ugly.
The new rename flag will make this totally sensible.
To be fair, on new enough kernels, you can also use O_TMPFILE and
linkat to achieve the same thing even more cleanly."
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Add new renameat2 syscall, which is the same as renameat with an added
flags argument.
Pass flags to vfs_rename() and to i_op->rename() as well.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
There's actually very little difference between vfs_rename_dir() and
vfs_rename_other() so move both inline into vfs_rename() which still stays
reasonably readable.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Move the d_move() in vfs_rename_dir() up, similarly to how it's done in
vfs_rename_other(). The next patch will consolidate these two functions
and this is the only structural difference between them.
I'm not sure if doing the d_move() after the dput is even valid. But there
may be a logical explanation for that. But moving the d_move() before the
dput() (and the mutex_unlock()) should definitely not hurt.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Add d_is_dir(dentry) helper which is analogous to S_ISDIR().
To avoid confusion, rename d_is_directory() to d_can_lookup().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
As Trond pointed out, you can currently deadlock yourself by setting a
file-private lock on a file that requires mandatory locking and then
trying to do I/O on it.
Avoid this problem by plumbing some knowledge of file-private locks into
the mandatory locking code. In order to do this, we must pass down
information about the struct file that's being used to
locks_verify_locked.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
We can get false negative from __lookup_mnt() if an unrelated vfsmount
gets moved. In that case legitimize_mnt() is guaranteed to fail,
and we will fall back to non-RCU walk... unless we end up running
into a hard error on a filesystem object we wouldn't have reached
if not for that false negative. IOW, delaying that check until
the end of pathname resolution is wrong - we should recheck right
after we attempt to cross the mountpoint. We don't need to recheck
unless we see d_mountpoint() being true - in that case even if
we have just raced with mount/umount, we can simply go on as if
we'd come at the moment when the sucker wasn't a mountpoint; if we
run into a hard error as the result, it was a legitimate outcome.
__lookup_mnt() returning NULL is different in that respect, since
it might've happened due to operation on completely unrelated
mountpoint.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.
This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
"2.9.7 Thread Interactions with Regular File Operations
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links: [...]"
and one of the effects is the file position update.
This unprotected file position behavior is not new behavior, and nobody
has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.
This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.
Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This changes 'do_execve()' to get the executable name as a 'struct
filename', and to free it when it is done. This is what the normal
users want, and it simplifies and streamlines their error handling.
The controlled lifetime of the executable name also fixes a
use-after-free problem with the trace_sched_process_exec tracepoint: the
lifetime of the passed-in string for kernel users was not at all
obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
the pathname allocation lifetime with the execve() having finished,
which in turn meant that the trace point that happened after
mm_release() of the old process VM ended up using already free'd memory.
To solve the kernel string lifetime issue, this simply introduces
"getname_kernel()" that works like the normal user-space getname()
function, except with the source coming from kernel memory.
As Oleg points out, this also means that we could drop the tcomm[] array
from 'struct linux_binprm', since the pathname lifetime now covers
setup_new_exec(). That would be a separate cleanup.
Reported-by: Igor Zhbanov <i.zhbanov@samsung.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recent changes to retry on ESTALE in linkat
(commit 442e31ca5a)
introduced a mountpoint reference leak and a small memory
leak in case a filesystem link operation returns ESTALE
which is pretty normal for distributed filesystems like
lustre, nfs and so on.
Free old_path in such a case.
[AV: there was another missing path_put() nearby - on the previous
goto retry]
Signed-off-by: Oleg Drokin: <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Factor out the code to get an ACL either from the inode or disk from
check_acl, so that it can be used elsewhere later on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.
On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).
This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Failure to grab reference to parent dentry should go through the
same cleanup as nd->seq mismatch. As it is, we might end up with
caller thinking it needs to path_put() nd->root, with obvious
nasty results once we'd hit that bug enough times to drive the
refcount of root dentry all the way to zero...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull audit updates from Eric Paris:
"Nothing amazing. Formatting, small bug fixes, couple of fixes where
we didn't get records due to some old VFS changes, and a change to how
we collect execve info..."
Fixed conflict in fs/exec.c as per Eric and linux-next.
* git://git.infradead.org/users/eparis/audit: (28 commits)
audit: fix type of sessionid in audit_set_loginuid()
audit: call audit_bprm() only once to add AUDIT_EXECVE information
audit: move audit_aux_data_execve contents into audit_context union
audit: remove unused envc member of audit_aux_data_execve
audit: Kill the unused struct audit_aux_data_capset
audit: do not reject all AUDIT_INODE filter types
audit: suppress stock memalloc failure warnings since already managed
audit: log the audit_names record type
audit: add child record before the create to handle case where create fails
audit: use given values in tty_audit enable api
audit: use nlmsg_len() to get message payload length
audit: use memset instead of trying to initialize field by field
audit: fix info leak in AUDIT_GET requests
audit: update AUDIT_INODE filter rule to comparator function
audit: audit feature to set loginuid immutable
audit: audit feature to only allow unsetting the loginuid
audit: allow unsetting the loginuid (with priv)
audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
audit: loginuid functions coding style
selinux: apply selinux checks on new audit message types
...
Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:
- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes
plus misc stuff all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...