Expand the fuse protocol to support per inode DAX.
FUSE_HAS_INODE_DAX flag is added indicating if fuse server/client
supporting per inode DAX. It can be conveyed in both FUSE_INIT request and
reply.
FUSE_ATTR_DAX flag is added indicating if DAX shall be enabled for
corresponding file. It is conveyed in FUSE_LOOKUP reply.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When a new inode is created, send its security context to server along with
creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK).
This gives server an opportunity to create new file and set security
context (possibly atomically). In all the configurations it might not be
possible to set context atomically.
Like nfs and ceph, use security_dentry_init_security() to dermine security
context of inode and send it with create, mkdir, mknod, and symlink
requests.
Following is the information sent to server.
fuse_sectx_header, fuse_secctx, xattr_name, security_context
- struct fuse_secctx_header
This contains total number of security contexts being sent and total
size of all the security contexts (including size of
fuse_secctx_header).
- struct fuse_secctx
This contains size of security context which follows this structure.
There is one fuse_secctx instance per security context.
- xattr name string
This string represents name of xattr which should be used while setting
security context.
- security context
This is the actual security context whose size is specified in
fuse_secctx struct.
Also add the FUSE_SECURITY_CTX flag for the `flags` field of the
fuse_init_out struct. When this flag is set the kernel will append the
security context for a newly created inode to the request (create, mkdir,
mknod, and symlink). The server is responsible for ensuring that the inode
appears atomically (preferrably) with the requested security context.
For example, If the server is using SELinux and backed by a "real" linux
file system that supports extended attributes it can write the security
context value to /proc/thread-self/attr/fscreate before making the syscall
to create the inode.
This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT flags are close to running out, so add another 32bits worth of
space.
Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this
flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi
field is allocated to contain the high 32bits of the flags.
A flags_hi field is also added to fuse_init_out, allocated out of the
remaining unused fields.
Known userspace implementations of the fuse protocol have been checked to
accept the extended FUSE_INIT request, but this might cause problems with
other implementations. If that happens to be the case, the protocol
negotiation will have to be extended with an extra initialization request
roundtrip.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add flag returned by FUSE_OPEN and FUSE_CREATE requests to avoid flushing
data cache on close.
Different filesystems implement ->flush() is different ways:
- Most disk filesystems do not implement ->flush() at all
- Some network filesystem (e.g. nfs) flush local write cache of
FMODE_WRITE file and send a "flush" command to server
- Some network filesystem (e.g. cifs) flush local write cache of
FMODE_WRITE file without sending an additional command to server
FUSE flushes local write cache of ANY file, even non FMODE_WRITE
and sends a "flush" command to server (if server implements it).
The FUSE implementation of ->flush() seems over agressive and
arbitrary and does not make a lot of sense when writeback caching is
disabled.
Instead of deciding on another arbitrary implementation that makes
sense, leave the choice of per-file flush behavior in the hands of
the server.
Link: https://lore.kernel.org/linux-fsdevel/CAJfpegspE8e6aKd47uZtSYX8Y-1e1FWS0VL0DH2Skb9gQP5RJQ@mail.gmail.com/
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
flush all data and metadata to physical storage when it is located on the
same system. This isn't happening with virtiofs though: sync() inside the
guest returns right away even though data still needs to be flushed from
the host page cache.
This is easily demonstrated by doing the following in the guest:
$ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
sync() = 0 <0.024068>
and start the following in the host when the 'dd' command completes
in the guest:
$ strace -T -e fsync /usr/bin/sync virtiofs/foo
fsync(3) = 0 <10.371640>
There are no good reasons not to honor the expected behavior of sync()
actually: it gives an unrealistic impression that virtiofs is super fast
and that data has safely landed on HW, which isn't the case obviously.
Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
request type for this purpose. Provision a 64-bit placeholder for possible
future extensions. Since the file server cannot handle the wait == 0 case,
we skip it to avoid a gratuitous roundtrip. Note that this is
per-superblock: a FUSE_SYNCFS is send for the root mount and for each
submount.
Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
the file server is treated as permanent success. This ensures
compatibility with older file servers: the client will get the current
behavior of sync() not being propagated to the file server.
Note that such an operation allows the file server to DoS sync(). Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default. Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.
Reported-by: Robert Krawitz <rlk@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When posix access ACL is set, it can have an effect on file mode and it can
also need to clear SGID if.
- None of caller's group/supplementary groups match file owner group.
AND
- Caller is not priviliged (No CAP_FSETID).
As of now fuser server is responsible for changing the file mode as
well. But it does not know whether to clear SGID or not.
So add a flag FUSE_SETXATTR_ACL_KILL_SGID and send this info with SETXATTR
to let file server know that sgid needs to be cleared as well.
Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse client needs to send additional information to file server when it
calls SETXATTR(system.posix_acl_access), so add extra flags field to the
structure.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With a 64-bit kernel build the FUSE device cannot handle ioctl requests
coming from 32-bit user space. This is due to the ioctl command
translation that generates different command identifiers that thus cannot
be used for direct comparisons without proper manipulation.
Explicitly extract type and number from the ioctl command to enable 32-bit
user space compatibility on 64-bit kernel builds.
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With FUSE_HANDLE_KILLPRIV_V2 support, server will need to kill suid/sgid/
security.capability on open(O_TRUNC), if server supports
FUSE_ATOMIC_O_TRUNC.
But server needs to kill suid/sgid only if caller does not have CAP_FSETID.
Given server does not have this information, client needs to send this info
to server.
So add a flag FUSE_OPEN_KILL_SUIDGID to fuse_open_in request which tells
server to kill suid/sgid (only if group execute is set).
This flag is added to the FUSE_OPEN request, as well as the FUSE_CREATE
request if the create was non-exclusive, since that might result in an
existing file being opened/truncated.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If fc->handle_killpriv_v2 is enabled, we expect file server to clear
suid/sgid/security.capbility upon chown/truncate/write as appropriate.
Upon truncate (ATTR_SIZE), suid/sgid are cleared only if caller does not
have CAP_FSETID. File server does not know whether caller has CAP_FSETID
or not. Hence set FATTR_KILL_SUIDGID upon truncate to let file server know
that caller does not have CAP_FSETID and it should kill suid/sgid as
appropriate.
On chown (ATTR_UID/ATTR_GID) suid/sgid need to be cleared irrespective of
capabilities of calling process, so set FATTR_KILL_SUIDGID unconditionally
in that case.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Kernel has:
ATTR_KILL_PRIV -> clear "security.capability"
ATTR_KILL_SUID -> clear S_ISUID
ATTR_KILL_SGID -> clear S_ISGID if executable
Fuse has:
FUSE_WRITE_KILL_PRIV -> clear S_ISUID and S_ISGID if executable
So FUSE_WRITE_KILL_PRIV implies the complement of ATTR_KILL_PRIV, which is
somewhat confusing. Also PRIV implies all privileges, including
"security.capability".
Change the name to FUSE_WRITE_KILL_SUIDGID and make FUSE_WRITE_KILL_PRIV an
alias to perserve API compatibility
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We already have FUSE_HANDLE_KILLPRIV flag that says that file server will
remove suid/sgid/caps on truncate/chown/write. But that's little different
from what Linux VFS implements.
To be consistent with Linux VFS behavior what we want is.
- caps are always cleared on chown/write/truncate
- suid is always cleared on chown, while for truncate/write it is cleared
only if caller does not have CAP_FSETID.
- sgid is always cleared on chown, while for truncate/write it is cleared
only if caller does not have CAP_FSETID as well as file has group execute
permission.
As previous flag did not provide above semantics. Implement a V2 of the
protocol with above said constraints.
Server does not know if caller has CAP_FSETID or not. So for the case
of write()/truncate(), client will send information in special flag to
indicate whether to kill priviliges or not. These changes are in subsequent
patches.
FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear
suid/sgid/security.capability. But with ->writeback_cache, WRITES are
cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2
and writeback_cache together. Though it probably might be good enough
for lot of use cases.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Add fuse_attr.flags
- Add FUSE_ATTR_SUBMOUNT
This is a flag for fuse_attr.flags that indicates that the given entry
resides on a different filesystem than the parent, and as such should
have a different st_dev.
- Add FUSE_SUBMOUNTS
The client sets this flag if it supports automounting directories.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
SETUPMAPPING is a command for use with 'virtiofsd', a fuse-over-virtio
implementation; it may find use in other fuse impelementations as well in
which the kernel does not have access to the address space of the daemon
directly.
A SETUPMAPPING operation causes a section of a file to be mapped into a
memory window visible to the kernel. The offsets in the file and the
window are defined by the kernel performing the operation.
The daemon may reject the request, for reasons including permissions and
limited resources.
When a request perfectly overlaps a previous mapping, the previous mapping
is replaced. When a mapping partially overlaps a previous mapping, the
previous mapping is split into one or two smaller mappings.
REMOVEMAPPING is the complement to SETUPMAPPING; it unmaps a range of
mapped files from the window visible to the kernel.
The map_alignment field communicates the alignment constraint for
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING and allows the daemon to constrain the
addresses and file offsets chosen by the kernel.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
virtio fs tunnels fuse over a virtio channel. One issue is two sides might
be speaking different endian-ness. To detects this, host side looks at the
opcode value in the FUSE_INIT command. Works fine at the moment but might
fail if a future version of fuse will use such an opcode for
initialization. Let's reserve this opcode so we remember and don't do
this.
Same for CUSE_INIT.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In the FOPEN_DIRECT_IO case the write path doesn't call file_remove_privs()
and that means setuid bit is not cleared if unpriviliged user writes to a
file with setuid bit set.
pjdfstest chmod test 12.t tests this and fails.
Fix this by adding a flag to the FUSE_WRITE message that requests clearing
privileges on the given file. This needs
This better than just calling fuse_remove_privs(), because the attributes
may not be up to date, so in that case a write may miss clearing the
privileges.
Test case:
$ passthrough_ll /mnt/pasthrough-mnt -o default_permissions,allow_other,cache=never
$ mkdir /mnt/pasthrough-mnt/testdir
$ cd /mnt/pasthrough-mnt/testdir
$ prove -rv pjdfstests/tests/chmod/12.t
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Currently, a CUSE server running on a 64-bit kernel can tell when an ioctl
request comes from a process running a 32-bit ABI, but cannot tell whether
the requesting process is using legacy IA32 emulation or x32 ABI. In
particular, the server does not know the size of the client process's
`time_t` type.
For 64-bit kernels, the `FUSE_IOCTL_COMPAT` and `FUSE_IOCTL_32BIT` flags
are currently set in the ioctl input request (`struct fuse_ioctl_in` member
`flags`) for a 32-bit requesting process. This patch defines a new flag
`FUSE_IOCTL_COMPAT_X32` and sets it if the 32-bit requesting process is
using the x32 ABI. This allows the server process to distinguish between
requests coming from client processes using IA32 emulation or the x32 ABI
and so infer the size of the client process's `time_t` type and any other
IA32/x32 differences.
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Retroactively add changelog entry for the atime and mtime "now" flags.
This was an oversight in commit 17637cbaba ("fuse: improve utimes
support").
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This was a mistake in the comment in commit e0a43ddcc0 ("fuse: allow
umask processing in userspace").
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The FUSE_FSYNC_DATASYNC flag was introduced by commit b6aeadeda2
("[PATCH] FUSE - file operations") as a magic number. No new values have
been added to fsync_flags since.
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Starting from commit 9c225f2655 ("vfs: atomic f_pos accesses as per
POSIX") files opened even via nonseekable_open gate read and write via lock
and do not allow them to be run simultaneously. This can create read vs
write deadlock if a filesystem is trying to implement a socket-like file
which is intended to be simultaneously used for both read and write from
filesystem client. See commit 10dce8af34 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") for details and e.g. commit 581d21a2d0 ("xenbus: fix deadlock
on writes to /proc/xen/xenbus") for a similar deadlock example on
/proc/xen/xenbus.
To avoid such deadlock it was tempting to adjust fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3Dhttps://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
opened handler is having stream-like semantics; does not use file position
and thus the kernel is free to issue simultaneous read and write request on
opened file handle.
This patch together with stream_open() should be added to stable kernels
starting from v3.14+. This will allow to patch OSSPD and other FUSE
filesystems that provide stream-like files to return FOPEN_STREAM |
FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
kernel versions. This should work because fuse_finish_open ignores unknown
open flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel that
is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
is sufficient to implement streams without read vs write deadlock.
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>