Writeback caching currently allocates requests with the maximum number of
possible pages, while the actual number of pages per request depends on a
couple of factors that cannot be determined when the request is allocated
(whether page is already under writeback, whether page is contiguous with
previous pages already added to a request).
This patch allows such requests to start with no page allocation (all pages
inline) and grow the page array on demand.
If the max_pages tunable remains the default value, then this will mean
just one allocation that is the same size as before. If the tunable is
larger, then this adds at most 3 additional memory allocations (which is
generously compensated by the improved performance from the larger
request).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.
Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
- https://lkml.org/lkml/2012/7/5/136
We've encountered performance degradation and fixed it on a big and complex
virtual environment.
Environment to reproduce degradation and improvement:
1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c
2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.
3. run test script and perform test on tmpfs
fuse_test()
{
cd /tmp
mkdir -p fusemnt
passthrough_fh -o max_pages=$1 /tmp/fusemnt
grep fuse /proc/self/mounts
dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
count=1K bs=1M 2>&1 | grep -v records
rm fusemnt/tmp/tmp
killall passthrough_fh
}
Test results:
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s
passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s
Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.
Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When allocating page array for a request the array for the page pointers
and the array for page descriptors are allocated by two separate kmalloc()
calls. Merge these into one allocation.
Also instead of initializing the request and the page arrays with memset(),
use the zeroing allocation variants.
Reserved requests never carry pages (page array size is zero). Make that
explicit by initializing the page array pointers to NULL and make sure the
assumption remains true by adding a WARN_ON().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We noticed the performance bottleneck in FUSE running our Virtuozzo storage
over rdma. On some types of workload we observe 20% of times spent in
request_find() in profiler. This function is iterating over long requests
list, and it scales bad.
The patch introduces hash table to reduce the number of iterations, we do
in this function. Hash generating algorithm is taken from hash_add()
function, while 256 lines table is used to store pending requests. This
fixes problem and improves the performance.
Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This field is not needed after the previous patch, since we can easily
convert request ID to interrupt request ID and vice versa.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not
allow to link requests to a hash table. We need can't use none of them as a
key to calculate hash.
This patch changes the algorithm of allocation of IDs for a request. Plain
requests obtain even ID, while interrupt requests are encoded in the low
bit. So, in next patches we will be able to use the rest of ID bits to
calculate hash, and the hash will be the same for plain and interrupt
requests.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently, we take fc->lock there only to check for fc->connected.
But this flag is changed only on connection abort, which is very
rare operation.
So allow checking fc->connected under just fc->bg_lock and use this lock
(as well as fc->lock) when resetting fc->connected.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To reduce contention of fc->lock, this patch introduces bg_lock for
protection of fields related to background queue. These are:
max_background, congestion_threshold, num_background, active_background,
bg_queue and blocked.
This allows next patch to make async reads not requiring fc->lock, so async
reads and writes will have better performance executed in parallel.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This cleanup patch makes the function to use the primitive
instead of direct dereferencing.
Also, move fiq dereferencing out of cycle, since it's
always constant.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Using waitqueue_active() is racy. Make sure we issue a wake_up()
unconditionally after storing into fc->blocked. After that it's okay to
optimize with waitqueue_active() since the first wake up provides the
necessary barrier for all waiters, not the just the woken one.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3c18ef8117 ("fuse: optimize wake_up")
Cc: <stable@vger.kernel.org> # v3.10
The 'bufs' array contains 'pipe->buffers' elements, but the
fuse_dev_splice_write() uses only 'pipe->nrbufs' elements.
So reduce the allocation size to 'pipe->nrbufs' elements.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The amount of pipe->buffers is basically controlled by userspace by
fcntl(... F_SETPIPE_SZ ...) so it could be large. High order allocations
could be slow (if memory is heavily fragmented) or may fail if the order
is larger than PAGE_ALLOC_COSTLY_ORDER.
Since the 'bufs' doesn't need to be physically contiguous, use
the kvmalloc_array() to allocate memory. If high order
page isn't available, the kvamalloc*() will fallback to 0-order.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_dev_splice_write() reads pipe->buffers to determine the size of
'bufs' array before taking the pipe_lock(). This is not safe as
another thread might change the 'pipe->buffers' between the allocation
and taking the pipe_lock(). So we end up with too small 'bufs' array.
Move the bufs allocations inside pipe_lock()/pipe_unlock() to fix this.
Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_abort_conn() does not guarantee that all async requests have actually
finished aborting (i.e. their ->end() function is called). This could
actually result in still used inodes after umount.
Add a helper to wait until all requests are fully done. This is done by
looking at the "num_waiting" counter. When this counter drops to zero, we
can be sure that no more requests are outstanding.
Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_dev_release() assumes that it's the only one referencing the
fpq->processing list, but that's not true, since fuse_abort_conn() can be
doing the same without any serialization between the two.
Fixes: c3696046be ("fuse: separate pqueue for clones")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Refcounting of request is broken when fuse_abort_conn() is called and
request is on the fpq->io list:
- ref is taken too late
- then it is not dropped
Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a connection gets aborted while congested, FUSE can leave
nr_wb_congested[] stuck until reboot causing wait_iff_congested() to
wait spuriously which can lead to severe performance degradation.
The leak is caused by gating congestion state clearing with
fc->connected test in request_end(). This was added way back in 2009
by 26c3679101 ("fuse: destroy bdi on umount"). While the commit
description doesn't explain why the test was added, it most likely was
to avoid dereferencing bdi after it got destroyed.
Since then, bdi lifetime rules have changed many times and now we're
always guaranteed to have access to the bdi while the superblock is
alive (fc->sb).
Drop fc->connected conditional to avoid leaking congestion states.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joshua Miller <joshmiller@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v2.6.29+
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In order to support mounts from namespaces other than init_user_ns, fuse
must translate uids and gids to/from the userns of the process servicing
requests on /dev/fuse. This patch does that, with a couple of restrictions
on the namespace:
- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.
- The namespace must be the same as s_user_ns.
These restrictions simplify the implementation by avoiding the need to pass
around userns references and by allowing fuse to rely on the checks in
setattr_prepare for ownership changes. Either restriction could be relaxed
in the future if needed.
For cuse the userns used is the opener of /dev/cuse. Semantically the cuse
support does not appear safe for unprivileged users. Practically the
permissions on /dev/cuse only make it accessible to the global root user.
If something slips through the cracks in a user namespace the only users
who will be able to use the cuse device are those users mapped into the
user namespace.
Translation in the posix acl is updated to use the uuser namespace of the
filesystem. Avoiding cases which might bypass this translation is handled
in a following change.
This change is stronlgy based on a similar change from Seth Forshee and
Dongsu Park.
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.
In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.
Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.
[SzM] Don't zero the context for the nofail case, just keep using the
munging version (makes sense for debugging and doesn't hurt).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.
If the initial process has exited, the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.
The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that
is not desirable.
The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages so
removing this code should not matter.
Getting the translation to a server running outside of the pid namespace of
a container can still be achieved by playing setns games at mount time. It
is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.
Fixes: 5d6d3a301c ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>