Commit Graph

516 Commits

Author SHA1 Message Date
Miklos Szeredi
698fa1d163 fuse: no RCU mode in fuse_access()
fuse_access() is never called in RCU walk, only on the final component of
access(2) and chdir(2)...

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-10-01 16:41:23 +02:00
Miklos Szeredi
6314efee3c fuse: readdirplus: fix RCU walk
Doing dput(parent) is not valid in RCU walk mode.  In RCU mode it would
probably be okay to update the parent flags, but it's actually not
necessary most of the time...

So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
recently initialized by READDIRPLUS.

This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
READDIRPLUS and only dropping out of RCU mode if this flag is set.
FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
the parent.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-10-01 16:41:22 +02:00
Miklos Szeredi
3c70b8eeda fuse: don't check_submounts_and_drop() in RCU walk
If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
with it instead of calling check_submounts_and_drop() which is not prepared
for being called from RCU walk.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-10-01 16:41:22 +02:00
Maxim Patlasov
0ab08f576b fuse: fix fallocate vs. ftruncate race
A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
description of races between ftruncate and anyone who can extend i_size:

> 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
> changed (due to our own fuse_do_setattr()) and is going to call
> truncate_pagecache() for some  'new_size' it believes valid right now. But by
> the time that particular truncate_pagecache() is called ...
> 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
> not -- it doesn't matter).
> 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
> 4. mmap-ed write makes a page in the extended region dirty.

This patch adds necessary bits to fuse_file_fallocate() to protect from that
race.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-18 14:19:59 +02:00
Maxim Patlasov
bde52788bd fuse: wait for writeback in fuse_file_fallocate()
The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):

1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
   and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
   the page goes to write-back. The page is fully processed by fuse_writepage
   (including end_page_writeback on the page), but fuse_flush_writepages did
   nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
   by calling fuse_release_nowrite. The latter triggers processing queued
   write-back request which will write stale data to the hole soon.

Changed in v2 (thanks to Brian for suggestion):
 - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
   we can end up in returning -ENOTSUPP while user data is already punched
   from page cache. Use filemap_write_and_wait_range() instead.
Changed in v3 (thanks to Miklos for suggestion):
 - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
   instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
   should be enough.
 - rebased to for-linus branch of fuse.git

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-18 14:19:59 +02:00
Kirill A. Shutemov
7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Maxim Patlasov
5a53748568 mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:04 -07:00
Linus Torvalds
16d70e1529 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse bugfixes from Miklos Szeredi:
 "Just a bunch of bugfixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: use list_for_each_entry() for list traversing
  fuse: readdir: check for slash in names
  fuse: hotfix truncate_pagecache() issue
  fuse: invalidate inode attributes on xattr modification
  fuse: postpone end_page_writeback() in fuse_writepage_locked()
2013-09-09 09:18:23 -07:00
Anand Avati
46ea1562da fuse: drop dentry on failed revalidate
Drop a subtree when we find that it has moved or been delated.  This can be
done as long as there are no submounts under this location.

If the directory was moved and we come across the same directory in a
future lookup it will be reconnected by d_materialise_unique().

Signed-off-by: Anand Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-05 16:23:54 -04:00
Miklos Szeredi
e2a6b95236 fuse: clean up return in fuse_dentry_revalidate()
On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the
error itself from ->d_revalidate() insted of returning zero (invalid).

Also make a common label for invalidating the dentry.  This will be used by
the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-05 16:23:54 -04:00
Miklos Szeredi
5835f3390e fuse: use d_materialise_unique()
Use d_materialise_unique() instead of d_splice_alias().  This allows dentry
subtrees to be moved to a new place if there moved, even if something is
referencing a dentry in the subtree (open fd, cwd, etc..).

This will also allow us to drop a subtree if it is found to be replaced by
something else.  In this case the disconnected subtree can later be
reconnected to its new location.

d_materialise_unique() ensures that a directory entry only ever has one
alias.  We keep fc->inst_mutex around the calls for d_materialise_unique()
on directories to prevent a race with mkdir "stealing" the inode.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-05 16:23:53 -04:00
Dong Fang
05726acabe fuse: use list_for_each_entry() for list traversing
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-09-04 17:42:42 +02:00
Miklos Szeredi
efeb9e60d4 fuse: readdir: check for slash in names
Userspace can add names containing a slash character to the directory
listing.  Don't allow this as it could cause all sorts of trouble.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-03 14:28:38 +02:00
Maxim Patlasov
06a7c3c278 fuse: hotfix truncate_pagecache() issue
The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.

The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:

1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.

The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:

1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.

The result will be the lost of data user wrote on the fourth step.

The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).

Changed in v2:
 - improved patch description to cover both sides of the issue.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-03 13:41:58 +02:00
Anand Avati
d331a415ae fuse: invalidate inode attributes on xattr modification
Calls like setxattr and removexattr result in updation of ctime.
Therefore invalidate inode attributes to force a refresh.

Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-03 13:41:58 +02:00
Maxim Patlasov
4a4ac4eba1 fuse: postpone end_page_writeback() in fuse_writepage_locked()
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):

1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
   writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
   the original page by end_page_writeback.
4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
   is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
   fuse_send_write_pages do wait for fuse writeback, but for another
   page->index.
6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
   fuse_send_writepage is supposed to crop inarg->size of the request,
   but it doesn't because i_size has already been extended back.

Moving end_page_writeback to the end of fuse_writepage_locked fixes the
race because now the fact that truncate_pagecache is successfully returned
infers that fuse_writepage_locked has already called end_page_writeback.
And this, in turn, infers that fuse_flush_writepages has already called
fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
could not extend it because of i_mutex held by ftruncate(2).

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2013-09-03 13:41:57 +02:00
Greg Kroah-Hartman
b78b6b3a9a Merge 3.11-rc3 into driver-core-next
We want these fixes in this branch.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-29 12:30:13 -07:00
Greg Kroah-Hartman
4183fb9503 cuse: convert class code to use dev_groups
The dev_attrs field of struct class is going away soon, dev_groups
should be used instead.  This converts the cuse class code to use the
correct field.

Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-26 18:05:18 -07:00
Miklos Szeredi
c7263bcdc4 fuse: readdirplus: cleanup
Niels noted that we don't need the 'dentry = NULL' line.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Niels de Vos <ndevos@redhat.com>
2013-07-17 14:53:54 +02:00
Miklos Szeredi
fa2b721360 fuse: readdirplus: change attributes once
If we got the inode through fuse_iget() then the attributes are already
up-to-date.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-07-17 14:53:53 +02:00
Miklos Szeredi
2914941e31 fuse: readdirplus: fix instantiate
Fuse does instantiation slightly differently from NFS/CIFS which use
d_materialise_unique().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
2013-07-17 14:53:53 +02:00
Miklos Szeredi
a28ef45cbb fuse: readdirplus: sanity checks
Add sanity checks before adding or updating an entry with data received
from readdirplus.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
2013-07-17 14:53:53 +02:00
Niels de Vos
53ce9a3364 fuse: readdirplus: fix dentry leak
In case d_lookup() returns a dentry with d_inode == NULL, the dentry is not
returned with dput(). This results in triggering a BUG() in
shrink_dcache_for_umount_subtree():

  BUG: Dentry ...{i=0,n=...} still in use (1) [unmount of fuse fuse]

[SzM: need to d_drop() as well]

Reported-by: Justin Clift <jclift@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Brian Foster <bfoster@redhat.com>
Tested-by: Niels de Vos <ndevos@redhat.com>
CC: stable@vger.kernel.org
2013-07-17 14:53:53 +02:00
Jiang Liu
0ed5fd1385 mm: use totalram_pages instead of num_physpages at runtime
The global variable num_physpages is scheduled to be removed, so use
totalram_pages instead of num_physpages at runtime.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:35 -07:00
Linus Torvalds
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00