Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so
e.g. on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.
Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: break out of shrink_delalloc earlier
btrfs: fix not enough reserved space
btrfs: fix dip leak
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: deal with short returns from copy_from_user
Btrfs: fix regressions in copy_from_user handling
Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.
But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.
The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.
This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.
If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.
Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.
This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Without this patch, inodes are not promptly freed on last close of an
unlinked file by an nfs client:
client$ mount -tnfs4 server:/export/ /mnt/
client$ tail -f /mnt/FOO
...
server$ df -i /export
server$ rm /export/FOO
(^C the tail -f)
server$ df -i /export
server$ echo 2 >/proc/sys/vm/drop_caches
server$ df -i /export
the df's will show that the inode is not freed on the filesystem until
the last step, when it could have been freed after killing the client's
tail -f. On-disk data won't be deallocated either, leading to possible
spurious ENOSPC.
This occurs because when the client does the close, it arrives in a
compound with a putfh and a close, processed like:
- putfh: look up the filehandle. The only alias found for the
inode will be DCACHE_UNHASHED alias referenced by the filp
this, so it creates a new DCACHE_DISCONECTED dentry and
returns that instead.
- close: closes the existing filp, which is destroyed
immediately by dput() since it's DCACHE_UNHASHED.
- end of the compound: release the reference
to the current filehandle, and dput() the new
DCACHE_DISCONECTED dentry, which gets put on the
unused list instead of being destroyed immediately.
Nick Piggin suggested fixing this by allowing d_obtain_alias to return
the unhashed dentry that is referenced by the filp, instead of making it
create a new dentry.
Leave __d_find_alias() alone to avoid changing behavior of other
callers.
Also nfsd doesn't need all the checks of __d_find_alias(); any dentry,
hashed or unhashed, disconnected or not, should work.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the fallocate path the kernel doesn't check for the immutable/append
flag. It's possible to have a race condition in this scenario: an
application open a file in read/write and it does something, meanwhile
root set the immutable flag on the file, the application at that point
can call fallocate with success. In addition, we don't allow to do any
unreserve operation on an append only file but only the reserve one.
Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
nfsd: wrong index used in inner loop
nfsd4: fix bad pointer on failure to find delegation
NFSD: fix decode_cb_sequence4resok
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
nd->inode is not set on the second attempt in path_walk()
unfuck proc_sysctl ->d_compare()
minimal fix for do_filp_open() race
We leave it at whatever it had been pointing to after the
first link_path_walk() had failed with -ESTALE. Things
do not work well after that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.
The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
a) struct inode is not going to be freed under ->d_compare();
however, the thing PROC_I(inode)->sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its ->unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in ->d_compare().
b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry->seq between verifying that it's still hashed / fetching
dentry->d_inode and passing it to ->d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have ->d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case of a nonempty list, the return on error here is obviously bogus;
it ends up being a pointer to the list head instead of to any valid
delegation on the list.
In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
then renew_client may oops, and it may take an embarassingly long time to
figure out why. Facepalm.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200
...
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page. To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.
This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cc: stable@kernel.org
Commit 914ee295af fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.
But, there were a few problems:
1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned. The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic
We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.
2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero. The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.
3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified. If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.
With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.
We deal with this by pushing the up to date checks down into the page
prep code. This fits better with how the rest of file_write works.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: no .snap inside of snapped namespace
libceph: fix msgr standby handling
libceph: fix msgr keepalive flag
libceph: fix msgr backoff
libceph: retry after authorization failure
libceph: fix handling of short returns from get_user_pages
ceph: do not clear I_COMPLETE from d_release
ceph: do not set I_COMPLETE
Revert "ceph: keep reference to parent inode on ceph_dentry"