This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: prevent RAID level downgrades when space is low
Btrfs: account for missing devices in RAID allocation profiles
Btrfs: EIO when we fail to read tree roots
Btrfs: fix compiler warnings
Btrfs: Make async snapshot ioctl more generic
Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
Btrfs: Fix a crash when mounting a subvolume
Btrfs: fix sync subvol/snapshot creation
Btrfs: Fix page leak in compressed writeback path
Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Btrfs: fixup return code for btrfs_del_orphan_item
Btrfs: do not do fast caching if we are allocating blocks for tree_root
Btrfs: deal with space cache errors better
Btrfs: fix use after free in O_DIRECT
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.
This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.
But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk. This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.
The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.
The fix here is to make sure that we provide duplication when the
caller has asked for it. It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.
This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.
The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.
The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain. This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs. The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.
Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().
Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:
BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...
Steps to reproduce the bug:
# mount /dev/loop1 /mnt
# mkdir save
# btrfs subvolume snapshot /mnt save/snap1
# umount /mnt
# mount -o subvol=save/snap1 /dev/loop1 /mnt
(crash)
Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.
There's ample room for further cleanup here, but this keeps the fix simple.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Not being able to delete an orphan item isn't a horrible thing. The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.
Signed-off-by: Josef Bacik <josef@redhat.com>
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers. Instead return -ENOENT if we didn't find the item. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root. So just check to
see if the root we are on is the tree root, and just don't do the fast caching.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing. Fix this by marking the space cache as having an error so we only
get the message once. This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway. None of these problems are actual problems, they are just
annoying and sub-optimal. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a bug where we use dip after we have freed it. Instead just use the
file_offset that was passed to the function. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>