Before this patch we called find_all_leafs for a data extent, then called
find_all_roots and then looked into the extent to grab the information
we were seeking. This was done without holding the leaves locked to avoid
deadlocks. However, this can obviouly race with concurrent tree
modifications.
Instead, we now look into the extent while we're holding the lock during
find_all_leafs and store this information together with the leaf list.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The key we store with a tree block backref is only a hint. It is set when
the ref is created and can remain correct for a long time. As the tree is
rebalanced, however, eventually the key no longer points to the correct
destination.
With this patch, we change find_parent_nodes to no longer add keys unless it
knows for sure they're correct (e.g. because they're for an extent data
backref). Then when we later encounter a backref ref with no parent and no
key set, we grab the block and take the first key from the block itself.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
That one has been around since the addition of backref.c. Due to the way we
calculate our slot numbers, after adding inline refs we're missing one keyed
ref unless it's located at the beginning of a new leaf.
Reported-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
ulist_next gets the pointer to the previously returned element to find the
next element from there. However, when we call ulist_add while iteration
with ulist_next is in progress (ulist explicitly supports this), we can
realloc the ulist internal memory, which makes the pointer to the previous
element useless.
Instead, we now use an iterator parameter that's independent from the
internal pointers.
Reported-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.
But, a few callers are using it with spinlocks held. Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case. This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix that when scrub tries to repair an I/O or checksum error and one of
the devices containing the mirror is missing, it crashes in bio_add_page
because the bdev is a NULL pointer for missing devices.
Reported-by: Marco L. Crociani <marco.crociani@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix the size members of btrfs_ioctl_ino_path_args and
btrfs_ioctl_logical_ino_args. The user space btrfs-progs utilities used
__u64 and the kernel headers used __u32 before.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb. Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
add_root_to_dirty_list happens once at the very beginning of the
transaction, but it is still racey.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We're spending huge amounts of time on lock contention during
end_io processing because we unconditionally assume we are overwriting
an existing extent in the file for each IO.
This checks to see if we are outside i_size, and if so, it uses a
less expensive readonly search of the btree to look for existing
extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has an optimization where it will preallocate dentries during
readdir to fill in enough information to open the inode without an extra
lookup.
But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
that leads to deadlocks because our readdir code has tree locks held.
For now, disable this optimization. We'll fix the gfp mask in the next
merge window.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix out-of-space checking, addressing a warning and potential resource
leak when resizing the filesystem down while allocating blocks.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
may_commit_transaction() calls
spin_lock(&space_info->lock);
spin_lock(&delayed_rsv->lock);
and update_global_block_rsv() calls
spin_lock(&block_rsv->lock);
spin_lock(&sinfo->lock);
Lockdep complains about this at run time.
Everywhere except in update_global_block_rsv(), the space_info lock is
the outer lock, therefore the locking order in update_global_block_rsv()
is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add
correct locking to address this.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_map_block sets mirror_num, so that the repair code knows eventually
which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
Before this fix mirror_num was incorrectly related to our stripe index.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and
start writing them out, but it doesn't splice the list or anything so as
long as somebody is doing work on the box you could end up in this section
_forever_. So just remove it, it's not needed anyway since sync will start
writeback on all inodes anyway, all we need to do is wait for ordered
extents and then we can commit the transaction. In my horrible torture test
sync goes from taking 4 minutes to about 1.5 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bitfield member mount_opt was too small by one bit to hold the mount
option that enabled to include data extents in the integrity checker.
Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR
option was added (git rebase silently merges so that the increase of the
size of the bitfield member is lost), the bit limit was removed entirely.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When a filesystem is mounted with the degraded option, it is
possible that some of the devices are not there.
btrfs_ioctl_dev_info() crashs in this case because the device
name is a NULL pointer. This ioctl was only used for scrub.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
It is basically a good thing if we are interruptible when waiting for
free space, but the generality in which it is implemented currently
leads to system calls being interruptible that are not documented this
way. For example git can't handle interrupted unlink(), leading to
corrupt repos under space pressure.
Instead we raise the bar to only be interruptible by SIGKILL.
Thanks to David Sterba for suggesting this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
The caller expects this function to return with the lock held and
releases it immediately on error.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Fix a bug, where in case we need to adjust stripe_size so that the
length of the resulting chunk is less than or equal to max_chunk_size,
DUP chunks turn out to be only half as big as they could be.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>