If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.
Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:
# Scenario 1
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir -p /mnt/a/x
echo "hello" > /mnt/a/x/foo
echo "world" > /mnt/a/x/bar
sync
mv /mnt/a/x /mnt/a/y
mkdir /mnt/a/x
xfs_io -c fsync /mnt/a/x
<power failure happens>
The next time the fs is mounted, log tree replay happens and
the directory "y" does not exist nor do the files "foo" and
"bar" exist anywhere (neither in "y" nor in "x", nor the root
nor anywhere).
# Scenario 2
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir /mnt/a
echo "hello" > /mnt/a/foo
sync
mv /mnt/a/foo /mnt/a/bar
echo "world" > /mnt/a/foo
xfs_io -c fsync /mnt/a/foo
<power failure happens>
The next time the fs is mounted, log tree replay happens and the
file "bar" does not exists anymore. A file with the name "foo"
exists and it matches the second file we created.
Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir /mnt/testdir
btrfs subvolume snapshot /mnt /mnt/testdir/snap
btrfs subvolume delete /mnt/testdir/snap
rmdir /mnt/testdir
mkdir /mnt/testdir
xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
<power failure>
The next time the fs is mounted the log replay procedure fails because
it attempts to delete the snapshot entry (which has dir item key type
of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
resulting in the following error that causes mount to fail:
[52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[52174.512570] ------------[ cut here ]------------
[52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
[52174.514681] BTRFS: Transaction aborted (error -2)
[52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
[52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1
[52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
[52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
[52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
[52174.524053] Call Trace:
[52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90
[52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
[52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
[52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
[52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
[52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284
[52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
[52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
[52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
[52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
[52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
[52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
[52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
[52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
[52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
[52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131
[52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
[52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
[52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
[52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
[52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131
[52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
[52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
[52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
[52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f
[52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
[52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).
Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:
* fstests: generic test for fsync after renaming directory
https://patchwork.kernel.org/patch/8694281/
* fstests: generic test for fsync after renaming file
https://patchwork.kernel.org/patch/8694301/
* fstests: add btrfs test for fsync after snapshot deletion
https://patchwork.kernel.org/patch/8670671/
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If device replace entry was found on disk at mounting and its num_write_errors
stats counter has non-NULL value, then replace operation will never be
finished and -EIO error will be reported by btrfs_scrub_dev() because
this counter is never reset.
# mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
# btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs
# btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error
Reset num_write_errors and num_uncorrectable_read_errors counters in the
dev_replace structure before start of replacing.
Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch adds tracepoints to the qgroup code on both the reporting side
(insert_dirty_extents) and the accounting side. Taken together it allows us
to see what qgroup operations have happened, and what their result was.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fd we pass in may not be on a btrfs file system, so don't try to do
BTRFS_I() on it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
create_pending_snapshot() will go readonly on _any_ error return from
btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by
just making a snapshot and asking it to inherit from an invalid qgroup. For
example:
$ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo
Will cause a transaction abort.
Fix this by only throwing errors in btrfs_qgroup_inherit() when we know
going readonly is acceptable.
The following xfstests test case reproduces this bug:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# remove previous $seqres.full before test
rm -f $seqres.full
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs
_scratch_mount
_run_btrfs_util_prog quota enable $SCRATCH_MNT
# The qgroup '1/10' does not exist and should be silently ignored
_run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1
_scratch_unmount
echo "Silence is golden"
status=0
exit
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As one user in mail list report reproducible balance ENOSPC error, it's
better to add more debug info for enospc_debug mount option.
Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter's static checker has found this error, it's introduced by
commit 64c043de46
("Btrfs: fix up read_tree_block to return proper error")
It's really supposed to 'break' the loop on error like others.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- We call inode_size_ok() only if FL_KEEP_SIZE isn't specified.
- As an optimisation we can skip the call if (off + len)
isn't greater than the current size of the file. This operation
is called under the lock so the less work we do, the better.
- If we call inode_size_ok() pass to it the correct value rather
than a more conservative estimation.
Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
transaction_kthread() is calling try_to_freeze(), but that's just an
expeinsive no-op given the fact that the thread is not marked freezable.
After removing this, disk-io.c is now independent on freezer API.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
cleaner_kthread() is not marked freezable, and therefore calling
try_to_freeze() in its context is a pointless no-op.
In addition to that, as has been clearly demonstrated by 80ad623edd
("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"), it's perfectly
valid / legal for cleaner_kthread() to stay scheduled out in an arbitrary
place during suspend (in that particular example that was waiting for
reading of extent pages), so there is no need to leave any traces of
freezer in this kthread.
Fixes: 80ad623edd ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Fixes: 6962491321 ("btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
but it is better than to have a silent metadata corruption on disk.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c40a3d38af (Btrfs: Compute and look up csums based on
sectorsized blocks) changes around how we walk the bios while looking up
crcs. There's an inner loop that is jumping to the next bvec based on
sectors and before it derefs the next bvec, it needs to make sure we're
still in the bio.
In this case, the outer loop would have decided to stop moving forward
too, and the bvec deref is never actually used for anything. But
CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use
btrfs_debug if it is enabled.
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
[ preserve the WARN_ON ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's basically harmless if "ref_level" isn't initialized since it's only
used for an error message, but it causes a static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's better to show a warning message for the exceptional case
that one of objectid (in most case, inode number) reaches its
highest value. For example, if inode cache is off and this event
happens, we can't create any file even if there are not so many files.
This message ease detecting such problem.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The document in the kernel sources is yet another palce where the
documentation would need to be updated, while it is not the primary
source. We actively maintain the wiki pages.
Signed-off-by: David Sterba <dsterba@suse.com>
When logging that an inode exists, for example as part of a directory
fsync operation, we were collecting any ordered extents for the inode but
we ended up doing nothing with them except tagging them as processed, by
setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
collecting and processing them. This created a time window where a second
fsync against the inode, using the fast path, ended up not logging the
checksums for the new extents but it logged the extents since they were
part of the list of modified extents. This happened because the ordered
extents were not collected and checksums were not yet added to the csum
tree - the ordered extents have not gone through btrfs_finish_ordered_io()
yet (which is where we add them to the csum tree by calling
inode.c:add_pending_csums()).
So fix this by not collecting an inode's ordered extents if we are logging
it with the LOG_INODE_EXISTS mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>