Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.
Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.
This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: BUG to BUG_ON changes
Btrfs: remove dead code
Btrfs: remove dead code
Btrfs: fix typos in comments
Btrfs: remove unused ftrace include
Btrfs: fix __ucmpdi2 compile bug on 32 bit builds
Btrfs: free inode struct when btrfs_new_inode fails
Btrfs: fix race in worker_loop
Btrfs: add flushoncommit mount option
Btrfs: notreelog mount option
Btrfs: introduce btrfs_show_options
Btrfs: rework allocation clustering
Btrfs: Optimize locking in btrfs_next_leaf()
Btrfs: break up btrfs_search_slot into smaller pieces
Btrfs: kill the pinned_mutex
Btrfs: kill the block group alloc mutex
Btrfs: clean up find_free_extent
Btrfs: free space cache cleanups
Btrfs: unplug in the async bio submission threads
Btrfs: keep processing bios for a given bdev if our proc is batching
Because btrfs is copy-on-write, we end up picking new locations for
blocks very often. This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.
This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation. The end result is that we tend to write sequentially to
the drive.
This happens all the time for metadata and it happens for data
when mounted -o ssd. But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.
This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.
The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents. This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes the pinned_mutex. The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there. We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: try to free metadata pages when we free btree blocks
Btrfs: add extra flushing for renames and truncates
Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
Btrfs: optimize fsyncs on old files
Btrfs: tree logging unlink/rename fixes
Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
Btrfs: limit balancing work while flushing delayed refs
Btrfs: readahead checksums during btrfs_finish_ordered_io
Btrfs: leave btree locks spinning more often
Btrfs: Only let very young transactions grow during commit
Btrfs: Check for a blocking lock before taking the spin
Btrfs: reduce stack in cow_file_range
Btrfs: reduce stalls during transaction commit
Btrfs: process the delayed reference queue in clusters
Btrfs: try to cleanup delayed refs while freeing extents
Btrfs: reduce stack usage in some crucial tree balancing functions
Btrfs: do extent allocation and reference count updates in the background
Btrfs: don't preallocate metadata blocks during btrfs_search_slot
Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.
This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.
If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.
Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).
For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).
For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.
This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.
This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.
Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree. These are processed by
finding records in the tree that are not currently being processed one at
a time.
This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.
This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.
If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.
These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.
This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:
1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.
#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.
These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.
Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.
Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places. But, it wasn't correct for UP compiled
kernels.
This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.
The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.
btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.
The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before metadata is written to disk, it is updated to reflect that writeout
has begun. Once this update is done, the block must be cow'd before it
can be modified again.
This update was originally synchronized by using a per-fs spinlock. Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.
So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short. This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.
It allows more work to be added to the worker
without requiring notifications from other procs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.
This patch has following changes to fix the bug:
Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.
Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.
At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The data in fs_info->super_for_commit are zeros before the
first transaction commit. If tree log sync and system crash
both occur before the first transaction commit, super block
will get corrupted.
This fixes it by properly filling in the super_for_commit field at
open time.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
bio_end_io for reads without checksumming on and btree writes were
happening without using async thread pools. This means the extent_io.c
code had to use spin_lock_irq and friends on the rb tree locks for
extent state.
There were some irq safe vs unsafe lock inversions between the delallock
lock and the extent state locks. This patch gets rid of them by moving
all end_io code into the thread pools.
To avoid contention and deadlocks between the data end_io processing and the
metadata end_io processing yet another thread pool is added to finish
off metadata writes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>