For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.
This is a part of the effort to reduce number of ext4 options hence the
test matrix.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently, there exists a race between delayed allocated writes and
the writeback when bigalloc feature is in use. The race was because we
wanted to determine what blocks in a cluster are under delayed
allocation and we were using buffer_delayed(bh) check for it. But, the
writeback codepath clears this bit without any synchronization which
resulted in a race and an ext4 warning similar to:
EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
reserved data blocks
The race existed in two places.
(1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
writeback code path.
(2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
buffer_delayed(bh) is set.
To fix (1), this patch introduces a new buffer_head state bit -
BH_Da_Mapped. This bit is set under the protection of
EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
allocated blocks during the writeout time. We can now reliably check
for this bit inside ext4_find_delalloc_range() to determine whether
the reservation for the blocks have already been claimed or not.
To fix (2), it was necessary to set buffer_delay(bh) under the
protection of i_data_sem. So, I extracted the very beginning of
ext4_map_blocks into a new function - ext4_da_map_blocks() - and
performed the required setting of bh_delay bit and the quota
reservation under the protection of i_data_sem. These two fixes makes
the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
thus removing the race.
Tested: I was able to reproduce the problem by running 'dd' and
'fsync' in parallel. Also, xfstests sometimes used to reproduce this
race. After the fix both my test and xfstests were successful and no
race (warning message) was observed.
Google-Bug-Id: 4997027
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.
Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename the function so it is more clear what is going on. Also rename
the various variables so it's clearer what's happening.
Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.
Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not. That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.
We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This adds supports for bigalloc file systems. It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).
Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported. Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.
Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds two new routines: ext4_discard_partial_page_buffers
and ext4_discard_partial_page_buffers_no_lock.
The ext4_discard_partial_page_buffers routine is a wrapper
function to ext4_discard_partial_page_buffers_no_lock.
The wrapper function locks the page and passes it to
ext4_discard_partial_page_buffers_no_lock.
Calling functions that already have the page locked can call
ext4_discard_partial_page_buffers_no_lock directly.
The ext4_discard_partial_page_buffers_no_lock function
zeros a specified range in a page, and unmaps the
corresponding buffer heads. Only block aligned regions of the
page will have their buffer heads unmapped. Unblock aligned regions
will be mapped if needed so that they can be updated with the
partial zero out. This function is meant to
be used to update a page and its buffer heads to be zeroed
and unmapped when the corresponding blocks have been released
or will be released.
This routine is used in the following scenarios:
* A hole is punched and the non page aligned regions
of the head and tail of the hole need to be discarded
* The file is truncated and the partial page beyond EOF needs
to be discarded
* The end of a hole is in the same page as EOF. After the
page is flushed, the partial page beyond EOF needs to be
discarded.
* A write operation begins or ends inside a hole and the partial
page appearing before or after the write needs to be discarded
* A write operation extends EOF and the partial page beyond EOF
needs to be discarded
This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
which is used when a write operation begins or ends in a hole.
When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only
buffer heads that are already unmapped will have the corresponding
regions of the page zeroed.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file). This was
discovered when xfstests #79 was generalized so it could run on all
file systems.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc:stable@kernel.org
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places. In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.
This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode(). Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix. Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.
This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
ext4: prevent memory leaks from ext4_mb_init_backend() on error path
ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
ext4: use ext4_msg() instead of printk in mballoc
ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
ext4: use the correct error exit path in ext4_init_inode_table()
ext4: add missing kfree() on error return path in add_new_gdb()
ext4: change umode_t in tracepoint headers to be an explicit __u16
ext4: fix races in ext4_sync_parent()
ext4: Fix overflow caused by missing cast in ext4_fallocate()
ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
ext4: simplify parameters of reserve_backup_gdb()
ext4: simplify parameters of add_new_gdb()
ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
ext4: simplify journal handling in setup_new_group_blocks()
ext4: let setup_new_group_blocks() set multiple bits at a time
ext4: fix a typo in ext4_group_extend()
ext4: let ext4_group_add_blocks() handle 0 blocks quickly
ext4: let ext4_group_add_blocks() return an error code
ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
...
Fix up conflict in fs/ext4/inode.c: commit aacfc19c62 ("fs: simplify
the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
function for the new simplified calling convention, while commit
dae1e52cb1 ("ext4: move ext4_ind_* functions from inode.c to
indirect.c") moved the function to another file.
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>