As it is accessed through the struct xfs_mount and can be set up
entirely from fs/xfs/xfs_super.c
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a top-level debug directory for global debug sysfs attributes.
This directory is added and removed on XFS module initialization and
removal respectively for DEBUG mode kernels only. It typically resides
at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
hierarchy as attributes might define global behavior or behavior that
must be configured before an xfs mount is available (e.g., log recovery
behavior).
Define the global debug kobject that represents the debug sysfs
directory and add generic attribute show/store helpers to support future
attributes. No debug attributes are exported as of yet.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Workqueues must be explicitly set as freezable to ensure they are frozen
in the assocated part of the hibernation/suspend sequence. Freezing of
workqueues and kernel threads is important to ensure that modifications
are not made on-disk after the hibernation image has been created.
Otherwise, the in-memory state can become inconsistent with what is on
disk and eventually lead to filesystem corruption. We have reports of
free space btree corruptions that occur immediately after restore from
hibernate that suggest the xfs-eofblocks workqueue could be causing
such problems if it races with hibernation.
Mark all of the internal XFS workqueues as freezable to ensure nothing
changes on-disk once the freezer infrastructure freezes kernel threads
and creates the hibernation image.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reported-by: Carlos E. R. <carlos.e.r@opensuse.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Trying to support tiny disks only and saving a bit memory might have
made sense on an SGI O2 15 years ago, but is pretty pointless today.
Remove the rarely tested codepath that uses various smaller in-memory
types to reduce our test matrix and make the codebase a little bit
smaller and less complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_set_inode32() caught my eye because it had weird spacing around
the "-1's". In cleaning that up, I realized that the assignment in
the declaration of "ino" is never used; it's rewritten before it
gets read.
Drop the ino initializer from its declaration since it's not used,
and move the agino initialization into the body of the function,
mostly so that we can have pretty whitespace and not exceed 80
columns. :)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Today, if we perform an xfs_growfs which adds allocation groups,
mp->m_maxagi is not properly updated when the growfs is complete.
Therefore inodes will continue to be allocated only in the
AGs which existed prior to the growfs, and the new space
won't be utilized.
This is because of this path in xfs_growfs_data_private():
xfs_growfs_data_private
xfs_initialize_perag(mp, nagcount, &nagimax);
if (mp->m_flags & XFS_MOUNT_32BITINODES)
index = xfs_set_inode32(mp);
else
index = xfs_set_inode64(mp);
if (maxagi)
*maxagi = index;
where xfs_set_inode* iterates over the (old) agcount in
mp->m_sb.sb_agblocks, which has not yet been updated
in the growfs path. So "index" will be returned based on
the old agcount, not the new one, and new AGs are not available
for inode allocation.
Fix this by explicitly passing the proper AG count (which
xfs_initialize_perag() already has) down another level,
so that xfs_set_inode* can make the proper decision about
acceptable AGs for inode allocation in the potentially
newly-added AGs.
This has been broken since 3.7, when these two
xfs_set_inode* functions were added in commit 2d2194f.
Prior to that, we looped over "agcount" not sb_agblocks
in these calculations.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a sysfs kset to contain all sub-objects associated with the XFS
module. The kset is created and removed on module initialization and
removal respectively. The kset uses fs_obj as a parent. This leads to
the creation of a /sys/fs/xfs directory when the kset exists.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Convert all the errors the core XFs code to negative error signs
like the rest of the kernel and remove all the sign conversion we
do in the interface layers.
Errors for conversion (and comparison) found via searches like:
$ git grep " E" fs/xfs
$ git grep "return E" fs/xfs
$ git grep " E[A-Z].*;$" fs/xfs
Negation points found via searches like:
$ git grep "= -[a-z,A-Z]" fs/xfs
$ git grep "return -[a-z,A-D,F-Z]" fs/xfs
$ git grep " -[a-z].*;" fs/xfs
[ with some bits I missed from Brian Foster ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.
Just nuke XFS_ERROR and associated bits.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We only have very few of these around, and allocation isn't that
much of a hot path. Remove the slab cache to simplify the code,
and to not waste any resources for the usual case of not having
any inodes that use the filestream allocator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
This reverts commit c4a391b53a. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.
CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>
Pull xfs update from Ben Myers:
"For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
refactoring. Bugfixes that stand out are the fix for the AGF/AGI
deadlock, incore extent list fixes, verifier fixes for v4 superblocks
and growfs, and memory leaks. There are some asserts, warnings, and
strings that were cleaned up. There was further rearrangement of code
to make libxfs and the kernel sync up more easily, differences between
v2 and v3 directory code were abstracted using an ops vector,
xfs_inactive was reworked, and the preallocation/hole punching code
was refactored.
- simplify kmem_zone_zalloc
- add traces for AGF/AGI read ops
- add additional AIL traces
- fix xfs_remove AGF vs AGI deadlock
- fix the extent count of new incore extent page in the indirection
array
- don't fail bad secondary superblocks verification on v4 filesystems
due to unzeroed bits after v4 fields
- fix possible NULL dereference in xlog_verify_iclog
- remove redundant assert in xfs_dir2_leafn_split
- prevent stack overflows from page cache allocation
- fix some sparse warnings
- fix directory block format verifier to check the leaf entry count
- abstract the differences in dir2/dir3 via an ops vector
- continue process of reorganization to make libxfs/kernel code
merges easier
- refactor the preallocation and hole punching code
- fix for growfs and verifiers
- remove unnecessary scary corruption error when probing non-xfs
filesystems
- remove extra newlines from strings passed to printk
- prevent deadlock trying to cover an active log
- rework xfs_inactive()
- add the inode directory type support to XFS_IOC_FSGEOM
- cleanup (remove) usage of is_bad_inode
- fix miscalculation in xfs_iext_realloc_direct which results in
oversized direct extent list
- remove unnecessary count arg to xfs_iomap_write_allocate
- fix memory leak in xlog_recover_add_to_trans
- check superblock instead of block magic to determine if dtype field
is present
- fix lockdep annotation due to project quotas
- fix regression in xfs_node_toosmall which can lead to incorrect
directory btree node collapse
- make log recovery verify filesystem uuid of recovering blocks
- fix XFS_IOC_FREE_EOFBLOCKS definition
- remove invalid assert in xfs_inode_free
- fix for AIL lock regression"
* tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs: (49 commits)
xfs: simplify kmem_{zone_}zalloc
xfs: add tracepoints to AGF/AGI read operations
xfs: trace AIL manipulations
xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
xfs: fix the extent count when allocating an new indirection array entry
xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields
xfs: fix possible NULL dereference in xlog_verify_iclog
xfs:xfs_dir2_node.c: pointer use before check for null
xfs: prevent stack overflows from page cache allocation
xfs: fix static and extern sparse warnings
xfs: validity check the directory block leaf entry count
xfs: make dir2 ftype offset pointers explicit
xfs: convert directory vector functions to constants
xfs: convert directory vector functions to constants
xfs: vectorise encoding/decoding directory headers
xfs: vectorise DA btree operations
xfs: vectorise directory leaf operations
xfs: vectorise directory data operations part 2
xfs: vectorise directory data operations
xfs: vectorise remaining shortform dir2 ops
...
When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
(e.g. causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.
The following script reproduces the problem:
function run_writers
{
for (( i = 0; i < 10; i++ )); do
mkdir $1/dir$i
for (( j = 0; j < 40000; j++ )); do
dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
done &
done
}
for dir in "$@"; do
run_writers $dir
done
sleep 40
time sync
Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.
To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426. With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kbuild test robot indicated that there were some new sparse
warnings in fs/xfs/xfs_dquot_buf.c. Actually, there were a lot more
that is wasn't warning about, so fix them all up.
Reported-by: kbuild test robot
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Currently the xfs_inode.h header has a dependency on the definition
of the BMAP btree records as the inode fork includes an array of
xfs_bmbt_rec_host_t objects in it's definition.
Move all the btree format definitions from xfs_btree.h,
xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
xfs_format.h to continue the process of centralising the on-disk
format definitions. With this done, the xfs inode definitions are no
longer dependent on btree header files.
The enables a massive culling of unnecessary includes, with close to
200 #include directives removed from the XFS kernel code base.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>