When we made all inode updates transactional, we no longer needed
the log recovery detection for inodes being newer on disk than the
transaction being replayed - it was redundant as replay of the log
would always result in the latest version of the inode would be on
disk. It was redundant, but left in place because it wasn't
considered to be a problem.
However, with the new "don't read inodes on create" optimisation,
flushiter has come back to bite us. Essentially, the optimisation
made always initialises flushiter to zero in the create transaction,
and so if we then crash and run recovery and the inode already on
disk has a non-zero flushiter it will skip recovery of that inode.
As a result, log recovery does the wrong thing and we end up with a
corrupt filesystem.
Because we have to support old kernel to new kernel upgrades, we
can't just get rid of the flushiter support in log recovery as we
might be upgrading from a kernel that doesn't have fully transactional
inode updates. Unfortunately, for v4 superblocks there is no way to
guarantee that log recovery knows about this fact.
We cannot add a new inode format flag to say it's a "special inode
create" because it won't be understood by older kernels and so
recovery could do the wrong thing on downgrade. We cannot specially
detect the combination of zero mode/non-zero flushiter on disk to
non-zero mode, zero flushiter in the log item during recovery
because wrapping of the flushiter can result in false detection.
Hence that makes this "don't use flushiter" optimisation limited to
a disk format that guarantees that we don't need it. And that means
the only fix here is to limit the "no read IO on create"
optimisation to version 5 superblocks....
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit e60896d8f2)
XFS_BROOT_SIZE_ADJ is an undocumented macro which accounts for
the difference in size between the on-disk and in-core btree
root. It's much clearer to just use the newly-added
XFS_BMAP_BMDR_SPACE macro which gives us the on-disk size
directly.
In one case, we must test that the if_broot exists before
applying the macro, however.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Long ago, bulkstat used to read inodes directly from the backing
buffer for speed. This had the unfortunate problem of being cache
incoherent with unlinks, and so xfs_ifree() had to mark the inode
as free directly in the backing buffer. bulkstat was changed some
time ago to use inode cache coherent lookups, and so will never see
unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
touch the inode backing buffer anymore.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we are allocating a new inode, we read the inode cluster off
disk to increment the generation number. We are already using a
random generation number for newly allocated inodes, so if we are not
using the ikeep mode, we can just generate a new generation number
when we initialise the newly allocated inode.
This avoids the need for reading the inode buffer during inode
creation. This will speed up allocation of inodes in cold, partially
allocated clusters as they will no longer need to be read from disk
during allocation. It will also reduce the CPU overhead of inode
allocation by not having the process the buffer read, even on cache
hits.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The inode unlinked list manipulations operate directly on the inode
buffer, and so bypass the inode CRC calculation mechanisms. Hence an
inode on the unlinked list has an invalid CRC. Fix this by
recalculating the CRC whenever we modify an unlinked list pointer in
an inode, ncluding during log recovery. This is trivial to do and
results in unlinked list operations always leaving a consistent
inode in the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Running a CONFIG_XFS_DEBUG kernel in production environments is not
the best idea as it introduces significant overhead, can change
the behaviour of algorithms (such as allocation) to improve test
coverage, and (most importantly) panic the machine on non-fatal
errors.
There are many cases where all we want to do is run a
kernel with more bounds checking enabled, such as is provided by the
ASSERT() statements throughout the code, but without all the
potential overhead and drawbacks.
This patch converts all the ASSERT statements to evaluate as
WARN_ON(1) statements and hence if they fail dump a warning and a
stack trace to the log. This has minimal overhead and does not
change any algorithms, and will allow us to find strange "out of
bounds" problems more easily on production machines.
There are a few places where assert statements contain debug only
code. These are converted to be debug-or-warn only code so that we
still get all the assert checks in the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a new inode version with a larger core. The primary objective is
to allow for a crc of the inode, and location information (uuid and ino)
to verify it was written in the right place. We also extend it by:
a creation time (for Samba);
a changecount (for NFSv4);
a flush sequence (in LSN format for recovery);
an additional inode flags field; and
some additional padding.
These additional fields are not implemented yet, but already laid
out in the structure.
[dchinner@redhat.com] Added LSN and flags field, some factoring and rework to
capture all the necessary information in the crc calculation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add support for larger btree blocks that contains a CRC32C checksum,
a filesystem uuid and block number for detecting filesystem
consistency and out of place writes.
[dchinner@redhat.com] Also include an owner field to allow reverse
mappings to be implemented for improved repairability and a LSN
field to so that log recovery can easily determine the last
modification that made it to disk for each buffer.
[dchinner@redhat.com] Add buffer log format flags to indicate the
type of buffer to recovery so that we don't have to do blind magic
number tests to determine what the buffer is.
[dchinner@redhat.com] Modified to fit into the verifier structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Remove the XFS_TRANS_DEBUG routines. They are no longer appropriate
and have not been used in years
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.
This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.
We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.
This patch also fixes a directory block readahead verifier issue
it exposed.
This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Metadata buffers that are read from disk have write verifiers
already attached to them, but newly allocated buffers do not. Add
appropriate write verifiers to all new metadata buffers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
These verifiers are essentially the same code as the read verifiers,
but do not require ioend processing. Hence factor the read verifier
functions and add a new write verifier wrapper that is used as the
callback.
This is done as one large patch for all verifiers rather than one
patch per verifier as the change is largely mechanical. This
includes hooking up the write verifier via the read verifier
function.
Hooking up the write verifier for buffers obtained via
xfs_trans_get_buf() will be done in a separate patch as that touches
code in many different places rather than just the verifier
functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an btree block verify callback function and pass it into the
buffer read functions. Because each different btree block type
requires different verification, add a function to the ops structure
that is called from the generic code.
Also, propagate the verification callback functions through the
readahead functions, and into the external bmap and bulkstat inode
readahead code that uses the generic btree buffer read functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an inode buffer verify callback function and pass it into the
buffer read functions. Inodes are special in that the verbose checks
will be done when reading the inode, but we still need to sanity
check the buffer when that is first read. Always verify the magic
numbers in all inodes in the buffer, rather than jus ton debug
kernels.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a verifier function callback capability to the buffer read
interfaces. This will be used by the callers to supply a function
that verifies the contents of the buffer when it is read from disk.
This patch does not provide callback functions, but simply modifies
the interfaces to allow them to be called.
The reason for adding this to the read interfaces is that it is very
difficult to tell fom the outside is a buffer was just read from
disk or whether we just pulled it out of cache. Supplying a callbck
allows the buffer cache to use it's internal knowledge of the buffer
to execute it only when the buffer is read from disk.
It is intended that the verifier functions will mark the buffer with
an EFSCORRUPTED error when verification fails. This allows the
reading context to distinguish a verification error from an IO
error, and potentially take further actions on the buffer (e.g.
attempt repair) based on the error reported.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This check is used in multiple places to determine whether we
should check for (and potentially free) post EOF blocks on an
inode. Add a helper to consolidate the check.
Note that when we remove an inode from the cache (xfs_inactive()),
we are required to trim post-EOF blocks even if the inode is marked
preallocated or append-only to maintain correct space accounting.
The 'force' parameter to xfs_can_free_eofblocks() specifies whether
we should ignore the prealloc/append-only status of the inode.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
I found some out of date comments while studying the inode allocation
code, so I believe it's worth to have these comments updated.
It basically rewrites the comment regarding to "call_again" variable,
which is not used anymore, but instead, callers of xfs_ialloc() decides
if it needs to be called again relying only if ialloc_context is NULL or
not.
Also did some small changes in another comment that I thought to be
pertinent to the current behaviour of these functions and some alignment
on both comments.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
The inode cache functions remaining in xfs_iget.c can be moved to xfs_icache.c
along with the other inode cache functions. This removes all functionality from
xfs_iget.c, so the file can simply be removed.
This move results in various functions now only having the scope of a single
file (e.g. xfs_inode_free()), so clean up all the definitions and exported
prototypes in xfs_icache.[ch] and xfs_inode.h appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_ilock() and friends really aren't related to the inode cache in
any way, so move them to xfs_inode.c with all the other inode
related functionality.
While doing this move, move the xfs_ilock() tracepoints to *before*
the lock is taken so that when a hang on a lock occurs we have
events to indicate which process and what inode we were trying to
lock when the hang occurred. This is much better than the current
silence we get on a hang...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
An inode that enters xfs_inactive has been removed from all global
lists but the inode hash, and can't be recycled in xfs_iget before
it has been marked reclaimable. Thus taking the iolock in here
is not nessecary at all, and given the amount of lockdep false
positives it has triggered already I'd rather remove the locking.
The only change outside of xfs_inactive is relaxing an assert in
xfs_itruncate_extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We can simplify check the IO_agbp pointer for being non-NULL instead of
passing another argument through two layers of function calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There is no need to keep this helper around, opencoding it in the only
caller is just as clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
All callers of xfs_imap_to_bp want the dinode pointer, so let's calculate it
inside xfs_imap_to_bp. Once that is done xfs_itobp becomes a fairly pointless
wrapper which can be replaced with direct calls to xfs_imap_to_bp.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
XFS_MAXIOFFSET() is just a simple macro that resolves to
mp->m_maxioffset. It doesn't need to exist, and it just makes the
code unnecessarily loud and shouty.
Make it quiet and easy to read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>