As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.
An identical check already appears in ext3 and ext4. This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.
[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
is superfluous as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Thanks for the comments. I have incorportated them all.
CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2, actually we don't allow any direct write pass i_size,
see the function ocfs2_prepare_inode_for_write. So we don't
need the bogus simple_setsize.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Now orphan scan worker has no trace log, so it is
very hard to tell whether it is finished or blocked.
So add 2 mlog trace log so that we can tell whether
the current orphan scan worker is blocked or not.
It does help when I analyzed a orphan scan bug.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The reason why we need this ioctl is to offer the none-privileged
end-user a possibility to get filesys info gathering.
We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
structure to kernel containing an array of request pointers and request
count, such as,
* From userspace:
struct ocfs2_info_blocksize oib = {
.ib_req = {
.ir_magic = OCFS2_INFO_MAGIC,
.ir_code = OCFS2_INFO_BLOCKSIZE,
...
}
...
}
struct ocfs2_info_clustersize oic = {
...
}
uint64_t reqs[2] = {(unsigned long)&oib,
(unsigned long)&oic};
struct ocfs2_info info = {
.oi_requests = reqs,
.oi_count = 2,
}
ret = ioctl(fd, OCFS2_IOC_INFO, &info);
* In kernel:
Get the request pointers from *info*, then handle each request one bye one.
Idea here is to make the spearated request small enough to guarantee
a better backward&forward compatibility since a small piece of request
would be less likely to be broken if filesys on raw disk get changed.
Currently, the following 7 requests are supported per the requirement from
userspace tool o2info, and I believe it will grow over time:-)
OCFS2_INFO_CLUSTERSIZE
OCFS2_INFO_BLOCKSIZE
OCFS2_INFO_MAXSLOTS
OCFS2_INFO_LABEL
OCFS2_INFO_UUID
OCFS2_INFO_FS_FEATURES
OCFS2_INFO_JOURNAL_SIZE
This ioctl is only specific to OCFS2.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'fixes' of git://oss.oracle.com/git/tma/linux-2.6:
ocfs2: Fix orphan add in ocfs2_create_inode_in_orphan
ocfs2: split out ocfs2_prepare_orphan_dir() into locking and prep functions
ocfs2: allow return of new inode block location before allocation of the inode
ocfs2: use ocfs2_alloc_dinode_update_counts() instead of open coding
ocfs2: split out inode alloc code from ocfs2_mknod_locked
Ocfs2: Fix a regression bug from mainline commit(6b933c8e6f).
ocfs2: Fix deadlock when allocating page
ocfs2: properly set and use inode group alloc hint
ocfs2: Use the right group in nfs sync check.
ocfs2: Flush drive's caches on fdatasync
ocfs2: make __ocfs2_page_mkwrite handle file end properly.
ocfs2: Fix incorrect checksum validation error
ocfs2: Fix metaecc error messages
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix lock annotations
fuse: flush background queue on connection close
ocfs2_create_inode_in_orphan() is used by reflink to create the newly
reflinked inode simultaneously in the orphan dir. This allows us to easily
handle partially-reflinked files during recovery cleanup.
We have a problem though - the orphan dir stringifies inode # to determine
a unique name under which the orphan entry dirent can be created. Since
ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
before it can allocate the inode, we currently call into the orphan code:
/*
* We give the orphan dir the root blkno to fake an orphan name,
* and allocate enough space for our insertion.
*/
status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
osb->root_blkno,
orphan_name, &orphan_insert);
Using osb->root_blkno might work fine on unindexed directories, but the
orphan dir can have an index. When it has that index, the above code fails
to allocate the proper index entry. Later, when we try to remove the file
from the orphan dir (using the actual inode #), the reflink operation will
fail.
To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
newly split out orphan and inode alloc code to figure out what the inode
block number will be (once allocated) and then prepare the orphan dir from
that data.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We do this because ocfs2_create_inode_in_orphan() wants to order locking of
the orphan dir with respect to locking of the inode allocator *before*
making any changes to the directory.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir). Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
ocfs2_search_chain() makes the same updates as
ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
the bitmap update, use our helper function.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Do this by splitting the bulk of the function away from the inode allocation
code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
change and won't see any difference. The new function created,
__ocfs2_mknod_locked() will be used shortly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2:
Avoid direct write if we fall back to buffered I/O'):
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285
The commit 6b933c8e6f changed __generic_file_aio_write
to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to flush
the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
the O_DIRECT semantics somehow in extented odirect writes.
This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
flushed.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
res->sr_bg_blkno. Unfortunately, res->sr_bg_blkno is going to be zero under
normal (non-fragmented) circumstances. The discontig block group patches
effectively turned off that feature. Fix this by correctly calculating what
the next group hint should be.
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We have added discontig block group now, and now an inode
can be allocated in an discontig block group. So get
it in ocfs2_get_suballoc_slot_bit.
The old ocfs2_test_suballoc_bit gets group block no
from the allocation inode which is wrong. Fix it by
passing the right group.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
When 'barrier' mount option is specified, we have to issue a cache flush
during fdatasync(2). We have to do this even if inode doesn't have
I_DIRTY_DATASYNC set because we still have to get written *data* to disk so
that they are not lost in case of crash.
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Singed-off-by: Tao Ma <tao.ma@oracle.com>
__ocfs2_page_mkwrite now is broken in handling file end.
1. the last page should be the page contains i_size - 1.
2. the len in the last page is also calculated wrong.
So change them accordingly.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to
read the inode off the disk. The latter first checks to see if that block is
cached in the journal, and, if so, returns that block. That is ok.
But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum
of such blocks. Blocks that are cached in the journal may not have had their
checksum computed as yet. We should not validate the checksums of such blocks.
Fixes ossbz#1282
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Singed-off-by: Tao Ma <tao.ma@oracle.com>
Like tools, the checksum validate function now prints the values in hex.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Singed-off-by: Tao Ma <tao.ma@oracle.com>