The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future, based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
AppArmor: kill unused macros in lsm.c
AppArmor: cleanup generated files correctly
KEYS: Add an iovec version of KEYCTL_INSTANTIATE
KEYS: Add a new keyctl op to reject a key with a specified error code
KEYS: Add a key type op to permit the key description to be vetted
KEYS: Add an RCU payload dereference macro
AppArmor: Cleanup make file to remove cruft and make it easier to read
SELinux: implement the new sb_remount LSM hook
LSM: Pass -o remount options to the LSM
SELinux: Compute SID for the newly created socket
SELinux: Socket retains creator role and MLS attribute
SELinux: Auto-generate security_is_socket_class
TOMOYO: Fix memory leak upon file open.
Revert "selinux: simplify ioctl checking"
selinux: drop unused packet flow permissions
selinux: Fix packet forwarding checks on postrouting
selinux: Fix wrong checks for selinux_policycap_netpeer
selinux: Fix check for xfrm selinux context algorithm
ima: remove unnecessary call to ima_must_measure
IMA: remove IMA imbalance checking
...
all remaining callers pass LOOKUP_PARENT to it, so
flags argument can die; renamed to kern_path_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change all the "mlog(0," in fs/ocfs2/refcounttree.c to trace events.
And finally remove masklog ML_REFCOUNT.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Current refcounttree codes actually didn't writeback the new pages out in
write-back mode, due to a bug of always passing a ZERO number of clusters
to 'ocfs2_cow_sync_writeback', the patch tries to pass a proper one in.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <jlbec@evilplan.org>
SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.
This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.
Signed-off-by: Eric Paris <eparis@redhat.com>
We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In CoW, when we meet with a readahead page, we know
it is time to move the readahead window. So carry
out a new readahead.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
The refcount record calculation in ocfs2_calc_refcount_meta_credits
is too optimistic that we can always allocate contiguous clusters
and handle an already existed refcount rec as a whole. Actually
because of file system fragmentation, we may have the chance to split
a refcount record into 3 parts during the transaction. So consider
the worst case in record calculation.
Cc: stable@kernel.org
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
During CoW, the pages after i_size don't contain valid data, so there's
no need to read and duplicate them.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size. There also
may be more unwritten extents after that.
When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster. This is a bug.
We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. They will use
ocfs2_zero_extend() to ensure the data is properly zeroed.
Older versions of ocfs2_zero_extend() simply zeroed every block between
i_size and the zeroing position. This presumes three things:
1) There is allocation for all of these blocks.
2) The extents are not unwritten.
3) The extents are not refcounted.
(1) and (2) hold true for non-sparse filesystems, which used to be the
only users of ocfs2_zero_extend(). (3) is another bug.
Since we're now using ocfs2_zero_extend() for sparse filesystems as
well, we teach ocfs2_zero_extend() to check every extent between
i_size and the zeroing position. If the extent is unwritten, it is
ignored. If it is refcounted, it is CoWed. Then it is zeroed.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
ocfs2: Silence a gcc warning.
ocfs2: Don't retry xattr set in case value extension fails.
ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
fs/ocfs2/dlm: Use kstrdup
fs/ocfs2/dlm: Drop memory allocation cast
Ocfs2: Optimize punching-hole code.
Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
ocfs2: Wrap signal blocking in void functions.
ocfs2/dlm: Increase o2dlm lockres hash size
ocfs2: Make ocfs2_extend_trans() really extend.
ocfs2/trivial: Code cleanup for allocation reservation.
ocfs2: make ocfs2_adjust_resv_from_alloc simple.
ocfs2: Make nointr a default mount option
ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2net: log socket state changes
ocfs2: print node # when tcp fails
...
Truncate is just a special case of punching holes(from new i_size to
end), we therefore could take advantage of the existing
ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
alloc.c. The goal here is to make truncate more generic and
straightforward.
Several functions only used by ocfs2_commit_truncate() will smiply be
removed.
ocfs2_remove_btree_range() was originally used by the hole punching
code, which didn't take refcount trees into account (definitely a bug).
We therefore need to change that func a bit to handle refcount trees.
It must take the refcount lock, calculate and reserve blocks for
refcount tree changes, and decrease refcounts at the end. We replace
ocfs2_lock_allocators() here by adding a new func
ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
reserve. This will not hurt any other code using
ocfs2_remove_btree_range() (such as dir truncate and hole punching).
I merged the following steps into one patch since they may be
logically doing one thing, though I know it looks a little bit fat
to review.
1). Remove redundant code used by ocfs2_commit_truncate(), since we're
moving to ocfs2_remove_btree_range anyway.
2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
accepting some extra blocks to reserve.
3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
needs. It's safe to do this since it's only being called by
truncate.
4). Change ocfs2_remove_btree_range() a bit to take refcount case into
account.
5). Finally, we change ocfs2_commit_truncate() to call
ocfs2_remove_btree_range() in a proper way.
The patch has been tested normally for sanity check, stress tests
with heavier workload will be expected.
Based on this patch, fixing the punching holes bug will be fairly easy.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
blocks. But if jbd2_journal_extend() fails, it will only restart
with the the new number of blocks. This tends to be awkward since
in most cases we want additional reserved blocks. It makes our code
harder to mantain since the caller can't be sure all the original
blocks will not be accessed and dirtied again. There are 15 callers
of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
h_buffer_credits before they call ocfs2_extend_trans(). This makes
ocfs2_extend_trans() really extend atop the original block count.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0
since before the kernel moved to git. There is no point in checking
this error.
ocfs2_journal_dirty() has been faithfully returning the status since the
beginning. All over ocfs2, we have blocks of code checking this can't
fail status. In the past few years, we've tried to avoid adding these
checks, because they are pointless. But anyone who looks at our code
assumes they are needed.
Finally, ocfs2_journal_dirty() is made a void function. All error
checking is removed from other files. We'll BUG_ON() the status of
jbd2_journal_dirty_metadata() just in case they change it someday. They
won't.
Signed-off-by: Joel Becker <joel.becker@oracle.com>