Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...
... and that - only to get the superblock. Privroot is a directory
and we don't allow hardlinks to those...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reiserfs is currently able to be deadlocked by having two NFS clients
where one has removed and recreated a file and another is accessing the
file with an open file handle.
If one client deletes and recreates a file with timing such that the
recreated file obtains the same [dirid, objectid] pair as the original
file while another client accesses the file via file handle, the create
and lookup can race and deadlock if the lookup manages to create the
in-memory inode first.
The create thread, in insert_inode_locked4, will hold the write lock
while waiting on the other inode to be unlocked. The lookup thread,
anywhere in the iget path, will release and reacquire the write lock while
it schedules. If it needs to reacquire the lock while the create thread
has it, it will never be able to make forward progress because it needs
to reacquire the lock before ultimately unlocking the inode.
This patch drops the write lock across the insert_inode_locked4 call so
that the ordering of inode_wait -> write lock is retained. Since this
would have been the case before the BKL push-down, this is safe.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
reiserfs_chown_xattrs() takes the iattr struct passed into ->setattr
and uses it to iterate over all the attrs associated with a file to change
ownership of xattrs (and transfer quota associated with the xattr files).
When the setuid bit is cleared during chown, ATTR_MODE and iattr->ia_mode
are passed to all the xattrs as well. This means that the xattr directory
will have S_IFREG added to its mode bits.
This has been prevented in practice by a missing IS_PRIVATE check
in reiserfs_acl_chmod, which caused a double-lock to occur while holding
the write lock. Since the file system was completely locked up, the
writeout of the corrupted mode never happened.
This patch temporarily clears everything but ATTR_UID|ATTR_GID for the
calls to reiserfs_setattr and adds the missing IS_PRIVATE check.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
After sleeping for filldir(), we check to see if the file system has
changed and research. The next_pos pointer is updated but its value
isn't pushed into the key used for the search itself. As a result,
the search returns the same item that the last cycle of the loop did
and filldir() is called multiple times with the same data.
The end result is that the buffer can contain the same name multiple
times. This can be returned to userspace or used internally in the
xattr code where it can manifest with the following warning:
jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2)
reiserfs_for_each_xattr uses reiserfs_readdir_dentry to iterate over
the xattr names and ends up trying to unlink the same name twice. The
second attempt fails with -ENOENT and the error is returned. At some
point I'll need to add support into reiserfsck to remove the orphaned
directories left behind when this occurs.
The fix is to push the value into the key before researching.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
same story as with the previous patches - note that return
value of blkdev_close() is lost, since there's nowhere the
caller (__fput()) could return it to.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Don't access the proc_dir_entry in ReiserFS's r_open(), r_start() r_show()
procfs interface functions.
ReiserFS stores the ->show() method pointer in PDE->data and the super_block
pointer in PDE->parent->data. This isn't changing.
Currently, ReiserFS passes the PDE pointer into seq_file::private from
r_open() so that r_start() and r_show() can then access it. Instead, use
seq_open_private() to allocate a two-pointer struct that's passed through
seq_file::private and put the ->show() method and the sb pointers in there.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After commit 21d8a15a (lookup_one_len: don't accept . and ..) reiserfs
started failing to delete xattrs from inode. This was due to a buggy
test for '.' and '..' in fill_with_dentries() which resulted in passing
'.' and '..' entries to lookup_one_len() in some cases. That returned
error and so we failed to iterate over all xattrs of and inode.
Fix the test in fill_with_dentries() along the lines of the one in
lookup_one_len().
Reported-by: Pawel Zawora <pzawora@gmail.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Calls into highlevel quota code cannot happen under the write lock. These
calls take dqio_mutex which ranks above write lock. So drop write lock
before calling back into quota code.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
Calls into reiserfs journalling code and reiserfs_get_block() need to
be protected with write lock. We remove write lock around calls to high
level quota code in the next patch so these paths would suddently become
unprotected.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>