Split inode_permission() into inode- and superblock-dependent parts.
This is aimed at unionmounts where the superblock from the upper layer has to
be checked rather than the superblock from the lower layer as the upper layer
may be writable, thus allowing an unwritable file from the lower layer to be
copied up and modified.
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com> (Further development)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just pass struct file *. Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int. Next: saner prototypes for parts in
namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All users of open intents have been converted to use ->atomic_{open,create}.
This patch gets rid of nd->intent.open and related infrastructure.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new inode operation which is called on the last component of an open.
Using this the filesystem can look up, possibly create and open the file in one
atomic operation. If it cannot perform this (e.g. the file type turned out to
be wrong) it may signal this by returning NULL instead of an open struct file
pointer.
i_op->atomic_open() is only called if the last component is negative or needs
lookup. Handling cached positive dentries here doesn't add much value: these
can be opened using f_op->open(). If the cached file turns out to be invalid,
the open can be retried, this time using ->atomic_open() with a fresh dentry.
For now leave the old way of using open intents in lookup and revalidate in
place. This will be removed once all the users are converted.
David Howells noticed that if ->atomic_open() opens the file but does not create
it, handle_truncate() will be called on it even if it is not a regular file.
Fix this by checking the file type in this case too.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split __dentry_open() into two functions:
do_dentry_open() - does most of the actual work, doesn't put file on failure
open_check_o_direct() - after a successful open, checks direct_IO method
This will allow i_op->atomic_open to do just the file initialization and leave
the direct_IO checking to the VFS.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently remouting superblock read-only is racy in a major way.
With the per mount read-only infrastructure it is now possible to
prevent most races, which this patch attempts.
Before starting the remount read-only, iterate through all mounts
belonging to the superblock and if none of them have any pending
writes, set sb->s_readonly_remount. This indicates that remount is in
progress and no further write requests are allowed. If the remount
succeeds set MS_RDONLY and reset s_readonly_remount.
If the remounting is unsuccessful just reset s_readonly_remount.
This can result in transient EROFS errors, despite the fact the
remount failed. Unfortunately hodling off writes is difficult as
remount itself may touch the filesystem (e.g. through load_nls())
which would deadlock.
A later patch deals with delayed writes due to nlink going to zero.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
some stuff in there can actually become static; some belongs to pnode.h
as it's a private interface between namespace.c and pnode.c...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.
pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.
Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protect the per-sb inode list with a new global lock
inode_sb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lookup_mnt() is only used in the core fs routines now, so it doesn't need to
be globally declared anymore. It isn't exported to modules at the moment, so
nothing that can be modularised seems to be using it.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new function: mount_fs(). Does all work done by vfs_kern_mount()
except the allocation and filling of vfsmount; returns root dentry
or ERR_PTR().
vfs_kern_mount() switched to using it and taken to fs/namespace.c,
along with its wrappers.
alloc_vfsmnt()/free_vfsmnt() made static.
functions in namespace.c slightly reordered.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new function: file_open_root(dentry, mnt, name, flags) opens the file
vfs_path_lookup would arrive to.
Note that name can be empty; in that case the usual requirement that
dentry should be a directory is lifted.
open-coded equivalents switched to it, may_open() got down exactly
one caller and became static.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
take calculation of open_flags by open(2) arguments into new helper
in fs/open.c, move filp_open() over there, have it and do_sys_open()
use that helper, switch exec.c callers of do_filp_open() to explicit
(and constant) struct open_flags.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.
In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.
In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.
flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.
invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.
invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.
So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.
flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.
dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.
md does use check_disk_size_change and so is affected.
This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.
Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>