This allocation can be as large as 64k. As David points out, "falling
back to vmalloc here is much better solution than failing to retreive
the attribute - it will work no matter how fragmented memory gets. That
means we don't get incomplete backups occurring after days or months of
uptime and successful backups".
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
vfs_getxattr_alloc() and vfs_xattr_cmp() are two new kernel xattr helper
functions. vfs_getxattr_alloc() first allocates memory for the requested
xattr and then retrieves it. vfs_xattr_cmp() compares a given value with
the contents of an extended attribute.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.
Why xattr lookup on every write I hear you ask?
write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.
In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.
Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.
This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.
I also used the same flag to avoid the suid check, although
that one is pretty cheap.
A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.
With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.
v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Return -ENODATA when trying to read a user.* attribute which cannot
exist: user space otherwise does not have a reasonable way to
distinguish between non-existent and inaccessible attributes.
Likewise, return -ENODATA when an unprivileged process tries to read a
trusted.* attribute: to unprivileged processes, those attributes are
invisible (listxattr() won't include them).
Related to this bug report: https://bugzilla.redhat.com/660613
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For some reason generic_setxattr() did not pass flags (XATTR_CREATE,
XATTR_REPLACE) to the filesystem specific helper. This caused that
setxattr(2) syscall just ignored these flags.
Fix the bug by passing flags correctly.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And give it a kernel-doc comment.
[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The entries in xattr handler table should be immutable (ie const)
like other operation tables.
Later patches convert common filesystems. Uncoverted filesystems
will still work, but will generate a compiler warning.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods. This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute. With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.
Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.
[with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This factors out the part of the vfs_setxattr function that performs the
setting of the xattr and its notification. This is needed so the SELinux
implementation of inode_setsecctx can handle the setting of the xattr while
maintaining the proper separation of layers.
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.
Before:
avg = 462.286
std = 5.46106
After:
avg = 453.12
std = 9.58257
(50 runs of each, stddev gives a reasonable confidence)
It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.
After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).
[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
We used to have rather schizophrenic set of checks for NULL ->i_op even
though it had been eliminated years ago. You'd need to go out of your
way to set it to NULL explicitly _and_ a bunch of code would die on
such inodes anyway. After killing two remaining places that still
did that bogosity, all that crap can go away.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* do not pass nameidata; struct path is all the callers want.
* switch to new helpers:
user_path_at(dfd, pathname, flags, &path)
user_path(pathname, &path)
user_lpath(pathname, &path)
user_path_dir(pathname, &path) (fail if not a directory)
The last 3 are trivial macro wrappers for the first one.
* remove nameidata in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>