Commit Graph

37 Commits

Author SHA1 Message Date
Al Viro
bd2a31d522 get rid of fget_light()
instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:42 -04:00
Linus Torvalds
9c225f2655 vfs: atomic f_pos accesses as per POSIX
Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
Al Viro
2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro
a5b470ba06 new helpers: fdget()/fdput()
Signed-off-bs: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:16:32 -04:00
Al Viro
ad47bd7252 make expand_files() and alloc_fd() static
no callers outside of fs/file.c left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:58 -04:00
Al Viro
8280d16172 new helper: replace_fd()
analog of dup2(), except that it takes struct file * as source.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:57 -04:00
Al Viro
fe17f22d7f take purely descriptor-related stuff from fcntl.c to file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:57 -04:00
Al Viro
1a7bd2265f make get_unused_fd_flags() a function
... and get_unused_fd() a macro around it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:50 -04:00
Al Viro
4a9d4b024a switch fput to task_work_add
... and schedule_work() for interrupt/kernel_thread callers
(and yes, now it *is* OK to call from interrupt).

We are guaranteed that __fput() will be done before we return
to userland (or exit).  Note that for fput() from a kernel
thread we get an async behaviour; it's almost always OK, but
sometimes you might need to have __fput() completed before
you do anything else.  There are two mechanisms for that -
a general barrier (flush_delayed_fput()) and explicit
__fput_sync().  Both should be used with care (as was the
case for fput() from kernel threads all along).  See comments
in fs/file_table.c for details.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:58 +04:00
Al Viro
b57ce9694e vfs: drop_file_write_access() made static
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:32 -04:00
Al Viro
1abf0c718f New kind of open files - "location only".
New flag for open(2) - O_PATH.  Semantics:
	* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
	* almost all operations on the resulting descriptors shall
fail with -EBADF.  Exceptions are:
	1) operations on descriptors themselves (i.e.
		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
		fcntl(fd, F_SETFD, ...))
	2) fcntl(fd, F_GETFL), for a common non-destructive way to
		check if descriptor is open
	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
		points of pathname resolution
	* closing such descriptor does *NOT* affect dnotify or
posix locks.
	* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself.  Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).

fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them.  That protects
existing code from dealing with those things.

There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Steven Rostedt
c2b3e74b78 fs: Remove unlikely() from fput_light()
In fput_light(), there's an unlikely(fput_needed), which running on
my normal desktop doing firefox, xchat, evolution and part of my distcc farm,
and running the annotate branch profiler shows that the unlikely is not
very unlikely.

 correct incorrect  %        Function             File              Line
 ------- ---------  -        --------             ----              ----
       0       48 100 fput_light                file.h               26
115828710 897415279  88 fput_light              file.h               26
865271179 5286128445  85 fput_light             file.h               26
19568539  8923664  31 fput_light                file.h               26
12353677  3562279  22 fput_light                file.h               26
  267691    67062  20 fput_light                file.h               26
15014853   348172   2 fput_light                file.h               26
  209258      205   0 fput_light                file.h               26
 1364164        0   0 fput_light                file.h               26

Which gives 1032903812 times it was correct and 6203351846 times it was
incorrect, or 85% incorrect.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 03:26:26 -05:00
Al Viro
d7065da038 get rid of the magic around f_count in aio
__aio_put_req() plays sick games with file refcount.  What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one.  Current code decrements f_count and if it hasn't
hit 0, everything is fine.  Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test().  IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was.  And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic().  Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that.  aio.c converted to it, __fput()
use is gone.  req->ki_file *always* contributes to refcount
now.  And __fput() became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:03:07 -04:00
Al Viro
2c48b9c455 switch alloc_file() to passing struct path
... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16 12:16:42 -05:00
Al Viro
3d1e463158 get rid of init_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16 12:16:42 -05:00
Al Viro
aeb5d72706 [PATCH] introduce fmode_t, do annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:47:06 -04:00
Al Viro
1027abe882 [PATCH] merge locate_fd() and get_unused_fd()
New primitive: alloc_fd(start, flags).  get_unused_fd() and
get_unused_fd_flags() become wrappers on top of it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:23 -04:00
Al Viro
9f3acc3140 [PATCH] split linux/file.h
Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:16 -04:00
Al Viro
3b1253880b [PATCH] sanitize unshare_files/reset_files_struct
* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
  caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
  unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
  it will break the system if somebody ever calls it for anything
  else (we can't modify ->files of somebody else).  Lose the
  task_struct * argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-25 09:23:59 -04:00
Dave Hansen
aceaf78da9 [PATCH] r/o bind mounts: create helper to drop file write access
If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

AV: drop write access in __fput() only after we evict from file list.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:32 -04:00
Harvey Harrison
b3c9752868 include/linux: Remove all users of FASTCALL() macro
FASTCALL() is always expanded to empty, remove it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Dave Hansen
ce8d2cdf3d r/o bind mounts: filesystem helpers for custom 'struct file's
Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.  In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of filesystems
writable.  It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems.  This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated.  I've been using the following script to test that the feature is
working as desired.  It takes a directory and makes a regular bind and a r/o
bind mount of it.  It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Ulrich Drepper
4a19542e5f O_CLOEXEC for SCM_RIGHTS
Part two in the O_CLOEXEC saga: adding support for file descriptors received
through Unix domain sockets.

The patch is once again pretty minimal, it introduces a new flag for recvmsg
and passes it just like the existing MSG_CMSG_COMPAT flag.  I think this bit
is not used otherwise but the networking people will know better.

This new flag is not recognized by recvfrom and recv.  These functions cannot
be used for that purpose and the asymmetry this introduces is not worse than
the already existing MSG_CMSG_COMPAT situations.

The patch must be applied on the patch which introduced O_CLOEXEC.  It has to
remove static from the new get_unused_fd_flags function but since scm.c cannot
live in a module the function still hasn't to be exported.

Here's a test program to make sure the code works.  It's so much longer than
the actual patch...

#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <sys/un.h>

#ifndef O_CLOEXEC
# define O_CLOEXEC 02000000
#endif
#ifndef MSG_CMSG_CLOEXEC
# define MSG_CMSG_CLOEXEC 0x40000000
#endif

int
main (int argc, char *argv[])
{
  if (argc > 1)
    {
      int fd = atol (argv[1]);
      printf ("child: fd = %d\n", fd);
      if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
        {
          puts ("file descriptor valid in child");
          return 1;
        }
      return 0;

    }

  struct sockaddr_un sun;
  strcpy (sun.sun_path, "./testsocket");
  sun.sun_family = AF_UNIX;

  char databuf[] = "hello";
  struct iovec iov[1];
  iov[0].iov_base = databuf;
  iov[0].iov_len = sizeof (databuf);

  union
  {
    struct cmsghdr hdr;
    char bytes[CMSG_SPACE (sizeof (int))];
  } buf;
  struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1,
                        .msg_control = buf.bytes,
                        .msg_controllen = sizeof (buf) };
  struct cmsghdr *cmsg = CMSG_FIRSTHDR (&msg);

  cmsg->cmsg_level = SOL_SOCKET;
  cmsg->cmsg_type = SCM_RIGHTS;
  cmsg->cmsg_len = CMSG_LEN (sizeof (int));

  msg.msg_controllen = cmsg->cmsg_len;

  pid_t child = fork ();
  if (child == -1)
    error (1, errno, "fork");
  if (child == 0)
    {
      int sock = socket (PF_UNIX, SOCK_STREAM, 0);
      if (sock < 0)
        error (1, errno, "socket");

      if (bind (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
        error (1, errno, "bind");
      if (listen (sock, SOMAXCONN) < 0)
        error (1, errno, "listen");

      int conn = accept (sock, NULL, NULL);
      if (conn == -1)
        error (1, errno, "accept");

      *(int *) CMSG_DATA (cmsg) = sock;
      if (sendmsg (conn, &msg, MSG_NOSIGNAL) < 0)
        error (1, errno, "sendmsg");

      return 0;
    }

  /* For a test suite this should be more robust like a
     barrier in shared memory.  */
  sleep (1);

  int sock = socket (PF_UNIX, SOCK_STREAM, 0);
  if (sock < 0)
    error (1, errno, "socket");

  if (connect (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0)
    error (1, errno, "connect");
  unlink (sun.sun_path);

  *(int *) CMSG_DATA (cmsg) = -1;

  if (recvmsg (sock, &msg, MSG_CMSG_CLOEXEC) < 0)
    error (1, errno, "recvmsg");

  int fd = *(int *) CMSG_DATA (cmsg);
  if (fd == -1)
    error (1, 0, "no descriptor received");

  char fdname[20];
  snprintf (fdname, sizeof (fdname), "%d", fd);
  execl ("/proc/self/exe", argv[0], fdname, NULL);
  puts ("execl failed");
  return 1;
}

[akpm@linux-foundation.org: Fix fastcall inconsistency noted by Michael Buesch]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Michael Buesch <mb@bu3sch.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:45 -07:00
Vadim Lobanov
01b2d93ca4 [PATCH] fdtable: Provide free_fdtable() wrapper
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel.  The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage.  Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:50 -08:00
Vadim Lobanov
5466b456ed [PATCH] fdtable: Implement new pagesize-based fdtable allocator
This patch provides an improved fdtable allocation scheme, useful for
expanding fdtable file descriptor entries.  The main focus is on the fdarray,
as its memory usage grows 128 times faster than that of an fdset.

The allocation algorithm sizes the fdarray in such a way that its memory usage
increases in easy page-sized chunks. The overall algorithm expands the allowed
size in powers of two, in order to amortize the cost of invoking vmalloc() for
larger allocation sizes. Namely, the following sizes for the fdarray are
considered, and the smallest that accommodates the requested fd count is
chosen:

    pagesize / 4
    pagesize / 2
    pagesize      <- memory allocator switch point
    pagesize * 2
    pagesize * 4
    ...etc...

Unlike the current implementation, this allocation scheme does not require a
loop to compute the optimal fdarray size, and can be done in efficient
straightline code.

Furthermore, since the fdarray overflows the pagesize boundary long before any
of the fdsets do, it makes sense to optimize run-time by allocating both
fdsets in a single swoop.  Even together, they will still be, by far, smaller
than the fdarray.  The fdtable->open_fds is now used as the anchor for the
fdset memory allocation.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00