Commit Graph

41 Commits

Author SHA1 Message Date
Linus Torvalds
9883035ae7 pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org  # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-29 13:12:42 -07:00
Muthu Kumar
b502bd1152 magic.h: move some FS magic numbers into magic.h
- Move open-coded filesystem magic numbers into magic.h

- Rearrange magic.h so that the filesystem-related constants are grouped
  together.

Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:31 -07:00
Randy Dunlap
0dc1488527 pipe_fs_i.h: fix kernel-doc warning
Fix kernel-doc notation warnings in pipe_fs_i.h:

  Warning(include/linux/pipe_fs_i.h:58): No description found for parameter 'buffers'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10 07:38:54 -08:00
Linus Torvalds
7208364652 Un-inline get_pipe_info() helper function
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-28 16:27:19 -08:00
Linus Torvalds
c66fb34794 Export 'get_pipe_info()' to other users
And in particular, use it in 'pipe_fcntl()'.

The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.

The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called.  But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.

Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-28 14:09:57 -08:00
Jens Axboe
ff9da691c0 pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
This changes the interface to be based on bytes instead. The API
matches that of F_SETPIPE_SZ in that it rounds up the passed in
size so that the resulting page array is a power-of-2 in size.

The proc file is renamed to /proc/sys/fs/pipe-max-size to
reflect this change.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-03 14:54:39 +02:00
Jens Axboe
b492e95be0 pipe: set lower and upper limit on max pages in the pipe page array
We need at least two to guarantee proper POSIX behaviour, so
never allow a smaller limit than that.

Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows
root to define a sane upper limit. Make it default to 16 times the
default size, which is 16 pages.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:12:52 +02:00
Jens Axboe
35f3d14dbb pipe: add support for shrinking and growing pipes
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:12:40 +02:00
Miklos Szeredi
6818173bd6 splice: implement default splice_read method
If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems.  This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
61e0d47c33 splice: add helpers for locking pipe inode
There are lots of sequences like this, especially in splice code:

	if (pipe->inode)
		mutex_lock(&pipe->inode->i_mutex);
	/* do something */
	if (pipe->inode)
		mutex_unlock(&pipe->inode->i_mutex);

so introduce helpers which do the conditional locking and unlocking.
Also replace the inode_double_lock() call with a pipe_double_lock()
helper to avoid spreading the use of this functionality beyond the
pipe code.

This patch is just a cleanup, and should cause no behavioral changes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Jens Axboe
0845718daf pipe: add documentation and comments
As per Andrew Mortons request, here's a set of documentation for
the generic pipe_buf_operations hooks, the pipe, and pipe_buffer
structures.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:16 +02:00
Jens Axboe
cac36bb06e pipe: change the ->pin() operation to ->confirm()
The name 'pin' was badly chosen, it doesn't pin a pipe buffer
in the most commonly used sense in the kernel. So change the
name to 'confirm', after debating this issue with Hugh
Dickins a bit.

A good return from ->confirm() means that the buffer is really
there, and that the contents are good.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:15 +02:00
Jens Axboe
497f9625c2 pipe: allow passing around of ops private pointer
relay needs this for proper consumption handling, and the network
receive support needs it as well to lookup the sk_buff on pipe
release.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Jens Axboe
d6b29d7cee splice: divorce the splice structure/function definitions from the pipe header
We need to move even more stuff into the header so that folks can use
the splice_to_pipe() implementation instead of open-coding a lot of
pipe knowledge (see relay implementation), so move to our own header
file finally.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Jens Axboe
130610d6f6 splice: add void cookie to the actor data
We need that for passing driver private info.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Jens Axboe
6a14b90bb6 vmsplice: add vmsplice-to-user support
A bit of a cheat, it actually just copies the data to userspace. But
this makes the interface nice and symmetric and enables people to build
on splice, with room for future improvement in performance.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:12 +02:00
Jens Axboe
c66ab6fa70 splice: abstract out actor data
For direct splicing (or private splicing), the output may not be a file.
So abstract out the handling into a specified actor function and put
the data in the splice_desc structure earlier, so we can build on top
of that.

This is the first step in better splice handling for drivers, and also
for implementing vmsplice _to_ user memory.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:12 +02:00
Jens Axboe
17374ff1aa pipe: move pipe_inode_info structure decleration up before it's used
There's really no reason it's below the first use of the pointer
type, and it'll fail compilation for the network addition (for good
reason). So move it up a bit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-06-08 08:33:53 +02:00
Mark Fasheh
40bee44eae Export __splice_from_pipe()
Ocfs2 wants to implement it's own splice write actor so that it can better
manage cluster / page locks. This lets us re-use the rest of splice write
while only providing our own code where it's actually important.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:55:47 +02:00
Eric Dumazet
6a8ba9d121 [PATCH] reorder struct pipe_buf_operations
Fields of struct pipe_buf_operations have not a precise layout (ie not
optimized to fit cache lines nor reduce cache line ping pongs)

The bufs[] array is *large* and is placed near the beginning of the
structure, so all following fields have a large offset.  This is
unfortunate because many archs have smaller instructions when using small
offsets relative to a base register.  On x86 for example, 7 bits offsets
have smaller instruction lengths.

Moving bufs[] at the end of pipe_buf_operations permits all fields to have
small offsets, and reduce text size, and icache pressure.

# size vmlinux.pre vmlinux
    text    data     bss     dec     hex filename
3268989  664356  492196 4425541  438745 vmlinux.pre
3268765  664356  492196 4425317  438665 vmlinux

So this patch reduces text size by 224 bytes on my x86_64 machine. Similar
results on ia32.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Eric Dumazet
d4c3cca941 [PATCH] constify pipe_buf_operations
- pipe/splice should use const pipe_buf_operations and file_operations

- struct pipe_inode_info has an unused field "start" : get rid of it.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Jens Axboe
1432873af7 [PATCH] splice: LRU fixups
Nick says that the current construct isn't safe. This goes back to the
original, but sets PIPE_BUF_FLAG_LRU on user pages as well as they all
seem to be on the LRU in the first place.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-05-04 06:55:12 +02:00
Jens Axboe
330ab71619 [PATCH] vmsplice: restrict stealing a little more
Apply the same rules as the anon pipe pages, only allow stealing
if no one else is using the page.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-05-02 15:29:57 +02:00
Jens Axboe
a893b99be7 [PATCH] splice: fix page LRU accounting
Currently we rely on the PIPE_BUF_FLAG_LRU flag being set correctly
to know whether we need to fiddle with page LRU state after stealing it,
however for some origins we just don't know if the page is on the LRU
list or not.

So remove PIPE_BUF_FLAG_LRU and do this check/add manually in pipe_to_file()
instead.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-05-02 15:03:27 +02:00
Jens Axboe
7afa6fd037 [PATCH] vmsplice: allow user to pass in gift pages
If SPLICE_F_GIFT is set, the user is basically giving this pages away to
the kernel. That means we can steal them for eg page cache uses instead
of copying it.

The data must be properly page aligned and also a multiple of the page size
in length.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-05-01 20:02:33 +02:00