The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Move open-coded filesystem magic numbers into magic.h
- Rearrange magic.h so that the filesystem-related constants are grouped
together.
Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc notation warnings in pipe_fs_i.h:
Warning(include/linux/pipe_fs_i.h:58): No description found for parameter 'buffers'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This avoids some include-file hell, and the function isn't really
important enough to be inlined anyway.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And in particular, use it in 'pipe_fcntl()'.
The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.
The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called. But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the interface to be based on bytes instead. The API
matches that of F_SETPIPE_SZ in that it rounds up the passed in
size so that the resulting page array is a power-of-2 in size.
The proc file is renamed to /proc/sys/fs/pipe-max-size to
reflect this change.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We need at least two to guarantee proper POSIX behaviour, so
never allow a smaller limit than that.
Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows
root to define a sane upper limit. Make it default to 16 times the
default size, which is 16 pages.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.
This will allow splice and functions using splice, such as the loop
device, to work on all filesystems. This includes "direct_io" files
in fuse which bypass the page cache.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There are lots of sequences like this, especially in splice code:
if (pipe->inode)
mutex_lock(&pipe->inode->i_mutex);
/* do something */
if (pipe->inode)
mutex_unlock(&pipe->inode->i_mutex);
so introduce helpers which do the conditional locking and unlocking.
Also replace the inode_double_lock() call with a pipe_double_lock()
helper to avoid spreading the use of this functionality beyond the
pipe code.
This patch is just a cleanup, and should cause no behavioral changes.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As per Andrew Mortons request, here's a set of documentation for
the generic pipe_buf_operations hooks, the pipe, and pipe_buffer
structures.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The name 'pin' was badly chosen, it doesn't pin a pipe buffer
in the most commonly used sense in the kernel. So change the
name to 'confirm', after debating this issue with Hugh
Dickins a bit.
A good return from ->confirm() means that the buffer is really
there, and that the contents are good.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
relay needs this for proper consumption handling, and the network
receive support needs it as well to lookup the sk_buff on pipe
release.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We need to move even more stuff into the header so that folks can use
the splice_to_pipe() implementation instead of open-coding a lot of
pipe knowledge (see relay implementation), so move to our own header
file finally.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
A bit of a cheat, it actually just copies the data to userspace. But
this makes the interface nice and symmetric and enables people to build
on splice, with room for future improvement in performance.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For direct splicing (or private splicing), the output may not be a file.
So abstract out the handling into a specified actor function and put
the data in the splice_desc structure earlier, so we can build on top
of that.
This is the first step in better splice handling for drivers, and also
for implementing vmsplice _to_ user memory.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's really no reason it's below the first use of the pointer
type, and it'll fail compilation for the network addition (for good
reason). So move it up a bit.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Ocfs2 wants to implement it's own splice write actor so that it can better
manage cluster / page locks. This lets us re-use the rest of splice write
while only providing our own code where it's actually important.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fields of struct pipe_buf_operations have not a precise layout (ie not
optimized to fit cache lines nor reduce cache line ping pongs)
The bufs[] array is *large* and is placed near the beginning of the
structure, so all following fields have a large offset. This is
unfortunate because many archs have smaller instructions when using small
offsets relative to a base register. On x86 for example, 7 bits offsets
have smaller instruction lengths.
Moving bufs[] at the end of pipe_buf_operations permits all fields to have
small offsets, and reduce text size, and icache pressure.
# size vmlinux.pre vmlinux
text data bss dec hex filename
3268989 664356 492196 4425541 438745 vmlinux.pre
3268765 664356 492196 4425317 438665 vmlinux
So this patch reduces text size by 224 bytes on my x86_64 machine. Similar
results on ia32.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- pipe/splice should use const pipe_buf_operations and file_operations
- struct pipe_inode_info has an unused field "start" : get rid of it.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick says that the current construct isn't safe. This goes back to the
original, but sets PIPE_BUF_FLAG_LRU on user pages as well as they all
seem to be on the LRU in the first place.
Signed-off-by: Jens Axboe <axboe@suse.de>
Currently we rely on the PIPE_BUF_FLAG_LRU flag being set correctly
to know whether we need to fiddle with page LRU state after stealing it,
however for some origins we just don't know if the page is on the LRU
list or not.
So remove PIPE_BUF_FLAG_LRU and do this check/add manually in pipe_to_file()
instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
If SPLICE_F_GIFT is set, the user is basically giving this pages away to
the kernel. That means we can steal them for eg page cache uses instead
of copying it.
The data must be properly page aligned and also a multiple of the page size
in length.
Signed-off-by: Jens Axboe <axboe@suse.de>