Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer. Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.
Uninitialized flags appears to have been there from the introduction of
the splice syscall.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
caused a regression when there were no more readers left on a pipe that
was being spliced into: rather than the expected SIGPIPE and -EPIPE
return value, the writer would end up waiting forever for space to free
up (which obviously was not going to happen with no readers around).
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Debugged-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v4.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Botched calculation of number of pages. As the result,
we were dropping pieces when doing splice to pipe from
e.g. 9p.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
i_size check is a leftover from the horrors that used to play with
the page cache in that function. With the switch to ->read_iter(),
it's neither needed nor correct - for gfs2 it ends up being buggy,
since i_size is not guaranteed to be correct until later (inside
->read_iter()).
Spotted-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t.
Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some
ifdefs in blk_types.h.
Instead we'll need to add a few explicit includes that were implicit
before, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
pages are filled by kernel_readv() via a kvec array (as we used
to do all along), so iov_iter here is used only as a way of
arranging for those pages to be in pipe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
iov_iter variant for passing data into pipe. copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it. Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied. iov_iter_advance()
will truncate everything past the spot it has advanced to.
New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.
Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.
In other words, ->read_iter() on those acts pretty much like
->splice_read(). Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
arrange for waiting, looping, etc. themselves.
That should make pipe_lock the outermost one.
Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those. It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".
Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running the following command:
busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null
with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
kernel BUG at include/linux/mm.h:367!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89
(busybox's cat uses sendfile(2), unlike the coreutils version)
This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.
All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.
Cc: stable@vger.kernel.org
Signed-off-by: Rabin Vincent <rabin@rab.in>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations. For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During testing, I discovered that __generic_file_splice_read() returns
0 (EOF) when aops->readpage fails with AOP_TRUNCATED_PAGE on the first
page of a single/multi-page splice read operation. This EOF return code
causes the userspace test to (correctly) report a zero-length read error
when it was expecting otherwise.
The current strategy of returning a partial non-zero read when ->readpage
returns AOP_TRUNCATED_PAGE works only when the failed page is not the
first of the lot being processed.
This patch attempts to retry lookup and call ->readpage again on pages
that had previously failed with AOP_TRUNCATED_PAGE. With this patch, my
tests pass and I haven't noticed any unwanted side effects.
This version removes the thrice-retry loop and instead indefinitely
retries lookups on AOP_TRUNCATED_PAGE errors from ->readpage. This
behavior is now similar to do_generic_file_read().
Signed-off-by: Abhi Das <adas@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following test program from Dmitry can cause softlockups or RCU
stalls as it copies 1GB from tmpfs into eventfd and we don't have any
scheduling point at that path in sendfile(2) implementation:
int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);
Add cond_resched() into __splice_from_pipe() to fix the problem.
CC: Dmitry Vyukov <dvyukov@google.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>