Introduce a new nofault flag to indicate to iov_iter_get_pages not to
fault in user pages.
This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
which causes get_user_pages to fail when it would otherwise fault in a
page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
in pages when page faults are not allowed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing. Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.
We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.
Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This reverts commit 2112ff5ce0.
We no longer need to track the truncation count, the one user that did
need it has been converted to using iov_iter_restore() instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().
This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.
Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remember how many bytes were truncated and reverted back. Because
not reexpanded iterators don't always work well with reverting, we may
need to know that to reexpand ourselves when needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it. In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed. That, however, needs to be done only on slow
paths.
All in-tree callers converted. And that kills the last user of iterate_all_kinds()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) constify iov_iter argument; we are not advancing it in this primitive.
2) cap the amount requested by the amount of data in iov_iter. All
existing callers should've been safe, but the check is really cheap and
doing it here makes for easier analysis, as well as more consistent
semantics among the primitives.
3) don't bother with iterate_iovec(). Explicit loop is not any harder
to follow, and we get rid of standalone iterate_iovec() users - it's
only used by iterate_and_advance() and (soon to be gone) iterate_all_kinds().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of having them mixed in iter->type, use separate ->iter_type
and ->data_source (u8 and bool resp.) And don't bother with (pseudo-)
bitmap for the former - microoptimizations from being able to check
if the flavour is one of two values are not worth the confusion for
optimizer. It can't prove that we never get e.g. ITER_IOVEC | ITER_PIPE,
so we end up with extra headache.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use corresponding plain variants, revert on short copy. That's the way it
should've been done from the very beginning, except that we didn't have
iov_iter_revert() back then...
[fixed another braino caught by Qian Cai <quic_qiancai@quicinc.com>]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove iov_iter_for_each_range() as it's no longer used with the removal of
lustre.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix four things[1] in the patch that adds ITER_XARRAY[2]:
(1) Remove the address_space struct predeclaration. This is a holdover
from when it was ITER_MAPPING.
(2) Fix _copy_mc_to_iter() so that the xarray segment updates count and
iov_offset in the iterator before returning.
(3) Fix iov_iter_alignment() to not loop in the xarray case. Because the
middle pages are all whole pages, only the end pages need be
considered - and this can be reduced to just looking at the start
position in the xarray and the iteration size.
(4) Fix iov_iter_advance() to limit the size of the advance to no more
than the remaining iteration size.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Link: https://lore.kernel.org/r/YIVrJT8GwLI0Wlgx@zeniv-ca.linux.org.uk [1]
Link: https://lore.kernel.org/r/161918448151.3145707.11541538916600921083.stgit@warthog.procyon.org.uk [2]
Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in <linux/compat.h>
In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.
Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:
On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > However now I see that copy_user_generic() works for the wrong reason.
> > It works because the exception on the source address due to poison
> > looks no different than a write fault on the user address to the
> > caller, it's still just a short copy. So it makes copy_to_user() work
> > for the wrong reason relative to the name.
>
> Right.
>
> And it won't work that way on other architectures. On x86, we have a
> generic function that can take faults on either side, and we use it
> for both cases (and for the "in_user" case too), but that's an
> artifact of the architecture oddity.
>
> In fact, it's probably wrong even on x86 - because it can hide bugs -
> but writing those things is painful enough that everybody prefers
> having just one function.
Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().
Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.
One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.
[ bp: Massage a bit. ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split rw_copy_check_uvector into two new helpers with more sensible
calling conventions:
- iovec_from_user copies a iovec from userspace either into the provided
stack buffer if it fits, or allocates a new buffer for it. Returns
the actually used iovec. It also verifies that iov_len does fit a
signed type, and handles compat iovecs if the compat flag is set.
- __import_iovec consolidates the native and compat versions of
import_iovec. It calls iovec_from_user, then validates each iovec
actually points to user addresses, and ensures the total length
doesn't overflow.
This has two major implications:
- the access_process_vm case loses the total lenght checking, which
wasn't required anyway, given that each call receives two iovecs
for the local and remote side of the operation, and it verifies
the total length on the local side already.
- instead of a single loop there now are two loops over the iovecs.
Given that the iovecs are cache hot this doesn't make a major
difference
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The header file linux/uio.h includes crypto/hash.h which pulls in
most of the Crypto API. Since linux/uio.h is used throughout the
kernel this means that every tiny bit of change to the Crypto API
causes the entire kernel to get rebuilt.
This patch fixes this by moving it into lib/iov_iter.c instead
where it is actually used.
This patch also fixes the ifdef to use CRYPTO_HASH instead of just
CRYPTO which does not guarantee the existence of ahash.
Unfortunately a number of drivers were relying on linux/uio.h to
provide access to linux/slab.h. This patch adds inclusions of
linux/slab.h as detected by build failures.
Also skbuff.h was relying on this to provide a declaration for
ahash_request. This patch adds a forward declaration instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert pipes to use head and tail pointers for the buffer ring rather than
pointer and length as the latter requires two atomic ops to update (or a
combined op) whereas the former only requires one.
(1) The head pointer is the point at which production occurs and points to
the slot in which the next buffer will be placed. This is equivalent
to pipe->curbuf + pipe->nrbufs.
The head pointer belongs to the write-side.
(2) The tail pointer is the point at which consumption occurs. It points
to the next slot to be consumed. This is equivalent to pipe->curbuf.
The tail pointer belongs to the read-side.
(3) head and tail are allowed to run to UINT_MAX and wrap naturally. They
are only masked off when the array is being accessed, e.g.:
pipe->bufs[head & mask]
This means that it is not necessary to have a dead slot in the ring as
head == tail isn't ambiguous.
(4) The ring is empty if "head == tail".
A helper, pipe_empty(), is provided for this.
(5) The occupancy of the ring is "head - tail".
A helper, pipe_occupancy(), is provided for this.
(6) The number of free slots in the ring is "pipe->ring_size - occupancy".
A helper, pipe_space_for_user() is provided to indicate how many slots
userspace may use.
(7) The ring is full if "head - tail >= pipe->ring_size".
A helper, pipe_full(), is provided for this.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull io_uring updates from Jens Axboe:
"This contains:
- Support for recvmsg/sendmsg as first class opcodes.
I don't envision going much further down this path, as there are
plans in progress to support potentially any system call in an
async fashion through io_uring. But I think it does make sense to
have certain core ops available directly, especially those that can
support a "try this non-blocking" flag/mode. (me)
- Handle generic short reads automatically.
This can happen fairly easily if parts of the buffered read is
cached. Since the application needs to issue another request for
the remainder, just do this internally and save kernel/user
roundtrip while providing a nicer more robust API. (me)
- Support for linked SQEs.
This allows SQEs to depend on each other, enabling an application
to eg queue a read-from-this-file,write-to-that-file pair. (me)
- Fix race in stopping SQ thread (Jackie)"
* tag 'for-5.3/io_uring-20190711' of git://git.kernel.dk/linux-block:
io_uring: fix io_sq_thread_stop running in front of io_sq_thread
io_uring: add support for recvmsg()
io_uring: add support for sendmsg()
io_uring: add support for sqe links
io_uring: punt short reads to async context
uio: make import_iovec()/compat_import_iovec() return bytes on success
If we pass pages through an iov_iter we always already have a reference
in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.
Some callers already treat the return value that way, others need a
slight tweak.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 875f1d0769 ("iov_iter: add ITER_BVEC_FLAG_NO_REF flag")
introduces one extra flag of ITER_BVEC_FLAG_NO_REF, and this flag
is stored into iter->type.
However, iov_iter_type() doesn't consider the new added flag, fix
it by masking this flag in iov_iter_type().
Fixes: 875f1d0769 ("iov_iter: add ITER_BVEC_FLAG_NO_REF flag")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>