Fixes#9932.
When Go is able to detect `io.Copy()` from a TCP socket or `AF_UNIX` stream
socket to a TCP socket, it attempts to implement the copy as a `splice(2)` from
the source to a pipe, followed by a `splice(2)` from the pipe to the
destination [1] (since `splice(2)` requires that one of the endpoints be a
pipe); the size of the pipe is set to 1 MB [2] (from a default of 64 KB [3]) to
reduce the number of splice syscalls required. In gVisor, a bug causes each
splice syscall from pipe to TCP socket to repeatedly read the *first* 64 KB [4]
of the pipe's data (when it contains more than 64 KB of data) rather than
*successive* chunks of 64 KB.
To fix this, advance pipe state by calling `Pipe.consumeLocked()` immediately
after `Pipe.peekLocked()`. Also defensively check that such FDs call
`Pipe.(usermem.IO)` methods on sequential addresses, and change
`fuse.deviceFD.Write()` to have this property.
[1] Go: `net/tcpsock_posix.go:TCPConn.readFrom()` =>
`net/splice_linux.go:splice()` => `internal/poll/splice_linux.go:Splice()`
[2] Go: `internal/poll/splice_linux.go:newPipe()` => `maxSpliceSize`
[3] `pkg/kernel/pipe/pipe.go:DefaultPipeSize`
[4] `pkg/tcpip/transport/tcp/endpoint.go:endpoint.Write()` =>
`endpoint.queueSegment()` => `endpoint.readFromPayloader()` =>
`pkg/buffer/buffer.go:Buffer.WriteFromReader()` =>
`pkg/buffer/chunk.go:MaxChunkSize`
PiperOrigin-RevId: 603151951
When grep's stdout is /dev/null (so printed matches are discarded), its outcome
is only observable in its exit code, which is binary (0 for matches, 1 for no
matches). When grep's stdin is additionally a pipe, GNU grep optimizes for this
specific case by switching from reading input to splicing it directly to stdout
after the first match:
```
if (exit_on_match | dev_null_output)
list_files = LISTFILES_NONE;
...
if (list_files == LISTFILES_NONE)
finalize_input (desc, &st, ineof);
...
static bool
drain_input (int fd, struct stat const *st)
{
ssize_t nbytes;
if (S_ISFIFO (st->st_mode) && dev_null_output)
{
#ifdef SPLICE_F_MOVE
/* Should be faster, since it need not copy data to user space. */
nbytes = splice (fd, NULL, STDOUT_FILENO, NULL,
INITIAL_BUFSIZE, SPLICE_F_MOVE);
```
This triggers a bug in our splice implementation: since memdev.nullFD.Write()
never calls back into pipe.Pipe.peekLocked() to get ErrWouldBlock, this is
never propagated up to syscalls/linux.Splice(). Consequently, splice() returns
0 instead of blocking; grep interprets this as EOF from the pipe and exits.
We can't fix this by calling src.CopyInTo() in memdev.nullFD.Write() because
this would have the wrong behavior for `write(/dev/null, unmapped addr)`, which
should succeed because `drivers/char/mem.c:null_write()` also ignores the
application-provided pointer. Instead, handle this in
VFSPipeFD.SpliceToNonPipe(). (Linux instead avoids this problem by
distinguishing file_operations::write and file_operations::splice_write, which
we would prefer to avoid if possible.)
Fixes#9736
PiperOrigin-RevId: 584091971
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This catches up the interface to the `EmitUnimplementedEvent` method signature
on `kernel.Kernel`.
Also add build-time test to verify that `kernel.Kernel` implements this
interface, in order to catch such breakages at build time in the future.
PiperOrigin-RevId: 519000411
This CL does the following:
- Add the ability for nested locks to have names.
- Give names to all current uses of nested locks in the codebase.
- Truncate `lockdep` debug stack traces to avoid the clutter from the
`lockdep` code itself
- Simplify `lockdep` to not longer require `classMap`.
PiperOrigin-RevId: 491486620
All atomic 64 bit ints are changed to atomicbitops.(Ui|I)nt64. A nogo checker
enforces that sync/atomic 64 bit functions are not called.
For reviewers: the interesting changes are in the atomicbitops and checkaligned
packages.
Why do this?
- It is very easy to accidentally use atomic values without sync/atomic funcs.
- We have checkatomics, but this is optional and is forgotten in several places.
- Using a type+checker to enforce this seems less error prone and simpler.
- We get NoCopy protection.
- Use of 64 bit atomics can break 32 bit builds. We have types to handle this
without any runtime cost, so we might as well use them.
PiperOrigin-RevId: 440473398
Linux behaves differently for regular files and dirs for poll(2)/select(2)
compared to epoll_ctl(2). The latter returns EPERM for file and dirs.
I've also changed host FDs to behave like the underlying FD in regards
to epoll to keep it compatible with docker.
Fixes#7134
PiperOrigin-RevId: 429412692
This change fixes a busy loop in the pipe code. VFSPipe.Open calls ctx.BlockOn
to wait an opposite side, but waitQueue.EventRegister always triggers
EventInternal, so we never block.
Reported-by: syzbot+773e19ca2574516c9e00@syzkaller.appspotmail.com
PiperOrigin-RevId: 415428542
This change adapts the existing context to use more suitable non-channel-based
methods. This is a requisite for migrating the kernel internals to a
sleeper-based notification mechanism.
The last uses of amutex outside those migrated as part of this change were
dropped in a previous change. Since amutex depends on the channel-based
implementation, this package is also deleted as part of this change.
PiperOrigin-RevId: 415189675
Docker maps stdin to `/dev/null` which doesn't support epoll. Host FD
was ignoring the error and suceeding the epoll_ctl call from the
container, giving false impressing that epoll would be notified.
This required plumbing failure to all waiter.Waitable.EventRegister
callers and implementers.
Closes#6795
PiperOrigin-RevId: 414797621
Instead of passing the event mask at registratrion time, pass the mask as part
of the waiter. This makes the mask immutable and simplifies the architecture of
waiters. This is also necessary for a future fix that will allow the fdnotifier
to keep persistent entries, as opposed to requiring constant updates.
This change is intended to be a no-op in terms of function. The only exception
is signalfd, where this mask was abused. To handle this case, the operation of
signalfd changed to allow one layer of indirection.
PiperOrigin-RevId: 409702998
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields,
which meant that it couldn't handle intrusive linked lists, which meant that it
couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a
result, VFS1 unregisters all epoll waiters before saving and re-registers them
after loading, and waitable VFS1 file implementations tag their waiter.Queues
state:"nosave" (causing them to be skipped by the save/restore machinery) or
state:"zerovalue" (causing them to only be checked for zero-value-equality on
save).
VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance
pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2
epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2
file implementations do not tag waiter.Queues.
Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used
by both VFS1 and VFS2 (the latter via signalfd); as a result of the above,
tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll.
Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2),
and remove these tags from all waiter.Queues.
Also clean up after the epoll test added by cl/402323053, which implied this
issue (by instantiating DisableSave in the new test) without reporting it.
PiperOrigin-RevId: 402596216