92 Commits

Author SHA1 Message Date
Andrei Vagin 52321c7e00 kernel/pipe: trigger EPOLLERR on a write end if all readers has been closed
Fixes #10066

PiperOrigin-RevId: 612838688
2024-03-05 07:48:01 -08:00
Fabricio Voznika c087777e37 Plumb restore context to afterLoad()
This allows for external information to be passed to restore code, like
host FDs to be remapped.

Updates #1956

PiperOrigin-RevId: 612540749
2024-03-04 12:21:50 -08:00
Jamie Liu da3eb80271 Fix #10046
See updated comment in sentry/kernel/pipe/vfs.go.

PiperOrigin-RevId: 610516821
2024-02-26 13:56:42 -08:00
Jamie Liu 94f3d8a792 Fix splices to FDs that call usermem.IO.CopyIn/CopyInTo more than once.
Fixes #9932.

When Go is able to detect `io.Copy()` from a TCP socket or `AF_UNIX` stream
socket to a TCP socket, it attempts to implement the copy as a `splice(2)` from
the source to a pipe, followed by a `splice(2)` from the pipe to the
destination [1] (since `splice(2)` requires that one of the endpoints be a
pipe); the size of the pipe is set to 1 MB [2] (from a default of 64 KB [3]) to
reduce the number of splice syscalls required. In gVisor, a bug causes each
splice syscall from pipe to TCP socket to repeatedly read the *first* 64 KB [4]
of the pipe's data (when it contains more than 64 KB of data) rather than
*successive* chunks of 64 KB.

To fix this, advance pipe state by calling `Pipe.consumeLocked()` immediately
after `Pipe.peekLocked()`. Also defensively check that such FDs call
`Pipe.(usermem.IO)` methods on sequential addresses, and change
`fuse.deviceFD.Write()` to have this property.

[1] Go: `net/tcpsock_posix.go:TCPConn.readFrom()` =>
`net/splice_linux.go:splice()` => `internal/poll/splice_linux.go:Splice()`

[2] Go: `internal/poll/splice_linux.go:newPipe()` => `maxSpliceSize`

[3] `pkg/kernel/pipe/pipe.go:DefaultPipeSize`

[4] `pkg/tcpip/transport/tcp/endpoint.go:endpoint.Write()` =>
`endpoint.queueSegment()` => `endpoint.readFromPayloader()` =>
`pkg/buffer/buffer.go:Buffer.WriteFromReader()` =>
`pkg/buffer/chunk.go:MaxChunkSize`

PiperOrigin-RevId: 603151951
2024-01-31 14:02:18 -08:00
Jing Chen be48200c0e Re-order loads in BUILD files to make transformations reversible in Copybara.
PiperOrigin-RevId: 598898756
2024-01-16 11:21:40 -08:00
Jamie Liu 704543b8f3 Ensure empty VFSPipeFD.SpliceToNonPipe(/dev/null) returns ErrWouldBlock.
When grep's stdout is /dev/null (so printed matches are discarded), its outcome
is only observable in its exit code, which is binary (0 for matches, 1 for no
matches). When grep's stdin is additionally a pipe, GNU grep optimizes for this
specific case by switching from reading input to splicing it directly to stdout
after the first match:

```
  if (exit_on_match | dev_null_output)
    list_files = LISTFILES_NONE;
...
  if (list_files == LISTFILES_NONE)
    finalize_input (desc, &st, ineof);
...
static bool
drain_input (int fd, struct stat const *st)
{
  ssize_t nbytes;
  if (S_ISFIFO (st->st_mode) && dev_null_output)
    {
#ifdef SPLICE_F_MOVE
      /* Should be faster, since it need not copy data to user space.  */
      nbytes = splice (fd, NULL, STDOUT_FILENO, NULL,
                       INITIAL_BUFSIZE, SPLICE_F_MOVE);
```

This triggers a bug in our splice implementation: since memdev.nullFD.Write()
never calls back into pipe.Pipe.peekLocked() to get ErrWouldBlock, this is
never propagated up to syscalls/linux.Splice(). Consequently, splice() returns
0 instead of blocking; grep interprets this as EOF from the pipe and exits.

We can't fix this by calling src.CopyInTo() in memdev.nullFD.Write() because
this would have the wrong behavior for `write(/dev/null, unmapped addr)`, which
should succeed because `drivers/char/mem.c:null_write()` also ignores the
application-provided pointer. Instead, handle this in
VFSPipeFD.SpliceToNonPipe(). (Linux instead avoids this problem by
distinguishing file_operations::write and file_operations::splice_write, which
we would prefer to avoid if possible.)

Fixes #9736

PiperOrigin-RevId: 584091971
2023-11-20 12:04:52 -08:00
Andrei Vagin 5f4abad306 Fix a few typos
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-25 12:13:42 -07:00
Jamie Liu ff81c0c639 Remove //pkg/sentry/device.
This package was used for VFS1 device number assignment.

PiperOrigin-RevId: 538918926
2023-06-08 16:21:04 -07:00
Etienne Perot f8b9824813 Update unimpl.EmitUnimplementedEvent interface to add the syscall number.
This catches up the interface to the `EmitUnimplementedEvent` method signature
on `kernel.Kernel`.

Also add build-time test to verify that `kernel.Kernel` implements this
interface, in order to catch such breakages at build time in the future.

PiperOrigin-RevId: 519000411
2023-03-23 17:01:37 -07:00
Adin Scannell 1ceb814544 Add default_applicable_licenses rules to packages.
PiperOrigin-RevId: 513581243
2023-03-02 10:50:04 -08:00
Etienne Perot 445fa6f40c Lockdep: Print more info in the "unbalanced unlock" case.
This CL does the following:

- Add the ability for nested locks to have names.
- Give names to all current uses of nested locks in the codebase.
- Truncate `lockdep` debug stack traces to avoid the clutter from the
  `lockdep` code itself
- Simplify `lockdep` to not longer require `classMap`.

PiperOrigin-RevId: 491486620
2022-11-28 17:53:09 -08:00
Ayush Ranjan 1fa3c06f1e Delete VFS1 completely.
- Delete pkg/sentry/fs/*.
- Move pkg/sentry/fs/fsutil out of VFS1 directory and remove VFS1 components.
- Remove remaining unused references to VFS1 from remaining codebase.
- Rename/refactor code to avoid even referencing VFS2, unless necessary.
- Rewrite VFS1-only tests to VFS2.

Updates #1624

PiperOrigin-RevId: 490064269
2022-11-21 13:57:52 -08:00
Andrei Vagin 604233c9f6 kernel: use lockdep mutexes
PiperOrigin-RevId: 449877248
2022-05-19 18:33:59 -07:00
Ayush Ranjan f6ed4523dc Reformat codebase.
PiperOrigin-RevId: 449358041
2022-05-17 17:48:35 -07:00
Kevin Krakauer 9050184c20 switch fsimpl/ from sync/atomic to atomicbitops for 32 bit values
PiperOrigin-RevId: 443535714
2022-04-21 18:32:04 -07:00
Kevin Krakauer 370672e989 prohibit direct use of sync/atomic (u)int64 functions
All atomic 64 bit ints are changed to atomicbitops.(Ui|I)nt64. A nogo checker
enforces that sync/atomic 64 bit functions are not called.

For reviewers: the interesting changes are in the atomicbitops and checkaligned
packages.

Why do this?
- It is very easy to accidentally use atomic values without sync/atomic funcs.
- We have checkatomics, but this is optional and is forgotten in several places.
  - Using a type+checker to enforce this seems less error prone and simpler.
- We get NoCopy protection.
- Use of 64 bit atomics can break 32 bit builds. We have types to handle this
  without any runtime cost, so we might as well use them.

PiperOrigin-RevId: 440473398
2022-04-08 16:06:26 -07:00
Fabricio Voznika dfcf798425 Fix epoll_ctl(2) regular files and dirs
Linux behaves differently for regular files and dirs for poll(2)/select(2)
compared to epoll_ctl(2). The latter returns EPERM for file and dirs.
I've also changed host FDs to behave like the underlying FD in regards
to epoll to keep it compatible with docker.

Fixes #7134

PiperOrigin-RevId: 429412692
2022-02-17 15:12:36 -08:00
Jamie Liu 8e22ce5019 Consistently order Pipe.mu before other file mutexes and MM.activeMu.
PiperOrigin-RevId: 422894869
2022-01-19 13:47:38 -08:00
Andrei Vagin 271e4f4ae6 kernel/pipe: clean up unused fields from the Pipe structure
They have been added by mistake.

PiperOrigin-RevId: 417716586
2021-12-21 17:14:04 -08:00
Andrei Vagin b76119a1e7 pipe: a reader has to wait when all writers will be notified
Otherwise, we can have a race when a reader cloes a pipe before
a write detects this reader.

PiperOrigin-RevId: 417645683
2021-12-21 10:19:23 -08:00
Andrei Vagin 4d29819e13 pipe: have separate notifiers for readers and writers
This change fixes a busy loop in the pipe code. VFSPipe.Open calls ctx.BlockOn
to wait an opposite side, but waitQueue.EventRegister always triggers
EventInternal, so we never block.

Reported-by: syzbot+773e19ca2574516c9e00@syzkaller.appspotmail.com
PiperOrigin-RevId: 415428542
2021-12-09 21:51:32 -08:00
Adin Scannell dedb7e6ca1 Align Context API with kernel internals.
This change adapts the existing context to use more suitable non-channel-based
methods. This is a requisite for migrating the kernel internals to a
sleeper-based notification mechanism.

The last uses of amutex outside those migrated as part of this change were
dropped in a previous change. Since amutex depends on the channel-based
implementation, this package is also deleted as part of this change.

PiperOrigin-RevId: 415189675
2021-12-08 23:51:37 -08:00
Fabricio Voznika 9768009a79 Don't eat error from epoll_ctl EPOLL_CTL_ADD
Docker maps stdin to `/dev/null` which doesn't support epoll. Host FD
was ignoring the error and suceeding the epoll_ctl call from the
container, giving false impressing that epoll would be notified.

This required plumbing failure to all waiter.Waitable.EventRegister
callers and implementers.

Closes #6795

PiperOrigin-RevId: 414797621
2021-12-07 12:36:00 -08:00
Adin Scannell 91f58d2cc8 Update Waitable API.
Instead of passing the event mask at registratrion time, pass the mask as part
of the waiter. This makes the mask immutable and simplifies the architecture of
waiters. This is also necessary for a future fix that will allow the fdnotifier
to keep persistent entries, as opposed to requiring constant updates.

This change is intended to be a no-op in terms of function. The only exception
is signalfd, where this mask was abused. To handle this case, the operation of
signalfd changed to allow one layer of indirection.

PiperOrigin-RevId: 409702998
2021-11-13 12:54:39 -08:00
Jamie Liu 8682ce689e Remove state:"nosave"/"zerovalue" annotations from all waiter.Queues.
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields,
which meant that it couldn't handle intrusive linked lists, which meant that it
couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a
result, VFS1 unregisters all epoll waiters before saving and re-registers them
after loading, and waitable VFS1 file implementations tag their waiter.Queues
state:"nosave" (causing them to be skipped by the save/restore machinery) or
state:"zerovalue" (causing them to only be checked for zero-value-equality on
save).

VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance
pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2
epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2
file implementations do not tag waiter.Queues.

Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used
by both VFS1 and VFS2 (the latter via signalfd); as a result of the above,
tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll.
Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2),
and remove these tags from all waiter.Queues.

Also clean up after the epoll test added by cl/402323053, which implied this
issue (by instantiating DisableSave in the new test) without reporting it.

PiperOrigin-RevId: 402596216
2021-10-12 10:25:30 -07:00