82 Commits

Author SHA1 Message Date
Jamie Liu 0b59173cff kernel: only lock TaskSet.mu in Task.unstopVforkParent() if necessary
PiperOrigin-RevId: 687421394
2024-10-18 14:17:05 -07:00
Jamie Liu 03bebc4402 kernel: add ThreadGroup.signalLock()
This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().

This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.

PiperOrigin-RevId: 681128063
2024-10-01 12:48:10 -07:00
Lucas Manning f229b3e772 Fix small logic bug with CLONE_NEWUSER|CLONE_NEWNS in clone.
The new mount namespace was being created with the old user namespace,
not the new one. This led to permission errors when creating new mounts.

PiperOrigin-RevId: 676489580
2024-09-19 11:24:15 -07:00
Fabricio Voznika d514dc4424 Track exec'ed processes and kill them after restore
Processes that are exec'ed into a container cannot be properly
restored because the caller is no longer present. This change
tracks processes that are exec'ed and kill them upon restore.

Updates #1956

PiperOrigin-RevId: 623644184
2024-04-10 16:53:49 -07:00
NymanRobin f481172b53 Convert atomic.Value to atomic.Pointer[T] 2024-03-05 11:09:23 +02:00
Etienne Perot dec37ea4ed gVisor seccomp: Implement in-Sentry seccomp cache.
This adds a per-task cache of seccomp actions to take for syscall numbers
where the filters return an action without depending on anything other than
the syscall number and the architecture code of the seccomp program input.

This avoids evaluating seccomp-bpf programs in the syscall hot path, for
programs that use seccomp *within* gVisor (aka on themselves).

Benchmarks show that this removes about 50ns from the syscall hot path
for a trivial filter like the one in the benchmark.
Real-world filters are much longer, and the benefit is magnified the more
complex the filter is.

```
                    │ not_cached  │              cached               │
                    │   sec/op    │   sec/op     vs base              │
SyscallUnderSeccomp   1.282µ ± 3%   1.230µ ± 1%  -4.06% (p=0.002 n=6)
```

PiperOrigin-RevId: 586522068
2023-11-29 20:00:16 -08:00
Andrei Vagin f3b0a527c2 inet: allow to create abstract unix sockets in non-root namespaces
PiperOrigin-RevId: 573253619
2023-10-13 10:20:56 -07:00
Etienne Perot 02f70b5df0 Implement a subset of keyctl(2) and keyrings(7) for better Docker support.
The intention of this change is to cover a sufficient surface to accommodate
the use of running Docker within gVisor, rather than a full implementation.

This implements the following features:

  - Keys as a first-class concept in the kernel.
  - Tracking keys in user namespaces.
  - Task session keyrings: possession, inheritance.
  - Key permission enforcement.
  - The following `keyctl(2)` operations:
    - `KEYCTL_GET_KEYRING_ID`
    - `KEYCTL_DESCRIBE`
    - `KEYCTL_JOIN_SESSION_KEYRING`
    - `KEYCTL_SETPERM`

Notably, this does not implement:

  - The ability to actually add any keys other than the session keyring
    (which does not hold any cryptographic key data).
  - Other special keyrings (thread keyring, process keyring, user session
    keyring, etc.).
  - Lots of `keyctl(2)` operations.
  - Key expiration.
  - Key garbage collection. Keys live until their user namespace is destroyed.
    However, each user namespace is limited to 200 keys, so memory growth is
    bounded.
  - `add_key(2)`
  - `request_key(2)`

... However, this makes design choices that seem odd given the limited scope
of this change, but make sense when taking into account the desire to
eventually accommodate them in the future. For example, there are many
`switch` statements with only one option for session keyrings, which would get
more options when adding support for other special keyrings. Similarly, the
signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but
currently only ever gets the session keyring as non-nil.

PiperOrigin-RevId: 567047896
2023-09-20 12:38:39 -07:00
Shambhavi Srivastava 8623c872ce Automated rollback of changelist 557871250
PiperOrigin-RevId: 560158129
2023-08-25 11:58:29 -07:00
Andrei Vagin eb6b3ac00b vfs: MountNamespace.Root() has to return a top mount of /
A few mounts can be mounted on top of `/`.

PiperOrigin-RevId: 558264274
2023-08-18 15:35:47 -07:00
Nicolas Lacasse 9be6f98612 Automated rollback of changelist 554554034
PiperOrigin-RevId: 557871250
2023-08-17 10:50:10 -07:00
Shambhavi Srivastava 21d66119b7 Implementing clone3
Updates #8585

PiperOrigin-RevId: 554554034
2023-08-07 12:19:32 -07:00
Jing Chen e89e40fded Implement setns CLONE_NEWUTS namespace type.
PiperOrigin-RevId: 554306089
2023-08-06 15:33:25 -07:00
Andrei Vagin abe7cee096 kernel: don't use atomic pointers for task.netns
task.netns is always changed from a task goroutine under task.mu.

It means that we can access it without any locks from a task goroutine
we don't need to increment a reference counter in such cases.

In all other cases, we need to take task.mu.

PiperOrigin-RevId: 552913323
2023-08-01 14:04:53 -07:00
Andrei Vagin aa2c8c33c6 Implement setns for mount namespaces
PiperOrigin-RevId: 552859231
2023-08-01 11:12:29 -07:00
Andrei Vagin 41bb04c149 Implement mount namespaces
This change implements only the basic functions of mount namespaces.
All features that depends on user namespaces will be implemented separately.

PiperOrigin-RevId: 552673896
2023-07-31 21:12:21 -07:00
Jing Chen 7f067c7e1d Implement setns CLONE_NEWIPC namespace type.
PiperOrigin-RevId: 552619565
2023-07-31 16:12:45 -07:00
Andrei Vagin 46115504ec Implement the setns syscall
This change introduces the nsfs file system. Each new namespace allocates
a new nsfs inode.

Here are reasons why we need these inodes:
* each namespace has to have an unique id.
* proc/pid/ns/ contains one entry for each namespace. Bind mounting one of
  the files in this directory to somewhere else in the filesystem keeps the
  corresponding namespace alive even if all processes currently in
  the namespace terminate.
* setns() allows the calling process to join an existing namespace specified
  by a file descriptor.

PiperOrigin-RevId: 550694515
2023-07-24 15:45:08 -07:00
Jamie Liu f517b70ded Pass context to kernel.TaskImage.release().
PiperOrigin-RevId: 543541608
2023-06-26 14:28:58 -07:00
Andrei Vagin fedbf08401 kernel: unshare a network namespace without taking Task.mu
t.netns is an atomic pointer so it should not be a problem for readers.
As for writers, only task can change its network namespace.

Reported-by: syzbot+8e29d377b851dcdfbca2@syzkaller.appspotmail.com
Reported-by: syzbot+9ffa998047fce0c57473@syzkaller.appspotmail.com
Reported-by: syzbot+c1c75367b97f5e31a12f@syzkaller.appspotmail.com
PiperOrigin-RevId: 542384765
2023-06-21 15:51:28 -07:00
Nicolas Lacasse 028cf757bb Clarify comment about copying Task.image in Task.Clone().
PiperOrigin-RevId: 510829046
2023-02-19 10:43:46 -08:00
Nicolas Lacasse ff32cb8b25 Hold t.mu while accessing t.image.
We copy t.image with t.mu held, to avoid holding the lock
while calling t.image.Fork.

PiperOrigin-RevId: 510698814
2023-02-18 13:12:02 -08:00
Andrei Vagin aeabb78527 Allow to return an error from PullFullState.
PiperOrigin-RevId: 504415294
2023-01-24 17:15:17 -08:00
Lucas Manning a248c63cd5 Fix circular lock between filesystemRWMutex and taskSetRWMutex.
PiperOrigin-RevId: 500747195
2023-01-09 10:24:35 -08:00
Ayush Ranjan 1fa3c06f1e Delete VFS1 completely.
- Delete pkg/sentry/fs/*.
- Move pkg/sentry/fs/fsutil out of VFS1 directory and remove VFS1 components.
- Remove remaining unused references to VFS1 from remaining codebase.
- Rename/refactor code to avoid even referencing VFS2, unless necessary.
- Rewrite VFS1-only tests to VFS2.

Updates #1624

PiperOrigin-RevId: 490064269
2022-11-21 13:57:52 -08:00