This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing
to lock TaskSet.mu, analogously to Linux's lock_task_sighand().
This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally
requires TaskSet.mu to be locked since it reads Task.exitState. To fix this,
use atomic memory operations on Task.exitState when required.
PiperOrigin-RevId: 681128063
The new mount namespace was being created with the old user namespace,
not the new one. This led to permission errors when creating new mounts.
PiperOrigin-RevId: 676489580
Processes that are exec'ed into a container cannot be properly
restored because the caller is no longer present. This change
tracks processes that are exec'ed and kill them upon restore.
Updates #1956
PiperOrigin-RevId: 623644184
This adds a per-task cache of seccomp actions to take for syscall numbers
where the filters return an action without depending on anything other than
the syscall number and the architecture code of the seccomp program input.
This avoids evaluating seccomp-bpf programs in the syscall hot path, for
programs that use seccomp *within* gVisor (aka on themselves).
Benchmarks show that this removes about 50ns from the syscall hot path
for a trivial filter like the one in the benchmark.
Real-world filters are much longer, and the benefit is magnified the more
complex the filter is.
```
│ not_cached │ cached │
│ sec/op │ sec/op vs base │
SyscallUnderSeccomp 1.282µ ± 3% 1.230µ ± 1% -4.06% (p=0.002 n=6)
```
PiperOrigin-RevId: 586522068
The intention of this change is to cover a sufficient surface to accommodate
the use of running Docker within gVisor, rather than a full implementation.
This implements the following features:
- Keys as a first-class concept in the kernel.
- Tracking keys in user namespaces.
- Task session keyrings: possession, inheritance.
- Key permission enforcement.
- The following `keyctl(2)` operations:
- `KEYCTL_GET_KEYRING_ID`
- `KEYCTL_DESCRIBE`
- `KEYCTL_JOIN_SESSION_KEYRING`
- `KEYCTL_SETPERM`
Notably, this does not implement:
- The ability to actually add any keys other than the session keyring
(which does not hold any cryptographic key data).
- Other special keyrings (thread keyring, process keyring, user session
keyring, etc.).
- Lots of `keyctl(2)` operations.
- Key expiration.
- Key garbage collection. Keys live until their user namespace is destroyed.
However, each user namespace is limited to 200 keys, so memory growth is
bounded.
- `add_key(2)`
- `request_key(2)`
... However, this makes design choices that seem odd given the limited scope
of this change, but make sense when taking into account the desire to
eventually accommodate them in the future. For example, there are many
`switch` statements with only one option for session keyrings, which would get
more options when adding support for other special keyrings. Similarly, the
signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but
currently only ever gets the session keyring as non-nil.
PiperOrigin-RevId: 567047896
task.netns is always changed from a task goroutine under task.mu.
It means that we can access it without any locks from a task goroutine
we don't need to increment a reference counter in such cases.
In all other cases, we need to take task.mu.
PiperOrigin-RevId: 552913323
This change implements only the basic functions of mount namespaces.
All features that depends on user namespaces will be implemented separately.
PiperOrigin-RevId: 552673896
This change introduces the nsfs file system. Each new namespace allocates
a new nsfs inode.
Here are reasons why we need these inodes:
* each namespace has to have an unique id.
* proc/pid/ns/ contains one entry for each namespace. Bind mounting one of
the files in this directory to somewhere else in the filesystem keeps the
corresponding namespace alive even if all processes currently in
the namespace terminate.
* setns() allows the calling process to join an existing namespace specified
by a file descriptor.
PiperOrigin-RevId: 550694515