Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.
PiperOrigin-RevId: 658891833
Processes that are exec'ed into a container cannot be properly
restored because the caller is no longer present. This change
tracks processes that are exec'ed and kill them upon restore.
Updates #1956
PiperOrigin-RevId: 623644184
This removes the compare-and-branch involved in casting
`ptraceTracer.Load()` to `*Task`.
This is on the hot syscall path, and in the overwhelmingly-likely case that
the task is not being traced, we should be as fast as possible.
Shaves off about 20 nanoseconds from the syscall hot path:
```
│ before │ after │
│ sec/op │ sec/op vs base │
Syscallbench/syscall.getpidopt-8 1.355µ ± 2% 1.334µ ± 2% -1.59% (p=0.030 n=64)
```
PiperOrigin-RevId: 611277514
This adds a per-task cache of seccomp actions to take for syscall numbers
where the filters return an action without depending on anything other than
the syscall number and the architecture code of the seccomp program input.
This avoids evaluating seccomp-bpf programs in the syscall hot path, for
programs that use seccomp *within* gVisor (aka on themselves).
Benchmarks show that this removes about 50ns from the syscall hot path
for a trivial filter like the one in the benchmark.
Real-world filters are much longer, and the benefit is magnified the more
complex the filter is.
```
│ not_cached │ cached │
│ sec/op │ sec/op vs base │
SyscallUnderSeccomp 1.282µ ± 3% 1.230µ ± 1% -4.06% (p=0.002 n=6)
```
PiperOrigin-RevId: 586522068
- Added docstrings for exported functions.
- Exported kernel.userCounters since it is used by another package (testutil)
and is an exported field of an exported type `kernel.TaskConfig`.
PiperOrigin-RevId: 580615070
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The intention of this change is to cover a sufficient surface to accommodate
the use of running Docker within gVisor, rather than a full implementation.
This implements the following features:
- Keys as a first-class concept in the kernel.
- Tracking keys in user namespaces.
- Task session keyrings: possession, inheritance.
- Key permission enforcement.
- The following `keyctl(2)` operations:
- `KEYCTL_GET_KEYRING_ID`
- `KEYCTL_DESCRIBE`
- `KEYCTL_JOIN_SESSION_KEYRING`
- `KEYCTL_SETPERM`
Notably, this does not implement:
- The ability to actually add any keys other than the session keyring
(which does not hold any cryptographic key data).
- Other special keyrings (thread keyring, process keyring, user session
keyring, etc.).
- Lots of `keyctl(2)` operations.
- Key expiration.
- Key garbage collection. Keys live until their user namespace is destroyed.
However, each user namespace is limited to 200 keys, so memory growth is
bounded.
- `add_key(2)`
- `request_key(2)`
... However, this makes design choices that seem odd given the limited scope
of this change, but make sense when taking into account the desire to
eventually accommodate them in the future. For example, there are many
`switch` statements with only one option for session keyrings, which would get
more options when adding support for other special keyrings. Similarly, the
signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but
currently only ever gets the session keyring as non-nil.
PiperOrigin-RevId: 567047896
task.netns is always changed from a task goroutine under task.mu.
It means that we can access it without any locks from a task goroutine
we don't need to increment a reference counter in such cases.
In all other cases, we need to take task.mu.
PiperOrigin-RevId: 552913323
This change introduces the nsfs file system. Each new namespace allocates
a new nsfs inode.
Here are reasons why we need these inodes:
* each namespace has to have an unique id.
* proc/pid/ns/ contains one entry for each namespace. Bind mounting one of
the files in this directory to somewhere else in the filesystem keeps the
corresponding namespace alive even if all processes currently in
the namespace terminate.
* setns() allows the calling process to join an existing namespace specified
by a file descriptor.
PiperOrigin-RevId: 550694515
Currently there's no implementation to enable the container to be
initialized to it's cgroups and hence the `EnterInitialCgroups` would
inherit all the root's cgroups when the container would start.
With this implementation we make sure that when the container starts
it enter's into the cgroups that is passed down it to from the sandbox.
PiperOrigin-RevId: 540650868
When charging a pids cgroup during thread creation, it is possible for
the hierachy containing the pids controller to be destroyed and
recreated. If thread creation fails and the charge has to be rolled
back, an intervening hierachy change previously caused a charge
underflow during the rollback.
If the hierachy changes between the charge and uncharge, the uncharge
is uncessary.
Reported-by: syzbot+b72cc8d190b428e43a03@syzkaller.appspotmail.com
PiperOrigin-RevId: 471112484
This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the
following lock cycle:
`kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex`
-> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex`
(Also, less locking within `createVMALocked` is probably for the better in
general.)
PiperOrigin-RevId: 449588573