65 Commits

Author SHA1 Message Date
Andrei Vagin 9fcf0b5b53 proc: invalidate task inodes when tasks are destroyed
PiperOrigin-RevId: 705785809
2024-12-13 00:58:08 -08:00
Etienne Perot 2b55090a58 Do not crash when creating thread group with already-exceeded soft CPU limit.
Reported-by: syzbot+da9595a72d0762aaa48d@syzkaller.appspotmail.com
PiperOrigin-RevId: 699425946
2024-11-23 01:28:50 -08:00
Jamie Liu 4298980325 Disallow task creation after Kernel.WaitExited() returns.
Otherwise tasks can be created via the control server between when
Kernel.WaitExited() returns and when the control server is stopped, resulting
in task goroutines running when Kernel.Release() is called.

PiperOrigin-RevId: 658891833
2024-08-02 13:41:17 -07:00
Fabricio Voznika d514dc4424 Track exec'ed processes and kill them after restore
Processes that are exec'ed into a container cannot be properly
restored because the caller is no longer present. This change
tracks processes that are exec'ed and kill them upon restore.

Updates #1956

PiperOrigin-RevId: 623644184
2024-04-10 16:53:49 -07:00
dongjinlong ba02461e12 chore: remove repetitive words in comments
Signed-off-by: dongjinlong <dongjinlong@outlook.com>
2024-03-26 19:57:40 +08:00
NymanRobin f481172b53 Convert atomic.Value to atomic.Pointer[T] 2024-03-05 11:09:23 +02:00
Etienne Perot 7480450936 Replace Task.ptraceTracer with atomic.Pointer.
This removes the compare-and-branch involved in casting
`ptraceTracer.Load()` to `*Task`.

This is on the hot syscall path, and in the overwhelmingly-likely case that
the task is not being traced, we should be as fast as possible.

Shaves off about 20 nanoseconds from the syscall hot path:

```
                                 │   before    │               after                │
                                 │   sec/op    │   sec/op     vs base               │
Syscallbench/syscall.getpidopt-8   1.355µ ± 2%   1.334µ ± 2%  -1.59% (p=0.030 n=64)
```

PiperOrigin-RevId: 611277514
2024-02-28 17:06:55 -08:00
Etienne Perot dec37ea4ed gVisor seccomp: Implement in-Sentry seccomp cache.
This adds a per-task cache of seccomp actions to take for syscall numbers
where the filters return an action without depending on anything other than
the syscall number and the architecture code of the seccomp program input.

This avoids evaluating seccomp-bpf programs in the syscall hot path, for
programs that use seccomp *within* gVisor (aka on themselves).

Benchmarks show that this removes about 50ns from the syscall hot path
for a trivial filter like the one in the benchmark.
Real-world filters are much longer, and the benefit is magnified the more
complex the filter is.

```
                    │ not_cached  │              cached               │
                    │   sec/op    │   sec/op     vs base              │
SyscallUnderSeccomp   1.282µ ± 3%   1.230µ ± 1%  -4.06% (p=0.002 n=6)
```

PiperOrigin-RevId: 586522068
2023-11-29 20:00:16 -08:00
Ayush Ranjan 0358995f0f Fix linter warning in kernel.go.
- Added docstrings for exported functions.
- Exported kernel.userCounters since it is used by another package (testutil)
  and is an exported field of an exported type `kernel.TaskConfig`.

PiperOrigin-RevId: 580615070
2023-11-08 12:20:17 -08:00
Andrei Vagin 5f4abad306 Fix a few typos
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-25 12:13:42 -07:00
Andrei Vagin f3b0a527c2 inet: allow to create abstract unix sockets in non-root namespaces
PiperOrigin-RevId: 573253619
2023-10-13 10:20:56 -07:00
Etienne Perot 02f70b5df0 Implement a subset of keyctl(2) and keyrings(7) for better Docker support.
The intention of this change is to cover a sufficient surface to accommodate
the use of running Docker within gVisor, rather than a full implementation.

This implements the following features:

  - Keys as a first-class concept in the kernel.
  - Tracking keys in user namespaces.
  - Task session keyrings: possession, inheritance.
  - Key permission enforcement.
  - The following `keyctl(2)` operations:
    - `KEYCTL_GET_KEYRING_ID`
    - `KEYCTL_DESCRIBE`
    - `KEYCTL_JOIN_SESSION_KEYRING`
    - `KEYCTL_SETPERM`

Notably, this does not implement:

  - The ability to actually add any keys other than the session keyring
    (which does not hold any cryptographic key data).
  - Other special keyrings (thread keyring, process keyring, user session
    keyring, etc.).
  - Lots of `keyctl(2)` operations.
  - Key expiration.
  - Key garbage collection. Keys live until their user namespace is destroyed.
    However, each user namespace is limited to 200 keys, so memory growth is
    bounded.
  - `add_key(2)`
  - `request_key(2)`

... However, this makes design choices that seem odd given the limited scope
of this change, but make sense when taking into account the desire to
eventually accommodate them in the future. For example, there are many
`switch` statements with only one option for session keyrings, which would get
more options when adding support for other special keyrings. Similarly, the
signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but
currently only ever gets the session keyring as non-nil.

PiperOrigin-RevId: 567047896
2023-09-20 12:38:39 -07:00
Jing Chen e89e40fded Implement setns CLONE_NEWUTS namespace type.
PiperOrigin-RevId: 554306089
2023-08-06 15:33:25 -07:00
Andrei Vagin abe7cee096 kernel: don't use atomic pointers for task.netns
task.netns is always changed from a task goroutine under task.mu.

It means that we can access it without any locks from a task goroutine
we don't need to increment a reference counter in such cases.

In all other cases, we need to take task.mu.

PiperOrigin-RevId: 552913323
2023-08-01 14:04:53 -07:00
Andrei Vagin 46115504ec Implement the setns syscall
This change introduces the nsfs file system. Each new namespace allocates
a new nsfs inode.

Here are reasons why we need these inodes:
* each namespace has to have an unique id.
* proc/pid/ns/ contains one entry for each namespace. Bind mounting one of
  the files in this directory to somewhere else in the filesystem keeps the
  corresponding namespace alive even if all processes currently in
  the namespace terminate.
* setns() allows the calling process to join an existing namespace specified
  by a file descriptor.

PiperOrigin-RevId: 550694515
2023-07-24 15:45:08 -07:00
Nicolas Lacasse e7bd1b4c9c Implement PR_{S,G}ET_CHILD_SUBREAPER.
Closes #2323

PiperOrigin-RevId: 548205854
2023-07-14 13:19:25 -07:00
Jamie Liu f517b70ded Pass context to kernel.TaskImage.release().
PiperOrigin-RevId: 543541608
2023-06-26 14:28:58 -07:00
Shambhavi Srivastava 90bf8f22fc Enabling container to be initialized to it's initial cgroups
Currently there's no implementation to enable the container to be
initialized to it's cgroups and hence the `EnterInitialCgroups` would
inherit all the root's cgroups when the container would start.

With this implementation we make sure that when the container starts
it enter's into the cgroups that is passed down it to from the sandbox.

PiperOrigin-RevId: 540650868
2023-06-15 12:07:05 -07:00
Rahat Mahmood d0ae59368d cgroupfs: Fix lock ordering between kernfs.Filesystem.mu and TaskSet.mu.
We can't DecRef a cgroup with TaskSet.mu held as it leads to circular
locking. Restructure task creation to drop cgroup refs outside the
TaskSet.mu critical section.

Reported-by: syzbot+16a334ab1d6873db18f2@syzkaller.appspotmail.com
Reported-by: syzbot+fe1b962d430d1170e671@syzkaller.appspotmail.com
PiperOrigin-RevId: 492589548
2022-12-02 16:45:40 -08:00
Rahat Mahmood 62ddad6119 cgroupfs: Fix several races with task migration.
Reported-by: syzbot+0be09ce607731f085f73@syzkaller.appspotmail.com
PiperOrigin-RevId: 491920581
2022-11-30 08:14:01 -08:00
Ayush Ranjan 020df37be7 Start cleaning up VFS1.
PiperOrigin-RevId: 486586072
2022-11-07 00:39:54 -08:00
Rahat Mahmood 46e08207b5 cgroupfs: Handle hierachy changes across charge/uncharge.
When charging a pids cgroup during thread creation, it is possible for
the hierachy containing the pids controller to be destroyed and
recreated. If thread creation fails and the charge has to be rolled
back, an intervening hierachy change previously caused a charge
underflow during the rollback.

If the hierachy changes between the charge and uncharge, the uncharge
is uncessary.

Reported-by: syzbot+b72cc8d190b428e43a03@syzkaller.appspotmail.com
PiperOrigin-RevId: 471112484
2022-08-30 16:02:58 -07:00
Andrei Vagin 5ffcc1f799 Don't leak network namespaces
PiperOrigin-RevId: 454707336
2022-06-13 15:05:21 -07:00
Andrei Vagin dfa1c7a1c2 Automated rollback of changelist 445452190
PiperOrigin-RevId: 454236327
2022-06-10 14:01:33 -07:00
Etienne Perot c9f8b165cf Cache each thread group's TID within their own namespace.
This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the
following lock cycle:
`kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex`
-> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex`

(Also, less locking within `createVMALocked` is probably for the better in
general.)

PiperOrigin-RevId: 449588573
2022-05-18 15:14:14 -07:00