gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Andrei Vagin	9fcf0b5b53	proc: invalidate task inodes when tasks are destroyed PiperOrigin-RevId: 705785809	2024-12-13 00:58:08 -08:00
Etienne Perot	2b55090a58	Do not crash when creating thread group with already-exceeded soft CPU limit. Reported-by: syzbot+da9595a72d0762aaa48d@syzkaller.appspotmail.com PiperOrigin-RevId: 699425946	2024-11-23 01:28:50 -08:00
Jamie Liu	4298980325	Disallow task creation after Kernel.WaitExited() returns. Otherwise tasks can be created via the control server between when Kernel.WaitExited() returns and when the control server is stopped, resulting in task goroutines running when Kernel.Release() is called. PiperOrigin-RevId: 658891833	2024-08-02 13:41:17 -07:00
Fabricio Voznika	d514dc4424	Track exec'ed processes and kill them after restore Processes that are exec'ed into a container cannot be properly restored because the caller is no longer present. This change tracks processes that are exec'ed and kill them upon restore. Updates #1956 PiperOrigin-RevId: 623644184	2024-04-10 16:53:49 -07:00
dongjinlong	ba02461e12	chore: remove repetitive words in comments Signed-off-by: dongjinlong <dongjinlong@outlook.com>	2024-03-26 19:57:40 +08:00
NymanRobin	f481172b53	Convert atomic.Value to atomic.Pointer[T]	2024-03-05 11:09:23 +02:00
Etienne Perot	7480450936	Replace `Task.ptraceTracer` with `atomic.Pointer`. This removes the compare-and-branch involved in casting `ptraceTracer.Load()` to `*Task`. This is on the hot syscall path, and in the overwhelmingly-likely case that the task is not being traced, we should be as fast as possible. Shaves off about 20 nanoseconds from the syscall hot path: ``` │ before │ after │ │ sec/op │ sec/op vs base │ Syscallbench/syscall.getpidopt-8 1.355µ ± 2% 1.334µ ± 2% -1.59% (p=0.030 n=64) ``` PiperOrigin-RevId: 611277514	2024-02-28 17:06:55 -08:00
Etienne Perot	dec37ea4ed	gVisor `seccomp`: Implement in-Sentry seccomp cache. This adds a per-task cache of seccomp actions to take for syscall numbers where the filters return an action without depending on anything other than the syscall number and the architecture code of the seccomp program input. This avoids evaluating seccomp-bpf programs in the syscall hot path, for programs that use seccomp within gVisor (aka on themselves). Benchmarks show that this removes about 50ns from the syscall hot path for a trivial filter like the one in the benchmark. Real-world filters are much longer, and the benefit is magnified the more complex the filter is. ``` │ not_cached │ cached │ │ sec/op │ sec/op vs base │ SyscallUnderSeccomp 1.282µ ± 3% 1.230µ ± 1% -4.06% (p=0.002 n=6) ``` PiperOrigin-RevId: 586522068	2023-11-29 20:00:16 -08:00
Ayush Ranjan	0358995f0f	Fix linter warning in kernel.go. - Added docstrings for exported functions. - Exported kernel.userCounters since it is used by another package (testutil) and is an exported field of an exported type `kernel.TaskConfig`. PiperOrigin-RevId: 580615070	2023-11-08 12:20:17 -08:00
Andrei Vagin	5f4abad306	Fix a few typos It is an idea of running codespell as part of our presubmit checks. Before enabling it for new changes, let's fix what it has found. Signed-off-by: Andrei Vagin <avagin@gmail.com>	2023-10-25 12:13:42 -07:00
Andrei Vagin	f3b0a527c2	inet: allow to create abstract unix sockets in non-root namespaces PiperOrigin-RevId: 573253619	2023-10-13 10:20:56 -07:00
Etienne Perot	02f70b5df0	Implement a subset of `keyctl(2)` and `keyrings(7)` for better Docker support. The intention of this change is to cover a sufficient surface to accommodate the use of running Docker within gVisor, rather than a full implementation. This implements the following features: - Keys as a first-class concept in the kernel. - Tracking keys in user namespaces. - Task session keyrings: possession, inheritance. - Key permission enforcement. - The following `keyctl(2)` operations: - `KEYCTL_GET_KEYRING_ID` - `KEYCTL_DESCRIBE` - `KEYCTL_JOIN_SESSION_KEYRING` - `KEYCTL_SETPERM` Notably, this does not implement: - The ability to actually add any keys other than the session keyring (which does not hold any cryptographic key data). - Other special keyrings (thread keyring, process keyring, user session keyring, etc.). - Lots of `keyctl(2)` operations. - Key expiration. - Key garbage collection. Keys live until their user namespace is destroyed. However, each user namespace is limited to 200 keys, so memory growth is bounded. - `add_key(2)` - `request_key(2)` ... However, this makes design choices that seem odd given the limited scope of this change, but make sense when taking into account the desire to eventually accommodate them in the future. For example, there are many `switch` statements with only one option for session keyrings, which would get more options when adding support for other special keyrings. Similarly, the signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but currently only ever gets the session keyring as non-nil. PiperOrigin-RevId: 567047896	2023-09-20 12:38:39 -07:00
Jing Chen	e89e40fded	Implement setns CLONE_NEWUTS namespace type. PiperOrigin-RevId: 554306089	2023-08-06 15:33:25 -07:00
Andrei Vagin	abe7cee096	kernel: don't use atomic pointers for task.netns task.netns is always changed from a task goroutine under task.mu. It means that we can access it without any locks from a task goroutine we don't need to increment a reference counter in such cases. In all other cases, we need to take task.mu. PiperOrigin-RevId: 552913323	2023-08-01 14:04:53 -07:00
Andrei Vagin	46115504ec	Implement the setns syscall This change introduces the nsfs file system. Each new namespace allocates a new nsfs inode. Here are reasons why we need these inodes: * each namespace has to have an unique id. * proc/pid/ns/ contains one entry for each namespace. Bind mounting one of the files in this directory to somewhere else in the filesystem keeps the corresponding namespace alive even if all processes currently in the namespace terminate. * setns() allows the calling process to join an existing namespace specified by a file descriptor. PiperOrigin-RevId: 550694515	2023-07-24 15:45:08 -07:00
Nicolas Lacasse	e7bd1b4c9c	Implement PR_{S,G}ET_CHILD_SUBREAPER. Closes #2323 PiperOrigin-RevId: 548205854	2023-07-14 13:19:25 -07:00
Jamie Liu	f517b70ded	Pass context to kernel.TaskImage.release(). PiperOrigin-RevId: 543541608	2023-06-26 14:28:58 -07:00
Shambhavi Srivastava	90bf8f22fc	Enabling container to be initialized to it's initial cgroups Currently there's no implementation to enable the container to be initialized to it's cgroups and hence the `EnterInitialCgroups` would inherit all the root's cgroups when the container would start. With this implementation we make sure that when the container starts it enter's into the cgroups that is passed down it to from the sandbox. PiperOrigin-RevId: 540650868	2023-06-15 12:07:05 -07:00
Rahat Mahmood	d0ae59368d	cgroupfs: Fix lock ordering between kernfs.Filesystem.mu and TaskSet.mu. We can't DecRef a cgroup with TaskSet.mu held as it leads to circular locking. Restructure task creation to drop cgroup refs outside the TaskSet.mu critical section. Reported-by: syzbot+16a334ab1d6873db18f2@syzkaller.appspotmail.com Reported-by: syzbot+fe1b962d430d1170e671@syzkaller.appspotmail.com PiperOrigin-RevId: 492589548	2022-12-02 16:45:40 -08:00
Rahat Mahmood	62ddad6119	cgroupfs: Fix several races with task migration. Reported-by: syzbot+0be09ce607731f085f73@syzkaller.appspotmail.com PiperOrigin-RevId: 491920581	2022-11-30 08:14:01 -08:00
Ayush Ranjan	020df37be7	Start cleaning up VFS1. PiperOrigin-RevId: 486586072	2022-11-07 00:39:54 -08:00
Rahat Mahmood	46e08207b5	cgroupfs: Handle hierachy changes across charge/uncharge. When charging a pids cgroup during thread creation, it is possible for the hierachy containing the pids controller to be destroyed and recreated. If thread creation fails and the charge has to be rolled back, an intervening hierachy change previously caused a charge underflow during the rollback. If the hierachy changes between the charge and uncharge, the uncharge is uncessary. Reported-by: syzbot+b72cc8d190b428e43a03@syzkaller.appspotmail.com PiperOrigin-RevId: 471112484	2022-08-30 16:02:58 -07:00
Andrei Vagin	5ffcc1f799	Don't leak network namespaces PiperOrigin-RevId: 454707336	2022-06-13 15:05:21 -07:00
Andrei Vagin	dfa1c7a1c2	Automated rollback of changelist 445452190 PiperOrigin-RevId: 454236327	2022-06-10 14:01:33 -07:00
Etienne Perot	c9f8b165cf	Cache each thread group's TID within their own namespace. This avoids requiring a lock in `ThreadGroup.ID`, which in turn breaks the following lock cycle: `kernel.taskSetRWMutex` -> `kernel.taskMutex` -> `mm.metadataMutex` -> `mm.mappingRWMutex` -> `kernel.taskSetRWMutex` (Also, less locking within `createVMALocked` is probably for the better in general.) PiperOrigin-RevId: 449588573	2022-05-18 15:14:14 -07:00

1 2 3

65 Commits