gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Jamie Liu	0b59173cff	kernel: only lock TaskSet.mu in Task.unstopVforkParent() if necessary PiperOrigin-RevId: 687421394	2024-10-18 14:17:05 -07:00
Jamie Liu	03bebc4402	kernel: add ThreadGroup.signalLock() This allows "remote" locking of ThreadGroup.signalHandlers.mu without needing to lock TaskSet.mu, analogously to Linux's lock_task_sighand(). This reveals a bug: kernel.Task.sendSignal[Timer]Locked() unintentionally requires TaskSet.mu to be locked since it reads Task.exitState. To fix this, use atomic memory operations on Task.exitState when required. PiperOrigin-RevId: 681128063	2024-10-01 12:48:10 -07:00
Lucas Manning	f229b3e772	Fix small logic bug with CLONE_NEWUSER\|CLONE_NEWNS in clone. The new mount namespace was being created with the old user namespace, not the new one. This led to permission errors when creating new mounts. PiperOrigin-RevId: 676489580	2024-09-19 11:24:15 -07:00
Fabricio Voznika	d514dc4424	Track exec'ed processes and kill them after restore Processes that are exec'ed into a container cannot be properly restored because the caller is no longer present. This change tracks processes that are exec'ed and kill them upon restore. Updates #1956 PiperOrigin-RevId: 623644184	2024-04-10 16:53:49 -07:00
NymanRobin	f481172b53	Convert atomic.Value to atomic.Pointer[T]	2024-03-05 11:09:23 +02:00
Etienne Perot	dec37ea4ed	gVisor `seccomp`: Implement in-Sentry seccomp cache. This adds a per-task cache of seccomp actions to take for syscall numbers where the filters return an action without depending on anything other than the syscall number and the architecture code of the seccomp program input. This avoids evaluating seccomp-bpf programs in the syscall hot path, for programs that use seccomp within gVisor (aka on themselves). Benchmarks show that this removes about 50ns from the syscall hot path for a trivial filter like the one in the benchmark. Real-world filters are much longer, and the benefit is magnified the more complex the filter is. ``` │ not_cached │ cached │ │ sec/op │ sec/op vs base │ SyscallUnderSeccomp 1.282µ ± 3% 1.230µ ± 1% -4.06% (p=0.002 n=6) ``` PiperOrigin-RevId: 586522068	2023-11-29 20:00:16 -08:00
Andrei Vagin	f3b0a527c2	inet: allow to create abstract unix sockets in non-root namespaces PiperOrigin-RevId: 573253619	2023-10-13 10:20:56 -07:00
Etienne Perot	02f70b5df0	Implement a subset of `keyctl(2)` and `keyrings(7)` for better Docker support. The intention of this change is to cover a sufficient surface to accommodate the use of running Docker within gVisor, rather than a full implementation. This implements the following features: - Keys as a first-class concept in the kernel. - Tracking keys in user namespaces. - Task session keyrings: possession, inheritance. - Key permission enforcement. - The following `keyctl(2)` operations: - `KEYCTL_GET_KEYRING_ID` - `KEYCTL_DESCRIBE` - `KEYCTL_JOIN_SESSION_KEYRING` - `KEYCTL_SETPERM` Notably, this does not implement: - The ability to actually add any keys other than the session keyring (which does not hold any cryptographic key data). - Other special keyrings (thread keyring, process keyring, user session keyring, etc.). - Lots of `keyctl(2)` operations. - Key expiration. - Key garbage collection. Keys live until their user namespace is destroyed. However, each user namespace is limited to 200 keys, so memory growth is bounded. - `add_key(2)` - `request_key(2)` ... However, this makes design choices that seem odd given the limited scope of this change, but make sense when taking into account the desire to eventually accommodate them in the future. For example, there are many `switch` statements with only one option for session keyrings, which would get more options when adding support for other special keyrings. Similarly, the signature of `PossessedKeys` takes in all 3 special "possessed" keyrings, but currently only ever gets the session keyring as non-nil. PiperOrigin-RevId: 567047896	2023-09-20 12:38:39 -07:00
Shambhavi Srivastava	8623c872ce	Automated rollback of changelist 557871250 PiperOrigin-RevId: 560158129	2023-08-25 11:58:29 -07:00
Andrei Vagin	eb6b3ac00b	vfs: MountNamespace.Root() has to return a top mount of / A few mounts can be mounted on top of `/`. PiperOrigin-RevId: 558264274	2023-08-18 15:35:47 -07:00
Nicolas Lacasse	9be6f98612	Automated rollback of changelist 554554034 PiperOrigin-RevId: 557871250	2023-08-17 10:50:10 -07:00
Shambhavi Srivastava	21d66119b7	Implementing clone3 Updates #8585 PiperOrigin-RevId: 554554034	2023-08-07 12:19:32 -07:00
Jing Chen	e89e40fded	Implement setns CLONE_NEWUTS namespace type. PiperOrigin-RevId: 554306089	2023-08-06 15:33:25 -07:00
Andrei Vagin	abe7cee096	kernel: don't use atomic pointers for task.netns task.netns is always changed from a task goroutine under task.mu. It means that we can access it without any locks from a task goroutine we don't need to increment a reference counter in such cases. In all other cases, we need to take task.mu. PiperOrigin-RevId: 552913323	2023-08-01 14:04:53 -07:00
Andrei Vagin	aa2c8c33c6	Implement setns for mount namespaces PiperOrigin-RevId: 552859231	2023-08-01 11:12:29 -07:00
Andrei Vagin	41bb04c149	Implement mount namespaces This change implements only the basic functions of mount namespaces. All features that depends on user namespaces will be implemented separately. PiperOrigin-RevId: 552673896	2023-07-31 21:12:21 -07:00
Jing Chen	7f067c7e1d	Implement setns CLONE_NEWIPC namespace type. PiperOrigin-RevId: 552619565	2023-07-31 16:12:45 -07:00
Andrei Vagin	46115504ec	Implement the setns syscall This change introduces the nsfs file system. Each new namespace allocates a new nsfs inode. Here are reasons why we need these inodes: * each namespace has to have an unique id. * proc/pid/ns/ contains one entry for each namespace. Bind mounting one of the files in this directory to somewhere else in the filesystem keeps the corresponding namespace alive even if all processes currently in the namespace terminate. * setns() allows the calling process to join an existing namespace specified by a file descriptor. PiperOrigin-RevId: 550694515	2023-07-24 15:45:08 -07:00
Jamie Liu	f517b70ded	Pass context to kernel.TaskImage.release(). PiperOrigin-RevId: 543541608	2023-06-26 14:28:58 -07:00
Andrei Vagin	fedbf08401	kernel: unshare a network namespace without taking Task.mu t.netns is an atomic pointer so it should not be a problem for readers. As for writers, only task can change its network namespace. Reported-by: syzbot+8e29d377b851dcdfbca2@syzkaller.appspotmail.com Reported-by: syzbot+9ffa998047fce0c57473@syzkaller.appspotmail.com Reported-by: syzbot+c1c75367b97f5e31a12f@syzkaller.appspotmail.com PiperOrigin-RevId: 542384765	2023-06-21 15:51:28 -07:00
Nicolas Lacasse	028cf757bb	Clarify comment about copying Task.image in Task.Clone(). PiperOrigin-RevId: 510829046	2023-02-19 10:43:46 -08:00
Nicolas Lacasse	ff32cb8b25	Hold t.mu while accessing t.image. We copy t.image with t.mu held, to avoid holding the lock while calling t.image.Fork. PiperOrigin-RevId: 510698814	2023-02-18 13:12:02 -08:00
Andrei Vagin	aeabb78527	Allow to return an error from PullFullState. PiperOrigin-RevId: 504415294	2023-01-24 17:15:17 -08:00
Lucas Manning	a248c63cd5	Fix circular lock between filesystemRWMutex and taskSetRWMutex. PiperOrigin-RevId: 500747195	2023-01-09 10:24:35 -08:00
Ayush Ranjan	1fa3c06f1e	Delete VFS1 completely. - Delete pkg/sentry/fs/*. - Move pkg/sentry/fs/fsutil out of VFS1 directory and remove VFS1 components. - Remove remaining unused references to VFS1 from remaining codebase. - Rename/refactor code to avoid even referencing VFS2, unless necessary. - Rewrite VFS1-only tests to VFS2. Updates #1624 PiperOrigin-RevId: 490064269	2022-11-21 13:57:52 -08:00

1 2 3 4

82 Commits