gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Jimmy Tran	17563a8af9	Return EACCES when calling setpgid() after execve() From setpgid manpage, EACCES - An attempt was made to change the process group ID of one of the children of the calling process and the child had already performed an execve(2) (setpgid(), setpgrp()). This CL makes gVisor implement this rule and updates the exec test suite accordingly. TESTED: http://sponge2/7f364e8a-4f82-463e-ba62-79234c4d054d PiperOrigin-RevId: 727095560	2025-02-14 16:14:14 -08:00
Nicolas Lacasse	0daeb1c40b	pread/writev: Copy in iovecs outside of task.mu. Copying in the iovecs requires acquiring mm.mappingRWMutex, which is above task.mu in the lock ordering. Instead of copying with task.CopyContext, we perform the copy with MemoryManager.Copy{In,Out}. The MemoryManagers are 'pinned' with IncUser() for the duration of the copy operations. PiperOrigin-RevId: 725325325	2025-02-10 13:25:06 -08:00
Yhinner	191b53da2a	Fix EXEC permission of the volume mount when calling mmap with PROT_EXEC	2025-01-27 18:59:59 +00:00
Nicolas Lacasse	c238e15234	Fix validation of close_range `last` fd argument. The `last` fd argument can be up to max uint32, and some applications call it with this maximum: https://github.com/GNOME/glib/blob/26bc1d08ec574b387ff4bcd919a020a586727bbf/glib/glib-unix.c#L890 PiperOrigin-RevId: 718526878	2025-01-22 14:25:24 -08:00
Andrei Vagin	1864d9d091	Untag user addresses before handling them in the Sentry Top-Byte-Ignore (TBI) is a feature on all ARMv8.0 CPUs that causes the top byte of virtual addresses to be ignored on loads and stores. Instead, bit 55 is extended over bits 56-63 before address translation. This feature allows use of the (ignored) top byte as a tag or for other in-band metadata. In Linux, brk()/mmap()/mremap() syscalls don't untag addresses. More details are in dcde237319e6 ("mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()") PiperOrigin-RevId: 715885990	2025-01-15 11:52:40 -08:00
gVisor bot	4971756d8d	Internal change. PiperOrigin-RevId: 702177156	2024-12-02 20:23:18 -08:00
Jamie Liu	cf5841ba66	mm: implement prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) PiperOrigin-RevId: 696727156	2024-11-14 19:06:07 -08:00
Jamie Liu	e23347e5b5	Move //pkg/sentry/kernel/time to //pkg/sentry/ktime. This avoids needing to rename it everywhere it's imported. PiperOrigin-RevId: 693930089	2024-11-06 18:13:51 -08:00
Andrei Vagin	3b28deddf4	sentry/syscalls: update docs for the unshare syscall Mount and network namespace are supported. PiperOrigin-RevId: 663442921	2024-08-15 14:04:15 -07:00
Kevin Krakauer	e39ed91daa	sentry: support NULL mount source NULL mount sources can be valid, e.g. when mounting "proc". PiperOrigin-RevId: 653769241	2024-07-18 15:12:40 -07:00
Jamie Liu	a5573312e0	Add explicit huge page and memory recycling support to pgalloc.MemoryFile. This CL addresses the following major issues: - When an application releases memory to the sentry, the sentry unconditionally releases that memory to the host, rather than allowing it to be reused for future allocations, in order to ensure that new allocations are uniformly decommitted (use no memory): cl/145016083. In most cases, this should have relatively little performance impact; since releasing memory from the application to the OS is expensive even outside of gVisor, application memory allocators optimizing for performance already limit the rate at which they release memory to the OS. However, in applications that involve frequent process creation and exit (e.g. build systems), this practice prevents reuse of memory deallocated by exiting processes for memory allocated by new processes, resulting in both performance degradation and a spike in memory usage (since the sentry may not have released all deallocated memory to the host by the time new allocations occur). - gVisor's historical approach to application THP is based on THP being enabled on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the upstream Linux kernel (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/). Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory without requiring the system to enable THP for all tmpfs files and memfds (by setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or "force"). - Both MM and the application page allocator (pgalloc) are agnostic as to whether the underlying memory file will be THP-backed. Instead, both attempt to align hugepage-sized and larger allocations to hugepage boundaries, such that if the memory file happens to support THP then such allocations will be appropriately aligned to use THP. This is suboptimal since many allocations do not benefit from THP, resulting in memory underutilization. These issues are especially relevant to platforms based on hardware virtualization, where acquiring memory from the host is significantly more expensive due to EPT/NPT fault overhead; when effective, THP reduces the frequency with which said cost is incurred by a factor of 512, and page reuse avoids incurring it at all. Thus: - Instead of inferring whether THP use is desired from allocation size, indicate this explicitly as AllocOpts.Huge, and only set it to true for allocations for non-stack private anonymous mappings. - Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode that indicates that the caller will commit all pages in the allocation. In such cases, pgalloc can reuse deallocated pages without risking increased memory usage, internally referred to as "recycling". AllocateCallerIndirectCommit is used primarily for page faults on a THP-backed region. (It is also used for single-page allocations on non-THP backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned ranges, this is relatively uncommon.) - Allow different chunks in pgalloc.MemoryFile's backing file to have varying THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE. - Split pgalloc.MemoryFile's existing page metadata set into two sets tracking deallocated pages for small/huge-page-backed regions respectively; two sets tracking in-use pages for small/huge-page-backed regions respectively; and a fifth set tracking memory accounting state. - Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for pgalloc tests, but may also be applicable to disk-backed MemoryFiles. Cleanup: - Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled, described in updateUsageLocked(), was based on the condition that MemoryFile.mu would be locked throughout the call to updateUsageLocked(), which was invalidated by cl/337865250. - Remove MemoryFileOpts.ManualZeroing, which is unused. - Rename "reclaiming" to "releasing"; the former is confusing since "reclaim" in Linux has a significantly different meaning (essentially "eviction" in pgalloc), and the latter seems to be conventional in user-mode memory allocators. Using THP for application memory requires setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to allow runsc to request THP from the kernel. After this CL, pgalloc.MemoryFile still releases memory to the host as fast as possible, limiting the effectiveness of page recycling. A following CL adds optional memory release throttling to improve this. Performance outcomes vary by workload and platform. (In all of the below, "baseline" is without this CL, "expt" is with this CL, and "expt2" is with this CL + reclaim throttling (cl/575046398).) For systrap in GKE: As noted, this change is required to enable application THP without forcing it on all host shmem users. In conjunction with recycling (which has a relatively small effect on systrap since it does not use hardware virtualization), THP use slightly improves performance, although whether this is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise": ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30) BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30) BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30) BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30) BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30) BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30) BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30) BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30) BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30) BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30) BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30) BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30) geomean 67.62 66.97 -0.96% 67.12 -0.74% ``` The KVM platform benefits significantly from reduced nested page faults due to huge pages, and to a lesser extent due to recycling: ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17) BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19) BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19) BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19) BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19) BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19) geomean 71.97 67.63 -6.03% 67.28 -6.52% ``` PiperOrigin-RevId: 647771821	2024-06-28 12:56:46 -07:00
Nicolas Lacasse	03b5480f8c	getdents size argument must fit in int32. PiperOrigin-RevId: 633251720	2024-05-13 10:20:13 -07:00
Jing Chen	e7b59aa1b6	Implement Getxattr for directfs and lisafs. It allows us to read the executable binary's security.capability. PiperOrigin-RevId: 611364676	2024-02-29 00:01:31 -08:00
Jamie Liu	162da5f040	Add kernel/time.Timer.SetClock() and use it on kernel.Task.blockingTimer. PiperOrigin-RevId: 611185267	2024-02-28 12:21:12 -08:00
Ayush Ranjan	fa0adadccb	Refactor BlockWithTimer() to reduce duplicated code. The users of Task.BlockWithTimer() are doing the same work. PiperOrigin-RevId: 610358579	2024-02-26 03:55:27 -08:00
Andrei Vagin	aa1a66353a	process_vm_{read,write}v returns EFAILT if iov-s describe inaccessible memory PiperOrigin-RevId: 607482542	2024-02-15 15:58:29 -08:00
Andrei Vagin	28a8d3d3f3	Don't return EINTR from process_vm_{write,read}v syscalls Just repeat the Linux behavior here. PiperOrigin-RevId: 607463895	2024-02-15 14:54:13 -08:00
Etienne Perot	9defeeaf09	`seccomp`: Check that programs that are too large are rejected. This is good for three reasons: - Linux also rejects programs larger than 4,096 (`BPF_MAXINSNS`) instructions. - This avoids making an allocation of unbounded length on the next line. - This avoids the pitfall where gVisor may spend CPU doing BPF bytecode optimizations, which can be worse than O(n), on a program which is unboundedly large. By checking the size of the BPF program before applying bytecode optimizations, this DoS vector is nullified. PiperOrigin-RevId: 604523997	2024-02-05 21:09:42 -08:00
Jamie Liu	d307bff77a	Allow process_vm_{readv,writev} to target non-group-leader threads. Also allow it to target exiting threads, which is consistently observable via e.g. PTRACE_EVENT_EXIT; check remoteTask.MemoryManager() with remoteTask.mu locked instead, which is consistent with Linux's mm/process_vm_access.c:process_vm_rw_core() => kernel/fork.c:mm_access() and avoids racing with remote task exit. PiperOrigin-RevId: 599943619	2024-01-19 14:51:42 -08:00
Andrei Vagin	5b33e4a3d8	Enable leak checkers for runsc tests Updates #4572 PiperOrigin-RevId: 597307765	2024-01-10 11:30:58 -08:00
prof awk	4d30f2c9ef	use new clear builtin to clear bufs	2023-11-27 19:43:25 +02:00
Ayush Ranjan	980de72deb	Call FileDescription.OnClose() for newfd being replaced in dup2 and dup3. dup(2) man page specifies: If the file descriptor newfd was previously open, it is closed before being reused; the close is performed silently (i.e., any errors during the close are not reported by dup2()). Even though we were DecRef-ing and hence releasing the replaced FD, we were not calling OnClose(). Compare fs/file.c:do_dup2() -> filp_close(tofree), which in turn calls filp_flush(). In gVisor, FileDescription.OnClose() analogously does such flush operations. in turn PiperOrigin-RevId: 583147682	2023-11-16 13:38:17 -08:00
Andrei Vagin	68cdc88378	Implement the fs.nr_open sysctl fs/nr_open limits the maximum size of fdtable-s. PiperOrigin-RevId: 580795874	2023-11-08 23:41:32 -08:00
Andrei Vagin	9bfd408753	syscall: process_vm_* copies data by chunks First, it avoids allocating a large buffer that can be costly. Second, it allows to interrupt a system call in case of any signals. PiperOrigin-RevId: 580721720	2023-11-08 17:59:08 -08:00
Nicolas Lacasse	aeaee71669	setsid() should return the session id. PiperOrigin-RevId: 579011508	2023-11-02 16:24:12 -07:00

1 2 3 4 5 ...

514 Commits