gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Ayush Ranjan	f06d4e7ebe	goferfs: Add S/R support for open FDs to deleted files. This support is only needed when the gofer mount in question is writable. By default, the rootfs has an overlayfs applied, so the gofer lower layer is not writabled. But if you are using --overlay2=none, then this change should allow you to save sandbox with open FDs to deleted files in rootfs. Updates #11425 PiperOrigin-RevId: 733021267	2025-03-03 12:38:10 -08:00
Andrei Vagin	f010ae01ac	Fix a few typos	2025-01-29 21:16:51 -08:00
Jamie Liu	cf5841ba66	mm: implement prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) PiperOrigin-RevId: 696727156	2024-11-14 19:06:07 -08:00
Jamie Liu	a5573312e0	Add explicit huge page and memory recycling support to pgalloc.MemoryFile. This CL addresses the following major issues: - When an application releases memory to the sentry, the sentry unconditionally releases that memory to the host, rather than allowing it to be reused for future allocations, in order to ensure that new allocations are uniformly decommitted (use no memory): cl/145016083. In most cases, this should have relatively little performance impact; since releasing memory from the application to the OS is expensive even outside of gVisor, application memory allocators optimizing for performance already limit the rate at which they release memory to the OS. However, in applications that involve frequent process creation and exit (e.g. build systems), this practice prevents reuse of memory deallocated by exiting processes for memory allocated by new processes, resulting in both performance degradation and a spike in memory usage (since the sentry may not have released all deallocated memory to the host by the time new allocations occur). - gVisor's historical approach to application THP is based on THP being enabled on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the upstream Linux kernel (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/). Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory without requiring the system to enable THP for all tmpfs files and memfds (by setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or "force"). - Both MM and the application page allocator (pgalloc) are agnostic as to whether the underlying memory file will be THP-backed. Instead, both attempt to align hugepage-sized and larger allocations to hugepage boundaries, such that if the memory file happens to support THP then such allocations will be appropriately aligned to use THP. This is suboptimal since many allocations do not benefit from THP, resulting in memory underutilization. These issues are especially relevant to platforms based on hardware virtualization, where acquiring memory from the host is significantly more expensive due to EPT/NPT fault overhead; when effective, THP reduces the frequency with which said cost is incurred by a factor of 512, and page reuse avoids incurring it at all. Thus: - Instead of inferring whether THP use is desired from allocation size, indicate this explicitly as AllocOpts.Huge, and only set it to true for allocations for non-stack private anonymous mappings. - Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode that indicates that the caller will commit all pages in the allocation. In such cases, pgalloc can reuse deallocated pages without risking increased memory usage, internally referred to as "recycling". AllocateCallerIndirectCommit is used primarily for page faults on a THP-backed region. (It is also used for single-page allocations on non-THP backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned ranges, this is relatively uncommon.) - Allow different chunks in pgalloc.MemoryFile's backing file to have varying THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE. - Split pgalloc.MemoryFile's existing page metadata set into two sets tracking deallocated pages for small/huge-page-backed regions respectively; two sets tracking in-use pages for small/huge-page-backed regions respectively; and a fifth set tracking memory accounting state. - Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for pgalloc tests, but may also be applicable to disk-backed MemoryFiles. Cleanup: - Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled, described in updateUsageLocked(), was based on the condition that MemoryFile.mu would be locked throughout the call to updateUsageLocked(), which was invalidated by cl/337865250. - Remove MemoryFileOpts.ManualZeroing, which is unused. - Rename "reclaiming" to "releasing"; the former is confusing since "reclaim" in Linux has a significantly different meaning (essentially "eviction" in pgalloc), and the latter seems to be conventional in user-mode memory allocators. Using THP for application memory requires setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to allow runsc to request THP from the kernel. After this CL, pgalloc.MemoryFile still releases memory to the host as fast as possible, limiting the effectiveness of page recycling. A following CL adds optional memory release throttling to improve this. Performance outcomes vary by workload and platform. (In all of the below, "baseline" is without this CL, "expt" is with this CL, and "expt2" is with this CL + reclaim throttling (cl/575046398).) For systrap in GKE: As noted, this change is required to enable application THP without forcing it on all host shmem users. In conjunction with recycling (which has a relatively small effect on systrap since it does not use hardware virtualization), THP use slightly improves performance, although whether this is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise": ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30) BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30) BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30) BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30) BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30) BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30) BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30) BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30) BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30) BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30) BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30) BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30) geomean 67.62 66.97 -0.96% 67.12 -0.74% ``` The KVM platform benefits significantly from reduced nested page faults due to huge pages, and to a lesser extent due to recycling: ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17) BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19) BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19) BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19) BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19) BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19) geomean 71.97 67.63 -6.03% 67.28 -6.52% ``` PiperOrigin-RevId: 647771821	2024-06-28 12:56:46 -07:00
Jamie Liu	2069e8643b	test: add PageTableLeak test PiperOrigin-RevId: 643437188	2024-06-14 13:27:13 -07:00
Ayush Ranjan	4bdb7e9fa0	Allow kernel versions with a + suffix in ParseKernelVersion() This is needed to support kernel versions like 6.1.90+. PiperOrigin-RevId: 640936154	2024-06-06 10:10:40 -07:00
gVisor bot	edbc2af9f7	Override operator new and delete in tests This is necessary to ensure errno is not updated while allocating. Allocators are allowed to update errno, even in case of success. As gvisor uses matchers to check the value of errno, the tests might fail if errno is overriden by an allocation done while building the matcher. Using a custom implementation of new and delete ensures this is not the case. PiperOrigin-RevId: 619238390	2024-03-26 10:39:52 -07:00
gVisor bot	e1ffb14778	Restore errno around allocation in test matchers Allocation are not guaranteed to preserve errno, even in case of success. Because the test matchers test against errno, preserve errno when allocating new matchers. PiperOrigin-RevId: 618437007	2024-03-23 05:58:23 -07:00
Jamie Liu	075a9df798	Fix async-signal-unsafety in mount test. - JoinPath returns a std::string and can therefore heap-allocate. - exit(3) is async-signal-unsafe since it executes arbitrary functions registered by atexit(3) / on_exit(3). - Before this CL, TempPath::path() returns a std::string (by value) and can therefore heap-allocate. PiperOrigin-RevId: 615143819	2024-03-12 13:05:16 -07:00
gVisor bot	c6b06ab1a5	Internal change. PiperOrigin-RevId: 612510308	2024-03-04 11:04:01 -08:00
Jamie Liu	a88e82fa4a	Deflake cpuacct cgroup test. - Cgroup::PollControlFileForChange() is only used by cpuacct tests to check that CPU usage for the test's containing cgroups increases over time. In this context, sleeping between checks is counterproductive because the test uses no CPU while sleeping. Remove the sleep. - Don't assume that the root cgroup's usage is initially non-zero, due to granularity issues. PiperOrigin-RevId: 611594418	2024-02-29 14:24:07 -08:00
gVisor bot	53d2b511e7	Change remaining test targets to use select_gtest() to choose the gtest target Should be a no-op as select_gtest() returns the same target as before. PiperOrigin-RevId: 607788473	2024-02-16 13:45:58 -08:00
Lucas Manning	8053cd8f0b	Add mount locking. Mounts that come from a more privileged namespace must be locked so that they cannot be unmounted from a less privileged namespace. This an important consequence of having mount namespaces + mount propagation. See https://man7.org/linux/man-pages/man7/mount_namespaces.7.html for full detail. PiperOrigin-RevId: 597322843	2024-01-10 12:25:38 -08:00
Bruno Dal Bo	5c41ffabdb	Fix select call in socket utils From the `select` manpage: ``` nfds This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1... ``` PiperOrigin-RevId: 595697728	2024-01-04 07:24:31 -08:00
Bruno Dal Bo	9ea77eb8bc	Allow infinite timeout in helpers PiperOrigin-RevId: 595462226	2024-01-03 11:52:03 -08:00
Andrei Vagin	d039776e16	test: fix SetupTimeWaitClose to create a time wait bucket reliably Linux can close a tcp connection without a time wait bucket if fin packets come from both sides close to each other. Let's send some data after the first fin packet to be sure that it is acked before sending fin from another side. PiperOrigin-RevId: 575002422	2023-10-19 14:29:06 -07:00
Ayush Ranjan	6a25f2ebb2	Remove duplicate RandomizeBuffer implementation from socket util. Instead use the one from test_util, which uses a more random seed for rand_r(). Suggested-by: Jamie Liu <jamieliu@google.com> Suggested-by: Andrei Vagin <avagin@google.com> PiperOrigin-RevId: 574243850	2023-10-17 13:15:22 -07:00
gVisor bot	ae1294b435	Internal change. PiperOrigin-RevId: 567370524	2023-09-21 11:42:16 -07:00
Lucas Manning	c2a7efe6a2	Clean up mount tests. This change makes it simpler to write mount tests that examine the optional field of mountinfo. It also cleans up the mounts in existing tests so they don't pollute mountinfo after exit. PiperOrigin-RevId: 561439839	2023-08-30 13:56:03 -07:00
Jeff Martin	8de4ec70bd	Internal change. PiperOrigin-RevId: 556842126	2023-08-14 10:45:30 -07:00
Nicolas Lacasse	e7bd1b4c9c	Implement PR_{S,G}ET_CHILD_SUBREAPER. Closes #2323 PiperOrigin-RevId: 548205854	2023-07-14 13:19:25 -07:00
Nick Brown	a435ed7c09	Set kOLargeFile for RISC-V architecture This is required to get these tests building on a Fuchsia + RISC-V architecture. The constant's value was extracted from: https://github.com/riscvarchive/riscv-musl/blob/3fe7e2c75df78eef42dcdc352a55757729f451e2/arch/riscv64/bits/fcntl.h#L16 PiperOrigin-RevId: 540340091	2023-06-14 12:03:54 -07:00
Alex Konradi	9a2f1d041a	Add a test for receiving UDP packet with src port 0 Add a test that validates that when a UDP packet is received with a source port value of 0, it is delivered to a listening socket. PiperOrigin-RevId: 536773375	2023-05-31 11:31:05 -07:00
gVisor bot	d860782ada	Merge pull request #8679 from avagin:cpp-warnings PiperOrigin-RevId: 523157922	2023-04-10 10:59:51 -07:00
Andrei Vagin	a699bc8a39	test: use the init_module syscall to trigger save/restore create_module isn't defined on arm64 PiperOrigin-RevId: 521866613	2023-04-04 14:32:09 -07:00

1 2 3 4 5 ...

190 Commits