gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Jamie Liu	2e6cfa72f2	ktime: support varying Timer implementations - Rename Timer to SampledTimer. - Move all Clock methods except Now to new interface SampledClock. - Move SampledTimer's exported methods (except SetClock) to new interface Timer. Combine Swap and SwapAnd into Set to reduce the number of redundant methods that must be implemented. - Add interface method Clock.NewTimer. This is in preparation for cl/693856539, which adds a second Timer implementation. PiperOrigin-RevId: 694299679	2024-11-07 17:19:25 -08:00
Jamie Liu	e23347e5b5	Move //pkg/sentry/kernel/time to //pkg/sentry/ktime. This avoids needing to rename it everywhere it's imported. PiperOrigin-RevId: 693930089	2024-11-06 18:13:51 -08:00
Jamie Liu	a5573312e0	Add explicit huge page and memory recycling support to pgalloc.MemoryFile. This CL addresses the following major issues: - When an application releases memory to the sentry, the sentry unconditionally releases that memory to the host, rather than allowing it to be reused for future allocations, in order to ensure that new allocations are uniformly decommitted (use no memory): cl/145016083. In most cases, this should have relatively little performance impact; since releasing memory from the application to the OS is expensive even outside of gVisor, application memory allocators optimizing for performance already limit the rate at which they release memory to the OS. However, in applications that involve frequent process creation and exit (e.g. build systems), this practice prevents reuse of memory deallocated by exiting processes for memory allocated by new processes, resulting in both performance degradation and a spike in memory usage (since the sentry may not have released all deallocated memory to the host by the time new allocations occur). - gVisor's historical approach to application THP is based on THP being enabled on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the upstream Linux kernel (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/). Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory without requiring the system to enable THP for all tmpfs files and memfds (by setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or "force"). - Both MM and the application page allocator (pgalloc) are agnostic as to whether the underlying memory file will be THP-backed. Instead, both attempt to align hugepage-sized and larger allocations to hugepage boundaries, such that if the memory file happens to support THP then such allocations will be appropriately aligned to use THP. This is suboptimal since many allocations do not benefit from THP, resulting in memory underutilization. These issues are especially relevant to platforms based on hardware virtualization, where acquiring memory from the host is significantly more expensive due to EPT/NPT fault overhead; when effective, THP reduces the frequency with which said cost is incurred by a factor of 512, and page reuse avoids incurring it at all. Thus: - Instead of inferring whether THP use is desired from allocation size, indicate this explicitly as AllocOpts.Huge, and only set it to true for allocations for non-stack private anonymous mappings. - Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode that indicates that the caller will commit all pages in the allocation. In such cases, pgalloc can reuse deallocated pages without risking increased memory usage, internally referred to as "recycling". AllocateCallerIndirectCommit is used primarily for page faults on a THP-backed region. (It is also used for single-page allocations on non-THP backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned ranges, this is relatively uncommon.) - Allow different chunks in pgalloc.MemoryFile's backing file to have varying THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE. - Split pgalloc.MemoryFile's existing page metadata set into two sets tracking deallocated pages for small/huge-page-backed regions respectively; two sets tracking in-use pages for small/huge-page-backed regions respectively; and a fifth set tracking memory accounting state. - Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for pgalloc tests, but may also be applicable to disk-backed MemoryFiles. Cleanup: - Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled, described in updateUsageLocked(), was based on the condition that MemoryFile.mu would be locked throughout the call to updateUsageLocked(), which was invalidated by cl/337865250. - Remove MemoryFileOpts.ManualZeroing, which is unused. - Rename "reclaiming" to "releasing"; the former is confusing since "reclaim" in Linux has a significantly different meaning (essentially "eviction" in pgalloc), and the latter seems to be conventional in user-mode memory allocators. Using THP for application memory requires setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to allow runsc to request THP from the kernel. After this CL, pgalloc.MemoryFile still releases memory to the host as fast as possible, limiting the effectiveness of page recycling. A following CL adds optional memory release throttling to improve this. Performance outcomes vary by workload and platform. (In all of the below, "baseline" is without this CL, "expt" is with this CL, and "expt2" is with this CL + reclaim throttling (cl/575046398).) For systrap in GKE: As noted, this change is required to enable application THP without forcing it on all host shmem users. In conjunction with recycling (which has a relatively small effect on systrap since it does not use hardware virtualization), THP use slightly improves performance, although whether this is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise": ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30) BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30) BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30) BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30) BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30) BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30) BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30) BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30) BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30) BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30) BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30) BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30) geomean 67.62 66.97 -0.96% 67.12 -0.74% ``` The KVM platform benefits significantly from reduced nested page faults due to huge pages, and to a lesser extent due to recycling: ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17) BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19) BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19) BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19) BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19) BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19) geomean 71.97 67.63 -6.03% 67.28 -6.52% ``` PiperOrigin-RevId: 647771821	2024-06-28 12:56:46 -07:00
Ayush Ranjan	ed9678b679	Delete pgalloc.MemoryFileProvider. The work done in `c087777e37` ("Plumb restore context to afterLoad()") makes pgalloc.MemoryFileProvider redundant as structs can now easily restore pgalloc.MemoryFile in stateify's afterLoad() method. This allows structs to have a pgalloc.MemoryFile field and use that directly, instead of going through the provided interface. This cleans up a lot of code and also should be more performant (avoids an interface method call on many hot paths). PiperOrigin-RevId: 615258927	2024-03-12 20:06:58 -07:00
Adin Scannell	1ceb814544	Add `default_applicable_licenses` rules to packages. PiperOrigin-RevId: 513581243	2023-03-02 10:50:04 -08:00
Kevin Krakauer	d8aa09e04c	convert uses of interface{} to any Done via: find . -name "*.go" \| xargs sed -i -E 's/interface\{\}/any/g' PiperOrigin-RevId: 487033228	2022-11-08 13:14:06 -08:00
Kevin Krakauer	39790bd3a1	switch remaining sync/atomic to atomicbitops for 32 bit values PiperOrigin-RevId: 443571047	2022-04-21 22:27:05 -07:00
Kevin Krakauer	370672e989	prohibit direct use of sync/atomic (u)int64 functions All atomic 64 bit ints are changed to atomicbitops.(Ui\|I)nt64. A nogo checker enforces that sync/atomic 64 bit functions are not called. For reviewers: the interesting changes are in the atomicbitops and checkaligned packages. Why do this? - It is very easy to accidentally use atomic values without sync/atomic funcs. - We have checkatomics, but this is optional and is forgotten in several places. - Using a type+checker to enforce this seems less error prone and simpler. - We get NoCopy protection. - Use of 64 bit atomics can break 32 bit builds. We have types to handle this without any runtime cost, so we might as well use them. PiperOrigin-RevId: 440473398	2022-04-08 16:06:26 -07:00
Nicolas Lacasse	5f33fdf37e	Pass overlay credentials via context in copy up. Some VFS operations (those which operate on FDs) get their credentials via the context instead of via an explicit creds param. For these cases, we must pass the overlay credentials on the context. PiperOrigin-RevId: 327881259	2020-08-21 15:06:09 -07:00
Adin Scannell	94b793262d	Fix all copy locks violations. This required minor restructuring of how system call tables were saved and restored, but it makes way more sense this way. Updates #2243	2020-04-08 10:00:14 -07:00
Adin Scannell	0e2f1b7abd	Update package locations. Because the abi will depend on the core types for marshalling (usermem, context, safemem, safecopy), these need to be flattened from the sentry directory. These packages contain no sentry-specific details. PiperOrigin-RevId: 291811289	2020-01-27 15:31:32 -08:00

11 Commits