gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Jamie Liu	a5573312e0	Add explicit huge page and memory recycling support to pgalloc.MemoryFile. This CL addresses the following major issues: - When an application releases memory to the sentry, the sentry unconditionally releases that memory to the host, rather than allowing it to be reused for future allocations, in order to ensure that new allocations are uniformly decommitted (use no memory): cl/145016083. In most cases, this should have relatively little performance impact; since releasing memory from the application to the OS is expensive even outside of gVisor, application memory allocators optimizing for performance already limit the rate at which they release memory to the OS. However, in applications that involve frequent process creation and exit (e.g. build systems), this practice prevents reuse of memory deallocated by exiting processes for memory allocated by new processes, resulting in both performance degradation and a spike in memory usage (since the sentry may not have released all deallocated memory to the host by the time new allocations occur). - gVisor's historical approach to application THP is based on THP being enabled on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the upstream Linux kernel (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/). Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory without requiring the system to enable THP for all tmpfs files and memfds (by setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or "force"). - Both MM and the application page allocator (pgalloc) are agnostic as to whether the underlying memory file will be THP-backed. Instead, both attempt to align hugepage-sized and larger allocations to hugepage boundaries, such that if the memory file happens to support THP then such allocations will be appropriately aligned to use THP. This is suboptimal since many allocations do not benefit from THP, resulting in memory underutilization. These issues are especially relevant to platforms based on hardware virtualization, where acquiring memory from the host is significantly more expensive due to EPT/NPT fault overhead; when effective, THP reduces the frequency with which said cost is incurred by a factor of 512, and page reuse avoids incurring it at all. Thus: - Instead of inferring whether THP use is desired from allocation size, indicate this explicitly as AllocOpts.Huge, and only set it to true for allocations for non-stack private anonymous mappings. - Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode that indicates that the caller will commit all pages in the allocation. In such cases, pgalloc can reuse deallocated pages without risking increased memory usage, internally referred to as "recycling". AllocateCallerIndirectCommit is used primarily for page faults on a THP-backed region. (It is also used for single-page allocations on non-THP backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned ranges, this is relatively uncommon.) - Allow different chunks in pgalloc.MemoryFile's backing file to have varying THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE. - Split pgalloc.MemoryFile's existing page metadata set into two sets tracking deallocated pages for small/huge-page-backed regions respectively; two sets tracking in-use pages for small/huge-page-backed regions respectively; and a fifth set tracking memory accounting state. - Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for pgalloc tests, but may also be applicable to disk-backed MemoryFiles. Cleanup: - Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled, described in updateUsageLocked(), was based on the condition that MemoryFile.mu would be locked throughout the call to updateUsageLocked(), which was invalidated by cl/337865250. - Remove MemoryFileOpts.ManualZeroing, which is unused. - Rename "reclaiming" to "releasing"; the former is confusing since "reclaim" in Linux has a significantly different meaning (essentially "eviction" in pgalloc), and the latter seems to be conventional in user-mode memory allocators. Using THP for application memory requires setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to allow runsc to request THP from the kernel. After this CL, pgalloc.MemoryFile still releases memory to the host as fast as possible, limiting the effectiveness of page recycling. A following CL adds optional memory release throttling to improve this. Performance outcomes vary by workload and platform. (In all of the below, "baseline" is without this CL, "expt" is with this CL, and "expt2" is with this CL + reclaim throttling (cl/575046398).) For systrap in GKE: As noted, this change is required to enable application THP without forcing it on all host shmem users. In conjunction with recycling (which has a relatively small effect on systrap since it does not use hardware virtualization), THP use slightly improves performance, although whether this is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise": ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30) BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30) BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30) BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30) BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30) BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30) BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30) BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30) BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30) BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30) BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30) BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30) geomean 67.62 66.97 -0.96% 67.12 -0.74% ``` The KVM platform benefits significantly from reduced nested page faults due to huge pages, and to a lesser extent due to recycling: ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ baseline │ expt │ expt2 │ │ sec/op │ sec/op vs base │ sec/op vs base │ BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17) BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19) BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19) BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19) BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19) BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19) BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19) BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19) geomean 71.97 67.63 -6.03% 67.28 -6.52% ``` PiperOrigin-RevId: 647771821	2024-06-28 12:56:46 -07:00
Jamie Liu	56ab580ccb	Automated rollback of changelist 633961720 PiperOrigin-RevId: 636602151	2024-05-23 10:50:10 -07:00
Konstantin Bogomolov	0bf4e9f6e5	Set limit on how big MemoryFile.Allocate calls can be. Either TotalHostMem or TotalMem are good candidates for limits because in case either of these is set we should not be going over them. The motivations of this is to help catch syscalls causing allocations with size values that are blatantly bad. PiperOrigin-RevId: 633961720	2024-05-15 08:23:50 -07:00
Jing Chen	be48200c0e	Re-order loads in BUILD files to make transformations reversible in Copybara. PiperOrigin-RevId: 598898756	2024-01-16 11:21:40 -08:00
Andrei Vagin	5f4abad306	Fix a few typos It is an idea of running codespell as part of our presubmit checks. Before enabling it for new changes, let's fix what it has found. Signed-off-by: Andrei Vagin <avagin@gmail.com>	2023-10-25 12:13:42 -07:00
Nayana Bidari	a6c3a61c29	Make changes for CgroupsReadControlFile method to get memory usage per cgroup. - Adds methods to Enter, Leave and Migrate the tasks for memory controller. - Update the memory cgroup id when the task enters and leaves the cgroups. PiperOrigin-RevId: 555568875	2023-08-10 11:09:28 -07:00
Nayana Bidari	a87aa73698	Increment/decrement memory accounted per cgroup. - Adds a new field in the usageInfo to store the memory cgroup id. - Creates a map of cgroup ids and memory stats to track the memory per cgroup in MemoryLocked struct. - Introduces new methods to increment, decrement, move, copy and get the total memory usage per cgroup. PiperOrigin-RevId: 549148091	2023-07-18 16:50:09 -07:00
Andrei Vagin	49c05d0f11	Enable lockdep for more mutexes PiperOrigin-RevId: 533162054	2023-05-18 09:59:15 -07:00
Adin Scannell	1ceb814544	Add `default_applicable_licenses` rules to packages. PiperOrigin-RevId: 513581243	2023-03-02 10:50:04 -08:00
Kevin Krakauer	370672e989	prohibit direct use of sync/atomic (u)int64 functions All atomic 64 bit ints are changed to atomicbitops.(Ui\|I)nt64. A nogo checker enforces that sync/atomic 64 bit functions are not called. For reviewers: the interesting changes are in the atomicbitops and checkaligned packages. Why do this? - It is very easy to accidentally use atomic values without sync/atomic funcs. - We have checkatomics, but this is optional and is forgotten in several places. - Using a type+checker to enforce this seems less error prone and simpler. - We get NoCopy protection. - Use of 64 bit atomics can break 32 bit builds. We have types to handle this without any runtime cost, so we might as well use them. PiperOrigin-RevId: 440473398	2022-04-08 16:06:26 -07:00
Rahat Mahmood	5bb1f5086e	cgroupfs: Implement hierarchical accounting for cpuacct controller. PiperOrigin-RevId: 438193226	2022-03-29 20:02:09 -07:00
Fabricio Voznika	33b41d8fe9	Report total memory based on limit or host gVisor was previously reporting the lower of cgroup limit or 2GB as total memory. This may cause applications to make bad decisions based on amount of memory available to them when more than 2GB is required. This change makes the lower of cgroup limit or the host total memory to be reported inside the sandbox. This also is more inline with docker which always reports host total memory. Note that reporting cgroup limit is strictly better than host total memory when there is a limit set. Fixes #5608 PiperOrigin-RevId: 403241608	2021-10-14 18:42:07 -07:00
Jamie Liu	7e0c1d9f1e	Use memutil.MapFile for the memory accounting page. PiperOrigin-RevId: 381145216	2021-06-23 17:03:58 -07:00
Ayush Ranjan	a9441aea27	[op] Replace syscall package usage with golang.org/x/sys/unix in pkg/. The syscall package has been deprecated in favor of golang.org/x/sys. Note that syscall is still used in the following places: - pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities are not yet available in golang.org/x/sys. - syscall.Stat_t is still used in some places because os.FileInfo.Sys() still returns it and not unix.Stat_t. Updates #214 PiperOrigin-RevId: 360701387	2021-03-03 10:25:58 -08:00
Ayush Ranjan	c206fcbfc2	pgalloc: Do not hold MemoryFile.mu while calling mincore. This change makes the following changes: - Unlocks MemoryFile.mu while calling mincore (checkCommitted) because mincore can take a really long time. Accordingly looks up the segment in the tree tree again and handles changes to the segment. - MemoryFile.UpdateUsage() can now only be called at frequency at most 100Hz. 100 Hz = linux.CLOCKS_PER_SEC. Co-authored-by: Jamie Liu <jamieliu@google.com> PiperOrigin-RevId: 337865250	2020-10-19 09:02:19 -07:00
Nicolas Lacasse	591ff0e424	Add maximum memory limit. PiperOrigin-RevId: 310179277	2020-05-06 10:30:18 -07:00
Adin Scannell	c9a18b16ad	Document MinimumTotalMemoryBytes. PiperOrigin-RevId: 294273559	2020-02-10 12:08:32 -08:00
Adin Scannell	d29e59af9f	Standardize on tools directory. PiperOrigin-RevId: 291745021	2020-01-27 12:21:00 -08:00
Ian Gudger	27500d529f	New sync package. * Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387	2020-01-09 22:02:24 -08:00
Kevin Krakauer	2a82d5ad68	Reorder BUILD license and load functions in gvisor. PiperOrigin-RevId: 275139066	2019-10-16 16:40:30 -07:00
Jamie Liu	0352cf5866	Remove support for non-incremental mapped accounting. PiperOrigin-RevId: 266496644	2019-08-30 19:06:55 -07:00
Adin Scannell	add40fd6ad	Update canonical repository. This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620	2019-06-13 16:50:15 -07:00
Jamie Liu	48961d27a8	Move //pkg/sentry/memutil to //pkg/memutil. PiperOrigin-RevId: 252124156	2019-06-07 14:52:27 -07:00
Michael Pratt	4d52a55201	Change copyright notice to "The gVisor Authors" Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" \| xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9	2019-04-29 14:26:23 -07:00
Jamie Liu	8f4634997b	Decouple filemem from platform and move it to pgalloc.MemoryFile. This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4	2019-03-14 08:12:48 -07:00

1 2

35 Commits