This CL addresses the following major issues:
- When an application releases memory to the sentry, the sentry unconditionally
releases that memory to the host, rather than allowing it to be reused for
future allocations, in order to ensure that new allocations are uniformly
decommitted (use no memory): cl/145016083. In most cases, this should have
relatively little performance impact; since releasing memory from the
application to the OS is expensive even outside of gVisor, application memory
allocators optimizing for performance already limit the rate at which they
release memory to the OS. However, in applications that involve frequent
process creation and exit (e.g. build systems), this practice prevents reuse
of memory deallocated by exiting processes for memory allocated by new
processes, resulting in both performance degradation and a spike in memory
usage (since the sentry may not have released all deallocated memory to the
host by the time new allocations occur).
- gVisor's historical approach to application THP is based on THP being enabled
on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
upstream Linux kernel
(https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
without requiring the system to enable THP for all tmpfs files and memfds (by
setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
"force").
- Both MM and the application page allocator (pgalloc) are agnostic as to
whether the underlying memory file will be THP-backed. Instead, both attempt
to align hugepage-sized and larger allocations to hugepage boundaries, such
that if the memory file happens to support THP then such allocations will be
appropriately aligned to use THP. This is suboptimal since many allocations
do not benefit from THP, resulting in memory underutilization.
These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.
Thus:
- Instead of inferring whether THP use is desired from allocation size,
indicate this explicitly as AllocOpts.Huge, and only set it to true for
allocations for non-stack private anonymous mappings.
- Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
that indicates that the caller will commit all pages in the allocation. In
such cases, pgalloc can reuse deallocated pages without risking increased
memory usage, internally referred to as "recycling".
AllocateCallerIndirectCommit is used primarily for page faults on a
THP-backed region. (It is also used for single-page allocations on non-THP
backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
ranges, this is relatively uncommon.)
- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.
- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
deallocated pages for small/huge-page-backed regions respectively; two sets
tracking in-use pages for small/huge-page-backed regions respectively; and a
fifth set tracking memory accounting state.
- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
pgalloc tests, but may also be applicable to disk-backed MemoryFiles.
Cleanup:
- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
described in updateUsageLocked(), was based on the condition that
MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
which was invalidated by cl/337865250.
- Remove MemoryFileOpts.ManualZeroing, which is unused.
- Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
in Linux has a significantly different meaning (essentially "eviction" in
pgalloc), and the latter seems to be conventional in user-mode memory
allocators.
Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.
After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.
Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)
For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":
```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
│ baseline │ expt │ expt2 │
│ sec/op │ sec/op vs base │ sec/op vs base │
BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30)
geomean 67.62 66.97 -0.96% 67.12 -0.74%
```
The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:
```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
│ baseline │ expt │ expt2 │
│ sec/op │ sec/op vs base │ sec/op vs base │
BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19)
geomean 71.97 67.63 -6.03% 67.28 -6.52%
```
PiperOrigin-RevId: 647771821
- Replace Add with TryInsertRange; for symmetry with RemoveRange, to establish
the convention that *Range methods perform an implicit search in the set, and
so that we can fork InsertRange which has Insert-like semantics (panics on
conflict), which the majority of callers want.
- Rename MergeRange and MergeAdjacent to MergeInsideRange and MergeOutsideRange
respectively; for the same convention, and to more clearly describe the
difference between these functions.
- Add MergePrev and MergeNext. These solve the longstanding problem of
requiring a separate call to Merge{Inside,Outside}Range (which will perform
additional searches) after mutating a set in a relatively simple manner.
- Add SplitBefore and SplitAfter, which are halves of Isolate. These are
slightly preferable to Isolate in many use cases for the latter (when
iterating segments within a range, only the first segment can include a key
before the start, so this saves some useless comparisons in almost every
iteration of such loops), and are useful in some more complex algorithms.
Also add LowerBoundSegmentSplitBefore and UpperBoundSegmentSplitAfter as
ergonomic aids for the former use case.
- Add {Visit,Mutate}[Full]Range, which are convenience wrappers around the
iterator API (including new functions) for simple use cases (and hence also
serve to demonstrate how the new iterator functions are used).
MutateFullRange in particular replaces ApplyContiguous and adds merging
during iteration.
- Add RemoveFullRange, which (analogous to {Visit,Mutate}FullRange) is a
variant of RemoveRange that checks that the range is fully covered by
segments.
- Add Unisolate, which combines MergePrev and MergeNext in the same way that
Isolate combines SplitBefore and SplitAfter. This is useful for merging after
mutation of a single segment.
- Add {First,Last,LowerBound,UpperBound}LargeEnoughGap, which are convenient
loop starters when using gap tracking.
- Replace SegmentDataSlices with FlatSegment, which is easier to use when
specifying "set literals" (as in tests).
- Make {prev,next}LargeEnoughGapHelper iterative rather than tail-recursive.
- Slightly optimize Iterator.{Prev,Next}NonEmpty: GapIterator.{Start,End} needs
to find the corresponding Iterator, so call Iterator.{Prev,Next}Segment
directly rather than doing so twice.
PiperOrigin-RevId: 583506148
MemoryFile allocated offsets from the top down. This can generate
fragmentation when allocations happen in the opposite direction.
Given that applications control the order of allocations, use simple
heuristics to track the last faulted address to determine whether memory
file should be allocated from the top-down or bottom-up.
Here is the number of PMAs used across all processes in some common
workloads:
Workload | Before | After | Diff
---------|---------|---------|------
rustc | 135,269 | 1,090 | 0.8%
mysql | 16,466 | 4,576 | 28%
nginx | 1,738 | 1,609 | 93%
jenkins | 13,830 | 6,380 | 46%
redis | 334 | 308 | 92%
absl | 297,419 | 292,121 | 98%
Here is the direction of PMA allocations:
Workload | BottomUp | TopDown |
---------|-----------|---------|
mysql | 13,849 | 1,990 |
nginx | 783 | 628 |
jenkins | 13,922 | 2,333 |
redis | 88 | 109 |
absl | 212,403 | 44,019 |
Tests were done on Intel x64.
PiperOrigin-RevId: 409014690
Split usermem package to help remove syserror dependency in go_marshal.
New hostarch package contains code not dependent on syserror.
PiperOrigin-RevId: 365651233
This change has multiple small components.
First, the chunk size is bumped to 1GB in order to avoid creating excessive
VMAs in the Sentry, which can lead to VMA exhaustion (and hitting limits).
Second, gap-tracking is added to the usage set in order to efficiently scan
for available regions.
Third, reclaim is moved to a simple segment set. This is done to allow the
order of reclaim to align with the Allocate order (which becomes much more
complex when trying to track a "max page" as opposed to "min page", so we
just track explicit segments instead, which should make reclaim scanning
faster anyways).
Finally, the findAvailable function attempts to scan from the top-down, in
order to maximize opportunities for VMA merging in applications (hopefully
preventing the same VMA exhaustion that can affect the Sentry).
PiperOrigin-RevId: 315009249
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.
PiperOrigin-RevId: 291811289
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.
PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4