During checkpointing, MemoryFile.SaveTo() narrows the set of pages known by
memory accounting to be "committed" to those containing non-zero bytes, in
order to avoid saving zero pages and therefore bloating the checkpoint. In the
process of doing so, it needs to touch pages in order to determine whether they
contain non-zero bytes, and does so without holding MemoryFile.mu (in a
MemoryFile.updateUsageLocked() callback). Thus, a concurrent call to
MemoryFile.UpdateUsage() => MemoryFile.updateUsageLocked() can racily observe
that touched zero pages are committed (via mincore()) and mark them
known-committed accordingly, causing them to be unintentionally saved in the
checkpoint.
When SaveOpts.ExcludeCommittedZeroPages is set, MemoryFile.SaveTo() does not
decommit previously-known-committed zero pages, since doing so would cost time;
the motivation for decommitting zero pages is to avoid increasing (real, host)
memory usage during checkpointing, but previously-known-committed pages must
have been using memory even before being touched. However, this significantly
widens the race window described above, since any future call to
MemoryFile.UpdateUsage() will observe said pages to be committed and mark them
known-committed again, effectively negating SaveOpts.ExcludeCommittedZeroPages.
To fix this, inhibit MemoryFile.UpdateUsage() during MemoryFile.SaveTo(); in
essence, when MemoryFile.SaveTo() is in progress, it exclusively defines what
pages are known-committed.
PiperOrigin-RevId: 733876220
On platforms that do not create page table entries in
platform.AddressSpace.MapFile(precommit=false), i.e. platforms for which
platform.Platform.MapUnit() == 0, platform.AddressSpace.MapFile() is generally
implemented as some form of host mmap(), which only synchronously creates host
kernel VMAs (virtual memory areas) and creates page table entries lazily in
response to application faults. On such platforms, MM.mapAsLocked() creates the
largest possible host VMAs since doing so reduces future sentry-handled page
faults and has effectively no additional cost. However, when async page loading
is active, this must wait for all mapped pages to be loaded, which may result
in the faulting application blocking for significantly longer than expected (in
experiments, a single page fault could result in waiting for up to 64GB of data
to be loaded). In such cases, additionally constrain mapped sizes to limit wait
times.
PiperOrigin-RevId: 680699946
When a pages file is provided to `runsc restore`, reads from that file are
asynchronous (via statefile.AsyncReader) in order to maximize throughput.
However, all such reads must complete before Kernel.LoadFrom() returns, so
applications cannot execute before MemoryFile loading is complete. The main
objective of this CL is to allow reads to continue after Kernel.LoadFrom()
returns, allowing applications to execute while MemoryFile loading is still in
progress. This behavior is user-visible: it affects whether deleting the pages
file frees disk space immediately on POSIX filesystems, may affect whether
deletion is possible on non-POSIX filesystems, and prevents unmounting
regardless. Thus it is flag-guarded as `runsc restore --background`.
MemoryFile ranges that have yet to be loaded, but that are being waited-for by
applications, should be prioritized over ranges for which no application is
waiting. This requires that application requests for data (calls to
MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which
ranges have not yet been loaded, request reads for such ranges with elevated
priority, and wait for only those reads to be completed; none of these are
supported by the existing statefile.AsyncReader.
Thus:
- Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is
designed to be easily implementable using a goroutine pool, Linux native AIO,
or io_uring, though only includes a goroutine pool implementation. (io_uring
is widely disabled due to security vulnerabilities. In my testing, Linux
native AIO is slower than the goroutine pool, but this may change with lower
GOMAXPROCS which needs further testing.)
- Move I/O scheduling into pgalloc: introduce an async page loader goroutine
that is started by MemoryFile.LoadFrom() when async page loading is requested
(implicitly, via the existence of a pages file), which is responsible for
driving submission of read requests and handling their completions.
PiperOrigin-RevId: 679321884
This package critically depends on reading/writing single bytes. Since
arguments to interface methods io.Reader.Read / io.Writer.Write escape, a naive
implementation would heap-allocate a one-byte array per read/write.
Prior to cl/625167495, the wire package provided custom Reader/Writer
interfaces, implementations of which were required to provide their own
ReadByte/WriteByte methods that did not take any escaping arguments.
cl/625167495 eliminated these interfaces and made the wire package use
sync.Pool to allocate one-byte arrays instead, simplifying the pipelining of
readers and writers but introducing non-trivial overhead. This CL re-introduces
wire.Reader/Writer, but as structs combining an io.Reader/Writer and a one-byte
array; this preserves the relative ease of using arbitrary io.Readers/Writers,
while eliminating sync.Pool overhead by essentially having callers of the wire
package provide the persistent buffer.
PiperOrigin-RevId: 666946511
This CL addresses the following major issues:
- When an application releases memory to the sentry, the sentry unconditionally
releases that memory to the host, rather than allowing it to be reused for
future allocations, in order to ensure that new allocations are uniformly
decommitted (use no memory): cl/145016083. In most cases, this should have
relatively little performance impact; since releasing memory from the
application to the OS is expensive even outside of gVisor, application memory
allocators optimizing for performance already limit the rate at which they
release memory to the OS. However, in applications that involve frequent
process creation and exit (e.g. build systems), this practice prevents reuse
of memory deallocated by exiting processes for memory allocated by new
processes, resulting in both performance degradation and a spike in memory
usage (since the sentry may not have released all deallocated memory to the
host by the time new allocations occur).
- gVisor's historical approach to application THP is based on THP being enabled
on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
upstream Linux kernel
(https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
without requiring the system to enable THP for all tmpfs files and memfds (by
setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
"force").
- Both MM and the application page allocator (pgalloc) are agnostic as to
whether the underlying memory file will be THP-backed. Instead, both attempt
to align hugepage-sized and larger allocations to hugepage boundaries, such
that if the memory file happens to support THP then such allocations will be
appropriately aligned to use THP. This is suboptimal since many allocations
do not benefit from THP, resulting in memory underutilization.
These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.
Thus:
- Instead of inferring whether THP use is desired from allocation size,
indicate this explicitly as AllocOpts.Huge, and only set it to true for
allocations for non-stack private anonymous mappings.
- Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
that indicates that the caller will commit all pages in the allocation. In
such cases, pgalloc can reuse deallocated pages without risking increased
memory usage, internally referred to as "recycling".
AllocateCallerIndirectCommit is used primarily for page faults on a
THP-backed region. (It is also used for single-page allocations on non-THP
backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
ranges, this is relatively uncommon.)
- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.
- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
deallocated pages for small/huge-page-backed regions respectively; two sets
tracking in-use pages for small/huge-page-backed regions respectively; and a
fifth set tracking memory accounting state.
- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
pgalloc tests, but may also be applicable to disk-backed MemoryFiles.
Cleanup:
- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
described in updateUsageLocked(), was based on the condition that
MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
which was invalidated by cl/337865250.
- Remove MemoryFileOpts.ManualZeroing, which is unused.
- Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
in Linux has a significantly different meaning (essentially "eviction" in
pgalloc), and the latter seems to be conventional in user-mode memory
allocators.
Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.
After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.
Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)
For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":
```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
│ baseline │ expt │ expt2 │
│ sec/op │ sec/op vs base │ sec/op vs base │
BuildABSL/page_cache.clean/filesystem.bindfs-16 39.09 ± 4% 38.84 ± 5% ~ (p=0.947 n=30) 38.84 ± 3% ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16 37.83 ± 3% 36.58 ± 4% ~ (p=0.057 n=30) 36.83 ± 5% ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16 39.34 ± 3% 38.59 ± 4% ~ (p=0.350 n=30) 38.58 ± 4% ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16 37.83 ± 3% 36.08 ± 4% -4.64% (p=0.026 n=30) 36.58 ± 4% ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16 39.59 ± 4% 38.83 ± 3% ~ (p=0.485 n=30) 40.09 ± 5% ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16 36.83 ± 3% 38.08 ± 5% ~ (p=0.307 n=30) 38.08 ± 1% ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16 38.34 ± 3% 37.59 ± 5% ~ (p=0.752 n=30) 38.59 ± 3% ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16 37.58 ± 4% 38.08 ± 5% ~ (p=0.708 n=30) 36.08 ± 6% ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16 212.7 ± 2% 211.0 ± 1% ~ (p=0.138 n=30) 211.2 ± 1% ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16 210.0 ± 1% 210.0 ± 1% ~ (p=0.542 n=30) 209.7 ± 1% ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16 210.5 ± 1% 210.0 ± 1% ~ (p=0.423 n=30) 210.0 ± 1% ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16 210.2 ± 1% 209.0 ± 1% ~ (p=0.219 n=30) 209.5 ± 1% ~ (p=0.230 n=30)
geomean 67.62 66.97 -0.96% 67.12 -0.74%
```
The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:
```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
│ baseline │ expt │ expt2 │
│ sec/op │ sec/op vs base │ sec/op vs base │
BuildABSL/page_cache.clean/filesystem.bindfs-12 43.11 ± 2% 39.35 ± 3% -8.71% (p=0.000 n=20) 38.10 ± 4% -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12 42.35 ± 3% 39.09 ± 4% -7.69% (p=0.000 n=20+19) 39.09 ± 5% -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12 42.35 ± 3% 38.34 ± 5% -9.46% (p=0.000 n=20) 38.59 ± 3% -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12 42.09 ± 1% 37.59 ± 4% -10.70% (p=0.000 n=20) 38.09 ± 4% -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12 42.85 ± 3% 38.84 ± 3% -9.35% (p=0.000 n=20) 39.09 ± 3% -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12 41.85 ± 2% 39.59 ± 6% -5.40% (p=0.000 n=20+19) 38.09 ± 3% -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12 42.60 ± 2% 38.34 ± 2% -10.00% (p=0.000 n=20) 39.59 ± 3% -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12 42.09 ± 4% 39.09 ± 3% -7.13% (p=0.000 n=20) 38.09 ± 3% -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12 207.7 ± 1% 206.4 ± 0% -0.60% (p=0.018 n=20) 205.9 ± 1% -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12 206.9 ± 1% 206.9 ± 1% ~ (p=0.121 n=20) 204.4 ± 1% -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12 207.7 ± 1% 204.9 ± 1% -1.33% (p=0.004 n=20) 203.9 ± 0% -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12 206.9 ± 1% 204.9 ± 0% -0.97% (p=0.004 n=20+19) 203.9 ± 0% -1.45% (p=0.000 n=20+19)
geomean 71.97 67.63 -6.03% 67.28 -6.52%
```
PiperOrigin-RevId: 647771821
Either TotalHostMem or TotalMem are good candidates for limits
because in case either of these is set we should not be going over
them.
The motivations of this is to help catch syscalls causing allocations
with size values that are blatantly bad.
PiperOrigin-RevId: 633961720
This option, enabled via `runsc checkpoint --exclude-committed-zero-pages`,
instructs `pgalloc.MemoryFile.SaveTo()` to also exclude definitely-committed
zero pages from checkpointing (in addition to possibly-committed zero pages,
which are always scanned for and excluded). This is useful when the application
being checkpointed is known to have a large number of committed zero pages:
pages that (1) have been touched by application memory accesses, a syscall such
as read(), or page pinning by e.g. a driver, and (2) have not been subsequently
released by the application to the operating system by e.g. munmap() or
madvise(MADV_DONTNEED) (+ page unpinning if necessary), and (3) are filled with
zero bytes.
Minor changes:
- In `MemoryFile.updateUsageLocked()`, pass file offset to `checkCommitted` so
that `MemoryFile.SaveTo()`'s `checkCommitted` can use `FALLOC_FL_PUNCH_HOLE`
to decommit pages rather than `MADV_REMOVE` (which translates addresses to
file offsets and then invokes `FALLOC_FL_PUNCH_HOLE`).
- In `MemoryFile.SaveTo()`, buffer up to a hugepage worth of pages to decommit
rather than decommitting one page per syscall.
- Increment `MemoryFile.usageExpected` in `MemoryFile.LoadFrom()`, such that
the first following call to `MemoryFile.UpdateUsage()` might skip the call
to `MemoryFile.updateUsageLocked()` (if memory usage hasn't changed since
loading).
PiperOrigin-RevId: 632370455
MapInternal() returns a coherent memory mapping of the host file descriptor
represented by a memmap.File, in the sentry's address space. This is
principally used when the sentry needs to access the contents of application
memory (for e.g. syscall arguments passed by pointer, or the source/destination
of a write()/read() syscall); it usually looks up the memmap.Files backing
application addresses and obtains mappings via MapInternal().
/dev/nvidia-uvm cannot generally be mapped into the sentry's address space, for
reasons described by
https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md#unified-virtual-memory-uvm
(in short, nvidia-uvm requires that a given page at file offset X can only be
mapped at address X). To allow the sentry to access the contents of such
mappings, make it possible for memmap.File.MapInternal() to indicate that a
fallback to buffered I/O is required, add interface methods
memmap.File.Buffer{Read,Write}At() to perform this buffered I/O, and implement
this fallback in the MM I/O path.
This CL does not use the new buffered I/O fallback anywhere; a following CL
adds it to nvproxy's nvidia-uvm.
Updates #10331
PiperOrigin-RevId: 629830825
These interfaces only existed to add ReadByte() and WriteByte() methods. There
were only 4 implementors of these methods: compressio.{Simple}{Reader/Writer}.
And there were only 2 users of this.
Using io.{Reader/Writer} is more extendible. For instance, it allows using
*os.File with the `wire` package without any wrappers.
Updated the 2 users to implement their own {read/write}Byte(). To avoid heap
allocation of the [1]byte storage during call to io.Reader.Read or
io.Writer.Write due to interface call, used sync.Pool. Earlier, calls to
compressio.Simple{Reader/Writer}'s implementation of {Read/Write}Byte would
cause a heap allocation.
PiperOrigin-RevId: 625167495
The work done in c087777e37 ("Plumb restore context to afterLoad()") makes
pgalloc.MemoryFileProvider redundant as structs can now easily restore
pgalloc.MemoryFile in stateify's afterLoad() method. This allows structs to
have a pgalloc.MemoryFile field and use that directly, instead of going through
the provided interface.
This cleans up a lot of code and also should be more performant (avoids an
interface method call on many hot paths).
PiperOrigin-RevId: 615258927
Earlier we were always restoring pma.file to mm.mfp.MemoryFile(). However,
d8eb29ed6f ("Add support for saving PMAs referencing tmpfs filestore files.")
added support for saving PMAs that reference "private" pgalloc.MemoryFiles that
are different from mm.mfp.MemoryFile().
We achieve the correct restore by:
- Adding a "RestoreID" field to pgalloc.MemoryFile. Private MemoryFiles set
this with a vfs.RestoreID.String(). Non-private MemoryFile does not set it.
- MemoryFile struct is not savable by itself, but pma.file field is saved as a
string. We store the RestoreID string there.
- On restore, if RestoreID is "", then restore using CtxMemoryFile. If it has a
non-empty RestoreID, then restore using CtxMemoryFileMap.
- Cleanup: vfs.CtxFilesystemMemoryFileMap was moved to pgalloc.CtxMemoryFileMap
so we can now provide a pgalloc.MemoryFileMapFromContext() method which
cleans up some code. Also the key to this map (MemoryFileOpts.RestoreID)
belongs to pgalloc, so it seems like the right place to have this context.
PiperOrigin-RevId: 614903073
The safemem.Reader interface receiver was causing the implementation to escape
to heap (as at compile time, the compiler can not prove anything about the
implementation).
Using generics for io.ReadFullToBlocks() does not help in avoiding the heap
allocation.
With this change, we avoid at least these allocations:
- rw variable in kcov.Kcov.TaskWork().
- safemem.BlockSeqReader struct in in mm.MemoryManager.getPMAsInternalLocked().
- gr variable in fsutil.FileRangeSet.Fill().
- h variable in gofer.dentry.Translate().
- h variable in gofer.dentryReadWriter.ReadToBlocks().
Some of these are on hot paths for IO workloads.
PiperOrigin-RevId: 609805848
The rootfs is such a mount (with default overlay2=root:self configuration).
It should be possible to checkpoint this mount when application has active
VMAs that have allocated PMAs from the backing filestore file.
2d90b66af1 ("Add checkpoint/restore support for tmpfs with file backend.")
added support for saving such a tmpfs filestore file during checkpoint and
restoring it correctly. Hence, after restore, the VMAs and PMAs should be
valid.
PiperOrigin-RevId: 605766485
- Replace Add with TryInsertRange; for symmetry with RemoveRange, to establish
the convention that *Range methods perform an implicit search in the set, and
so that we can fork InsertRange which has Insert-like semantics (panics on
conflict), which the majority of callers want.
- Rename MergeRange and MergeAdjacent to MergeInsideRange and MergeOutsideRange
respectively; for the same convention, and to more clearly describe the
difference between these functions.
- Add MergePrev and MergeNext. These solve the longstanding problem of
requiring a separate call to Merge{Inside,Outside}Range (which will perform
additional searches) after mutating a set in a relatively simple manner.
- Add SplitBefore and SplitAfter, which are halves of Isolate. These are
slightly preferable to Isolate in many use cases for the latter (when
iterating segments within a range, only the first segment can include a key
before the start, so this saves some useless comparisons in almost every
iteration of such loops), and are useful in some more complex algorithms.
Also add LowerBoundSegmentSplitBefore and UpperBoundSegmentSplitAfter as
ergonomic aids for the former use case.
- Add {Visit,Mutate}[Full]Range, which are convenience wrappers around the
iterator API (including new functions) for simple use cases (and hence also
serve to demonstrate how the new iterator functions are used).
MutateFullRange in particular replaces ApplyContiguous and adds merging
during iteration.
- Add RemoveFullRange, which (analogous to {Visit,Mutate}FullRange) is a
variant of RemoveRange that checks that the range is fully covered by
segments.
- Add Unisolate, which combines MergePrev and MergeNext in the same way that
Isolate combines SplitBefore and SplitAfter. This is useful for merging after
mutation of a single segment.
- Add {First,Last,LowerBound,UpperBound}LargeEnoughGap, which are convenient
loop starters when using gap tracking.
- Replace SegmentDataSlices with FlatSegment, which is easier to use when
specifying "set literals" (as in tests).
- Make {prev,next}LargeEnoughGapHelper iterative rather than tail-recursive.
- Slightly optimize Iterator.{Prev,Next}NonEmpty: GapIterator.{Start,End} needs
to find the corresponding Iterator, so call Iterator.{Prev,Next}Segment
directly rather than doing so twice.
PiperOrigin-RevId: 583506148