As per https://pkg.go.dev/math/rand#Seed:
"If Seed is not called, the generator is seeded randomly at program startup."
"Prior to Go 1.20, the generator was seeded like Seed(1) at program startup. To
force the old behavior, call Seed(1) at program startup."
"As of Go 1.20 there is no reason to call Seed with a random value."
rand.Seed() is deprecated. Followup to #11015.
PiperOrigin-RevId: 685052229
When a pages file is provided to `runsc restore`, reads from that file are
asynchronous (via statefile.AsyncReader) in order to maximize throughput.
However, all such reads must complete before Kernel.LoadFrom() returns, so
applications cannot execute before MemoryFile loading is complete. The main
objective of this CL is to allow reads to continue after Kernel.LoadFrom()
returns, allowing applications to execute while MemoryFile loading is still in
progress. This behavior is user-visible: it affects whether deleting the pages
file frees disk space immediately on POSIX filesystems, may affect whether
deletion is possible on non-POSIX filesystems, and prevents unmounting
regardless. Thus it is flag-guarded as `runsc restore --background`.
MemoryFile ranges that have yet to be loaded, but that are being waited-for by
applications, should be prioritized over ranges for which no application is
waiting. This requires that application requests for data (calls to
MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which
ranges have not yet been loaded, request reads for such ranges with elevated
priority, and wait for only those reads to be completed; none of these are
supported by the existing statefile.AsyncReader.
Thus:
- Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is
designed to be easily implementable using a goroutine pool, Linux native AIO,
or io_uring, though only includes a goroutine pool implementation. (io_uring
is widely disabled due to security vulnerabilities. In my testing, Linux
native AIO is slower than the goroutine pool, but this may change with lower
GOMAXPROCS which needs further testing.)
- Move I/O scheduling into pgalloc: introduce an async page loader goroutine
that is started by MemoryFile.LoadFrom() when async page loading is requested
(implicitly, via the existence of a pages file), which is responsible for
driving submission of read requests and handling their completions.
PiperOrigin-RevId: 679321884
Earlier wire.Uint.save() could make up to 10 Write() calls depending on how
large the Uint being marshaled was. Write() is an interface method call, so
this avoids the dynamic dispatch overhead. Furthermore, compressio's Write
implementations themselves do a bunch of fixed work per call and invoke more
interface functions.
Increased the scratch buffer in the wire.Writer to accommodate 10 bytes. The
wire.Uint can be marshaled into at most 10 bytes.
Before:
goos: linux
goarch: amd64
pkg: pkg/state/wire/wire
cpu: AMD EPYC 7B12
BenchmarkUintSave
BenchmarkUintSave-24 37557860 31.13 ns/op 0 B/op 0 allocs/op
After:
goos: linux
goarch: amd64
pkg: pkg/state/wire/wire
cpu: AMD EPYC 7B12
BenchmarkUintSave
BenchmarkUintSave-24 129611625 9.274 ns/op 0 B/op 0 allocs/op
PiperOrigin-RevId: 672679703
This change optimizes compressio.SimpleWriter by buffering the output manually.
When a key is provided, SimpleWriter was adding a 4 byte header for each chunk
being written. But in practice, the wire package always calls
SimpleWriter.Write with 1-size byte slices. So each chunk is only 1 in length.
So 80% of the statefile ends up being just chunk headers and 20% is data.
This does the same optimization as 68f0b41bf9 ("compressio: Remove chunk size
from the wire format for SimpleRW when key=nil.") for key!=nil case.
This change additionally optimizes the calls to hash.Hash.Sum() to use existing
scratch buffers and hence avoids a byte-slice allocation.
Before:
BenchmarkTinyIO
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24 100000000 11.76 ns/op 2 B/op 0 allocs/op
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24 442598 2738 ns/op 211 B/op 4 allocs/op
After:
BenchmarkTinyIO
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24 100000000 11.22 ns/op 2 B/op 0 allocs/op
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24 89841070 16.09 ns/op 3 B/op 0 allocs/op
Co-authored-by: Jamie Liu <jamieliu@google.com>
PiperOrigin-RevId: 672619452
This change optimizes checkpoint/restore when --compression=none is being used.
Note that runsc never uses a key in SimpleRW.
In practice, the SimpleRW structs are only used for checkpoint/restore.
The caller of the Read/Write methods is the wire package. All the types defined
in the wire package (except String and Ref) translate their save()/load()
implementations to wire.Uint.save()/load().
wire.Uint attempts to be smart and compress the uint64 by reading or writing it
out byte by byte using a particular format (where there MSB indicates whether
more bits are needed to construct this uint64).
So what ends up happening is, the entire kernel is serialized byte by byte
and compressio mostly receives one byte slices to read/write.
For each call to Read/Write in SimpleRW, it adds a 4 byte header representing
"chunk size". So we have 4 bytes for chunk size, followed by 1 byte of data.
This is atrociously wasteful. 80% of the checkpoint file is such chunk sizes.
We only require such chunk size headers when a key is provided to compressio
and there is a hash appended after each chunk. So the chunk size would be
needed to figure out where the data ends and the has begins. But in runsc, key
is never used. So this change gets rid of chunk size from the wire format of
SimpleRW when key=nil.
This should reduce the checkpoint.img file size by 80% and speed up kernel
save and load.
PiperOrigin-RevId: 669404610
This package critically depends on reading/writing single bytes. Since
arguments to interface methods io.Reader.Read / io.Writer.Write escape, a naive
implementation would heap-allocate a one-byte array per read/write.
Prior to cl/625167495, the wire package provided custom Reader/Writer
interfaces, implementations of which were required to provide their own
ReadByte/WriteByte methods that did not take any escaping arguments.
cl/625167495 eliminated these interfaces and made the wire package use
sync.Pool to allocate one-byte arrays instead, simplifying the pipelining of
readers and writers but introducing non-trivial overhead. This CL re-introduces
wire.Reader/Writer, but as structs combining an io.Reader/Writer and a one-byte
array; this preserves the relative ease of using arbitrary io.Readers/Writers,
while eliminating sync.Pool overhead by essentially having callers of the wire
package provide the persistent buffer.
PiperOrigin-RevId: 666946511
This prevents a very large object string from pushing an error message from
saving (which is usually something informative like "type Foo does not
implement SaverLoader") off the edge of the line. cl/605534648 already did this
for loading.
PiperOrigin-RevId: 631622344
These interfaces only existed to add ReadByte() and WriteByte() methods. There
were only 4 implementors of these methods: compressio.{Simple}{Reader/Writer}.
And there were only 2 users of this.
Using io.{Reader/Writer} is more extendible. For instance, it allows using
*os.File with the `wire` package without any wrappers.
Updated the 2 users to implement their own {read/write}Byte(). To avoid heap
allocation of the [1]byte storage during call to io.Reader.Read or
io.Writer.Write due to interface call, used sync.Pool. Earlier, calls to
compressio.Simple{Reader/Writer}'s implementation of {Read/Write}Byte would
cause a heap allocation.
PiperOrigin-RevId: 625167495
Enables save resume with checkpoint command. Previously when --leave-running
was set, the sandbox was destroyed after the checkpoint and restored with the
same id. With this change the sandbox will not be destroyed and resumes running
after the checkpoint.
PiperOrigin-RevId: 623282685
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").
Updates #1956.
PiperOrigin-RevId: 614125262
- Use the correct instruction in safecopy.checkXstate(). Before this CL:
```
TEXT pkg/sentry/arch/fpu/fpu.initX86FPState.abi0(SB)
...
fpu_amd64.s:78 0x7661b2 480fae2f XRSTOR64 0(DI)
TEXT pkg/safecopy/safecopy.checkXstate.abi0(SB)
...
xrstor_amd64.s:54 0x7648a0 0fae2f XRSTOR 0(DI)
```
I'm not sure what the actual difference between XRSTOR and XRSTOR64 is, but
Linux is careful to use XRSTOR64 (arch/x86/kernel/fpu/xstate.h:XRSTOR,
REX_PREFIX) so it probably matters.
- When an AfterLoad callback fails, log the error message before the failing
object, since the latter can be huge and prevent the error message from being
logged.
- Include additional information in the error message emitted by
fpu.State.AfterLoad().
PiperOrigin-RevId: 605534648
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
checkpoint image compression (compressio) implies additional overhead
during its operations. when gvisor restores the kernel state inflate()
algorithm requires:
- CPU to un/compress the data
- Memory blocks to store and un/compress data
memory blocks originate from the bytes.Buffers and the sync.Pool that
tries to reuse them. they are released only when the system decides it
is a good moment:
pool.go: runtime_registerPoolCleanup(poolCleanup)
in my system (and in production) it takes around 240s to get the related
memory region freed (unmapped()).
during that period of time from the image state is read and the kernel is
loaded till the moment when the `poolCleanup` is called + GC() releasing
buffers gVisor Kernel (sandbox) process holds tens and hundreds of
megabytes of anonymous memory pages (RAM) busy (allocated+reserved).
pretty much often, the memory overhead of using compression can result in x2
memory overhead in production system with checkpoints restore and +100ms
(hundreds) ms of startup latency just to uncompress the image.
our use case does not suffer from having uncompressed images on disk but
suffer from the waste of memory during startup and CPU overhead.
this patch adds flag to disable compression for containers checkpoints.
Signed-off-by: Ivan Prisyazhnyy <john.koepi@gmail.com>
This is an effort to reduce it to be a well-behaved background process.
With 110 sandboxes running, at rest, this goes from
```
VmRSS: 72376 kB
RssAnon: 51944 kB
```
to:
```
VmRSS: 45864 kB
RssAnon: 25788 kB
```
This GCs much more aggressively, including after every single request, which
means we do spend disproportionately more CPU in order to get that low memory
usage. From my testing, serving requests takes about 12% more CPU, and it's
all spent in GC.
The optimizations that went into this are:
- Add a method in `state` to discard the global type maps.
- Add a custom "packed" number type in `prometheus` library that encodes small
integers and floating-point numbers in 32 bits whenever possible without
loss of precision, otherwise they are encoded in their full 64-bit glory and
the 32-bit representation is used as a pointer to the 64-bit representation.
These are stored either per-sandbox (for static-after-sandbox-creation
numbers like distribution bucket boundaries), or per-metric-retrieval
attempt otherwise.
- Use string interning for commonly-seen strings across sandboxes, like metric
names and label names. Label values are also interned, but only at a
per-sandbox granularity.
- Reworked allocation-heavy functions like `OrderedLabels` and some string
rendering functions to be (almost) allocation-free. This doesn't reduce
memory usage at rest, and does increase their CPU cost, but in return it
significantly cuts down on the percentage of CPU time spent in GC
(>50% -> 25%) enough to justify spending the extra CPU in these functions.
PiperOrigin-RevId: 515181387
This removes the need for ongoing tags.
This change requires some minor updates to remove dependency cycles, since
the goid package is a base library used by many internals (log, sync, etc.).
PiperOrigin-RevId: 504066914