65 Commits

Author SHA1 Message Date
Ayush Ranjan a81ec225dc Remove unnecessary calls to rand.Seed(time.Now().Unix()).
As per https://pkg.go.dev/math/rand#Seed:
"If Seed is not called, the generator is seeded randomly at program startup."
"Prior to Go 1.20, the generator was seeded like Seed(1) at program startup. To
force the old behavior, call Seed(1) at program startup."
"As of Go 1.20 there is no reason to call Seed with a random value."

rand.Seed() is deprecated. Followup to #11015.

PiperOrigin-RevId: 685052229
2024-10-11 20:19:04 -07:00
Koichi Shiraishi 0cf77c02f8 all: remove use io/ioutil deprecated package & fix some deprecated thing
Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>
2024-10-10 20:36:24 +09:00
Jamie Liu 41f01d8f9c pgalloc: integrate async page loading
When a pages file is provided to `runsc restore`, reads from that file are
asynchronous (via statefile.AsyncReader) in order to maximize throughput.
However, all such reads must complete before Kernel.LoadFrom() returns, so
applications cannot execute before MemoryFile loading is complete. The main
objective of this CL is to allow reads to continue after Kernel.LoadFrom()
returns, allowing applications to execute while MemoryFile loading is still in
progress. This behavior is user-visible: it affects whether deleting the pages
file frees disk space immediately on POSIX filesystems, may affect whether
deletion is possible on non-POSIX filesystems, and prevents unmounting
regardless. Thus it is flag-guarded as `runsc restore --background`.

MemoryFile ranges that have yet to be loaded, but that are being waited-for by
applications, should be prioritized over ranges for which no application is
waiting. This requires that application requests for data (calls to
MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which
ranges have not yet been loaded, request reads for such ranges with elevated
priority, and wait for only those reads to be completed; none of these are
supported by the existing statefile.AsyncReader.

Thus:

- Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is
  designed to be easily implementable using a goroutine pool, Linux native AIO,
  or io_uring, though only includes a goroutine pool implementation. (io_uring
  is widely disabled due to security vulnerabilities. In my testing, Linux
  native AIO is slower than the goroutine pool, but this may change with lower
  GOMAXPROCS which needs further testing.)

- Move I/O scheduling into pgalloc: introduce an async page loader goroutine
  that is started by MemoryFile.LoadFrom() when async page loading is requested
  (implicitly, via the existence of a pages file), which is responsible for
  driving submission of read requests and handling their completions.

PiperOrigin-RevId: 679321884
2024-09-26 15:51:13 -07:00
Ayush Ranjan 905d769f6f Optimize wire.Uint.save() to make only 1 Write() call.
Earlier wire.Uint.save() could make up to 10 Write() calls depending on how
large the Uint being marshaled was. Write() is an interface method call, so
this avoids the dynamic dispatch overhead. Furthermore, compressio's Write
implementations themselves do a bunch of fixed work per call and invoke more
interface functions.

Increased the scratch buffer in the wire.Writer to accommodate 10 bytes. The
wire.Uint can be marshaled into at most 10 bytes.

Before:

goos: linux
goarch: amd64
pkg: pkg/state/wire/wire
cpu: AMD EPYC 7B12
BenchmarkUintSave
BenchmarkUintSave-24    	37557860	        31.13 ns/op	       0 B/op	       0 allocs/op

After:

goos: linux
goarch: amd64
pkg: pkg/state/wire/wire
cpu: AMD EPYC 7B12
BenchmarkUintSave
BenchmarkUintSave-24    	129611625	         9.274 ns/op	       0 B/op	       0 allocs/op
PiperOrigin-RevId: 672679703
2024-09-09 14:45:45 -07:00
Ayush Ranjan 7cf7cffd4f Optimize compressio.SimpleWriter with non-nil key using manual buffering.
This change optimizes compressio.SimpleWriter by buffering the output manually.
When a key is provided, SimpleWriter was adding a 4 byte header for each chunk
being written. But in practice, the wire package always calls
SimpleWriter.Write with 1-size byte slices. So each chunk is only 1 in length.
So 80% of the statefile ends up being just chunk headers and 20% is data.

This does the same optimization as 68f0b41bf9 ("compressio: Remove chunk size
from the wire format for SimpleRW when key=nil.") for key!=nil case.

This change additionally optimizes the calls to hash.Hash.Sum() to use existing
scratch buffers and hence avoids a byte-slice allocation.

Before:

BenchmarkTinyIO
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24         	100000000	        11.76 ns/op	       2 B/op	       0 allocs/op
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24           	  442598	      2738 ns/op	     211 B/op	       4 allocs/op

After:

BenchmarkTinyIO
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24         	100000000	        11.22 ns/op	       2 B/op	       0 allocs/op
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock
BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24           	89841070	        16.09 ns/op	       3 B/op	       0 allocs/op

Co-authored-by: Jamie Liu <jamieliu@google.com>
PiperOrigin-RevId: 672619452
2024-09-09 11:51:52 -07:00
Ayush Ranjan 68f0b41bf9 compressio: Remove chunk size from the wire format for SimpleRW when key=nil.
This change optimizes checkpoint/restore when --compression=none is being used.
Note that runsc never uses a key in SimpleRW.

In practice, the SimpleRW structs are only used for checkpoint/restore.
The caller of the Read/Write methods is the wire package. All the types defined
in the wire package (except String and Ref) translate their save()/load()
implementations to wire.Uint.save()/load().

wire.Uint attempts to be smart and compress the uint64 by reading or writing it
out byte by byte using a particular format (where there MSB indicates whether
more bits are needed to construct this uint64).

So what ends up happening is, the entire kernel is serialized byte by byte
and compressio mostly receives one byte slices to read/write.

For each call to Read/Write in SimpleRW, it adds a 4 byte header representing
"chunk size". So we have 4 bytes for chunk size, followed by 1 byte of data.
This is atrociously wasteful. 80% of the checkpoint file is such chunk sizes.

We only require such chunk size headers when a key is provided to compressio
and there is a hash appended after each chunk. So the chunk size would be
needed to figure out where the data ends and the has begins. But in runsc, key
is never used. So this change gets rid of chunk size from the wire format of
SimpleRW when key=nil.

This should reduce the checkpoint.img file size by 80% and speed up kernel
save and load.

PiperOrigin-RevId: 669404610
2024-08-30 12:14:33 -07:00
Jamie Liu 56521670ef state/wire: do not use sync.Pool for single-byte buffers
This package critically depends on reading/writing single bytes. Since
arguments to interface methods io.Reader.Read / io.Writer.Write escape, a naive
implementation would heap-allocate a one-byte array per read/write.

Prior to cl/625167495, the wire package provided custom Reader/Writer
interfaces, implementations of which were required to provide their own
ReadByte/WriteByte methods that did not take any escaping arguments.
cl/625167495 eliminated these interfaces and made the wire package use
sync.Pool to allocate one-byte arrays instead, simplifying the pipelining of
readers and writers but introducing non-trivial overhead. This CL re-introduces
wire.Reader/Writer, but as structs combining an io.Reader/Writer and a one-byte
array; this preserves the relative ease of using arbitrary io.Readers/Writers,
while eliminating sync.Pool overhead by essentially having callers of the wire
package provide the persistent buffer.

PiperOrigin-RevId: 666946511
2024-08-23 15:45:22 -07:00
gVisor bot 7a57658d7c Internal change.
PiperOrigin-RevId: 666929095
2024-08-23 14:51:29 -07:00
Jamie Liu 04b2c4631d state: fix redundant reconciliation in typeDecodeDatabase.Lookup()
Simple microbenchmark for save+load of a pgalloc.memAcctSet:
Before this CL:

```
goos: linux
goarch: amd64
pkg: pkg/sentry/pgalloc/pgalloc
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
BenchmarkMemAcctSetSaveLoad
BenchmarkMemAcctSetSaveLoad/1000
BenchmarkMemAcctSetSaveLoad/1000-48                  362           3313192 ns/op
BenchmarkMemAcctSetSaveLoad/10000
BenchmarkMemAcctSetSaveLoad/10000-48                  33          35237918 ns/op
BenchmarkMemAcctSetSaveLoad/100000
BenchmarkMemAcctSetSaveLoad/100000-48                  3         360797304 ns/op
PASS
```

After this CL:

```
goos: linux
goarch: amd64
pkg: pkg/sentry/pgalloc/pgalloc
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
BenchmarkMemAcctSetSaveLoad
BenchmarkMemAcctSetSaveLoad/1000
BenchmarkMemAcctSetSaveLoad/1000-48                  417           2837200 ns/op
BenchmarkMemAcctSetSaveLoad/10000
BenchmarkMemAcctSetSaveLoad/10000-48                  38          30813388 ns/op
BenchmarkMemAcctSetSaveLoad/100000
BenchmarkMemAcctSetSaveLoad/100000-48                  4         311275972 ns/op
PASS
```
PiperOrigin-RevId: 663494805
2024-08-15 16:31:39 -07:00
Fabricio Voznika d4e733ac17 Add a few extension points
PiperOrigin-RevId: 644476039
2024-06-18 12:31:58 -07:00
Ayush Ranjan 5ab3eb46f4 Close statefile.AsyncReader on error paths.
PiperOrigin-RevId: 635854957
2024-05-21 10:41:36 -07:00
Jamie Liu 2e4177ed2d state: print encoding error before object
This prevents a very large object string from pushing an error message from
saving (which is usually something informative like "type Foo does not
implement SaverLoader") off the edge of the line. cl/605534648 already did this
for loading.

PiperOrigin-RevId: 631622344
2024-05-07 19:17:26 -07:00
Ayush Ranjan 06c085fae5 Add AsyncReader implementation in statefile package.
This type allows reading asynchronously and provides a Wait() method as a
barrier operation.

PiperOrigin-RevId: 627520473
2024-04-23 15:20:56 -07:00
Ayush Ranjan 43c2c00c50 Delete wire.Reader and wire.Writer.
These interfaces only existed to add ReadByte() and WriteByte() methods. There
were only 4 implementors of these methods: compressio.{Simple}{Reader/Writer}.
And there were only 2 users of this.

Using io.{Reader/Writer} is more extendible. For instance, it allows using
*os.File with the `wire` package without any wrappers.

Updated the 2 users to implement their own {read/write}Byte(). To avoid heap
allocation of the [1]byte storage during call to io.Reader.Read or
io.Writer.Write due to interface call, used sync.Pool. Earlier, calls to
compressio.Simple{Reader/Writer}'s implementation of {Read/Write}Byte would
cause a heap allocation.

PiperOrigin-RevId: 625167495
2024-04-15 19:40:27 -07:00
Fabricio Voznika dd51b97d9d Add compression variant for checkpoint tests
PiperOrigin-RevId: 625104713
2024-04-15 15:38:12 -07:00
Nayana Bidari 87d8df37c7 Enable save/checkpoint resume with runsc checkpoint command.
Enables save resume with checkpoint command. Previously when --leave-running
was set, the sandbox was destroyed after the checkpoint and restored with the
same id. With this change the sandbox will not be destroyed and resumes running
after the checkpoint.

PiperOrigin-RevId: 623282685
2024-04-09 14:34:50 -07:00
Ayush Ranjan 7e395bbbd4 Plumb restore context to load*() methods.
This allows for external information to be passed to restore code.
Similar to c087777e37 ("Plumb restore context to afterLoad()").

Updates #1956.

PiperOrigin-RevId: 614125262
2024-03-08 20:28:02 -08:00
Fabricio Voznika c087777e37 Plumb restore context to afterLoad()
This allows for external information to be passed to restore code, like
host FDs to be remapped.

Updates #1956

PiperOrigin-RevId: 612540749
2024-03-04 12:21:50 -08:00
Jamie Liu 59a057980d Minor FPU save/restore fixes.
- Use the correct instruction in safecopy.checkXstate(). Before this CL:

```
TEXT pkg/sentry/arch/fpu/fpu.initX86FPState.abi0(SB)
  ...
  fpu_amd64.s:78        0x7661b2                480fae2f                XRSTOR64 0(DI)

TEXT pkg/safecopy/safecopy.checkXstate.abi0(SB)
  ...
  xrstor_amd64.s:54     0x7648a0                0fae2f                  XRSTOR 0(DI)
```

I'm not sure what the actual difference between XRSTOR and XRSTOR64 is, but
Linux is careful to use XRSTOR64 (arch/x86/kernel/fpu/xstate.h:XRSTOR,
REX_PREFIX) so it probably matters.

- When an AfterLoad callback fails, log the error message before the failing
  object, since the latter can be huge and prevent the error message from being
  logged.

- Include additional information in the error message emitted by
  fpu.State.AfterLoad().

PiperOrigin-RevId: 605534648
2024-02-08 23:11:09 -08:00
Andrei Vagin 5f4abad306 Fix a few typos
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-25 12:13:42 -07:00
Ivan Prisyazhnyy 669726877e state: compressio: don't use flate for some workloads
checkpoint image compression (compressio) implies additional overhead
during its operations. when gvisor restores the kernel state inflate()
algorithm requires:

- CPU to un/compress the data
- Memory blocks to store and un/compress data

memory blocks originate from the bytes.Buffers and the sync.Pool that
tries to reuse them. they are released only when the system decides it
is a good moment:

    pool.go: runtime_registerPoolCleanup(poolCleanup)

in my system (and in production) it takes around 240s to get the related
memory region freed (unmapped()).

during that period of time from the image state is read and the kernel is
loaded till the moment when the `poolCleanup` is called + GC() releasing
buffers gVisor Kernel (sandbox) process holds tens and hundreds of
megabytes of anonymous memory pages (RAM) busy (allocated+reserved).

pretty much often, the memory overhead of using compression can result in x2
memory overhead in production system with checkpoints restore and +100ms
(hundreds) ms of startup latency just to uncompress the image.

our use case does not suffer from having uncompressed images on disk but
suffer from the waste of memory during startup and CPU overhead.

this patch adds flag to disable compression for containers checkpoints.

Signed-off-by: Ivan Prisyazhnyy <john.koepi@gmail.com>
2023-08-10 17:55:40 +02:00
Etienne Perot 064faf80a4 runsc metric-server: Optimize memory usage and allocation-heavy functions.
This is an effort to reduce it to be a well-behaved background process.

With 110 sandboxes running, at rest, this goes from

```
VmRSS:   72376 kB
RssAnon: 51944 kB
```

to:

```
VmRSS:   45864 kB
RssAnon: 25788 kB
```

This GCs much more aggressively, including after every single request, which
means we do spend disproportionately more CPU in order to get that low memory
usage. From my testing, serving requests takes about 12% more CPU, and it's
all spent in GC.

The optimizations that went into this are:

- Add a method in `state` to discard the global type maps.
- Add a custom "packed" number type in `prometheus` library that encodes small
  integers and floating-point numbers in 32 bits whenever possible without
  loss of precision, otherwise they are encoded in their full 64-bit glory and
  the 32-bit representation is used as a pointer to the 64-bit representation.
  These are stored either per-sandbox (for static-after-sandbox-creation
  numbers like distribution bucket boundaries), or per-metric-retrieval
  attempt otherwise.
- Use string interning for commonly-seen strings across sandboxes, like metric
  names and label names. Label values are also interned, but only at a
  per-sandbox granularity.
- Reworked allocation-heavy functions like `OrderedLabels` and some string
  rendering functions to be (almost) allocation-free. This doesn't reduce
  memory usage at rest, and does increase their CPU cost, but in return it
  significantly cuts down on the percentage of CPU time spent in GC
  (>50% -> 25%) enough to justify spending the extra CPU in these functions.

PiperOrigin-RevId: 515181387
2023-03-08 17:15:52 -08:00
Adin Scannell 1ceb814544 Add default_applicable_licenses rules to packages.
PiperOrigin-RevId: 513581243
2023-03-02 10:50:04 -08:00
Adin Scannell 12a930a63e Move goid to dynamic facts render.
This removes the need for ongoing tags.

This change requires some minor updates to remove dependency cycles, since
the goid package is a base library used by many internals (log, sync, etc.).

PiperOrigin-RevId: 504066914
2023-01-23 13:29:37 -08:00
Rahat Mahmood a17ad261d6 stateify: Handle multi-name fields in struct declarations.
PiperOrigin-RevId: 500268939
2023-01-06 15:32:10 -08:00