gvisor

mirror of https://github.com/netbirdio/gvisor.git synced 2026-05-22 17:12:49 -07:00

Author	SHA1	Message	Date
Ayush Ranjan	a81ec225dc	Remove unnecessary calls to rand.Seed(time.Now().Unix()). As per https://pkg.go.dev/math/rand#Seed: "If Seed is not called, the generator is seeded randomly at program startup." "Prior to Go 1.20, the generator was seeded like Seed(1) at program startup. To force the old behavior, call Seed(1) at program startup." "As of Go 1.20 there is no reason to call Seed with a random value." rand.Seed() is deprecated. Followup to #11015. PiperOrigin-RevId: 685052229	2024-10-11 20:19:04 -07:00
Koichi Shiraishi	0cf77c02f8	all: remove use io/ioutil deprecated package & fix some deprecated thing Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>	2024-10-10 20:36:24 +09:00
Jamie Liu	41f01d8f9c	pgalloc: integrate async page loading When a pages file is provided to `runsc restore`, reads from that file are asynchronous (via statefile.AsyncReader) in order to maximize throughput. However, all such reads must complete before Kernel.LoadFrom() returns, so applications cannot execute before MemoryFile loading is complete. The main objective of this CL is to allow reads to continue after Kernel.LoadFrom() returns, allowing applications to execute while MemoryFile loading is still in progress. This behavior is user-visible: it affects whether deleting the pages file frees disk space immediately on POSIX filesystems, may affect whether deletion is possible on non-POSIX filesystems, and prevents unmounting regardless. Thus it is flag-guarded as `runsc restore --background`. MemoryFile ranges that have yet to be loaded, but that are being waited-for by applications, should be prioritized over ranges for which no application is waiting. This requires that application requests for data (calls to MemoryFile.(memmap.File).DataFD/MapInternal()) are able to determine which ranges have not yet been loaded, request reads for such ranges with elevated priority, and wait for only those reads to be completed; none of these are supported by the existing statefile.AsyncReader. Thus: - Add //pkg/sentry/pgalloc/aio, which provides an async I/O API that is designed to be easily implementable using a goroutine pool, Linux native AIO, or io_uring, though only includes a goroutine pool implementation. (io_uring is widely disabled due to security vulnerabilities. In my testing, Linux native AIO is slower than the goroutine pool, but this may change with lower GOMAXPROCS which needs further testing.) - Move I/O scheduling into pgalloc: introduce an async page loader goroutine that is started by MemoryFile.LoadFrom() when async page loading is requested (implicitly, via the existence of a pages file), which is responsible for driving submission of read requests and handling their completions. PiperOrigin-RevId: 679321884	2024-09-26 15:51:13 -07:00
Ayush Ranjan	905d769f6f	Optimize wire.Uint.save() to make only 1 Write() call. Earlier wire.Uint.save() could make up to 10 Write() calls depending on how large the Uint being marshaled was. Write() is an interface method call, so this avoids the dynamic dispatch overhead. Furthermore, compressio's Write implementations themselves do a bunch of fixed work per call and invoke more interface functions. Increased the scratch buffer in the wire.Writer to accommodate 10 bytes. The wire.Uint can be marshaled into at most 10 bytes. Before: goos: linux goarch: amd64 pkg: pkg/state/wire/wire cpu: AMD EPYC 7B12 BenchmarkUintSave BenchmarkUintSave-24 37557860 31.13 ns/op 0 B/op 0 allocs/op After: goos: linux goarch: amd64 pkg: pkg/state/wire/wire cpu: AMD EPYC 7B12 BenchmarkUintSave BenchmarkUintSave-24 129611625 9.274 ns/op 0 B/op 0 allocs/op PiperOrigin-RevId: 672679703	2024-09-09 14:45:45 -07:00
Ayush Ranjan	7cf7cffd4f	Optimize compressio.SimpleWriter with non-nil key using manual buffering. This change optimizes compressio.SimpleWriter by buffering the output manually. When a key is provided, SimpleWriter was adding a 4 byte header for each chunk being written. But in practice, the wire package always calls SimpleWriter.Write with 1-size byte slices. So each chunk is only 1 in length. So 80% of the statefile ends up being just chunk headers and 20% is data. This does the same optimization as `68f0b41bf9` ("compressio: Remove chunk size from the wire format for SimpleRW when key=nil.") for key!=nil case. This change additionally optimizes the calls to hash.Hash.Sum() to use existing scratch buffers and hence avoids a byte-slice allocation. Before: BenchmarkTinyIO BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24 100000000 11.76 ns/op 2 B/op 0 allocs/op BenchmarkTinyIO/NoCompressWriteHash1024KbBlock BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24 442598 2738 ns/op 211 B/op 4 allocs/op After: BenchmarkTinyIO BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock BenchmarkTinyIO/NoCompressWriteNoHash1024KbBlock-24 100000000 11.22 ns/op 2 B/op 0 allocs/op BenchmarkTinyIO/NoCompressWriteHash1024KbBlock BenchmarkTinyIO/NoCompressWriteHash1024KbBlock-24 89841070 16.09 ns/op 3 B/op 0 allocs/op Co-authored-by: Jamie Liu <jamieliu@google.com> PiperOrigin-RevId: 672619452	2024-09-09 11:51:52 -07:00
Ayush Ranjan	68f0b41bf9	compressio: Remove chunk size from the wire format for SimpleRW when key=nil. This change optimizes checkpoint/restore when --compression=none is being used. Note that runsc never uses a key in SimpleRW. In practice, the SimpleRW structs are only used for checkpoint/restore. The caller of the Read/Write methods is the wire package. All the types defined in the wire package (except String and Ref) translate their save()/load() implementations to wire.Uint.save()/load(). wire.Uint attempts to be smart and compress the uint64 by reading or writing it out byte by byte using a particular format (where there MSB indicates whether more bits are needed to construct this uint64). So what ends up happening is, the entire kernel is serialized byte by byte and compressio mostly receives one byte slices to read/write. For each call to Read/Write in SimpleRW, it adds a 4 byte header representing "chunk size". So we have 4 bytes for chunk size, followed by 1 byte of data. This is atrociously wasteful. 80% of the checkpoint file is such chunk sizes. We only require such chunk size headers when a key is provided to compressio and there is a hash appended after each chunk. So the chunk size would be needed to figure out where the data ends and the has begins. But in runsc, key is never used. So this change gets rid of chunk size from the wire format of SimpleRW when key=nil. This should reduce the checkpoint.img file size by 80% and speed up kernel save and load. PiperOrigin-RevId: 669404610	2024-08-30 12:14:33 -07:00
Jamie Liu	56521670ef	state/wire: do not use sync.Pool for single-byte buffers This package critically depends on reading/writing single bytes. Since arguments to interface methods io.Reader.Read / io.Writer.Write escape, a naive implementation would heap-allocate a one-byte array per read/write. Prior to cl/625167495, the wire package provided custom Reader/Writer interfaces, implementations of which were required to provide their own ReadByte/WriteByte methods that did not take any escaping arguments. cl/625167495 eliminated these interfaces and made the wire package use sync.Pool to allocate one-byte arrays instead, simplifying the pipelining of readers and writers but introducing non-trivial overhead. This CL re-introduces wire.Reader/Writer, but as structs combining an io.Reader/Writer and a one-byte array; this preserves the relative ease of using arbitrary io.Readers/Writers, while eliminating sync.Pool overhead by essentially having callers of the wire package provide the persistent buffer. PiperOrigin-RevId: 666946511	2024-08-23 15:45:22 -07:00
gVisor bot	7a57658d7c	Internal change. PiperOrigin-RevId: 666929095	2024-08-23 14:51:29 -07:00
Jamie Liu	04b2c4631d	state: fix redundant reconciliation in typeDecodeDatabase.Lookup() Simple microbenchmark for save+load of a pgalloc.memAcctSet: Before this CL: ``` goos: linux goarch: amd64 pkg: pkg/sentry/pgalloc/pgalloc cpu: Intel(R) Xeon(R) CPU @ 2.60GHz BenchmarkMemAcctSetSaveLoad BenchmarkMemAcctSetSaveLoad/1000 BenchmarkMemAcctSetSaveLoad/1000-48 362 3313192 ns/op BenchmarkMemAcctSetSaveLoad/10000 BenchmarkMemAcctSetSaveLoad/10000-48 33 35237918 ns/op BenchmarkMemAcctSetSaveLoad/100000 BenchmarkMemAcctSetSaveLoad/100000-48 3 360797304 ns/op PASS ``` After this CL: ``` goos: linux goarch: amd64 pkg: pkg/sentry/pgalloc/pgalloc cpu: Intel(R) Xeon(R) CPU @ 2.60GHz BenchmarkMemAcctSetSaveLoad BenchmarkMemAcctSetSaveLoad/1000 BenchmarkMemAcctSetSaveLoad/1000-48 417 2837200 ns/op BenchmarkMemAcctSetSaveLoad/10000 BenchmarkMemAcctSetSaveLoad/10000-48 38 30813388 ns/op BenchmarkMemAcctSetSaveLoad/100000 BenchmarkMemAcctSetSaveLoad/100000-48 4 311275972 ns/op PASS ``` PiperOrigin-RevId: 663494805	2024-08-15 16:31:39 -07:00
Fabricio Voznika	d4e733ac17	Add a few extension points PiperOrigin-RevId: 644476039	2024-06-18 12:31:58 -07:00
Ayush Ranjan	5ab3eb46f4	Close statefile.AsyncReader on error paths. PiperOrigin-RevId: 635854957	2024-05-21 10:41:36 -07:00
Jamie Liu	2e4177ed2d	state: print encoding error before object This prevents a very large object string from pushing an error message from saving (which is usually something informative like "type Foo does not implement SaverLoader") off the edge of the line. cl/605534648 already did this for loading. PiperOrigin-RevId: 631622344	2024-05-07 19:17:26 -07:00
Ayush Ranjan	06c085fae5	Add AsyncReader implementation in statefile package. This type allows reading asynchronously and provides a Wait() method as a barrier operation. PiperOrigin-RevId: 627520473	2024-04-23 15:20:56 -07:00
Ayush Ranjan	43c2c00c50	Delete wire.Reader and wire.Writer. These interfaces only existed to add ReadByte() and WriteByte() methods. There were only 4 implementors of these methods: compressio.{Simple}{Reader/Writer}. And there were only 2 users of this. Using io.{Reader/Writer} is more extendible. For instance, it allows using *os.File with the `wire` package without any wrappers. Updated the 2 users to implement their own {read/write}Byte(). To avoid heap allocation of the [1]byte storage during call to io.Reader.Read or io.Writer.Write due to interface call, used sync.Pool. Earlier, calls to compressio.Simple{Reader/Writer}'s implementation of {Read/Write}Byte would cause a heap allocation. PiperOrigin-RevId: 625167495	2024-04-15 19:40:27 -07:00
Fabricio Voznika	dd51b97d9d	Add compression variant for checkpoint tests PiperOrigin-RevId: 625104713	2024-04-15 15:38:12 -07:00
Nayana Bidari	87d8df37c7	Enable save/checkpoint resume with runsc checkpoint command. Enables save resume with checkpoint command. Previously when --leave-running was set, the sandbox was destroyed after the checkpoint and restored with the same id. With this change the sandbox will not be destroyed and resumes running after the checkpoint. PiperOrigin-RevId: 623282685	2024-04-09 14:34:50 -07:00
Ayush Ranjan	7e395bbbd4	Plumb restore context to load*() methods. This allows for external information to be passed to restore code. Similar to `c087777e37` ("Plumb restore context to afterLoad()"). Updates #1956. PiperOrigin-RevId: 614125262	2024-03-08 20:28:02 -08:00
Fabricio Voznika	c087777e37	Plumb restore context to afterLoad() This allows for external information to be passed to restore code, like host FDs to be remapped. Updates #1956 PiperOrigin-RevId: 612540749	2024-03-04 12:21:50 -08:00
Jamie Liu	59a057980d	Minor FPU save/restore fixes. - Use the correct instruction in safecopy.checkXstate(). Before this CL: ``` TEXT pkg/sentry/arch/fpu/fpu.initX86FPState.abi0(SB) ... fpu_amd64.s:78 0x7661b2 480fae2f XRSTOR64 0(DI) TEXT pkg/safecopy/safecopy.checkXstate.abi0(SB) ... xrstor_amd64.s:54 0x7648a0 0fae2f XRSTOR 0(DI) ``` I'm not sure what the actual difference between XRSTOR and XRSTOR64 is, but Linux is careful to use XRSTOR64 (arch/x86/kernel/fpu/xstate.h:XRSTOR, REX_PREFIX) so it probably matters. - When an AfterLoad callback fails, log the error message before the failing object, since the latter can be huge and prevent the error message from being logged. - Include additional information in the error message emitted by fpu.State.AfterLoad(). PiperOrigin-RevId: 605534648	2024-02-08 23:11:09 -08:00
Andrei Vagin	5f4abad306	Fix a few typos It is an idea of running codespell as part of our presubmit checks. Before enabling it for new changes, let's fix what it has found. Signed-off-by: Andrei Vagin <avagin@gmail.com>	2023-10-25 12:13:42 -07:00
Ivan Prisyazhnyy	669726877e	state: compressio: don't use flate for some workloads checkpoint image compression (compressio) implies additional overhead during its operations. when gvisor restores the kernel state inflate() algorithm requires: - CPU to un/compress the data - Memory blocks to store and un/compress data memory blocks originate from the bytes.Buffers and the sync.Pool that tries to reuse them. they are released only when the system decides it is a good moment: pool.go: runtime_registerPoolCleanup(poolCleanup) in my system (and in production) it takes around 240s to get the related memory region freed (unmapped()). during that period of time from the image state is read and the kernel is loaded till the moment when the `poolCleanup` is called + GC() releasing buffers gVisor Kernel (sandbox) process holds tens and hundreds of megabytes of anonymous memory pages (RAM) busy (allocated+reserved). pretty much often, the memory overhead of using compression can result in x2 memory overhead in production system with checkpoints restore and +100ms (hundreds) ms of startup latency just to uncompress the image. our use case does not suffer from having uncompressed images on disk but suffer from the waste of memory during startup and CPU overhead. this patch adds flag to disable compression for containers checkpoints. Signed-off-by: Ivan Prisyazhnyy <john.koepi@gmail.com>	2023-08-10 17:55:40 +02:00
Etienne Perot	064faf80a4	`runsc metric-server`: Optimize memory usage and allocation-heavy functions. This is an effort to reduce it to be a well-behaved background process. With 110 sandboxes running, at rest, this goes from ``` VmRSS: 72376 kB RssAnon: 51944 kB ``` to: ``` VmRSS: 45864 kB RssAnon: 25788 kB ``` This GCs much more aggressively, including after every single request, which means we do spend disproportionately more CPU in order to get that low memory usage. From my testing, serving requests takes about 12% more CPU, and it's all spent in GC. The optimizations that went into this are: - Add a method in `state` to discard the global type maps. - Add a custom "packed" number type in `prometheus` library that encodes small integers and floating-point numbers in 32 bits whenever possible without loss of precision, otherwise they are encoded in their full 64-bit glory and the 32-bit representation is used as a pointer to the 64-bit representation. These are stored either per-sandbox (for static-after-sandbox-creation numbers like distribution bucket boundaries), or per-metric-retrieval attempt otherwise. - Use string interning for commonly-seen strings across sandboxes, like metric names and label names. Label values are also interned, but only at a per-sandbox granularity. - Reworked allocation-heavy functions like `OrderedLabels` and some string rendering functions to be (almost) allocation-free. This doesn't reduce memory usage at rest, and does increase their CPU cost, but in return it significantly cuts down on the percentage of CPU time spent in GC (>50% -> 25%) enough to justify spending the extra CPU in these functions. PiperOrigin-RevId: 515181387	2023-03-08 17:15:52 -08:00
Adin Scannell	1ceb814544	Add `default_applicable_licenses` rules to packages. PiperOrigin-RevId: 513581243	2023-03-02 10:50:04 -08:00
Adin Scannell	12a930a63e	Move goid to dynamic facts render. This removes the need for ongoing tags. This change requires some minor updates to remove dependency cycles, since the goid package is a base library used by many internals (log, sync, etc.). PiperOrigin-RevId: 504066914	2023-01-23 13:29:37 -08:00
Rahat Mahmood	a17ad261d6	stateify: Handle multi-name fields in struct declarations. PiperOrigin-RevId: 500268939	2023-01-06 15:32:10 -08:00

1 2 3

65 Commits