35 Commits

Author SHA1 Message Date
Jamie Liu a5573312e0 Add explicit huge page and memory recycling support to pgalloc.MemoryFile.
This CL addresses the following major issues:

- When an application releases memory to the sentry, the sentry unconditionally
  releases that memory to the host, rather than allowing it to be reused for
  future allocations, in order to ensure that new allocations are uniformly
  decommitted (use no memory): cl/145016083. In most cases, this should have
  relatively little performance impact; since releasing memory from the
  application to the OS is expensive even outside of gVisor, application memory
  allocators optimizing for performance already limit the rate at which they
  release memory to the OS. However, in applications that involve frequent
  process creation and exit (e.g. build systems), this practice prevents reuse
  of memory deallocated by exiting processes for memory allocated by new
  processes, resulting in both performance degradation and a spike in memory
  usage (since the sentry may not have released all deallocated memory to the
  host by the time new allocations occur).

- gVisor's historical approach to application THP is based on THP being enabled
  on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
  upstream Linux kernel
  (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
  Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
  without requiring the system to enable THP for all tmpfs files and memfds (by
  setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
  "force").

- Both MM and the application page allocator (pgalloc) are agnostic as to
  whether the underlying memory file will be THP-backed. Instead, both attempt
  to align hugepage-sized and larger allocations to hugepage boundaries, such
  that if the memory file happens to support THP then such allocations will be
  appropriately aligned to use THP. This is suboptimal since many allocations
  do not benefit from THP, resulting in memory underutilization.

These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.

Thus:

- Instead of inferring whether THP use is desired from allocation size,
  indicate this explicitly as AllocOpts.Huge, and only set it to true for
  allocations for non-stack private anonymous mappings.

- Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
  that indicates that the caller will commit all pages in the allocation. In
  such cases, pgalloc can reuse deallocated pages without risking increased
  memory usage, internally referred to as "recycling".
  AllocateCallerIndirectCommit is used primarily for page faults on a
  THP-backed region. (It is also used for single-page allocations on non-THP
  backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
  ranges, this is relatively uncommon.)

- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
  THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.

- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
  deallocated pages for small/huge-page-backed regions respectively; two sets
  tracking in-use pages for small/huge-page-backed regions respectively; and a
  fifth set tracking memory accounting state.

- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
  pgalloc tests, but may also be applicable to disk-backed MemoryFiles.

Cleanup:

- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
  described in updateUsageLocked(), was based on the condition that
  MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
  which was invalidated by cl/337865250.

- Remove MemoryFileOpts.ManualZeroing, which is unused.

- Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
  in Linux has a significantly different meaning (essentially "eviction" in
  pgalloc), and the latter seems to be conventional in user-mode memory
  allocators.

Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.

After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.

Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)

For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                                │  baseline  │               expt                │               expt2               │
                                                │   sec/op   │   sec/op    vs base               │   sec/op    vs base               │
BuildABSL/page_cache.clean/filesystem.bindfs-16   39.09 ± 4%   38.84 ± 5%       ~ (p=0.947 n=30)   38.84 ± 3%       ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16   37.83 ± 3%   36.58 ± 4%       ~ (p=0.057 n=30)   36.83 ± 5%       ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16    39.34 ± 3%   38.59 ± 4%       ~ (p=0.350 n=30)   38.58 ± 4%       ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16    37.83 ± 3%   36.08 ± 4%  -4.64% (p=0.026 n=30)   36.58 ± 4%       ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16   39.59 ± 4%   38.83 ± 3%       ~ (p=0.485 n=30)   40.09 ± 5%       ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16   36.83 ± 3%   38.08 ± 5%       ~ (p=0.307 n=30)   38.08 ± 1%       ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16   38.34 ± 3%   37.59 ± 5%       ~ (p=0.752 n=30)   38.59 ± 3%       ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16   37.58 ± 4%   38.08 ± 5%       ~ (p=0.708 n=30)   36.08 ± 6%       ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16   212.7 ± 2%   211.0 ± 1%       ~ (p=0.138 n=30)   211.2 ± 1%       ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16   210.0 ± 1%   210.0 ± 1%       ~ (p=0.542 n=30)   209.7 ± 1%       ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16   210.5 ± 1%   210.0 ± 1%       ~ (p=0.423 n=30)   210.0 ± 1%       ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16   210.2 ± 1%   209.0 ± 1%       ~ (p=0.219 n=30)   209.5 ± 1%       ~ (p=0.230 n=30)
geomean                                           67.62        66.97       -0.96%                  67.12       -0.74%
```

The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                │  baseline  │                 expt                  │                 expt2                 │
                                                │   sec/op   │   sec/op    vs base                   │   sec/op    vs base                   │
BuildABSL/page_cache.clean/filesystem.bindfs-12   43.11 ± 2%   39.35 ± 3%   -8.71% (p=0.000 n=20)      38.10 ± 4%  -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12   42.35 ± 3%   39.09 ± 4%   -7.69% (p=0.000 n=20+19)   39.09 ± 5%   -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12    42.35 ± 3%   38.34 ± 5%   -9.46% (p=0.000 n=20)      38.59 ± 3%   -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12    42.09 ± 1%   37.59 ± 4%  -10.70% (p=0.000 n=20)      38.09 ± 4%   -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12   42.85 ± 3%   38.84 ± 3%   -9.35% (p=0.000 n=20)      39.09 ± 3%   -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12   41.85 ± 2%   39.59 ± 6%   -5.40% (p=0.000 n=20+19)   38.09 ± 3%   -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12   42.60 ± 2%   38.34 ± 2%  -10.00% (p=0.000 n=20)      39.59 ± 3%   -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12   42.09 ± 4%   39.09 ± 3%   -7.13% (p=0.000 n=20)      38.09 ± 3%   -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12   207.7 ± 1%   206.4 ± 0%   -0.60% (p=0.018 n=20)      205.9 ± 1%   -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12   206.9 ± 1%   206.9 ± 1%        ~ (p=0.121 n=20)      204.4 ± 1%   -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12   207.7 ± 1%   204.9 ± 1%   -1.33% (p=0.004 n=20)      203.9 ± 0%   -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12   206.9 ± 1%   204.9 ± 0%   -0.97% (p=0.004 n=20+19)   203.9 ± 0%   -1.45% (p=0.000 n=20+19)
geomean                                           71.97        67.63        -6.03%                     67.28        -6.52%
```
PiperOrigin-RevId: 647771821
2024-06-28 12:56:46 -07:00
Jamie Liu 56ab580ccb Automated rollback of changelist 633961720
PiperOrigin-RevId: 636602151
2024-05-23 10:50:10 -07:00
Konstantin Bogomolov 0bf4e9f6e5 Set limit on how big MemoryFile.Allocate calls can be.
Either TotalHostMem or TotalMem are good candidates for limits
because in case either of these is set we should not be going over
them.

The motivations of this is to help catch syscalls causing allocations
with size values that are blatantly bad.

PiperOrigin-RevId: 633961720
2024-05-15 08:23:50 -07:00
Jing Chen be48200c0e Re-order loads in BUILD files to make transformations reversible in Copybara.
PiperOrigin-RevId: 598898756
2024-01-16 11:21:40 -08:00
Andrei Vagin 5f4abad306 Fix a few typos
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2023-10-25 12:13:42 -07:00
Nayana Bidari a6c3a61c29 Make changes for CgroupsReadControlFile method to get memory usage per cgroup.
- Adds methods to Enter, Leave and Migrate the tasks for memory controller.
- Update the memory cgroup id when the task enters and leaves the cgroups.

PiperOrigin-RevId: 555568875
2023-08-10 11:09:28 -07:00
Nayana Bidari a87aa73698 Increment/decrement memory accounted per cgroup.
- Adds a new field in the usageInfo to store the memory cgroup id.
- Creates a map of cgroup ids and memory stats to track the memory per cgroup
in MemoryLocked struct.
- Introduces new methods to increment, decrement, move, copy and get the total
memory usage per cgroup.

PiperOrigin-RevId: 549148091
2023-07-18 16:50:09 -07:00
Andrei Vagin 49c05d0f11 Enable lockdep for more mutexes
PiperOrigin-RevId: 533162054
2023-05-18 09:59:15 -07:00
Adin Scannell 1ceb814544 Add default_applicable_licenses rules to packages.
PiperOrigin-RevId: 513581243
2023-03-02 10:50:04 -08:00
Kevin Krakauer 370672e989 prohibit direct use of sync/atomic (u)int64 functions
All atomic 64 bit ints are changed to atomicbitops.(Ui|I)nt64. A nogo checker
enforces that sync/atomic 64 bit functions are not called.

For reviewers: the interesting changes are in the atomicbitops and checkaligned
packages.

Why do this?
- It is very easy to accidentally use atomic values without sync/atomic funcs.
- We have checkatomics, but this is optional and is forgotten in several places.
  - Using a type+checker to enforce this seems less error prone and simpler.
- We get NoCopy protection.
- Use of 64 bit atomics can break 32 bit builds. We have types to handle this
  without any runtime cost, so we might as well use them.

PiperOrigin-RevId: 440473398
2022-04-08 16:06:26 -07:00
Rahat Mahmood 5bb1f5086e cgroupfs: Implement hierarchical accounting for cpuacct controller.
PiperOrigin-RevId: 438193226
2022-03-29 20:02:09 -07:00
Fabricio Voznika 33b41d8fe9 Report total memory based on limit or host
gVisor was previously reporting the lower of cgroup limit or 2GB as total
memory. This may cause applications to make bad decisions based on amount
of memory available to them when more than 2GB is required.

This change makes the lower of cgroup limit or the host total memory to be
reported inside the sandbox. This also is more inline with docker which always
reports host total memory. Note that reporting cgroup limit is strictly better
than host total memory when there is a limit set.

Fixes #5608

PiperOrigin-RevId: 403241608
2021-10-14 18:42:07 -07:00
Jamie Liu 7e0c1d9f1e Use memutil.MapFile for the memory accounting page.
PiperOrigin-RevId: 381145216
2021-06-23 17:03:58 -07:00
Ayush Ranjan a9441aea27 [op] Replace syscall package usage with golang.org/x/sys/unix in pkg/.
The syscall package has been deprecated in favor of golang.org/x/sys.

Note that syscall is still used in the following places:
- pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities
  are not yet available in golang.org/x/sys.
- syscall.Stat_t is still used in some places because os.FileInfo.Sys() still
  returns it and not unix.Stat_t.

Updates #214

PiperOrigin-RevId: 360701387
2021-03-03 10:25:58 -08:00
Ayush Ranjan c206fcbfc2 pgalloc: Do not hold MemoryFile.mu while calling mincore.
This change makes the following changes:
- Unlocks MemoryFile.mu while calling mincore (checkCommitted) because mincore
  can take a really long time. Accordingly looks up the segment in the tree
  tree again and handles changes to the segment.
- MemoryFile.UpdateUsage() can now only be called at frequency at most 100Hz.
  100 Hz = linux.CLOCKS_PER_SEC.

Co-authored-by: Jamie Liu <jamieliu@google.com>
PiperOrigin-RevId: 337865250
2020-10-19 09:02:19 -07:00
Nicolas Lacasse 591ff0e424 Add maximum memory limit.
PiperOrigin-RevId: 310179277
2020-05-06 10:30:18 -07:00
Adin Scannell c9a18b16ad Document MinimumTotalMemoryBytes.
PiperOrigin-RevId: 294273559
2020-02-10 12:08:32 -08:00
Adin Scannell d29e59af9f Standardize on tools directory.
PiperOrigin-RevId: 291745021
2020-01-27 12:21:00 -08:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
Kevin Krakauer 2a82d5ad68 Reorder BUILD license and load functions in gvisor.
PiperOrigin-RevId: 275139066
2019-10-16 16:40:30 -07:00
Jamie Liu 0352cf5866 Remove support for non-incremental mapped accounting.
PiperOrigin-RevId: 266496644
2019-08-30 19:06:55 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Jamie Liu 48961d27a8 Move //pkg/sentry/memutil to //pkg/memutil.
PiperOrigin-RevId: 252124156
2019-06-07 14:52:27 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00