190 Commits

Author SHA1 Message Date
Ayush Ranjan f06d4e7ebe goferfs: Add S/R support for open FDs to deleted files.
This support is only needed when the gofer mount in question is writable.
By default, the rootfs has an overlayfs applied, so the gofer lower layer is
not writabled. But if you are using --overlay2=none, then this change should
allow you to save sandbox with open FDs to deleted files in rootfs.

Updates #11425

PiperOrigin-RevId: 733021267
2025-03-03 12:38:10 -08:00
Andrei Vagin f010ae01ac Fix a few typos 2025-01-29 21:16:51 -08:00
Jamie Liu cf5841ba66 mm: implement prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME)
PiperOrigin-RevId: 696727156
2024-11-14 19:06:07 -08:00
Jamie Liu a5573312e0 Add explicit huge page and memory recycling support to pgalloc.MemoryFile.
This CL addresses the following major issues:

- When an application releases memory to the sentry, the sentry unconditionally
  releases that memory to the host, rather than allowing it to be reused for
  future allocations, in order to ensure that new allocations are uniformly
  decommitted (use no memory): cl/145016083. In most cases, this should have
  relatively little performance impact; since releasing memory from the
  application to the OS is expensive even outside of gVisor, application memory
  allocators optimizing for performance already limit the rate at which they
  release memory to the OS. However, in applications that involve frequent
  process creation and exit (e.g. build systems), this practice prevents reuse
  of memory deallocated by exiting processes for memory allocated by new
  processes, resulting in both performance degradation and a spike in memory
  usage (since the sentry may not have released all deallocated memory to the
  host by the time new allocations occur).

- gVisor's historical approach to application THP is based on THP being enabled
  on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
  upstream Linux kernel
  (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
  Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
  without requiring the system to enable THP for all tmpfs files and memfds (by
  setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
  "force").

- Both MM and the application page allocator (pgalloc) are agnostic as to
  whether the underlying memory file will be THP-backed. Instead, both attempt
  to align hugepage-sized and larger allocations to hugepage boundaries, such
  that if the memory file happens to support THP then such allocations will be
  appropriately aligned to use THP. This is suboptimal since many allocations
  do not benefit from THP, resulting in memory underutilization.

These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.

Thus:

- Instead of inferring whether THP use is desired from allocation size,
  indicate this explicitly as AllocOpts.Huge, and only set it to true for
  allocations for non-stack private anonymous mappings.

- Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
  that indicates that the caller will commit all pages in the allocation. In
  such cases, pgalloc can reuse deallocated pages without risking increased
  memory usage, internally referred to as "recycling".
  AllocateCallerIndirectCommit is used primarily for page faults on a
  THP-backed region. (It is also used for single-page allocations on non-THP
  backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
  ranges, this is relatively uncommon.)

- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
  THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.

- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
  deallocated pages for small/huge-page-backed regions respectively; two sets
  tracking in-use pages for small/huge-page-backed regions respectively; and a
  fifth set tracking memory accounting state.

- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
  pgalloc tests, but may also be applicable to disk-backed MemoryFiles.

Cleanup:

- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
  described in updateUsageLocked(), was based on the condition that
  MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
  which was invalidated by cl/337865250.

- Remove MemoryFileOpts.ManualZeroing, which is unused.

- Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
  in Linux has a significantly different meaning (essentially "eviction" in
  pgalloc), and the latter seems to be conventional in user-mode memory
  allocators.

Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.

After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.

Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)

For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                                │  baseline  │               expt                │               expt2               │
                                                │   sec/op   │   sec/op    vs base               │   sec/op    vs base               │
BuildABSL/page_cache.clean/filesystem.bindfs-16   39.09 ± 4%   38.84 ± 5%       ~ (p=0.947 n=30)   38.84 ± 3%       ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16   37.83 ± 3%   36.58 ± 4%       ~ (p=0.057 n=30)   36.83 ± 5%       ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16    39.34 ± 3%   38.59 ± 4%       ~ (p=0.350 n=30)   38.58 ± 4%       ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16    37.83 ± 3%   36.08 ± 4%  -4.64% (p=0.026 n=30)   36.58 ± 4%       ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16   39.59 ± 4%   38.83 ± 3%       ~ (p=0.485 n=30)   40.09 ± 5%       ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16   36.83 ± 3%   38.08 ± 5%       ~ (p=0.307 n=30)   38.08 ± 1%       ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16   38.34 ± 3%   37.59 ± 5%       ~ (p=0.752 n=30)   38.59 ± 3%       ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16   37.58 ± 4%   38.08 ± 5%       ~ (p=0.708 n=30)   36.08 ± 6%       ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16   212.7 ± 2%   211.0 ± 1%       ~ (p=0.138 n=30)   211.2 ± 1%       ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16   210.0 ± 1%   210.0 ± 1%       ~ (p=0.542 n=30)   209.7 ± 1%       ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16   210.5 ± 1%   210.0 ± 1%       ~ (p=0.423 n=30)   210.0 ± 1%       ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16   210.2 ± 1%   209.0 ± 1%       ~ (p=0.219 n=30)   209.5 ± 1%       ~ (p=0.230 n=30)
geomean                                           67.62        66.97       -0.96%                  67.12       -0.74%
```

The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                │  baseline  │                 expt                  │                 expt2                 │
                                                │   sec/op   │   sec/op    vs base                   │   sec/op    vs base                   │
BuildABSL/page_cache.clean/filesystem.bindfs-12   43.11 ± 2%   39.35 ± 3%   -8.71% (p=0.000 n=20)      38.10 ± 4%  -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12   42.35 ± 3%   39.09 ± 4%   -7.69% (p=0.000 n=20+19)   39.09 ± 5%   -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12    42.35 ± 3%   38.34 ± 5%   -9.46% (p=0.000 n=20)      38.59 ± 3%   -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12    42.09 ± 1%   37.59 ± 4%  -10.70% (p=0.000 n=20)      38.09 ± 4%   -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12   42.85 ± 3%   38.84 ± 3%   -9.35% (p=0.000 n=20)      39.09 ± 3%   -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12   41.85 ± 2%   39.59 ± 6%   -5.40% (p=0.000 n=20+19)   38.09 ± 3%   -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12   42.60 ± 2%   38.34 ± 2%  -10.00% (p=0.000 n=20)      39.59 ± 3%   -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12   42.09 ± 4%   39.09 ± 3%   -7.13% (p=0.000 n=20)      38.09 ± 3%   -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12   207.7 ± 1%   206.4 ± 0%   -0.60% (p=0.018 n=20)      205.9 ± 1%   -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12   206.9 ± 1%   206.9 ± 1%        ~ (p=0.121 n=20)      204.4 ± 1%   -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12   207.7 ± 1%   204.9 ± 1%   -1.33% (p=0.004 n=20)      203.9 ± 0%   -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12   206.9 ± 1%   204.9 ± 0%   -0.97% (p=0.004 n=20+19)   203.9 ± 0%   -1.45% (p=0.000 n=20+19)
geomean                                           71.97        67.63        -6.03%                     67.28        -6.52%
```
PiperOrigin-RevId: 647771821
2024-06-28 12:56:46 -07:00
Jamie Liu 2069e8643b test: add PageTableLeak test
PiperOrigin-RevId: 643437188
2024-06-14 13:27:13 -07:00
Ayush Ranjan 4bdb7e9fa0 Allow kernel versions with a + suffix in ParseKernelVersion()
This is needed to support kernel versions like 6.1.90+.

PiperOrigin-RevId: 640936154
2024-06-06 10:10:40 -07:00
gVisor bot edbc2af9f7 Override operator new and delete in tests
This is necessary to ensure errno is not updated while allocating.

Allocators are allowed to update errno, even in case of success. As gvisor
uses matchers to check the value of errno, the tests might fail if errno is
overriden by an allocation done while building the matcher. Using a custom
implementation of new and delete ensures this is not the case.

PiperOrigin-RevId: 619238390
2024-03-26 10:39:52 -07:00
gVisor bot e1ffb14778 Restore errno around allocation in test matchers
Allocation are not guaranteed to preserve errno, even in case of success.
Because the test matchers test against errno, preserve errno when allocating
new matchers.

PiperOrigin-RevId: 618437007
2024-03-23 05:58:23 -07:00
Jamie Liu 075a9df798 Fix async-signal-unsafety in mount test.
- JoinPath returns a std::string and can therefore heap-allocate.

- exit(3) is async-signal-unsafe since it executes arbitrary functions
  registered by atexit(3) / on_exit(3).

- Before this CL, TempPath::path() returns a std::string (by value) and can
  therefore heap-allocate.

PiperOrigin-RevId: 615143819
2024-03-12 13:05:16 -07:00
gVisor bot c6b06ab1a5 Internal change.
PiperOrigin-RevId: 612510308
2024-03-04 11:04:01 -08:00
Jamie Liu a88e82fa4a Deflake cpuacct cgroup test.
- Cgroup::PollControlFileForChange() is only used by cpuacct tests to check
  that CPU usage for the test's containing cgroups increases over time. In this
  context, sleeping between checks is counterproductive because the test uses
  no CPU while sleeping. Remove the sleep.

- Don't assume that the root cgroup's usage is initially non-zero, due to
  granularity issues.

PiperOrigin-RevId: 611594418
2024-02-29 14:24:07 -08:00
gVisor bot 53d2b511e7 Change remaining test targets to use select_gtest() to choose the gtest target
Should be a no-op as select_gtest() returns the same target as before.

PiperOrigin-RevId: 607788473
2024-02-16 13:45:58 -08:00
Lucas Manning 8053cd8f0b Add mount locking.
Mounts that come from a more privileged namespace must be locked so that they
cannot be unmounted from a less privileged namespace. This an important
consequence of having mount namespaces + mount propagation. See
https://man7.org/linux/man-pages/man7/mount_namespaces.7.html for full
detail.

PiperOrigin-RevId: 597322843
2024-01-10 12:25:38 -08:00
Bruno Dal Bo 5c41ffabdb Fix select call in socket utils
From the `select` manpage:
```
      nfds   This argument should be set to the highest-numbered file
              descriptor in any of the three sets, plus 1...
```

PiperOrigin-RevId: 595697728
2024-01-04 07:24:31 -08:00
Bruno Dal Bo 9ea77eb8bc Allow infinite timeout in helpers
PiperOrigin-RevId: 595462226
2024-01-03 11:52:03 -08:00
Andrei Vagin d039776e16 test: fix SetupTimeWaitClose to create a time wait bucket reliably
Linux can close a tcp connection without a time wait bucket if fin packets come
from both sides close to each other.

Let's send some data after the first fin packet to be sure that it is acked
before sending fin from another side.

PiperOrigin-RevId: 575002422
2023-10-19 14:29:06 -07:00
Ayush Ranjan 6a25f2ebb2 Remove duplicate RandomizeBuffer implementation from socket util.
Instead use the one from test_util, which uses a more random seed for rand_r().

Suggested-by: Jamie Liu <jamieliu@google.com>
Suggested-by: Andrei Vagin <avagin@google.com>
PiperOrigin-RevId: 574243850
2023-10-17 13:15:22 -07:00
gVisor bot ae1294b435 Internal change.
PiperOrigin-RevId: 567370524
2023-09-21 11:42:16 -07:00
Lucas Manning c2a7efe6a2 Clean up mount tests.
This change makes it simpler to write mount tests that examine the optional
field of mountinfo. It also cleans up the mounts in existing tests so they
don't pollute mountinfo after exit.

PiperOrigin-RevId: 561439839
2023-08-30 13:56:03 -07:00
Jeff Martin 8de4ec70bd Internal change.
PiperOrigin-RevId: 556842126
2023-08-14 10:45:30 -07:00
Nicolas Lacasse e7bd1b4c9c Implement PR_{S,G}ET_CHILD_SUBREAPER.
Closes #2323

PiperOrigin-RevId: 548205854
2023-07-14 13:19:25 -07:00
Nick Brown a435ed7c09 Set kOLargeFile for RISC-V architecture
This is required to get these tests building on a Fuchsia + RISC-V
architecture. The constant's value was extracted from:
https://github.com/riscvarchive/riscv-musl/blob/3fe7e2c75df78eef42dcdc352a55757729f451e2/arch/riscv64/bits/fcntl.h#L16

PiperOrigin-RevId: 540340091
2023-06-14 12:03:54 -07:00
Alex Konradi 9a2f1d041a Add a test for receiving UDP packet with src port 0
Add a test that validates that when a UDP packet is received with a source port
value of 0, it is delivered to a listening socket.

PiperOrigin-RevId: 536773375
2023-05-31 11:31:05 -07:00
gVisor bot d860782ada Merge pull request #8679 from avagin:cpp-warnings
PiperOrigin-RevId: 523157922
2023-04-10 10:59:51 -07:00
Andrei Vagin a699bc8a39 test: use the init_module syscall to trigger save/restore
create_module isn't defined on arm64

PiperOrigin-RevId: 521866613
2023-04-04 14:32:09 -07:00