52 Commits

Author SHA1 Message Date
Koichi Shiraishi 0cf77c02f8 all: remove use io/ioutil deprecated package & fix some deprecated thing
Signed-off-by: Koichi Shiraishi <zchee.io@gmail.com>
2024-10-10 20:36:24 +09:00
gVisor bot 3c4b246cf2 Fix printf violations inside of the gvisor code
Recently printf.Analyzer has become stricter
(https://github.com/golang/go/issues/60529)
which led to new findings.
gvisor nogo tests run this analyzer and fail if it produces findings.

PiperOrigin-RevId: 671657227
2024-09-06 00:45:23 -07:00
Etienne Perot b4ca91450f Standardize timestamps in runsc log filenames.
Prior to this change, the log files each have their own timestamp computed
independently. For example, this means that the coverage log file, the panic
log file, the debug log file, the first Gofer's log file, and the profile
files for the same Sentry may all have different timestamps in their
filenames. Now they are the same.

This change introduces a central `runsc/starttime` package for which the sole
purpose is to hold the start time of the `runsc` process, for easy plumbing
in all places that need it.

PiperOrigin-RevId: 646667986
2024-06-25 17:48:13 -07:00
Etienne Perot c8da73daaf Add option to dump profiling metrics within a container's stdout logs.
This is part of a series of changes to add metric charts in performance
benchmarks.

This change is useful to be able to extract profiling metric data easily
regardless of runtime configuration.

This change also changes where profiling metrics are initialized and
configured. They are now only part of `runsc boot`, rather than all `runsc`
invocations.

PiperOrigin-RevId: 629900287
2024-05-01 18:33:26 -07:00
Etienne Perot 9425d102e5 Make runsc log header more helpful.
Changes:

- Header is more compact (in non-debug mode).
- Added Go runtime version.
- Added number of CPU cores.
- In debug mode, log page size.
- Removed some non-important pieces of configuration to info-level logs.
- Made debug mode enable the entire configuration to be logged.
- Move initialization of some non-logging-related stuff be after log
  initialization. (Some of them *use* the log, so it makes no sense to do
  that before logging is actually initialized.)

PiperOrigin-RevId: 595309362
2024-01-02 23:52:26 -08:00
Etienne Perot 127262d21a Add annotation to send runsc debug logs to user logs.
On Kubernetes, these are logged at the pod level.

This makes it convenient to debug gVisor on Kubernetes clusters where
SSH access to nodes is difficult or prohibited by policy. With this and
other pod annotations, it is possible to do strace debugging with only
pod annotations and no SSH.

PiperOrigin-RevId: 595197543
2024-01-02 13:37:07 -08:00
Tiwei Bie da2b10e207 runsc: decouple GoferMountConf to two layers
This patch decouples GoferMountConf to two layers to allow us to
configure all combinations of a gofer mount in a succinct way:

- Upper layer config: none, memory, self, anon. The upper layer
  is always tmpfs. It describes the backend for tmpfs.
- Lower layer config: none, lisafs. It describes the backend for
  the filesystem which actually holds the image contents.

The old SelfTmpfs will be represented as "upper=self,lower=none",
MemoryOverlay will be "upper=memory,lower=lisafs", SelfOverlay
will be "upper=self,lower=lisafs", and so on. Thanks to @ayushr2
for the suggestion on how to better decouple this.

This is a preparation for adding the EROFS rootfs support. There
is no functional change intended.

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2023-11-05 09:02:22 +08:00
Josh Seba 5f52ed57e5 Add subcommand to list nvproxy supported driver versions
Since the driver versions are hard-coded, determining the supported
version list requires code inspection, which is difficult to automate.
Add a sub-command (and category) to print the list of nvidia
driver versions supported by nvproxy for a given build of runsc.

Signed-off-by: Josh Seba <jseba@cloudflare.com>
2023-10-30 15:19:06 -07:00
Konstantin Bogomolov 440b37a5c1 Add profiling metric flags to output metric data to local TSV file.
The idea behind conditionally compiled metrics originally was to use them in
hotpaths for profiling purposes. This CL makes that possible by outputting
declared metrics in TSV format, which can be used to track custom events at
runtime in relatively high resolution.

Usage:
    1. Optionally enable compilation of runsc conditionally-compiled metrics
       by passing in condmetric_profiling to the Go tags.
    2. Add these flags to runsc:
    - [Required] --profiling-metrics-log=/tmp/some.csv
    - [Optional] --profiling-metrics=/task/syscalls,/task/faults
      - If this flag is not specified it will monitor all
        conditionally-compiled metrics by default.
    - [Optional] --profiling-metrics-rate-us=10000

Some future improvements:
    - Flag to output a metric-difference between timestamps instead of
      constant accumulation.
    - Output a gnuplot command along with the data.
    - Current monitoring resolution is limited by what time.Sleep allows.
      This can be overcome by spinning/yielding when lower monitoring
      rates are requested.

PiperOrigin-RevId: 560849611
2023-08-28 16:28:09 -07:00
Ayush Ranjan 7981df85f3 Make all custom flag.Value implementations idempotent.
Set(String()) should be an idempotent operation. This is a useful property
which allows us to generate args while re-execing the same process. Setting
`--flag-name=val.String()` should work.

PiperOrigin-RevId: 552598313
2023-07-31 14:53:21 -07:00
Etienne Perot f815fa9079 runsc: Only register some base flags if they are not already defined.
PiperOrigin-RevId: 538323866
2023-06-06 16:38:18 -07:00
Ayush Ranjan a1006d486d Unexport fields of config.Overlay2.
Users/callers of config.Overlay2 relied on its internal functioning/layout.
The stable contract is at the --overlay2 flag level. So changed callers to use
that contract instead via Overlay2.Set().

Unexported fields to encapsulate the type correctly.

PiperOrigin-RevId: 530818635
2023-05-09 23:42:02 -07:00
Etienne Perot 8f991198b4 Move rlimit-setting code from runsc main to run when starting a sandbox.
This allows `runsc` subcommands that don't start sandboxes to run in
restricted contexts without printing a warning about not being able to set
`RLIMIT_MEMLOCK` when the ability to do so this doesn't matter.

In particular, this helps with `runsc metric-server`, which can be locked down
to run with very little capabilities.

A previous version of this change had moved this to the beginning of the
`runsc boot` subcommand code. However, this doesn't work, because `runsc boot`
runs as an unprivileged user (`nobody`) and does not have `CAP_SYS_RESOURCE`.

Prior to that change, all `runsc` invocations tried to call `setrlimit`, so
what happened in practice is that `runsc create` (running as `root`) would
call `setrlimit`, and then `runsc boot` would inherit the `RLIM_INFINITY` and
would therefore never actually call `setrlimit` by itself. When moving the
`setrlimit` code to only run within `runsc boot`, suddenly the `runsc boot`
invocation found itself in a context where it started trying to call
`setrlimit`, which would silently fail.

This approach has the downside of having the side-effect of needlessly setting
`RLIM_INFINITY` on the calling `runsc` process. This was effectively what was
already happening prior to moving this code into `runsc boot` anyway, so this
should be OK. The alternative would be to add yet another intermediate
subcommand before `runsc boot` which runs with `CAP_SYS_RESOURCE`, then calls
`setrlimit`, then drops `CAP_SYS_RESOURCE`, then execs `runsc boot`, but that
seems like adding a lot more extra complexity to the boot process than is
warranted for this feature.

Thanks to Ayush Ranjan for bisecting the performance regression down to this
change.

Ran benchmarks and performance is comparable to before moving `setrlimit` code
within `runsc boot`.

PiperOrigin-RevId: 521916084
2023-04-04 18:08:14 -07:00
Etienne Perot 8820b3bb90 Automated rollback of changelist 508491474
PiperOrigin-RevId: 521829631
2023-04-04 12:10:26 -07:00
Zach Koopmans f92957314c Add portforward command to runsc
Add portforward comand so that we can use runsc to forward connections
to container ports. This will eventually be supported in k8s.

PiperOrigin-RevId: 520739913
2023-03-30 14:16:19 -07:00
Etienne Perot d0326a67da runsc: Refactor in how the version string is propagated in runsc.
This is helpful so that it can be imported form other packages without import
loops.

This will be used in a follow-up change to add the version string as a
per-sandbox metric metadata label.

PiperOrigin-RevId: 519002695
2023-03-23 17:12:09 -07:00
Ayush Ranjan f9638850c6 Add support for directfs in runsc.
directfs is a feature much like host networking. The gofer security boundary is
dropped for additional performance. The sentry is allowed to directly access the
host filesystem to access or mutate files in order to service application
syscalls. This mode bypasses the gofer completely.

With directfs, the sentry is run in the root user namespace. (Similar to how
hostinet works.) The sentry also runs with additional capabilities that allow
it to bypass file permissions checks, change file owner, chroot, etc. Seccomp
filters are loosened around the sentry to allow it to make filesystem syscalls
directly to the host.

The sandbox still runs with an empty pivot root. The gofer donates FDs for the
root mount and all bind mounts so that the sentry can perform filesystem
operations using them.

There is a dummy gofer process in directfs mode. The gofer sets up all bind
mounts and donates host FDs of the mount point to the sandbox. The gofer dies
with the sandbox. All gofer mount options are supported with directfs. directfs
mounts are set up similarly to how gofer mounts are set up.

PiperOrigin-RevId: 513634936
2023-03-02 14:10:58 -08:00
Adin Scannell 1ceb814544 Add default_applicable_licenses rules to packages.
PiperOrigin-RevId: 513581243
2023-03-02 10:50:04 -08:00
Konstantin Bogomolov 1832c38a95 Add TESTONLY sentry panic trigger through afs_syscall.
This flag is added for tests that need to trigger a panic in the sentry
kernel. Only done for x86_64 which does have a dedicated syscall number for
afs_syscall; ARM does not.

PiperOrigin-RevId: 512731631
2023-02-27 14:28:30 -08:00
Etienne Perot 81d6f80caf Add runsc metric-metadata subcommand.
Also group metric-related commands together in their own command group.

PiperOrigin-RevId: 511240778
2023-02-21 10:34:37 -08:00
Etienne Perot 5419f17710 Move rlimit-setting code from runsc main to run only as part of runsc boot.
This allows other `runsc` subcommands to run in restricted contexts without
printing a warning about not being able to set `RLIMIT_MEMLOCK` in contexts
where this doesn't matter.

In particular, this helps with `runsc metric-server`, which can be locked down
to run with very little capabilities.

PiperOrigin-RevId: 508491474
2023-02-09 15:29:05 -08:00
Ayush Ranjan c4fe64c5ef Add dir= prefix in overlay2 flag's medium.
This is to make it clear that the host file will be created inside this
directory. This also makes it look cleaner when other medium options are added
later.

PiperOrigin-RevId: 505033408
2023-01-26 22:56:52 -08:00
Etienne Perot 2ce059fcad gVisor: Implement runsc metric-server which serves Prometheus metrics.
The metrics server implements the following interface:

- GET `/metrics`: Serves Prometheus metrics.
- POST `/runsc-metrics`: Contains administrative endpoints.
  All of them require the `root` argument to be specified, and match the one
  that the server expects. This is used to avoid confusing multiple instances
  of the  `runsc` metrics server.
  - POST `/runsc-metrics/healthcheck`: Used by clients to verify that the
    server is running and is the expected metric server.

This change has no tests, but coverage is provided in a later change that
provides an end-to-end container tests that the metric server works and
exports data faithfully.

This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:

- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
  goroutine.
  - Since we don't want to write our own HTTP server implementation, this
    means the HTTP endpoint has to be served by a separate process that
    remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
  - This means we cannot trust metrics data that comes out of the Sentry.
    Therefore, there needs to be an elaborate dance where we pre-register
    metric metadata before starting any untrusted workload. Then, the server
    relaying the metric data must verify the validity of metric values against
    this metric metadata. This avoids leaking metrics, cardinality blow-ups,
    and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
  - This means having the ability to just say
    `--metrics-server=localhost:1337` in the `runsc` runtime entry in
    `/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
    containers are running.
  - Since only one process may listen on a port at a given time, this means
    the metric server needs to be able to multiplex requests out to multiple
    running sandboxes, and remain alive for the entire duration of either of
    these sandboxes.
  - For this reason, the metrics server runs *outside* of the usual
    per-container cgroups.
  - This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
  that its clients are trustworthy.
  - For this reason, a metrics server is bound to a runtime root directory,
    and double-checks all that the sandboxes it is asked to follow actually
    exist in this root directory.

PiperOrigin-RevId: 503254345
2023-01-19 13:45:46 -08:00
Rahat Mahmood ef96e9328e Disable io_uring syscalls by default.
The current io_uring support is very limited and experimental. Disable
it by default, and add a flag to enable it for testing.

PiperOrigin-RevId: 500760451
2023-01-09 11:11:57 -08:00
Etienne Perot 01061a8f20 gVisor: Add runsc metrics-export subcommand.
This subcommand prints a sandbox's instrumentation data in Prometheus format
to stdout.

This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:

- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
  goroutine.
  - Since we don't want to write our own HTTP server implementation, this
    means the HTTP endpoint has to be served by a separate process that
    remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
  - This means we cannot trust metrics data that comes out of the Sentry.
    Therefore, there needs to be an elaborate dance where we pre-register
    metric metadata before starting any untrusted workload. Then, the server
    relaying the metric data must verify the validity of metric values against
    this metric metadata. This avoids leaking metrics, cardinality blow-ups,
    and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
  - This means having the ability to just say
    `--metrics-server=localhost:1337` in the `runsc` runtime entry in
    `/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
    containers are running.
  - Since only one process may listen on a port at a given time, this means
    the metric server needs to be able to multiplex requests out to multiple
    running sandboxes, and remain alive for the entire duration of either of
    these sandboxes. However, it should also die when there are no sandboxes,
    so that we don't end up with leftover metric servers lying around.
  - For this reason, the metrics server runs *outside* of the usual
    per-container cgroups.
  - This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
  that its clients are trustworthy.
  - For this reason, a metrics server is bound to a runtime root directory,
    and double-checks all that the sandboxes it is asked to follow actually
    exist in this root directory.

PiperOrigin-RevId: 498067941
2022-12-27 18:11:57 -08:00