Recently printf.Analyzer has become stricter
(https://github.com/golang/go/issues/60529)
which led to new findings.
gvisor nogo tests run this analyzer and fail if it produces findings.
PiperOrigin-RevId: 671657227
Prior to this change, the log files each have their own timestamp computed
independently. For example, this means that the coverage log file, the panic
log file, the debug log file, the first Gofer's log file, and the profile
files for the same Sentry may all have different timestamps in their
filenames. Now they are the same.
This change introduces a central `runsc/starttime` package for which the sole
purpose is to hold the start time of the `runsc` process, for easy plumbing
in all places that need it.
PiperOrigin-RevId: 646667986
This is part of a series of changes to add metric charts in performance
benchmarks.
This change is useful to be able to extract profiling metric data easily
regardless of runtime configuration.
This change also changes where profiling metrics are initialized and
configured. They are now only part of `runsc boot`, rather than all `runsc`
invocations.
PiperOrigin-RevId: 629900287
Changes:
- Header is more compact (in non-debug mode).
- Added Go runtime version.
- Added number of CPU cores.
- In debug mode, log page size.
- Removed some non-important pieces of configuration to info-level logs.
- Made debug mode enable the entire configuration to be logged.
- Move initialization of some non-logging-related stuff be after log
initialization. (Some of them *use* the log, so it makes no sense to do
that before logging is actually initialized.)
PiperOrigin-RevId: 595309362
On Kubernetes, these are logged at the pod level.
This makes it convenient to debug gVisor on Kubernetes clusters where
SSH access to nodes is difficult or prohibited by policy. With this and
other pod annotations, it is possible to do strace debugging with only
pod annotations and no SSH.
PiperOrigin-RevId: 595197543
This patch decouples GoferMountConf to two layers to allow us to
configure all combinations of a gofer mount in a succinct way:
- Upper layer config: none, memory, self, anon. The upper layer
is always tmpfs. It describes the backend for tmpfs.
- Lower layer config: none, lisafs. It describes the backend for
the filesystem which actually holds the image contents.
The old SelfTmpfs will be represented as "upper=self,lower=none",
MemoryOverlay will be "upper=memory,lower=lisafs", SelfOverlay
will be "upper=self,lower=lisafs", and so on. Thanks to @ayushr2
for the suggestion on how to better decouple this.
This is a preparation for adding the EROFS rootfs support. There
is no functional change intended.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Since the driver versions are hard-coded, determining the supported
version list requires code inspection, which is difficult to automate.
Add a sub-command (and category) to print the list of nvidia
driver versions supported by nvproxy for a given build of runsc.
Signed-off-by: Josh Seba <jseba@cloudflare.com>
The idea behind conditionally compiled metrics originally was to use them in
hotpaths for profiling purposes. This CL makes that possible by outputting
declared metrics in TSV format, which can be used to track custom events at
runtime in relatively high resolution.
Usage:
1. Optionally enable compilation of runsc conditionally-compiled metrics
by passing in condmetric_profiling to the Go tags.
2. Add these flags to runsc:
- [Required] --profiling-metrics-log=/tmp/some.csv
- [Optional] --profiling-metrics=/task/syscalls,/task/faults
- If this flag is not specified it will monitor all
conditionally-compiled metrics by default.
- [Optional] --profiling-metrics-rate-us=10000
Some future improvements:
- Flag to output a metric-difference between timestamps instead of
constant accumulation.
- Output a gnuplot command along with the data.
- Current monitoring resolution is limited by what time.Sleep allows.
This can be overcome by spinning/yielding when lower monitoring
rates are requested.
PiperOrigin-RevId: 560849611
Set(String()) should be an idempotent operation. This is a useful property
which allows us to generate args while re-execing the same process. Setting
`--flag-name=val.String()` should work.
PiperOrigin-RevId: 552598313
Users/callers of config.Overlay2 relied on its internal functioning/layout.
The stable contract is at the --overlay2 flag level. So changed callers to use
that contract instead via Overlay2.Set().
Unexported fields to encapsulate the type correctly.
PiperOrigin-RevId: 530818635
This allows `runsc` subcommands that don't start sandboxes to run in
restricted contexts without printing a warning about not being able to set
`RLIMIT_MEMLOCK` when the ability to do so this doesn't matter.
In particular, this helps with `runsc metric-server`, which can be locked down
to run with very little capabilities.
A previous version of this change had moved this to the beginning of the
`runsc boot` subcommand code. However, this doesn't work, because `runsc boot`
runs as an unprivileged user (`nobody`) and does not have `CAP_SYS_RESOURCE`.
Prior to that change, all `runsc` invocations tried to call `setrlimit`, so
what happened in practice is that `runsc create` (running as `root`) would
call `setrlimit`, and then `runsc boot` would inherit the `RLIM_INFINITY` and
would therefore never actually call `setrlimit` by itself. When moving the
`setrlimit` code to only run within `runsc boot`, suddenly the `runsc boot`
invocation found itself in a context where it started trying to call
`setrlimit`, which would silently fail.
This approach has the downside of having the side-effect of needlessly setting
`RLIM_INFINITY` on the calling `runsc` process. This was effectively what was
already happening prior to moving this code into `runsc boot` anyway, so this
should be OK. The alternative would be to add yet another intermediate
subcommand before `runsc boot` which runs with `CAP_SYS_RESOURCE`, then calls
`setrlimit`, then drops `CAP_SYS_RESOURCE`, then execs `runsc boot`, but that
seems like adding a lot more extra complexity to the boot process than is
warranted for this feature.
Thanks to Ayush Ranjan for bisecting the performance regression down to this
change.
Ran benchmarks and performance is comparable to before moving `setrlimit` code
within `runsc boot`.
PiperOrigin-RevId: 521916084
Add portforward comand so that we can use runsc to forward connections
to container ports. This will eventually be supported in k8s.
PiperOrigin-RevId: 520739913
This is helpful so that it can be imported form other packages without import
loops.
This will be used in a follow-up change to add the version string as a
per-sandbox metric metadata label.
PiperOrigin-RevId: 519002695
directfs is a feature much like host networking. The gofer security boundary is
dropped for additional performance. The sentry is allowed to directly access the
host filesystem to access or mutate files in order to service application
syscalls. This mode bypasses the gofer completely.
With directfs, the sentry is run in the root user namespace. (Similar to how
hostinet works.) The sentry also runs with additional capabilities that allow
it to bypass file permissions checks, change file owner, chroot, etc. Seccomp
filters are loosened around the sentry to allow it to make filesystem syscalls
directly to the host.
The sandbox still runs with an empty pivot root. The gofer donates FDs for the
root mount and all bind mounts so that the sentry can perform filesystem
operations using them.
There is a dummy gofer process in directfs mode. The gofer sets up all bind
mounts and donates host FDs of the mount point to the sandbox. The gofer dies
with the sandbox. All gofer mount options are supported with directfs. directfs
mounts are set up similarly to how gofer mounts are set up.
PiperOrigin-RevId: 513634936
This flag is added for tests that need to trigger a panic in the sentry
kernel. Only done for x86_64 which does have a dedicated syscall number for
afs_syscall; ARM does not.
PiperOrigin-RevId: 512731631
This allows other `runsc` subcommands to run in restricted contexts without
printing a warning about not being able to set `RLIMIT_MEMLOCK` in contexts
where this doesn't matter.
In particular, this helps with `runsc metric-server`, which can be locked down
to run with very little capabilities.
PiperOrigin-RevId: 508491474
This is to make it clear that the host file will be created inside this
directory. This also makes it look cleaner when other medium options are added
later.
PiperOrigin-RevId: 505033408
The metrics server implements the following interface:
- GET `/metrics`: Serves Prometheus metrics.
- POST `/runsc-metrics`: Contains administrative endpoints.
All of them require the `root` argument to be specified, and match the one
that the server expects. This is used to avoid confusing multiple instances
of the `runsc` metrics server.
- POST `/runsc-metrics/healthcheck`: Used by clients to verify that the
server is running and is the expected metric server.
This change has no tests, but coverage is provided in a later change that
provides an end-to-end container tests that the metric server works and
exports data faithfully.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:
- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
goroutine.
- Since we don't want to write our own HTTP server implementation, this
means the HTTP endpoint has to be served by a separate process that
remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
- This means we cannot trust metrics data that comes out of the Sentry.
Therefore, there needs to be an elaborate dance where we pre-register
metric metadata before starting any untrusted workload. Then, the server
relaying the metric data must verify the validity of metric values against
this metric metadata. This avoids leaking metrics, cardinality blow-ups,
and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
- This means having the ability to just say
`--metrics-server=localhost:1337` in the `runsc` runtime entry in
`/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
containers are running.
- Since only one process may listen on a port at a given time, this means
the metric server needs to be able to multiplex requests out to multiple
running sandboxes, and remain alive for the entire duration of either of
these sandboxes.
- For this reason, the metrics server runs *outside* of the usual
per-container cgroups.
- This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
that its clients are trustworthy.
- For this reason, a metrics server is bound to a runtime root directory,
and double-checks all that the sandboxes it is asked to follow actually
exist in this root directory.
PiperOrigin-RevId: 503254345
The current io_uring support is very limited and experimental. Disable
it by default, and add a flag to enable it for testing.
PiperOrigin-RevId: 500760451
This subcommand prints a sandbox's instrumentation data in Prometheus format
to stdout.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:
- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
goroutine.
- Since we don't want to write our own HTTP server implementation, this
means the HTTP endpoint has to be served by a separate process that
remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
- This means we cannot trust metrics data that comes out of the Sentry.
Therefore, there needs to be an elaborate dance where we pre-register
metric metadata before starting any untrusted workload. Then, the server
relaying the metric data must verify the validity of metric values against
this metric metadata. This avoids leaking metrics, cardinality blow-ups,
and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
- This means having the ability to just say
`--metrics-server=localhost:1337` in the `runsc` runtime entry in
`/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
containers are running.
- Since only one process may listen on a port at a given time, this means
the metric server needs to be able to multiplex requests out to multiple
running sandboxes, and remain alive for the entire duration of either of
these sandboxes. However, it should also die when there are no sandboxes,
so that we don't end up with leftover metric servers lying around.
- For this reason, the metrics server runs *outside* of the usual
per-container cgroups.
- This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
that its clients are trustworthy.
- For this reason, a metrics server is bound to a runtime root directory,
and double-checks all that the sandboxes it is asked to follow actually
exist in this root directory.
PiperOrigin-RevId: 498067941