The Prometheus parser doesn't guarantee that it returns metric data in
timestamp-sorted order. So sort it in the test manually.
Before: Fails 7 out of 2048 times
After: Fails 0 out of 2048 times
PiperOrigin-RevId: 688224325
This allows detecting data corruption more finely.
A future change will make these errors skipped over, allowing data to
still be visualized even if partially corrupt.
PiperOrigin-RevId: 634038986
This is part of a series of changes to add metric charts in performance
benchmarks.
This change is meant to do three things:
- Remove the interface indirection from the Prometheus library, which is
performance-critical due to its use in writing out profiling metrics
(although the runsc metric server also benefits from this too).
- Use a `StringWriter`-like writer contract, to avoid needless casting
between strings and bytes within the Prometheus library. The library only
ever needs to deal with strings, so it is up to callers to do the
conversion to bytes if they need to (which the runsc metric-server does).
- Avoid buffer allocations in the metric server when each snapshot is larger
than the buffer size. Instead, buffers are saved and reused.
PiperOrigin-RevId: 630500004
It is an idea of running codespell as part of our presubmit checks.
Before enabling it for new changes, let's fix what it has found.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The idea behind conditionally compiled metrics originally was to use them in
hotpaths for profiling purposes. This CL makes that possible by outputting
declared metrics in TSV format, which can be used to track custom events at
runtime in relatively high resolution.
Usage:
1. Optionally enable compilation of runsc conditionally-compiled metrics
by passing in condmetric_profiling to the Go tags.
2. Add these flags to runsc:
- [Required] --profiling-metrics-log=/tmp/some.csv
- [Optional] --profiling-metrics=/task/syscalls,/task/faults
- If this flag is not specified it will monitor all
conditionally-compiled metrics by default.
- [Optional] --profiling-metrics-rate-us=10000
Some future improvements:
- Flag to output a metric-difference between timestamps instead of
constant accumulation.
- Output a gnuplot command along with the data.
- Current monitoring resolution is limited by what time.Sleep allows.
This can be overcome by spinning/yielding when lower monitoring
rates are requested.
PiperOrigin-RevId: 560849611
This adds additional checks that the distribution statistics (minimum,
maximum, sum-of-squared-deviations) trend in the correct direction across
snapshots.
PiperOrigin-RevId: 538028640
The Prometheus metric verification library uses a `numberPacker` to pack the
numbers it must retain from snapshot to snapshot into a tiny amount of space.
This involves putting those that fit in a few bits as-is, but for the larger
ones, they are stored in the `numberPacker` struct itself, and the packed
number represents an offset within that struct instead.
When the library verifies a new snapshot, it instantiates a new `numberPacker`
so that numbers that are no longer referenced are not kept around forever.
This works well in most cases, but in the case where a new snapshot does *not*
contain a particular metric for whatever reason, the library will still report
it as existing, but the indirectly-referenced numbers will no longer exist.
This CL reworks how `numberPacker` is used such that this usage of
indirectly-stored numbers is tracked more precisely across snapshots, and
ensures that all such numbers are ported over from a snapshot to the next even
if the next snapshot only contains a partial result.
It also changes the `numberPacker` semantics to have its storage be explicitly
allocated, and will `panic` if asked to store more than that. This means the
`numberPacker` never needs to allocate memory, so (as a bonus) it can use
`go:nosplit`. Additionally, the Verifier checks that it uses exactly all the
storage slots it thinks it will need, which ensures that it has correctly
tracked the expected usage.
The tests are minimal but will be further reinforced in a future CL which adds
additional distribution statistics into distribution metrics. One of these
(the sum-of-squared-deviations statistic) is a floating-point number which
almost always requires indirect storage, and thus provides more consistent
coverage. Previous unit tests almost never actually required indirect storage,
hence this bug not having been found until now.
PiperOrigin-RevId: 538002697
This plumbs the distribution statistics out to the data consumed from the
gVisor sandbox process by the `runsc metric-server` process.
PiperOrigin-RevId: 537990236
Prior to this CL, attempting to write data for a data point with no labels
other than `d.ExternalLabels` set would not actually print these labels.
This CL adds `d.ExternalLabels` to the check that checks whether there are any
labels to write, and simplifies it to not check for nil-ness (as the `len` of
a `nil` map is 0).
PiperOrigin-RevId: 526715338
This is useful for metrics where some labels are specific to the `Data` struct
in question, while others are shared. It is wasteful to create unique maps
for each `*Data` when they contain mostly the same labels.
There used to only be one such metric that requires data labels, but an
upcoming change will change that, making this change worthwhile to avoid
wasting memory.
PiperOrigin-RevId: 521596697
This prefix is used by the metric server to synthesize its own metrics.
If the sandbox were to define metrics with the same name, they would conflict.
By having this prefix check, this prevents a malicious sandbox from defining
metrics that conflict with those that the metric server is trying to export.
PiperOrigin-RevId: 518922970
This is an effort to reduce it to be a well-behaved background process.
With 110 sandboxes running, at rest, this goes from
```
VmRSS: 72376 kB
RssAnon: 51944 kB
```
to:
```
VmRSS: 45864 kB
RssAnon: 25788 kB
```
This GCs much more aggressively, including after every single request, which
means we do spend disproportionately more CPU in order to get that low memory
usage. From my testing, serving requests takes about 12% more CPU, and it's
all spent in GC.
The optimizations that went into this are:
- Add a method in `state` to discard the global type maps.
- Add a custom "packed" number type in `prometheus` library that encodes small
integers and floating-point numbers in 32 bits whenever possible without
loss of precision, otherwise they are encoded in their full 64-bit glory and
the 32-bit representation is used as a pointer to the 64-bit representation.
These are stored either per-sandbox (for static-after-sandbox-creation
numbers like distribution bucket boundaries), or per-metric-retrieval
attempt otherwise.
- Use string interning for commonly-seen strings across sandboxes, like metric
names and label names. Label values are also interned, but only at a
per-sandbox granularity.
- Reworked allocation-heavy functions like `OrderedLabels` and some string
rendering functions to be (almost) allocation-free. This doesn't reduce
memory usage at rest, and does increase their CPU cost, but in return it
significantly cuts down on the percentage of CPU time spent in GC
(>50% -> 25%) enough to justify spending the extra CPU in these functions.
PiperOrigin-RevId: 515181387
This synthetic metric contains the Unix timestamp that each sandbox was
started at.
This is useful for counter metrics, such that rates of change over time can
be properly on a per-sandbox basis.
PiperOrigin-RevId: 505228878
Prior to this CL, the metric server generated its own ID when it discovered
a new sandbox. This means existing sandboxes get new iteration IDs, which
breaks the continuity of counter metrics across process restarts.
This CL uses the container creation time in the sandbox state file, and uses
this (together with the sandbox ID) to generate a unique ID for each
instantiation of a sandbox with a given ID.
Added test to verify this behavior.
PiperOrigin-RevId: 505208685
This removes `prometheus.Snapshot.WriteTo` and replaces it with a `Write`
function that handles writing multiple `Snapshot` objects to the same writer.
This is necessary for following the OpenMetrics spec more closely, which e.g.
requires same-name metrics to be grouped together in the output. When rendering
data from multiple `Snapshot`s from multiple sandboxes, this grouping was not
respected.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`.
PiperOrigin-RevId: 499551401
This library implements metric verifier functionality. Given metric
registration information extracted from the sandbox at boot time (before
any untrusted container is started), it accepts successive data snapshots
and verifies that they meet all checks: metrics exist, metadata matches,
cardinality is within bounds, etc.
A metric verifier is stateful, as it verifies that counters count only
upwards, and snapshots in time are only taken with time advancing ever
forward.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:
- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
goroutine.
- Since we don't want to write our own HTTP server implementation, this
means the HTTP endpoint has to be served by a separate process that
remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
- This means we cannot trust metrics data that comes out of the Sentry.
Therefore, there needs to be an elaborate dance where we pre-register
metric metadata before starting any untrusted workload. Then, the server
relaying the metric data must verify the validity of metric values against
this metric metadata. This avoids leaking metrics, cardinality blow-ups,
and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
- This means having the ability to just say
`--metrics-server=localhost:1337` in the `runsc` runtime entry in
`/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
containers are running.
- Since only one process may listen on a port at a given time, this means
the metric server needs to be able to multiplex requests out to multiple
running sandboxes, and remain alive for the entire duration of either of
these sandboxes. However, it should also die when there are no sandboxes,
so that we don't end up with leftover metric servers lying around.
- For this reason, the metrics server runs *outside* of the usual
per-container cgroups.
- This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
that its clients are trustworthy.
- For this reason, a metrics server is bound to a runtime root directory,
and double-checks all that the sandboxes it is asked to follow actually
exist in this root directory.
PiperOrigin-RevId: 499370269
When writing data from multiple Prometheus snapshots at once to the same
output stream, writing the same metric preamble (`HELP`/`TYPE` comment) is
invalid.
This change moves the `metricsWritten` map (used to track which metrics have
had their preamble already written) into `ExportOptions`, which allows it to
be shared across `Snapshot`s.
This is useful in the metrics server, which has to write data from multiple
sandboxes in the same HTTP response. This way, metrics preamble from the
second sandbox is not written to the output.
Also print a newline before each new preamble, for aesthetics.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`.
PiperOrigin-RevId: 499351638
This adds a new library, `//pkg/prometheus`, which contains just enough data
structures such that we can encode instrumentation information in Prometheus
information. These data structures are JSON-encodable, such that they can be
used over the `runsc` control channel for export (implemented in a future CL).
The existing `metric.go` library gains new functionality to export its own
data using this new export format.
This change is part of a series of changes to support Prometheus-style metrics
in `runsc`. Doing so requires making several seemingly-odd design decisions,
due to the following architectural constraints:
- Prometheus requires an HTTP server serving the `/metrics` endpoint.
- For performance reasons, the `runsc boot` process cannot run the `netpoller`
goroutine.
- Since we don't want to write our own HTTP server implementation, this
means the HTTP endpoint has to be served by a separate process that
remains running during the lifetime of the container.
- The `runsc boot` process is untrusted.
- This means we cannot trust metrics data that comes out of the Sentry.
Therefore, there needs to be an elaborate dance where we pre-register
metric metadata before starting any untrusted workload. Then, the server
relaying the metric data must verify the validity of metric values against
this metric metadata. This avoids leaking metrics, cardinality blow-ups,
and other such DoS vectors.
- This feature needs to be easy-to-use in a typical Docker setting.
- This means having the ability to just say
`--metrics-server=localhost:1337` in the `runsc` runtime entry in
`/etc/docker/daemon.json` and have that Just Work(TM), even when multiple
containers are running.
- Since only one process may listen on a port at a given time, this means
the metric server needs to be able to multiplex requests out to multiple
running sandboxes, and remain alive for the entire duration of either of
these sandboxes. However, it should also die when there are no sandboxes,
so that we don't end up with leftover metric servers lying around.
- For this reason, the metrics server runs *outside* of the usual
per-container cgroups.
- This also saves system resources by not running one server per sandbox.
- The metrics server must be exposed to the outside world, and cannot assume
that its clients are trustworthy.
- For this reason, a metrics server is bound to a runtime root directory,
and double-checks all that the sandboxes it is asked to follow actually
exist in this root directory.
PiperOrigin-RevId: 498039624