
(igor kisselev/Shutterstock)
Metrics promise common understanding throughout programs, however with evolving codecs and complicated math, they typically trigger extra confusion than readability. Right here’s what we’re getting mistaken and the way we are able to repair it.
In 1887, an ophthalmologist named L.L. Zamenhof launched Esperanto, a common language designed to interrupt down limitations and unite individuals world wide. It was formidable, idealistic, and in the end area of interest, with solely about 100,000 audio system as we speak.
Observability has its personal model of Esperanto: metrics. They’re the standardized, numerical representations of system well being. In principle, metrics ought to simplify how we monitor and troubleshoot digital infrastructure. In follow, they’re typically misunderstood, misused, and maddeningly inconsistent.
Let’s discover why metrics, our supposed common language, stay so tough to get proper.
Metrics, Decoded (and Re-Encoded)
A metric is a numeric measurement at a cut-off date. That appears simple—till you dive into the nuance of how metrics are outlined and used. Take redis.keyspace.hits, for instance: a counter that tracks how typically a Redis occasion efficiently finds knowledge within the keyspace. Relying on the telemetry format—OpenTelemetry, Prometheus, or StatsD—will probably be formatted in another way, even with the identical dimensions , aggregations, and metric worth.
We now have competing requirements like StatsD, Prometheus, and OpenTelemetry (OTLP) Metrics, every introducing its personal strategy to outline and transmit datapoints and their related metadata. These codecs don’t simply differ in syntax, they differ in elementary habits and metadata construction. The outcome? Three instruments might present you a similar metric worth, however require fully completely different logic to gather, retailer, and analyze it.
That fragmentation results in operational confusion, inflated storage prices, and groups spending extra time decoding telemetry than appearing on it.
Format Conversion Does Not Equal Metric Understanding
Even when format translation is dealt with, aggregation nonetheless causes confusion. Think about accumulating redis.keyspace.hits each six seconds throughout 10 containers. If the container.id tag is dropped, the metric values should now be aggregated. In OTLP, Prometheus, or StatsD, dropping the container.id tag adjustments how the metric is interpreted because the values of the metrics should now be aggregated. Prometheus would possibly sum the values, OTLP can deal with it as a delta counter, and StatsD may common them, which leads to habits extra like a gauge than a counter. These refined variations in how metrics are interpreted can result in inconsistent evaluation. With out intentional dealing with of metrics, groups danger drawing incorrect conclusions from the information.
However even after format translation, the toughest half typically comes subsequent: deciding how you can combination these metrics. The reply will depend on the metric sort. Summing gauges can result in incorrect outcomes. Treating a delta as a cumulative counter can introduce danger. Aggregation math that’s technically appropriate should still confuse downstream programs, particularly if these programs anticipate monotonic habits.
Metrics are math, and the mathematics issues. This is the reason instruments want metric-specific logic, just like the event-centric logic that already exists for logs and traces.
Why It Issues
If we are able to’t depend on a shared understanding of metrics, observability suffers. Incidents take longer to resolve. Alerting turns into noisy. Groups lose religion of their knowledge.
The trail ahead isn’t about creating one other normal. It’s about growing higher tooling that simplifies format dealing with, smarter methods to combination and interpret knowledge, and training that helps groups use metrics successfully without having a math diploma.
By treating metrics as a novel type of telemetry with its personal construction and challenges, we are able to take away the guesswork and empower groups to behave with confidence. It’s time to construct with readability in thoughts—not only for machines, however for the people deciphering the information.
In regards to the writer: Josh Biggley is a employees product supervisor at Cribl. A 25-year veteran of the tech trade, Biggley loves to speak about monitoring, observability, OpenTelemetry, community telemetry, and all issues nerdy. He has expertise with Fortune 25 corporations and pre-seed startups alike, throughout manufacturing, healthcare, authorities, and consulting verticals.
Associated Objects:
2025 Observability Predictions and Observations
Knowledge Observability within the Age of AI: A Information for Knowledge Engineers
Cribl CEO Clint Sharp Discusses the Knowledge Observability Deluge