
(Chuysang/Shutterstock)
On the planet of monitoring software program, the way you course of telemetry knowledge can considerably influence your potential to derive insights, troubleshoot points, and handle prices.
There are 2 main use circumstances for the way telemetry knowledge is leveraged:
- RadarĀ (Monitoring of methods) normally falls into the bucket of identified knowns and identified unknowns. This results in eventualities the place some knowledge is sort of āpre-determinedā to behave, be plotted in a sure approach ā as a result of we all know what we’re on the lookout for.
- BlackboxĀ (Debugging, RCA and many others.) ones alternatively are extra to do with unknown unknowns. Which entails to what we donāt know and should must hunt for to construct an understanding of the system.
Understanding Telemetry Information Challenges
Earlier than diving into processing approaches, itās vital to know the distinctive challenges of telemetry knowledge:
- Quantity: Trendy methods generate monumental quantities of telemetry knowledge
- Velocity: Information arrives in steady, high-throughput streams
- Selection: A number of codecs throughout metrics, logs, traces, profiles and occasions
- Time-sensitivity: Worth typically decreases with age
- Correlation wants: Information from totally different sources have to be linked collectively
These traits create particular issues when selecting between ETL and ELT approaches.
Ā
ETL for Telemetry: Remodel-First Structure
Technical Structure
In an ETL method, telemetry knowledge undergoes transformation earlier than reaching its ultimate vacation spot:
A typical implementation stack would possibly embody:
- Assortment: OpenTelemetry, Prometheus, Fluent Bit
- Transport: Kafka or Kinesis or in reminiscence because the buffering layer
- Transformation: Stream processing
- Storage: Time-series databases (Prometheus) or specialised indices or Object Storage (s3)
Key Technical Elements
- Aggregation Methods
Pre-aggregation considerably reduces knowledge quantity and question complexity. A typical pre-aggregation stream appears to be like like this:
This transformation condenses uncooked knowledge into 5-minute summaries, dramatically lowering storage necessities and bettering question efficiency.
Instance:Ā For a gaming utility dealing with hundreds of thousands of requests per day, uncooked request latency metrics (doubtlessly billions of knowledge factors) could be grouped by service and endpoint, then aggregated into 5-minute (or 1-minute) home windows. A single API name that generates 100 latency knowledge factors per second (8.64 million per day) is decreased to only 288 aggregated entries per day (one per 5-minute window), whereas nonetheless preserving essential p50/p90/p99 percentiles wanted for SLA monitoring.
- Cardinality Administration
Excessive-cardinality metrics can break time-series databases. The cardinality administration course of follows this sample:
Efficient methods embody:
- Label filtering and normalization
- Strategic aggregation of particular dimensions
- Hashing strategies for high-cardinality values whereas preserving question patterns
Instance:Ā A microservice monitoring HTTP requests consists of person IDs and request paths in its metrics. With 50,000 day by day energetic customers and hundreds of distinctive URL paths, this creates hundreds of thousands of distinctive label mixtures. The cardinality administration system filters out person IDs fully (configurable, too excessive cardinality), normalizes URL paths by changing dynamic segments with placeholders (e.g.,Ā /customers/123/profilebecomesĀ /customers/{id}/profile), and applies constant categorization to errors. This reduces distinctive time sequence from hundreds of thousands to lots of, permitting the time-series database to operate effectively.
- Actual-time Enrichment
Including context to metrics through the transformation part entails integrating exterior knowledge sources:
This course of provides essential enterprise and operational context to uncooked telemetry knowledge, enabling extra significant evaluation and alerting primarily based on service significance, buyer influence, and different components past pure technical metrics.
Instance:Ā A fee processing service emits fundamental metrics like request counts, latencies, and error charges. The enrichment pipeline joins this telemetry with service registry knowledge so as to add metadata concerning the service tier (essential), SLO targets (99.99% availability), and crew possession (payments-team). It then incorporates enterprise context to tag transactions with their kind (subscription renewal, one-time buy, refund) and estimated income influence. When an incident happens, alerts are robotically prioritized primarily based on enterprise influence quite than simply technical severity, and routed to the suitable crew with wealthy context.
Technical Benefits
- Question efficiency: Pre-calculated aggregates get rid of computation at question time
- Predictable useful resource utilization: Each storage and question compute are managed
- Schema enforcement: Information conformity is assured earlier than storage
- Optimized storage codecs: Information could be saved in codecs optimized for particular entry patterns
Technical Limitations
- Lack of granularity: Some element is completely misplaced
- Schema rigidity: Adapting to new necessities requires pipeline adjustments
- Processing overhead: Actual-time transformation provides complexity and useful resource calls for
- Transformation-time selections: Evaluation paths have to be identified upfront
ELT for Telemetry: Uncooked Storage with Versatile Transformation
Technical Structure
ELT structure prioritizes getting uncooked knowledge into storage, with transformations carried out at question time:
A typical implementation would possibly embody:
- Assortment: OpenTelemetry, Prometheus, Fluent Bit
- Transport: Direct ingestion with out advanced processing
- Storage: Object storage (S3, GCS) or knowledge lakes in Parquet format
- Transformation: SQL engines (Presto, Athena), Spark jobs, or specialised OLAP methods
Key Technical Elements
- Environment friendly Uncooked Storage
Optimizing for long-term storage of uncooked telemetry requires cautious consideration of file codecs and storage group:
This method leverages columnar storage codecs like Parquet with applicable compression (ZSTD for traces, Snappy for metrics), dictionary encoding, and optimized column indexing primarily based on widespread question patterns (trace_id, service, time ranges).
Instance:Ā A cloud-native utility generates 10TB of hint knowledge day by day throughout its distributed companies. As a substitute of discarding or closely sampling this knowledge, the entire hint info is captured utilizing OpenTelemetry collectors and transformed to Parquet format with ZSTD compression. Key fields like trace_id, service title, and timestamp are listed for environment friendly querying. This method reduces the storage footprint by 85% in comparison with uncooked JSON whereas sustaining question efficiency. When a essential customer-impacting concern occurred, engineers had been in a position to entry full hint knowledge from 3 months prior, figuring out a refined sample of intermittent failures that may have been misplaced with conventional sampling.
- Partitioning Methods
Efficient partitioning is essential for question efficiency in opposition to uncooked telemetry. A well-designed partitioning technique follows this hierarchy:
This partitioning method allows environment friendly time-range queries whereas additionally permitting filtering by service and tenant, that are widespread question dimensions. The partitioning technique is designed to:
- Optimize for time-based retrieval (most typical question sample)
- Allow environment friendly tenant isolation for multi-tenant methods
- Permit service-specific queries with out scanning all knowledge
- Separate telemetry sorts for optimized storage codecs per kind
Instance:Ā A SaaS platform with 200+ enterprise clients makes use of this partitioning technique for its observability knowledge lake. When a high-priority buyer experiences a problem that occurred final Tuesday between 2-4pm, engineers can instantly question simply these particular partitions:Ā /12 months=2023/month=11/day=07/hour=1[4-5]/tenant=enterprise-x/*. This method reduces the scan dimension from doubtlessly petabytes to just some gigabytes, enabling responses in seconds quite than hours. When evaluating present efficiency in opposition to historic baselines, the time-based partitioning permits environment friendly month-over-month comparisons by scanning solely the related time partitions.
- Question-time Transformations
SQL and analytical engines present highly effective query-time transformations. The question processing stream for on-the-fly evaluation appears to be like like this (See Fig. 8).
This question stream demonstrates how advanced evaluation like calculating service latency percentiles, error charges, and utilization patterns could be carried out fully at question time while not having pre-computation. The analytical engine applies optimizations like predicate pushdown, parallel execution, and columnar processing to attain affordable efficiency even in opposition to massive uncooked datasets.
Instance:Ā A DevOps crew investigating a efficiency regression found it solely affected premium clients utilizing a particular characteristic. Utilizing query-time transformations in opposition to the ELT knowledge lake, they wrote a single question that first filtered to the affected time interval, joined buyer tier info, extracted related attributes about characteristic utilization, calculated percentile response instances grouped by buyer section, and recognized that premium clients with excessive transaction volumes had been experiencing degraded efficiency solely when a particular non-compulsory characteristic flag was enabled. This evaluation would have been inconceivable with pre-aggregated knowledge because the buyer section + characteristic flag dimension hadnāt been beforehand recognized as vital for monitoring.
Technical Benefits
- Schema flexibility: New dimensions could be analyzed with out pipeline adjustments
- Price-effective storage: Object storage is considerably cheaper than specialised DBs
- Retroactive evaluation: Historic knowledge could be examined with new views
Technical Limitations
- Question efficiency challenges: Interactive evaluation could also be gradual on massive datasets
- Useful resource-intensive evaluation: Compute prices could be excessive for advanced queries
- Implementation complexity: Requires extra refined question tooling
- Storage overhead: Uncooked knowledge consumes considerably extra space
Technical Implementation: The Hybrid Method
Core Structure Elements
Implementation Technique
- Twin-path processing
Instance: A worldwide ride-sharing platform carried out a dual-path telemetry system that routes service well being metrics and buyer expertise indicators (trip wait instances, ETA accuracy) by the ETL path for real-time dashboards and alerting. In the meantime, all uncooked knowledge together with detailed person journeys, driver actions, and utility logs flows by the ELT path to cost-effective storage. When a regional outage occurred, operations groups used the real-time dashboards to shortly determine and mitigate the speedy concern. Later, knowledge scientists used the preserved uncooked knowledge to carry out a complete root trigger evaluation, correlating a number of components that wouldnāt have been seen in pre-aggregated knowledge alone.
- Sensible knowledge routing
Instance:Ā A monetary companies firm deployed a sensible routing system for his or her telemetry knowledge. All knowledge is preserved within the knowledge lake, however essential metrics like transaction success charges, fraud detection alerts, and authentication service well being metrics are instantly routed to the real-time processing pipeline. Moreover, any security-related occasions equivalent to failed login makes an attempt, permission adjustments, or uncommon entry patterns are instantly despatched to a devoted safety evaluation pipeline. Throughout a latest safety incident, this routing enabled the safety crew to detect and reply to an uncommon sample of authentication makes an attempt inside minutes, whereas the entire context of person journeys and utility conduct was preserved within the knowledge lake for subsequent forensic evaluation.
- Unified question interface
Actual-world Implementation Instance
A particular engineering implementation atĀ last9.ioĀ demonstrates how this hybrid method works in apply:
For a large-scale Kubernetes platform with lots of of clusters and hundreds of companies, we carried out a hybrid telemetry pipeline with:
- Essential-path metricsĀ processed by a pipeline that:
-
- Performs dimensional discount (limiting label mixtures)
-
- Pre-calculates service-level aggregations
-
- Computes derived metrics like success charges and latency percentiles
- Uncooked telemetryĀ saved in an economical knowledge lake:
-
- Partitioned by time, knowledge kind, and tenant
-
- Optimized for typical question patterns
-
- Compressed with applicable codecs (Zstd for traces, Snappy for metrics)
- Unified question layerĀ that:
-
- Routes dashboard and alerting queries to pre-aggregated storage
-
- Redirects exploratory and ad-hoc evaluation to the info lake
-
- Manages correlation queries throughout each methods
This method delivered each the question efficiency wanted for real-time operations and the analytical depth required for advanced troubleshooting.
Determination Framework
When architecting telemetry pipelines, these technical issues ought to information your method:
Determination Issue | Use ETL | Use ELT |
Question latency necessities | Can wait minutes | |
Information retention wants | Days/Weeks | Months/Years |
Cardinality | Low/Medium | Very excessive |
Evaluation patterns | Effectively-defined | Exploratory |
Price range precedence | Compute | Storage |
Conclusion
The technical realities of telemetry knowledge processing demand pondering past easy ETL vs. ELT paradigms. Engineering groups ought to architect tiered methods that leverage the strengths of each approaches:
- ETL-processed knowledgeĀ for operational use circumstances requiring speedy insights
- ELT-processed knowledgeĀ for deeper evaluation, troubleshooting, and historic patterns
- Metadata-driven routingĀ to intelligently direct queries to the suitable tier
This engineering-centric method balances efficiency necessities with value issues whereas sustaining the flexibleness required in trendy observability methods.
Concerning the creator:Ā Nishant Modak is the founder and CEO of Last9, a excessive cardinality observability platform firm backed by Sequoia India (now PeakXV). Heās been an entrepreneur and dealing with massive scale corporations for almost twenty years.
Associated Objects:
From ETL to ELT: The Subsequent Technology of Information Integration Success
50 Years Of ETL: Can SQL For ETL Be Changed?