This put up is co-written with Ido Ziv from Kaltura.
As organizations develop, managing observability throughout a number of groups and functions turns into more and more advanced. Logs, metrics, and traces generate huge quantities of knowledge, making it difficult to keep up efficiency, reliability, and cost-efficiency.
At Kaltura, an AI-infused video-first firm serving thousands and thousands of customers throughout a whole lot of functions, observability is mission-critical. Understanding system conduct at scale isn’t nearly troubleshooting—it’s about offering seamless experiences for patrons and staff alike. However reaching efficient observability at this scale comes with challenges: managing spans; correlating logs, traces, and occasions throughout distributed techniques; and sustaining visibility with out overwhelming groups with noise. Balancing granularity, value, and actionable insights requires fixed tuning and considerate structure.
On this put up, we share how Kaltura reworked its observability technique and technological stack by migrating from a software program as a service (SaaS) logging answer to Amazon OpenSearch Service—reaching greater log retention, a 60% discount in value, and a centralized platform that empowers a number of groups with real-time insights.
Observability challenges at scale
Kaltura ingests over 8TB of logs and traces each day, processing greater than 20 billion occasions throughout 6 manufacturing AWS Areas and over 200 functions—with log spikes reaching as much as 6 GB per second. This immense knowledge quantity, mixed with a extremely distributed structure, created important challenges in observability. Traditionally, Kaltura relied on a SaaS-based observability answer that met preliminary necessities however turned more and more tough to scale. Because the platform developed, groups generated disparate log codecs, utilized retention insurance policies that now not mirrored knowledge worth, and operated greater than 10 organically grown observability sources. The shortage of standardization and visibility required in depth handbook effort to correlate knowledge, keep pipelines, and troubleshoot points – resulting in rising operational complexity and stuck prices that didn’t scale effectively with utilization.
Kaltura’s DevOps group acknowledged the necessity to reassess their observability answer and commenced exploring a wide range of choices, from self-managed platforms to completely managed SaaS choices. After a complete analysis, they made the strategic choice emigrate to OpenSearch Service, utilizing its superior options corresponding to Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Administration.
Resolution overview
Kaltura created a brand new AWS account that may be a devoted observability account, the place OpenSearch Service was deployed. Logs and traces have been collected from completely different accounts and producers corresponding to microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and providers working on Amazon Elastic Compute Cloud (Amazon EC2).
By utilizing AWS providers corresponding to AWS Identification and Entry Administration (IAM), AWS Key Administration Service (AWS KMS), and Amazon CloudWatch, Kaltura was capable of meet the requirements to create a production-grade system whereas holding safety and reliability in thoughts. The next determine reveals a high-level design of the setting setup.
Ingestion
As seen within the following diagram, logs are shipped utilizing log shippers, also referred to as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a software designed to gather, course of, and transport log knowledge from numerous sources to a centralized location, corresponding to log analytics platforms, administration techniques, or an aggregator system. Fluent Bit was utilized in all sources and in addition offered gentle processing skills. Fluent Bit was deployed as a daemonset in Kubernetes. The applying growth groups didn’t change their code, as a result of the Fluent Bit pods have been studying the stdout of the appliance pods.
The next code is an instance of FluentBit configurations for Amazon EKS:
Spans and traces have been collected straight from the appliance layer utilizing a seamless integration method. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) utilizing the OpenTelemetry Operator for Kubernetes. Moreover, the group developed a customized OTEL code library, which was included into the appliance code to effectively seize and log traces and spans, offering complete observability throughout their system.
Knowledge from Fluent Bit and OpenTelemetry Collector was despatched to OpenSearch Ingestion, a completely managed, serverless knowledge collector that delivers real-time log, metric, and hint knowledge to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Every producer despatched knowledge to a selected pipeline, one for logs and one for traces, the place knowledge was reworked, aggregated, enriched, and normalized earlier than being despatched to OpenSearch Service. The hint pipeline used the otel_trace and service_map processors, whereas utilizing the OpenSearch Ingestion OpenTelemetry hint analytics blueprint.
The next code is an instance of the OpenSearch Ingestion pipeline for logs:
The previous instance reveals the usage of processors corresponding to grok, date, add_entries, rename_keys, and drop_events:
- add_entries:
- Provides a brand new area
log_type
based mostly on filename - Default: “default”
- If the filename accommodates particular substrings (corresponding to
api.log
orstats.log
), it assigns a extra particular sort
- Provides a brand new area
- grok:
- Applies Grok parsing to logs of sort “api”
- Extracts fields like
timestamp
,logIp
,host
,priorityName
,precedence
,reminiscence
,actual
, andmessage
utilizing a customized sample
- date:
- Parses timestamp strings into an ordinary datetime format
- Shops it in a area referred to as
@timestamp
based mostly on ISO8601 format - Handles a number of timestamp patterns
- rename_keys:
- timestamp or date are renamed into
@timestamp
- Doesn’t overwrite if
@timestamp
already exists
- timestamp or date are renamed into
- drop_events:
- Drops logs the place filename accommodates
simplesamlphp.log
- It is a filtering rule to disregard noisy or irrelevant logs
- Drops logs the place filename accommodates
The next is an instance of the enter of a log line:
After processing, we get the next code:
Kaltura adopted some OpenSearch Ingestion finest practices, corresponding to:
- Together with a dead-letter queue (DLQ) in pipeline configuration. This could considerably assist troubleshoot pipeline points.
- Beginning and stopping pipelines to optimize cost-efficiency, when attainable.
- In the course of the proof of idea stage:
- Putting in Knowledge Prepper domestically for sooner growth iterations.
- Disabling persistent buffering to expedite blue-green deployments.
Reaching operational excellence with environment friendly log and hint administration
Logs and traces play a significant position in figuring out operational points, however they arrive with distinctive challenges. First, they symbolize time collection knowledge, which inherently evolves over time. Second, their worth sometimes diminishes as time passes, making environment friendly administration essential. Third, they’re append-only in nature. With OpenSearch, Kaltura confronted distinct trade-offs between value, knowledge retention, and latency. The objective was to ensure invaluable knowledge remained accessible to engineering groups with minimal latency, however the answer additionally wanted to be cost-effective. Balancing these elements required considerate planning and optimization.
Knowledge was ingested to OpenSearch knowledge streams, which simplifies the method of ingesting append-only time collection knowledge. A number of Index State Administration (ISM) insurance policies have been utilized to completely different knowledge streams, which have been depending on log retention necessities. ISM insurance policies dealt with shifting indexes from sizzling storage to UltraWarm, and finally deleting the indexes. This allowed a customizable and cost-effective answer, with low latency for querying new knowledge and affordable latency for querying historic knowledge.
The next instance ISM coverage makes certain indexes are managed effectively, rolled over, and moved to completely different storage tiers based mostly on their age and dimension, and finally deleted after 60 days. If an motion fails, it’s retried with an exponential backoff technique. In case of failures, notifications are despatched to related groups to maintain them knowledgeable.
To create an information stream in OpenSearch, a definition of index template is required, which configures how the information stream and its backing indexes will behave. Within the following instance, the index template specifies key index settings such because the variety of shards, replication, and refresh interval—controlling how knowledge is distributed, replicated, and refreshed throughout the cluster. It additionally defines the mappings, which describe the construction of the information—what fields exist, their varieties, and the way they need to be listed. These mappings ensure that the information stream is aware of easy methods to interpret and retailer incoming log knowledge effectively. Lastly, the template allows the @timestamp
area because the time-based area required for an information stream.
Implementing role-based entry management and person entry
The brand new observability platform is accessed by many sorts of customers; inside customers log in to OpenSearch Dashboards utilizing SAML-based federation with Okta. The next diagram illustrates the person circulation.
Every person accesses the dashboards to view observability objects related to their position. Nice-grained entry management (FGAC) is enforced in OpenSearch utilizing built-in IAM position and SAML group mappings to implement role-based entry management (RBAC).When customers log in to the OpenSearch area, they’re mechanically routed to the suitable tenant based mostly on their assigned position. This setup makes certain builders can create dashboards tailor-made to debugging inside growth environments, and assist groups can construct dashboards centered on figuring out and troubleshooting manufacturing points. The SAML integration alleviates the necessity to handle inside OpenSearch customers solely.
For every position in Kaltura, a corresponding OpenSearch position was created with solely the required permissions. For example, assist engineers are granted entry to the monitoring plugin to create alerts based mostly on logs, whereas QA engineers, who don’t require this performance, aren’t granted that entry.
The next screenshot reveals the position of the DevOps engineers outlined with cluster permissions.
These customers are routed to their very own devoted DevOps tenant, to which they solely have write entry. This makes it attainable for various customers from completely different roles in Kaltura to create the dashboard objects that target their priorities and desires. OpenSearch helps backend position mapping; Kaltura mapped the Okta group to the position so when a person logs in from Okta, they mechanically get assigned based mostly on their position.
This additionally works with IAM roles to facilitate automations within the cluster utilizing exterior providers, corresponding to OpenSearch Ingestion pipelines, as could be seen within the following screenshot.
Utilizing observability options and repair mapping for enhanced hint and log correlation
After a person is logged in, they will use the Observability plugins, view surrounding occasions in logs, correlate logs and traces, and use the Hint Analytics plugin. Customers can examine traces and spans, and group traces with latency info utilizing built-in dashboards. Customers may drill all the way down to a selected hint or span and correlate it again to log occasions. The service_map
processor utilized in OpenSearch Ingestion sends OpenTelemetry knowledge to create a distributed service map for visualization in OpenSearch Dashboards.
Utilizing the mixed alerts of traces and spans, OpenSearch discovers the appliance connectivity and maps them to a service map.
After OpenSearch ingests the traces and spans from Otel, they’re aggregated to teams in accordance with paths and developments. Durations are additionally calculated and offered to the person over time.
With a hint ID, it’s attainable to filter out all of the related spans by the service and see how lengthy every took, figuring out points with exterior providers corresponding to MongoDB and Redis.
From the spans, customers can uncover the related logs.
Submit-migration enhancements
After the migration, a robust developer group emerged inside Kaltura that embraced the brand new observability answer. As adoption grew, so did requests for brand spanking new options and enhancements aimed toward enhancing the general developer expertise.
One key enchancment was extending log retention. Kaltura achieved this by re-ingesting historic logs from Amazon Easy Storage Service (Amazon S3) utilizing a devoted OpenSearch Ingestion pipeline with Amazon S3 learn permissions. With this enhancement, groups can entry and analyze logs from as much as a 12 months in the past utilizing the identical acquainted dashboards and filters.
Along with monitoring EKS clusters and EC2 situations, Kaltura expanded its observability stack by integrating extra AWS providers. Amazon API Gateway and AWS Lambda have been launched to assist log ingestion from exterior distributors, permitting for seamless correlation with present knowledge and broader visibility throughout techniques.
Lastly, to empower groups and promote autonomy, knowledge stream templates and ISM insurance policies are managed straight by builders inside their very own repositories. By utilizing infrastructure as code instruments like Terraform, builders can outline index mappings, alerts, and dashboards as code—versioned in Git and deployed persistently throughout environments.
Conclusion
Kaltura efficiently carried out a wise log retention technique, extending actual time retention from 5 days for all log varieties to 30 days for crucial logs, whereas sustaining cost-efficiency by means of the usage of UltraWarm nodes. This method led to a 60% discount in prices in comparison with their earlier answer. Moreover, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate techniques right into a unified, all-in-one answer. This consolidation not solely improved operational effectivity but in addition sparked elevated engagement from developer groups, driving function requests, fostering inside design collaborations, and attracting early adopters for brand spanking new enhancements. If Kaltura’s journey has impressed you and also you’re eager about implementing an analogous answer in your group, contemplate these steps:
- Begin by understanding the necessities and setting expectations with the engineering groups in your group
- Begin with a fast proof of idea to get hands-on expertise
- Seek advice from the next assets that will help you get began:
Concerning the authors
Ido Ziv is a DevOps group chief in Kaltura with over 6 years of expertise. His hobbies embody crusing and Kubernetes (however not on the similar time).
Roi Gamliel is a Senior Options Architect serving to startups construct on AWS. He’s passionate concerning the OpenSearch Venture, serving to prospects fine-tune their workloads and maximize outcomes.
Yonatan Dolan is a Principal Analytics Specialist at Amazon Net Companies. He’s positioned in Israel and helps prospects harness AWS analytical providers to make use of knowledge, acquire insights, and derive worth.