It is a visitor publish co-written with Shashidhar Soppin, Manochandra Menni and Anchal Kansal from Zeta.
Zeta is a core banking know-how supplier that allows banks to quickly launch extensible banking property and legal responsibility merchandise. Zeta’s major merchandise are Olympus and Tachyon. Olympus is a platform as a service (PaaS) that simplifies constructing and working cloud-native, safe and distributed multi-tenant software program as a service (SaaS) merchandise. It blends infrastructure as code and GitOps methodologies for environment friendly and constant deployment of SaaS merchandise. Its structure prioritizes sturdy tenant isolation, real-time occasion processing, and complete observability, supporting strong API integrations and seamless deployment. Zeta’s Tachyon is a full-stack, cloud-native, API-first digital-banking SaaS service delivered by way of Olympus. The banking providers of Tachyon embrace cost engines (for UPI, credit score, debit, and pay as you go playing cards), financial savings & checking account administration, and many others. Tachyon is a contemporary debit processing product with private finance administration and card controls. It’s designed to extend utilization, upsell credit score, cut back fraud, and enhance buyer satisfaction. The Tachyon product provides complete provisioning, funds, and account administration APIs and SDKs, enabling seamless integration of economic merchandise into third-party apps with out compromising privateness and safety. Zeta operates Tachyon as a multi-tenant SaaS product, serving prospects who’re configured as particular person tenants throughout the system. Zeta’s know-how stack is monitored by their Buyer Service Navigator product (CSN), which is a part of Olympus.
As a world SaaS supplier, Zeta wanted an answer able to monitoring tenants, measuring SLAs, assembly native regulatory necessities, and scaling effectively with each new tenant onboarding and seasonal utilization spikes. Zeta sought a cheap, scalable system that would supply a unified “single pane of glass” to watch the appliance providers, cloud infrastructure, open-source elements, and third-party merchandise.
Zeta confronted a formidable problem in orchestrating a cohesive monitoring system throughout a quickly increasing multi-tenant surroundings, numerous domains, and quite a few instruments. As extra tenants joined their system, the complexity grew exponentially, making Zeta’s monitoring answer more and more tough to keep up. The first problem stemmed from fragmented monitoring instruments that made it tough to shortly establish root causes throughout interconnected programs, resulting in extended troubleshooting occasions and potential service degradation. When customers reported points, resembling bank card cost issues, Website Reliability Engineering (SRE) workforce needed to navigate by way of a a number of disparate monitoring instruments and siloed knowledge, and the dearth of built-in observability resulted in time-consuming guide correlation efforts. This multi-tenant, multi-solution panorama considerably difficult the power to keep up constant monitoring requirements and repair ranges. The problem was additional difficult by the advanced regulatory panorama, the place international enlargement required adherence to numerous native laws, necessitating a versatile structure able to accommodating various knowledge retention insurance policies and entry controls throughout totally different jurisdictions. Every new tenant addition multiplied the complexity of balancing the monitoring wants of inside SRE groups and prospects, requiring subtle knowledge segregation and entry administration. Moreover, Zeta required complete anomaly detection capabilities throughout programs, elements, infrastructure, and operations, requiring an answer that would scale dynamically whereas establishing dynamic baselines and figuring out refined patterns that may point out rising points. Because the tenant base continued to develop, the necessity for a unified, scalable monitoring answer that would streamline these processes, improve operational visibility, and keep system integrity grew to become important.
Zeta’s purpose was to streamline their processes and improve operational visibility throughout the whole know-how panorama. By addressing these challenges, Zeta aimed to create a unified observability answer that might considerably enhance incident response occasions, improve regulatory compliance posture, and in the end ship a extra dependable and performant service to their international buyer base.
On this publish we clarify how Zeta constructed a extra unified monitoring answer utilizing Amazon OpenSearch Service that improved efficiency, lowered guide processes, and elevated end-user satisfaction. Zeta has achieved over an 80% discount in imply time to decision (MTTR), with incident response occasions lowering from 30+ minutes to beneath 5 minutes.
Resolution overview
Zeta designed and constructed an observability system, CSN, to ship complete visibility throughout the service surroundings. CSN is a part of the Olympus suite of merchandise. CSN serves as the first interface for the SRE workforce, providing real-time service well being dashboards, infrastructure monitoring, SLA efficiency analytics, and an admin panel for consumer administration. The system is provided with single sign-on (SSO) integration and enforces role-based entry management (RBAC) to allow safe, granular entry. With CSN, SREs can effectively monitor system well being, obtain actionable alerts and warnings, and handle operational workflows throughout important providers.
CSN is powered by OpenSearch Service to supply an built-in answer for DevOps and Website Reliability Engineers to assist establish important occasions and points. Zeta selected OpenSearch Service as a result of it provides a completely managed, open-source search analytics engine that scales effortlessly to deal with the rising variety of tenants, related knowledge progress, and analytics wants. It’s seamless integration with AWS providers, strong safety features, and assist for real-time knowledge ingestion and querying make it superb for powering the CSN dashboards and analytics workloads. The next diagram illustrates the CSN deployment structure.
The OpenSearch Service area makes use of the Multi-AZ with Standby deployment mannequin, following AWS finest practices for prime availability and fault tolerance. Nodes—together with devoted cluster supervisor nodes, knowledge nodes, and UltraWarm nodes—are distributed evenly throughout three Availability Zones in the identical AWS Area. Availability Zones 1 and a pair of deal with lively indexing and search site visitors, and Availability Zone 3 comprises standby nodes that stay passive throughout regular operations. If an Availability Zone failure happens, OpenSearch Service routinely promotes standby nodes to lively standing, sustaining cluster operations with minimal disruption and no want for knowledge redistribution.
The OpenSearch cluster consists of three devoted cluster supervisor nodes and a multiple-of-three knowledge node depend to keep up quorum and balanced shard allocation. Every index makes use of a minimum of two replicas, offering redundant copies of information throughout the Availability Zones. This Multi-AZ with Standby configuration delivers excessive resilience and speedy failover, supporting steady service availability and strong catastrophe restoration for the observability workloads.
Knowledge assortment and ingestion
The observability technique facilities on a knowledge assortment and ingestion pipeline designed to deal with the complexity and scale. The structure, as proven within the following diagram, addresses three important knowledge sorts: AWS useful resource logs, utility logs, and distributed traces, with every knowledge kind utilizing tailor-made assortment and processing strategies optimized for the workloads.
AWS useful resource logs assortment
The infrastructure spans a number of AWS providers together with Amazon Elastic Kubernetes Service(Amazon EKS), Amazon Relational Database Service(Amazon RDS), Amazon Redshift, Software Load Balancer, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Elastic Compute Cloud (Amazon EC2) and extra. Zeta makes use of Amazon CloudWatch Logs as the first assortment level for AWS service logs, which offers native integration with these providers.
AWS providers ship their logs on to CloudWatch Logs, that are then pulled by Fluentd working on the Amazon EKS cluster for centralized processing. This strategy natively captures operational knowledge from the AWS assets, together with:
- Database operational logs and audit trails from Amazon RDS situations
- Knowledge warehouse question execution logs from Amazon Redshift
- Software Load Balancer entry logs capturing site visitors patterns and efficiency metrics
- Kafka cluster operational logs from Amazon MSK
- AWS API invocation audit trails from AWS CloudTrail
- Container runtime and working system logs from Amazon EC2
- Throughout the log assortment, personally identifiable info (PII) is filtered out. The answer adheres strictly to PCI-DSS pointers all through this course of.
Zeta used Amazon MSK as a scalable and dependable spine for gathering and streaming logs from varied sources throughout the AWS assets. Logs are ingested into Amazon MSK, offering a sturdy and fault-tolerant buffer that decouples log producers from shoppers. This structure allows real-time log streaming and helps superior processing pipelines earlier than the logs are routed to the OpenSearch Service. By integrating Amazon MSK into the logging workflow, scalability, resilience, and adaptability is improved, so that prime log volumes are effectively managed with out impacting downstream programs. This strategy, mixed with native AWS integrations, minimizes operational complexity and maintains complete, centralized log visibility throughout the cloud surroundings.
Fluentd processes these logs and routes them on to OpenSearch Service, sustaining the advantages of AWS integration whereas offering centralized accessibility. This centralized logging strategy with built-in buffering capabilities reduces the direct load on OpenSearch Service by batching and optimizing log supply, serving to to stop potential ingestion bottlenecks throughout high-volume durations. The strategy alleviates the necessity for customized log delivery brokers on AWS assets, lowering operational overhead whereas sustaining complete protection of the cloud infrastructure.
Software logs processing
For application-level observability, a pipeline utilizing Fluentd is deployed as Kubernetes DaemonSet. Software microservices working on Amazon EKS generate logs that Fluentd DaemonSets accumulate, parses, and enrich with metadata resembling pod names, namespaces, and repair identifiers. The processed logs then circulate by way of Amazon MSK for dependable, high-throughput message streaming earlier than closing processing by Fluentd and indexing in OpenSearch Service.
This Kafka-based strategy offers a number of benefits:
- Decoupling – This helps producers and shoppers to function independently, in order that Zeta can scale ingestion and processing individually based mostly on demand.
- Backpressure dealing with – Utilizing Kafka’s buffering capabilities, this manages site visitors spikes throughout peak banking hours, absorbing sudden will increase in log quantity whereas sustaining system stability throughout seasonal utilization surges.
- Sturdiness of logs – The system maintains logs durably in order that no log knowledge is misplaced throughout system upkeep or surprising failures by way of message persistence.
The logs then cross by way of a second Fluentd layer for closing processing and routing to OpenSearch Service, the place they’re listed throughout service-specific indexes (app-index
, falco-index,
kong-index
).
Distributed hint assortment
To deal with the problem of correlating points throughout Zeta’s microservices structure, system makes use of distributed tracing utilizing Jaeger, an open-source, end-to-end distributed tracing system. Jaeger allows monitoring and troubleshooting transactions in advanced distributed programs by monitoring requests as they circulate by way of a number of providers. The appliance providers and Kong API Gateway are instrumented with Jaeger consumer libraries that generate hint knowledge together with spans, which signify particular person operations inside a hint. Every span comprises metadata resembling operation names, begin and end timestamps, tags, and logs that present context concerning the operation being carried out. The Jaeger Collector aggregates these spans from a number of providers, performing validation, indexing, and transformation earlier than forwarding the information.
The traces circulate by way of Amazon MSK for a similar reliability advantages because the logging pipeline – offering sturdiness, decoupling, and backpressure dealing with throughout high-volume durations. Jaeger Ingester then consumes traces from Amazon MSK and processes them for storage within the jaeger-index
inside OpenSearch Service.
This knowledge assortment and ingestion technique offers full end-to-end visibility and builds an observability system that allows SRE groups to watch, troubleshoot, and optimize the providers throughout the whole know-how stack.
Storage tiering
To handle the log, metric, and hint knowledge at scale—about 3TB generated each day—the answer carried out OpenSearch Service storage tiering to steadiness efficiency, retention, and price. Zeta requires close to real-time search and retrieval for a minimum of every week, whereas retaining logs and traces for as much as 10 years. Retaining this knowledge in lively clusters would impression search efficiency and considerably enhance prices, so the answer makes use of the OpenSearch Service scorching, UltraWarm, and chilly storage tiers to optimize the information lifecycle. The next diagram illustrates storage tiering in OpenSearch Service.
Scorching storage is used for the latest and often accessed knowledge, supporting real-time indexing and low-latency queries. This tier depends on high-performance storage hooked up to straightforward knowledge nodes, making it superb for powering reside dashboards and analytics the place velocity is important. The answer makes use of AWS Graviton 2 powered m6g.4xlarge.search occasion sorts to run the OpenSearch Service area which offers upto 40% decrease price in comparison with x86 based mostly situations. Every scorching knowledge node has an hooked up gp3 EBS quantity to retailer indexes. Zeta maintains knowledge in scorching storage for 1 week.
UltraWarm storage serves as a cheap layer for older, read-only knowledge that’s queried much less often however nonetheless wants to stay searchable. UltraWarm nodes use Amazon Easy Storage Service (Amazon S3) because the backing retailer with an built-in caching mechanism, to retain massive volumes of information at a fraction of the price of scorching storage whereas nonetheless supporting interactive queries for historic evaluation. Zeta makes use of ultrawarm1.massive.search occasion sorts within the UltraWarm storage tier and maintains knowledge in UltraWarm storage for 15 days.
Chilly storage is designed for long-term archival of sometimes accessed or compliance-driven knowledge. Knowledge in chilly storage is indifferent from lively compute assets and resides in Amazon S3, incurring minimal price. When historic knowledge must be queried, the indexes are hooked up to the UltraWarm nodes utilizing OpenSearch API calls. This helps extracting historic knowledge for audits, periodic analysis or forensic investigations with out sustaining lively compute for the whole retention interval, thereby lowering storage price.
OpenSearch Service automates index transitions between scorching, UltraWarm, and chilly storage tiers utilizing Index State Administration (ISM) insurance policies. ISM insurance policies specify the circumstances and actions for every state, resembling transitioning based mostly on index age, measurement, or doc depend. When an index qualifies for a transition, ISM jobs—working each 5 to eight minutes—consider the coverage and transfer the index to the following tier. When indexes attain the UltraWarm threshold, they’re migrated to UltraWarm nodes backed by Amazon S3, which reduces storage prices whereas maintaining knowledge accessible for queries. After the UltraWarm retention interval, ISM archives the indexes to chilly storage, detaching them from compute assets however permitting reattachment for future queries or compliance wants. This automated lifecycle administration reduces operational overhead, optimizes storage prices, and maintains efficiency for each latest and historic knowledge.
For observability knowledge, new indexes are created within the scorching tier, the place they continue to be for 7 days to assist quick ingestion and low-latency queries. After this era, ISM transitions these indexes to UltraWarm storage, the place they’re retained for an extra 15 days as read-only knowledge, balancing price with searchability.
Safety
Safety is probably the most important a part of the structure. Zeta’s observability system implements a number of layers of safety for knowledge confidentiality, integrity, and compliance with banking laws, and is constructed utilizing a zero-trust strategy following the AWS shared accountability mannequin for OpenSearch Service:
- Infrastructure safety: The OpenSearch Service area is deployed inside a digital personal cloud (VPC) with personal subnets, isolating it from direct web entry. Safety teams implement restrictive ingress guidelines, permitting entry solely from licensed sources. The OpenSearch Service area makes use of encryption at relaxation by way of AWS Key Administration Service (KMS). Knowledge in transit is secured utilizing TLS 1.3 encryption, in order that log knowledge, traces, and search queries stay protected throughout transmission. Service-to-service communication makes use of AWS Id and Entry Administration (IAM) roles and encrypted connections, assuaging the necessity for hardcoded credentials.
- Entry management and authentication: The answer makes use of Amazon OpenSearch Service fine-grained entry management(FGAC) built-in with IAM, the place IAM serves because the authentication supplier and FGAC handles authorization by mapping IAM roles to OpenSearch backend roles. This strategy helps Zeta to regulate entry permissions on the index and doc stage based mostly on tenant necessities and consumer tasks. The info ingestion pipeline implements end-to-end safety with Fluentd authenticating to Amazon MSK utilizing IAM roles over encrypted connections. Amazon MSK clusters use encryption in transit and at relaxation, defending log knowledge all through the streaming pipeline. Kubernetes RBAC insurance policies prohibit pod-to-pod communication and restrict service account permissions.
- Knowledge privateness and tenant isolation: Every tenants’ knowledge is maintained in logical separation in OpenSearch Service utilizing tenant id. CSN implements tenant-aware authentication and authorization with FGAC, limiting customers to their licensed tenants’ dashboards and knowledge. Each API endpoint validates tenant context, in order that customers can solely entry knowledge inside their licensed scope. Importantly, no buyer knowledge is captured within the logs – solely system metrics are used to construct the monitoring system, adhering to banking safety requirements and finest practices. Person actions are audited and logged for compliance functions, with audit trails maintained based on regulatory necessities.
This safety framework allows the observability system meet the safety necessities of core banking operations whereas sustaining operational effectivity and regulatory compliance throughout international industries.
Buyer Service Navigator
CSN delivers SREs a robust diagnostics interface engineered for high-efficiency monitoring, deep evaluation, and speedy troubleshooting of system efficiency throughout distributed environments. The system ingests and processes telemetry knowledge at sub-minute intervals, offering near-real-time metrics, traces, and logs from important infrastructure elements. Actionable, interactive visualizations—resembling heatmaps, anomaly graphs, and dependency maps— helps SREs to shortly detect SLO breaches and drill all the way down to granular root causes, usually inside a couple of minutes of an incident.
The next screenshot reveals an instance service well being dashboard in CSN for an Olympus tenant.
The next screenshot reveals an instance of the API efficiency insights dashboard in CSN.
Enterprise and technical advantages
The OpenSearch Service-based CSN System offers the next enterprise and technical advantages:
- Handbook effort is lowered by way of automated Index State Administration (ISM) and lifecycle insurance policies, in order that Zeta’s groups to concentrate on innovation
- Automated lifecycle insurance policies facilitate seamless retention and archiving of compliance knowledge, lowering the chance of non-compliance
- The system helps log retention for over 10 years to satisfy regulatory necessities for Zeta’s banking and monetary providers prospects
- A number of layers of safety—together with encryption at relaxation and in transit, FGAC, and tenant isolation to guard buyer knowledge and assist Zeta’s zero-trust structure
- By consolidating logs, traces, and metrics from disparate programs into OpenSearch, SRE groups can correlate occasions extra successfully, thereby lowering troubleshooting efforts and attaining an 80% enchancment in MTTR
- Zeta achieved 99.999999999% knowledge sturdiness for archived logs saved in Amazon S3, offering long-term knowledge integrity
- Zstandard compression is being carried out to optimize long-term storage prices
Conclusion
CSN’s superior correlation engine routinely associates associated occasions throughout microservices, databases, community layers, and infrastructure, considerably streamlining root trigger evaluation. Built-in alerting and automatic runbooks additional cut back response occasions. Since implementing CSN, Zeta has achieved over an 80% discount in MTTR, with incident response occasions lowering from 30+ minutes to beneath 5 minutes. The service helps seamless multi-tenant monitoring, processes 3TB of machine-generated knowledge each day, and is architected for petabyte-scale progress. Moreover, CSN helps Zeta meet regulatory necessities for retaining historic logs over a number of years whereas maintaining storage prices beneath management. This has considerably improved operational resilience, elevated service availability, and empowered groups to proactively resolve points earlier than they have an effect on finish customers.
Able to take your group’s observability capabilities to the following stage? Dive into the technical particulars of OpenSearch Service within the Amazon OpenSearch Developer Information. Go to our new migration hub web page for extra prescriptive steering on transferring your workloads to OpenSearch Service.
In regards to the authors
Deepesh Dhapola is a Senior Options Architect at AWS India, the place he architects high-performance, resilient cloud options for monetary providers and fintech organizations. He makes a speciality of utilizing superior AI applied sciences—together with generative AI, clever brokers, and the Mannequin Context Protocol (MCP)—to design safe, scalable, and context-aware purposes. With deep experience in machine studying and a eager concentrate on rising tendencies, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to reinforce operational effectivity and foster innovation for AWS prospects. Past his technical pursuits, he enjoys high quality time together with his household and explores inventive culinary strategies.
Shashidhar (Shashi) Soppin is an achieved Enterprise Architect and cloud transformation chief with over 24+ years of expertise spanning regulated industries and high-growth know-how environments. At present steering strategic initiatives as Lead Architect at Zeta’s CTO workplace, Shashidhar has helped in constructing and led world-class engineering groups, driving innovation in cloud, safety, and fintech domains. He has architected safe, scalable platforms—scaling consumer bases by 10x, enabling advanced integrations for main Financial institution’s migration to Zeta’s platforms, and pioneering Zero Belief frameworks that achieved excellent regulatory compliance. A results-driven government and former DMTS at Wipro, Shashidhar holds 25+ granted patents and has delivered multi-million greenback enterprise offers throughout domains together with AI/ML. Famend as a broadcast creator (“Necessities of Deep Studying”), frequent trade speaker, and hands-on innovator, he combines technical experience with enterprise acumen, propelling organizations towards strong, future-ready cloud ecosystems and operational excellence. Previous to Wipro he labored in IBM-ISL as effectively.
Anchal Kansal is a Lead Website Reliability Engineer at Zeta, the place she has spent the previous 4 years constructing and scaling dependable, high-performance programs. With deep experience in OpenSearch, observability platforms, and large-scale infrastructure, she focuses on guaranteeing uptime, efficiency, and operational effectivity. Anchal is enthusiastic about fixing advanced reliability challenges and sharing sensible insights with the engineering group.
Manochandra (Mano) is the Website Reliability Engineering (SRE) skilled at Zeta, specializing in knowledge management-oriented programs. With a deep understanding of large-scale distributed architectures, he has in depth expertise designing, deploying, and sustaining resilient, production-grade OpenSearch programs. Mano is thought for his proactive strategy in optimizing infrastructure reliability and efficiency, in addition to his skill to troubleshoot advanced operational challenges. His experience spans implementing automation, monitoring, and incident administration finest practices, making him a go-to useful resource for guaranteeing service availability and scalability at Zeta.
Hitesh Subnani is a FSI Options Architect at AWS India, the place he works with prospects to design and construct architectures that ship enterprise worth. He makes a speciality of complete observability and analytics programs, enabling organizations to realize deep insights from operational knowledge. With experience in search and analytics applied sciences, Hitesh focuses on scalable monitoring programs, real-time dashboards, and compliance-driven architectures for AWS prospects within the monetary sector.
Tarun Chakraborty is a Sr. Technical Account Supervisor (TAM) at AWS India, the place he companions with main banks and fintech organizations to speed up their cloud transformation journeys. With over 15 years of expertise in know-how and monetary providers, he serves as a trusted advisor serving to prospects leverage AWS’s complete suite of providers to drive innovation and obtain their enterprise goals.