HomeBig DataFrom Lag to Agility: Reinventing Freshworks’ Knowledge Ingestion Structure

From Lag to Agility: Reinventing Freshworks’ Knowledge Ingestion Structure


As a world software-as-a-service (SaaS) firm specializing in offering intuitive, AI-powered enterprise options, designed to boost buyer and worker experiences. Freshworks relies on real-time information to energy decision-making and ship higher experiences to its 75,000+ prospects. With tens of millions of every day occasions throughout merchandise, well timed information processing is essential. To satisfy this want, Freshworks has constructed a near-real-time ingestion pipeline on Databricks, able to managing numerous schemas throughout merchandise and dealing with tens of millions of occasions per minute with a 30-minute SLA—whereas guaranteeing tenant-level information isolation in a multi-tenant setup.

Attaining this requires a strong, versatile, and optimized information pipeline—which is precisely what we had been got down to construct.

Legacy Structure and the Case for Change

Freshworks’ legacy pipeline was constructed with Python customers; the place every person motion triggered occasions despatched in actual time from merchandise to Kafka and the Python customers remodeled and routed these occasions to new Kafka subjects. A Rails batching system then transformed the remodeled information into CSV recordsdata saved in AWS S3, and Apache Airflow jobs loaded these batches into the information warehouse. After ingestion, intermediate recordsdata had been deleted to handle storage. This structure was well-suited for early progress however quickly hit limits as occasion quantity surged.

Speedy progress uncovered core challenges:

  • Scalability: The pipeline struggled to deal with tens of millions of messages per minute, particularly throughout spikes, and required frequent guide scaling.
  • Operational Complexity: The multi-stage movement made schema adjustments and upkeep dangerous and time-consuming, usually leading to mismatches and failures.
  • Value Inefficiency: Storage and compute bills grew shortly, pushed by redundant processing and lack of optimization.
  • Responsiveness: The legacy setup couldn’t meet calls for for real-time ingestion or quick, dependable analytics as Freshworks scaled. Extended ingestion delays impaired information freshness and impacted buyer insights.

As scale and complexity elevated, the fragility and overhead of the previous system made clear the necessity for a unified, scalable, and autonomous information structure to assist the enterprise progress and analytics wants.

New Structure: Actual-Time Knowledge Processing with Apache Spark and Delta Lake

The answer – A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.

We designed a single, streamlined structure the place Spark Structured Streaming instantly consumes from Kafka, transforms information, and writes it into Delta Lake—multi functional job, working completely inside Databricks.

This shift has lowered information motion, simplified upkeep and troubleshooting, and accelerated time-to-insight.

The important thing parts of the brand new structure:

The Streaming Element : Spark Structured Streaming

Every incoming occasion from Kafka passes by means of a fastidiously orchestrated sequence of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:

  1. Environment friendly Deduplication:
    Occasions, recognized by UUIDs, are checked in opposition to a Delta desk of beforehand processed UUIDs to filter duplicates between streaming batches.
  2. Knowledge Validation:
    Schema and enterprise guidelines filter malformed information, guarantee required fields, and deal with nulls.
  3. Customized Transformations with JSON-e:
    The JSON-e engine allows superior, reusable logic—like conditionals, loops, and Python UDFs—enabling product groups to outline dynamic, reusable logic tailor-made to every product.
  4. Flattening to Tabular Kind:
    Remodeled JSON occasions are flattened into hundreds of structured tables. A separate inner schema administration instrument ( managing 20,000+ tables & 5M+ columns) lets product groups handle schema adjustments and robotically promote to manufacturing, which is registered in Delta Lake and picked up by Spark streaming seamlessly.
  5. Flattened Knowledge Deduplication:
    A hash of saved columns is in contrast in opposition to the final 4 hours of processed information in Redis, stopping duplicate ingestion and lowering compute prices.

The Storage Element: Lakehouse

As soon as remodeled, the information is written on to Delta Lake tables utilizing a number of highly effective optimizations:

  • Parallel Writes with Multiprocessing:
    A single Spark job usually writes to ~250 Delta tables, making use of various transformation logic. That is executed utilizing Python multiprocessing, which performs Delta merges in parallel, maximising cluster utilization and lowering latency.
  • Environment friendly Updates with Deletion Vectors:
    As much as 35% of information per batch are updates or deletes. As an alternative of rewriting giant recordsdata, we leverage Deletion Vectors to allow delicate deletes. This improves replace efficiency by 3x, making real-time updates sensible even at a terabyte scale.
  • Accelerated Merges with Disk Caching:
    Disk Caching ensures that often accessed (sizzling) information stays in reminiscence. By caching solely the columns wanted for merges, we obtain as much as 4x sooner merge operations whereas lowering I/O and compute prices. Right this moment, 95% of merge reads are served instantly from the cache.

Autoscaling & Adapting in Actual Time

Autoscaling is constructed into the pipeline to make sure that the system scales up or down dynamically to deal with quantity and value most effectively with out impacting efficiency.

Autoscaling is pushed by batch lag and execution time, monitored in actual time. Resizing is triggered through job APIs by means of Spark’s QueryListener (OnProgress technique after every batch), guaranteeing in-flight processing isn’t disrupted. This fashion the system is responsive, resilient, and environment friendly with out guide intervention.

Constructed-In Resilience: Dealing with Failures Gracefully

To keep up information integrity and availability, the structure consists of strong fault tolerance:

  • Occasions that fail transformation are retried through Kafka with backoff logic.
  • Completely failed information are saved in a Delta desk for offline evaluation and reprocessing, guaranteeing no information is misplaced.
  • This design ensures information integrity with out human intervention, even throughout peak hundreds or schema adjustments and the flexibility to republish the failed information later.

Observability and Monitoring at Each Step

A robust monitoring stack—constructed with Prometheus, Grafana, and Elasticsearch—built-in with Databricks provides us end-to-end visibility:

  • Metrics Assortment:
    Each batch in Databricks logs key metrics—equivalent to enter document rely, remodeled information, and error charges, that are built-in to Prometheus, with real-time alerts to the assist staff.
  • Occasion Monitoring:
    Occasion statuses are logged in Elasticsearch, enabling fine-grained debugging permitting each product(producers) and analytics (shopper) groups to hint points.

Transformation & Batch Execution Metrics:

Transformation & Batch Execution Metrics

Observe transformation well being utilizing above metrics to establish points and set off alerts for fast investigations

From Complexity to Confidence

Maybe essentially the most transformative shift has been in simplicity.

What as soon as concerned 5 techniques and numerous integration factors is now a single, observable, autoscaling pipeline working completely inside Databricks. We’ve eradicated brittle dependencies, streamlined operations, and enabled groups to work sooner and with larger autonomy.Primarily Fewer transferring components meant Fewer surprises & Extra confidence.

By reimagining the information stack round streaming and the Deltalake, we’ve constructed a system that not solely meets right this moment’s scale however is prepared for tomorrow’s progress.

Why Databricks?

As we reimagined the information structure, we evaluated a number of applied sciences, together with Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged because the clear alternative, providing a singular mix of efficiency, simplicity, and ecosystem alignment that met the evolving wants of Freshworks.

A Unified Ecosystem for Knowledge Processing

Relatively than stitching collectively a number of instruments, Databricks affords an end-to-end platform that spans job orchestration, information governance, and CI/CD integration, lowering complexity and accelerating growth.

  • Unity Catalog acts as the one supply of reality for information governance. With granular entry management, lineage monitoring, and centralized schema administration, it ensures
    • our staff is ready to safe all the information belongings well-organized information entry for every tenant, preserving strict entry boundaries ,
    • Be compliant to regulatory wants with all occasions & actions being audited within the audit tables together with data on who has entry to which belongings, and
  • Databricks Jobs have inherent orchestration and changed reliance on exterior orchestrators like Airflow. Native scheduling and pipeline execution lowered operational friction and improved reliability.
  • CI/CD and REST APIs helped Freshworks’ groups to automate every part—from job creation, cluster scaling to schema updates. This automation has accelerated releases, improved consistency, and minimized guide errors, permitting us to experiment quick and study quick.

Optimized Spark Platform

  • Key capabilities like automated useful resource allocation, unified batch & streaming structure, executor fault restoration, and dynamic scaling to course of tens of millions of information allowed us to keep up constant throughput, even throughout site visitors spikes or infra hiccups.

Excessive-Efficiency Caching

  • Databricks Disk Caching proved to be the important thing think about assembly the required information latency, as most merges had been served from sizzling information saved within the disk cache.
  • Its functionality to robotically detect adjustments in underlying information recordsdata and preserve the cache up to date ensured that the batch processing intervals persistently met the required SLA.

Delta Lake: Basis for Actual-Time and Dependable Ingestion

Delta Lake performs a vital function within the pipeline, enabling low-latency, ACID-compliant, high-integrity information processing at scale.

Delta Lake Function SaaS Pipeline Profit
ACID Transactions Freshworks writes excessive frequency streaming from a number of sources & concurrent writes on the information. ACID compliance of Delta Lake, Ensures information consistency of knowledge throughout the reads & writes.
Schema Evolution As a result of quick rising and inherent nature of the merchandise, the schema of varied merchandise retains evolving and Delta lake’s schema evolution adapts to altering necessities the place they’re seamlessly utilized to delta tables & are robotically picked up by spark streaming functions.
Time Journey With tens of millions of transactions & audit wants, the flexibility to return to a snapshot of knowledge within the Delta Lake helps auditing and rollback to cut-off date wants.
Scalable Change Dealing with & Deletion Vectors Delta Lake helps & allows environment friendly insert/replace/delete operations by means of transaction logs with out rewriting giant information recordsdata. This proved essential in lowering ingestion latencies from hours to some minutes in our pipelines.
Open Format Freshworks being a multi-tenant SAAS system, the open Delta format gives broad compatibility with analytics instruments on prime of the Lakehouse; supporting multi-tenant learn operations.

So, by combining Databricks Spark’s velocity, Delta Lake’s reliability, and Databricks’ built-in platform, we constructed a scalable, strong, and cost-effective future-ready basis for Freshworks’ real-time analytics.

What We Discovered: Key Insights

No transformation is with out its challenges. Alongside the way in which, we encountered a couple of surprises that taught us priceless classes:

1. State Retailer Overhead: Excessive Reminiscence Footprint and Stability Points

Utilizing Spark’s dropDuplicatesWithinWatermark prompted excessive reminiscence use and instability, particularly throughout autoscaling, and led to elevated S3 listing prices because of many small recordsdata.

Repair: Switching to Delta-based caching for deduplication drastically improved reminiscence effectivity and stability. The general S3 listing value and reminiscence footprint had been drastically lowered, serving to to cut back the time and value of knowledge deduplication.

2. Liquid Clustering: Widespread Challenges

Clustering on a number of columns resulted in sparse information distribution and elevated information scans, lowering question efficiency.

The queries had a major predicate with a number of secondary predicates; clustering on a number of columns led to a sparse distribution of knowledge on the first predicate column.

Repair: Clustering on a single major column led to higher file group and considerably sooner queries by optimizing information scans.

3. Rubbish Assortment (GC) Points: Job Restarts Wanted

Lengthy-running jobs (7+ days) began experiencing efficiency slowness and extra frequent rubbish assortment cycles.

Repair: We needed to introduce weekly job restarts to mitigate extended GC cycles and efficiency degradation.

4. Knowledge Skew: Dealing with Kafka Matter Imbalance

Knowledge skew was noticed as completely different Kafka subjects had disproportionately various information volumes. This led to uneven information distribution throughout processing nodes, inflicting skewed job workloads and non-uniform useful resource utilization.

Repair: Repartitioning earlier than transformations ensured an excellent and balanced information distribution, balancing information processing load and improved throughput.

5. Conditional Merge: Optimizing Merge Efficiency

Even when only some columns had been wanted, the merge operations had been loading all columns from the goal desk, which led to excessive merge instances and I/O prices.

Repair: We carried out an anti-join earlier than merge and early discard of late-arriving or irrelevant information, considerably rushing up merges by stopping pointless information from being loaded.

Conclusion

By utilizing Databricks and Delta Lake, Freshworks has redefined its information structure—transferring from fragmented, guide workflows to a contemporary, unified, real-time platform.

The affect?

  • 4x enchancment in information sync time throughout site visitors surges
  • ~25% Value saving due to scalable, cost-efficient operations with zero downtime
  • 50% discount in upkeep effort
  • Excessive availability and SLA-compliant efficiency—even throughout peak hundreds
  • Improved buyer expertise through real-time insights

This transformation empowers each buyer of Freshworks—from IT to Help—to make sooner, data-driven choices with out worrying concerning the information quantity supporting their enterprise wants getting served and processed.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments