How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

August 13, 2025

53

Delivering on the promise of real-time agentic automation requires a quick, dependable, and scalable information basis. At UiPath, we would have liked a contemporary streaming structure to underpin merchandise like Maestro and Insights, enabling close to real-time visibility into agentic automation metrics as they unfold. That journey led us to unify batch and streaming on Azure Databricks utilizing Apache Spark™ Structured Streaming, enabling cost-efficient, low-latency analytics that assist agentic decision-making throughout the enterprise.

This weblog particulars the technical strategy, trade-offs, and impression of those enhancements.

With Databricks-based streaming, we have achieved sub-minute event-to-warehouse latency whereas delivering simplified structure and future-proof scalability, setting the brand new customary for event-driven information processing throughout UiPath.

Why Streaming Issues for UiPath Maestro and UiPath Insights

At UiPath, merchandise like Maestro and Insights rely closely on well timed, dependable information. Maestro acts because the orchestration layer for our agentic automation platform; coordinating AI brokers, robots, and people based mostly on real-time occasions. Whether or not it’s reacting to a system set off, executing a long-running workflow, or together with a human-in-the-loop step, Maestro is dependent upon quick, correct sign processing to make the fitting selections.

UiPath Insights, which powers monitoring and analytics throughout these automations, provides one other layer of demand: capturing key metrics and behavioral indicators in close to actual time to floor developments, calculate ROI, and assist problem detection.

Delivering these sorts of outcomes – reactive orchestration and real-time observability – requires an information pipeline structure that’s not solely low-latency, but additionally scalable, dependable, and maintainable. That want is what led us to rethink our streaming structure on Azure Databricks.

Constructing the Streaming Information Basis

Delivering on the promise of highly effective analytics and real-time monitoring requires a basis of scalable, dependable information pipelines. Over the previous few years, now we have developed and expanded a number of pipelines to assist new product options and reply to evolving enterprise necessities. Now, now we have the chance to evaluate how we are able to optimize these pipelines to not solely save prices, but additionally have higher scalability, and at-least as soon as supply assure to assist information from new providers like Maestro.

Whereas our earlier setup (proven above) labored nicely for our prospects, it additionally revealed areas for enchancment:

The batching pipeline launched as much as half-hour of latency and relied on a posh infrastructure
The actual-time pipeline delivered quicker information however got here with larger value.
For Robotlogs, our largest dataset, we maintained separate ingestion and storage paths for each historic and real-time processing, leading to duplication and inefficiency.
To assist the brand new ETL pipeline for UiPath Maestro, a brand new UiPath product, we would want to realize at-least as soon as supply assure.

To deal with these challenges, we undertook a serious architectural overhaul. We merged the batching and real-time ingestion processes for Robotlogs right into a single pipeline, and re-architected the real-time ingestion pipeline to be extra cost-efficient and scalable.

Why Spark Structured Streaming on Databricks?

As we got down to simplify and modernize our pipeline structure, we would have liked a framework that would deal with each high-throughput batch workloads and low-latency real-time information—with out introducing operational overhead. Spark Structured Streaming (SSS) on Azure Databricks was a pure match.

Constructed on prime of Spark SQL and Spark Core, Structured Streaming treats real-time information as an unbounded desk—permitting us to reuse acquainted Spark batch constructs whereas gaining the advantages of a fault-tolerant, scalable streaming engine. This unified programming mannequin decreased complexity and accelerated growth.

We had already leveraged Spark Structured Streaming to develop our Actual-time Alert function, which makes use of stateful stream processing in Databricks. Now, we’re increasing its capabilities to construct our subsequent technology of Actual-time ingestion pipelines, enabling us to realize low-latency, scalability, value effectivity, and at-least-once supply ensures.

The Subsequent Era of Actual-time Ingestion

Our new structure, proven beneath, dramatically simplifies the info ingestion course of by consolidating beforehand separate parts right into a unified, scalable pipeline utilizing Spark Structured Streaming on Databricks:

On the core of this new design is a set of streaming jobs that learn instantly from occasion sources. These jobs carry out parsing, filtering, flattening, and—most critically—be a part of every occasion with reference information to complement it earlier than writing to our information warehouse.

We orchestrate these jobs utilizing Databricks Lakeflow Jobs, which helps handle retries and job restoration in case of transient failures. This streamlined setup improves each developer productiveness and system reliability.

The advantages of this new structure embody:

Value effectivity: Saves COGS by lowering infrastructure complexity and compute utilization
Low latency: Ingestion latency averages round one minute, with the pliability to cut back this additional
Future-proof scalability: Throughput is proportional to the variety of cores, and we are able to scale out infinitely
No information misplaced: Spark does the heavy-lifting of failure restoration, supporting at-least as soon as supply.
- With downstream sink deduplication in future growth, it will likely be in a position to obtain precisely as soon as supply
Quick growth cycle because of the Spark DataFrame API
Easy and unified structure

Low-Latency

Our streaming job presently runs in micro-batch mode with a one-minute set off interval. Which means that from the second an occasion is printed to our Occasion Bus, it usually lands in our information warehouse round 27 seconds on median, with 95% of data arriving inside 51 seconds, and 99% inside 72 seconds.

Structured Streaming gives configurable set off settings, which may even deliver down the latency to some seconds. For now, we’ve chosen the one-minute set off as the fitting stability between value and efficiency, with the pliability to decrease it sooner or later if necessities change.

Scalability

Spark divides the massive information work by partitions, which absolutely make the most of the Employee/Executor CPU cores. Every Structured Streaming job is break up into levels, that are additional divided into duties, every of which runs on a single core. This stage of parallelization permits us to completely make the most of our Spark cluster and scale effectively with rising information volumes.

Due to optimizations like in-memory processing, Catalyst question planning, whole-stage code technology, and vectorized execution, we course of round 40,000 occasions per second in scalability validation. If site visitors will increase, we are able to scale out just by rising partition counts on the supply Occasion Bus and including extra employee nodes—guaranteeing future-proof scalability with minimal engineering effort.

Supply Assure

Spark Structured Streaming gives exactly-once supply by default, because of its checkpointing system. After every micro-batch, Spark persists the progress (or “epoch”) of every supply partition as write-ahead logs and the job’s software state in state retailer. Within the occasion of a failure, the job resumes from the final checkpoint—guaranteeing no information is misplaced or skipped.

That is talked about within the authentic Spark Structured Streaming analysis paper, which states that attaining exactly-once supply requires:

The enter supply to be replayable
The output sink to assist idempotent writes

However there’s additionally an implicit third requirement that always goes unstated: the system should be capable of detect and deal with failures gracefully.

That is the place Spark works nicely—its sturdy failure restoration mechanisms can detect activity failures, executor crashes, and driver points, and mechanically take corrective actions equivalent to retries or restarts.

Be aware that we’re presently working with at-least as soon as supply, as our output sink is just not idempotent but. If now we have additional necessities of exactly-once supply sooner or later, so long as we put additional engineering efforts into idempotency, we should always be capable of obtain it.

Uncooked Information is Higher

We have now additionally made another enhancements. We have now now included and persevered a standard rawMessage subject throughout all tables. This column shops the unique occasion payload as a uncooked string. To borrow the sushi precept (though we imply a barely totally different factor right here): uncooked information is best.

Uncooked information considerably simplifies troubleshooting. When one thing goes unsuitable—like a lacking subject or sudden worth—we are able to immediately check with the unique message and hint the difficulty, with out chasing down logs or upstream methods. With out this uncooked payload, diagnosing information points turns into a lot tougher and slower.

The draw back is a small improve in storage. However because of low cost cloud storage and the columnar format of our warehouse, this has minimal value and no impression on question efficiency.

Easy and Highly effective API

The brand new implementation is taking us much less growth time. That is largely because of the DataFrame API in Spark, which gives a high-level, declarative abstraction over distributed information processing. Prior to now, utilizing RDDs meant manually reasoning about execution plans, understanding DAGs, and optimizing the order of operations like joins and filters. DataFrames permit us to concentrate on the logic of what we need to compute, slightly than how one can compute it. This considerably simplifies the event course of.

This has additionally improved operations. We not must manually rerun failed jobs or hint errors throughout a number of pipeline parts. With a simplified structure and fewer transferring components, each growth and debugging are considerably simpler.

Driving Actual-Time Analytics Throughout UiPath

The success of this new structure has not gone unnoticed. It has rapidly turn into the brand new customary for real-time occasion ingestion throughout UiPath. Past its preliminary implementation for UiPath Maestro and Insights, the sample has been broadly adopted by a number of new groups and tasks for his or her real-time analytics wants, together with these engaged on cutting-edge initiatives. This widespread adoption is a testomony to the structure’s scalability, effectivity, and extensibility, making it straightforward for brand new groups to onboard and enabling a brand new technology of merchandise with highly effective real-time analytics capabilities.

Should you’re trying to scale your real-time analytics workloads with out the operational burden, the structure outlined right here gives a confirmed path, powered by Databricks and Spark Structured Streaming and able to assist the following technology of AI and agentic methods.

About UiPath
UiPath (NYSE: PATH) is a worldwide chief in agentic automation, empowering enterprises to harness the total potential of AI brokers to autonomously execute and optimize advanced enterprise processes. The UiPath Platform™ uniquely combines managed company, developer flexibility, and seamless integration to assist organizations scale agentic automation safely and confidently. Dedicated to safety, governance, and interoperability, UiPath helps enterprises as they transition right into a future the place automation delivers on the total potential of AI to remodel industries.

Previous articleTest your danger blind spot

Next articleHow AI brokers are powering the following wave of telecom operations (Reader Discussion board)

How UiPath Constructed a Scalable Actual-Time ETL pipeline on Databricks

Why Streaming Issues for UiPath Maestro and UiPath Insights

Constructing the Streaming Information Basis

Why Spark Structured Streaming on Databricks?

The Subsequent Era of Actual-time Ingestion

Low-Latency

Scalability

Supply Assure

Uncooked Information is Higher

Easy and Highly effective API

Driving Actual-Time Analytics Throughout UiPath

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

Medidata’s journey to a contemporary lakehouse structure on AWS

How KV Caching Makes Fashionable LLMs Quick?

LEAVE A REPLY Cancel reply

Most Popular

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

Recent Comments

ABOUT US

POPULAR POSTS

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

POPULAR CATEGORY