Apache Spark™ has change into the de facto engine for giant information processing, powering workloads at among the largest organizations on this planet. Over the previous decade, we’ve seen Apache Spark evolve from a robust general-purpose compute engine right into a vital layer of the Open Lakehouse Structure – with Spark SQL, Structured Streaming, open desk codecs, and unified governance serving as pillars for contemporary information platforms.
With the current launch of Apache Spark 4.0, that evolution continues with main advances in streaming, Python, SQL, and semi-structured information. You’ll be able to learn extra in regards to the launch right here.
Constructing upon the sturdy basis of Apache Spark, we’re excited to announce a brand new addition to open supply:
We’re donating Declarative Pipelines – a confirmed customary for constructing dependable, scalable information pipelines – to Apache Spark.
This contribution extends Apache Spark’s declarative energy from particular person queries to full pipelines – permitting customers to outline what their pipeline ought to do, and letting Apache Spark determine how to do it. The design attracts on years of observing real-world Apache Spark workloads, codifying what we’ve realized right into a declarative API that covers the most typical patterns – together with each batch and streaming flows.

Declarative APIs make ETL easier and extra maintainable
By way of years of working with real-world Spark customers, we’ve seen frequent challenges emerge when constructing manufacturing pipelines:
- An excessive amount of time spent wiring collectively pipelines with “glue code” to deal with incremental ingestion or deciding when to materialize datasets. That is undifferentiated heavy lifting that each staff finally ends up sustaining as a substitute of specializing in core enterprise logic
- Reimplementing the identical patterns throughout groups, resulting in inconsistency and operational overhead
- Missing a standardized framework for testing, lineage, CI/CD, and monitoring at scale
At Databricks, we started addressing these challenges by codifying frequent engineering greatest practices right into a product known as DLT. DLT took a declarative method: as a substitute of wiring up all of the logic your self, you specify the ultimate state of your tables, and the engine takes care of issues like dependency mapping, error dealing with, checkpointing, failures, and retries for you.
The consequence was a giant leap ahead in productiveness, reliability, and maintainability – particularly for groups managing advanced manufacturing pipelines.
Since launching DLT, we’ve realized rather a lot.
We’ve seen the place the declarative method could make an outsized affect; and the place groups wanted extra flexibility and management. We’ve seen the worth of automating advanced logic and streaming orchestration; and the significance of constructing on open Spark APIs to make sure portability and developer freedom.
That have knowledgeable a brand new course: A first-class, open-source, Spark-native framework for declarative pipeline improvement.
From Queries to Finish-to-Finish Pipelines: The Subsequent Step in Spark’s Declarative Evolution
Apache Spark SQL made question execution declarative: as a substitute of implementing joins and aggregations with low-level RDD code, builders might merely write SQL to explain the consequence they needed, and Spark dealt with the remainder.
Spark Declarative Pipelines builds on that basis and takes it a step additional – extending the declarative mannequin past particular person queries to full pipelines spanning a number of tables. Now, builders can outline what datasets ought to exist and the way they’re derived, whereas Spark determines the optimum execution plan, manages dependencies, and handles incremental processing mechanically.

Constructed with openness and composability in thoughts, Spark Declarative Pipelines gives:
- Declarative APIs for outlining tables and transformations
- Native assist for each batch and streaming information flows
- Information-aware orchestration with computerized dependency monitoring, execution ordering, and backfill dealing with
- Computerized checkpointing, retries, and incremental processing for streaming information
- Help for each SQL and Python
- Execution transparency with full entry to underlying Spark plans
And most significantly, it’s Apache Spark all the way in which down – no wrappers or black containers.
A New Customary, Now within the Open
This contribution represents years of labor throughout Apache Spark, Delta Lake, and the broader open information neighborhood. It’s impressed by what we’ve realized from constructing DLT – however designed to be extra versatile, extra extensible, and totally open supply.
And we’re simply getting began. We’re contributing this as a typical layer all the Apache Spark ecosystem can construct upon – whether or not you’re orchestrating pipelines in your personal platform, constructing domain-specific abstractions, or contributing on to Spark itself. This framework is right here to assist you.
“Declarative pipelines disguise the complexity of recent information engineering underneath a easy, intuitive programming mannequin. As an engineering supervisor, I like the truth that my engineers can deal with what issues most to the enterprise. It’s thrilling to see this degree of innovation now being open sourced-making it accessible to much more groups.”
— Jian (Miracle) Zhou, Senior Engineering Supervisor, Navy Federal Credit score Union
“At 84.51 we’re at all times on the lookout for methods to make our information pipelines simpler to construct and preserve, particularly as we transfer towards extra open and versatile instruments. The declarative method has been a giant assist in decreasing the quantity of code we have now to handle, and it’s made it simpler to assist each batch and streaming with out stitching collectively separate techniques. Open-sourcing this framework as Spark Declarative Pipelines is a good step for the Spark neighborhood.”
— Brad Turnbaugh, Sr. Information Engineer, 84.51°
What’s Subsequent
Keep tuned for extra particulars within the Apache Spark documentation. Within the meantime, you’ll be able to overview the Jira and neighborhood dialogue for the proposal.
Should you’re constructing pipelines with Apache Spark as we speak, we invite you to discover the declarative mannequin. Our aim is to make pipeline improvement easier, extra dependable, and extra collaborative for everybody.
The Lakehouse is about extra than simply open storage. It’s about open codecs, open engines – and now, open patterns for constructing on prime of them.
We imagine declarative pipelines will change into a brand new customary for Apache Spark improvement. And we’re excited to construct that future collectively, with the neighborhood, within the open.