Recursive Frequent Desk Expressions (CTEs) are actually supported in Databricks. This brings a local technique to specific loops and traversals in SQL, helpful for working with hierarchical and graph-structured information. These capabilities are aligned with the SQL normal and comply with acquainted patterns utilized in platforms like Teradata. Recursive CTEs have lengthy been a part of the SQL normal, so they are going to be acquainted to prospects migrating from legacy information warehouses. Databricks has additionally contributed Recursive CTE assist to Apache Spark™, making it absolutely open supply.
Databricks makes use of normal ANSI SQL syntax for recursive CTEs, together with the RECURSIVE key phrase.
This seemingly small function considerably enhances SQL’s expressive capabilities, theoretically making it Turing full—which means it may carry out any computation a pc can. Recursive CTEs allow composable options that beforehand required procedural code, reminiscent of Python or exterior instruments.
Recursive CTEs are actually out there in Public Preview DBSQL 2025.20 and Databricks Runtime 17.0 (coming quickly to Lakeflow Declarative Pipelines). On this weblog, we’ll discover how recursive CTEs work—and the way they may also help you resolve real-world issues utilizing pure SQL.
Key options of recursive CTE assist
Databricks’ recursive CTE assist consists of:
- Traversal of tree- and graph-like constructions, reminiscent of org charts, folders, and routing networks
- Absolutely open supply and built-in into Apache Spark™
- Constructed-in safeguards for infinite recursion (100 steps, 1M rows)
- Customizable safeguards utilizing MAX RECURSION LEVEL
- Help for managed infinite recursion utilizing LIMIT
Recursive CTEs work nicely with each conventional techniques that retailer hierarchical information in normalized tables in addition to information coming from trendy purposes that generate versatile JSON/XML hierarchies. See examples under of every together with RCTEs leveraging the Variant information kind for JSON hierarchies.
Plus, assist for recursive CTEs simplifies migrations from legacy database techniques. Teradata and Postgres are two examples of techniques whose syntax is equivalent, whereas techniques like Oracle, which use CONNECT BY syntax, are simply transformed.
How recursive CTEs work
Recursive CTEs are widespread desk expressions outlined with the RECURSIVE key phrase. They encompass two elements mixed by utilizing UNION ALL:
- A base case subquery — this runs as soon as and seeds the recursion
- A recursive step subquery — this refers back to the CTE itself and is repeatedly utilized to construct new rows.
Execution begins with the bottom question. Then, on every iteration, the recursive step is run utilizing the output of the earlier step. This continues till no new rows are produced.
To forestall infinite recursion from consuming extreme sources, Databricks enforces two security limits: a most recursion depth of 100 steps and a row restrict of 1 million. If both threshold is exceeded, the question fails with an error.
When you’re assured in your recursion requiring greater than 100 steps to provide all the outcomes, you might override the max degree by utilizing the MAX RECURSION LEVEL trace:
For extra particulars, confer with the CTE documentation.
“At bp Provide Buying and selling and Transport – Market Threat, understanding portfolio hierarchy reporting throughout enterprise items is important for our enterprise to function effectively. By changing our legacy code with recursive CTEs in Databricks SQL, we lowered a hierarchical information preparation step from ~6 minutes to ~30 seconds, which is a 12× enchancment.” — Dharmik Prajapati, bp Employees Software program Engineer
Examples of fixing iterative duties utilizing Recursive CTEs
Navigate Tree and Hierarchy information: Discovering required supplies utilizing a Invoice of Supplies
Within the manufacturing trade, each manufactured half requires a set of elements to construct. Every part could possibly be damaged down right into a smaller set of particular person elements. The whole set of all elements is known as a Invoice of Supplies (BOM).
A BOM typically varieties a tree-like construction—or extra usually, a directed acyclic graph (DAG). On this instance, we take a look at the elements of a bicycle, which we’ll simplify by assuming a tree construction, the place every part is utilized in precisely one dad or mum.
Suppose we wish to compute what number of uncooked supplies are wanted to construct a bicycle. Take into account the next BOM:
Every row describes a part, the bigger half it belongs to, and what number of elements are wanted to assemble one unit of the dad or mum.
The recursive CTE begins with one goal: constructing one bike. That’s the bottom case. In every recursive step, we break down elements into their subcomponents. As an illustration, a bicycle features a body, a drivetrain, and two wheels. Every wheel, in flip, consists of a tire and 32 spokes. The recursive construction turns into clear as we break down elements into smaller items.
As soon as we have absolutely expanded the hierarchy, we filter out intermediate elements (dad and mom) to retain solely the uncooked supplies required for meeting.
This question computes the entire amount of every base materials wanted to construct one bike:
Pathfinding based mostly on the graph traversal: Discovering all flight paths from one metropolis
Let’s look at an issue utilizing a graph information construction. A graph consists of a set of nodes related by edges. It’s used to symbolize relationships or connections between pairs of components. Fixing a graph drawback used to require Python, difficult scripting logic, or an exterior library. Now, recursive queries make it easy.
A typical graph construction drawback is airplane journey: which airports can I attain utilizing a collection of flights? Suppose we’ve the next set of flights that exist on a day:
Every flight is given the IATA codes of its supply and vacation spot, together with the departure and arrival occasions.
Suppose an individual arrives on the airport BEG at 8 AM, and needs to search out all of the doable journey routes they’ll make on that day.
That is naturally posed as an iterative drawback. Every time we uncover a brand new metropolis we will attain, we discover all flights that depart from there after our arrival time. Because of this, within the recursive CTE, we preserve observe of the arrival time at each airport.
This produces the set of all reachable airports, together with the required quantity and set of flights.
This question may also help customers discover all reachable locations given schedule constraints, supporting purposes like journey planning, package deal routing, or transportation logistics.
Within the earlier instance, we outlined column names within the WITH RECURSIVE … AS (…) clause. Right here, we outline them within the anchor question as an alternative. Each approaches are legitimate in recursive CTEs on Databricks.
Traverse semi-structured and unstructured information: Discovering all staff saved in a JSON file
Conventional techniques typically retailer hierarchical information in inflexible, normalized tables. In the meantime, trendy purposes often generate versatile JSON/XML hierarchies. Databricks’ mixture of recursive CTEs with the VARIANT kind enables you to migrate these information patterns seamlessly, permitting you to question each conventional normalized information and versatile JSON/XML constructions in a single system.
On this instance, we’re given a (comparatively small) firm hierarchy. However as an alternative of a completely structured desk, we’re given it within the type of a JSON:
Suppose we wish an inventory of all staff and their titles in a desk. The fields of the individuals within the firm don’t comply with the identical schema: some have direct subordinates, whereas others don’t; some have their location, some don’t! With using the VARIANT datatype in Databricks, all of their wanted commonalities can be utilized inside a recursive CTE to completely discover the nested construction of the JSON, whereas their variations could be ignored.
The base case of the recursion is the total JSON information of the basis worker, which features a listing of their subordinates. In every recursive step, the question processes every subordinate’s information and repeats the method till it reaches an worker with no subordinates.
Right here’s the recursive question for this instance:
Regardless that all CTEs listed here are beneath a WITH RECURSIVE block, solely the one with precise recursion is handled as recursive. Databricks is sensible sufficient to detect which of them want recursion—even in case you mark all of them!
The output of the question:
Methods to get began
Get began with recursive CTEs by studying the Databricks documentation.
To study extra about Databricks SQL, go to our web site or learn the documentation. You too can take a look at the product tour for Databricks SQL. If you wish to migrate your present warehouse to a high-performance, serverless information warehouse with an important person expertise and decrease complete price, then Databricks SQL is the answer — strive it free of charge.