Seize knowledge lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

June 24, 2025

74

The following technology of Amazon SageMaker is the middle to your knowledge, analytics, and AI. SageMaker brings collectively AWS synthetic intelligence and machine studying (AI/ML) and analytics capabilities and delivers an built-in expertise for analytics and AI with unified entry to knowledge. From Amazon SageMaker Unified Studio, a single interface, you may entry your knowledge and use a set of highly effective instruments for knowledge processing, SQL analytics, mannequin improvement, coaching and inference, in addition to generative AI improvement. This unified expertise is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance expertise at each step.

With knowledge lineage, now a part of SageMaker Catalog, area directors and knowledge producers can centralize lineage metadata of their knowledge property in a single place. You possibly can observe the movement of information over time, supplying you with a transparent understanding of the place it originated, the way it has modified, and its final use throughout the enterprise. By offering this degree of transparency across the origin of information, knowledge lineage helps knowledge shoppers acquire belief that the info is appropriate for his or her use case. As a result of knowledge lineage is captured on the desk, column, and job degree, knowledge producers also can conduct affect evaluation and reply to knowledge points when wanted.

Seize of information lineage in SageMaker begins after connections and knowledge sources are configured and lineage occasions are generated when knowledge is remodeled in AWS Glue or Amazon Redshift. This functionality can also be absolutely suitable with OpenLineage, so you may additional increase knowledge lineage seize to different knowledge processing instruments. This put up walks you thru how you can use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push knowledge lineage occasions programmatically from instruments supporting the OpenLineage customary like dbt, Apache Airflow, and Apache Spark.

Resolution overview

Many third-party and open supply instruments which are used right this moment to orchestrate and run knowledge pipelines, like dbt, Airflow, and Spark, have energetic assist of the OpenLineage customary to offer interoperability throughout environments. With this functionality, you solely want to incorporate and configure the correct library to your setting, to have the ability to stream lineage occasions from jobs working on this device on to their corresponding output logs or to a goal HTTP endpoint that you simply specify.

With the goal HTTP endpoint possibility, you may introduce a sample to put up lineage occasions from these instruments into SageMaker or Amazon DataZone to additional enable you to centralize governance of your knowledge property and processes in a single place. This sample takes the type of a proxy, and its simplified structure is illustrated within the following determine.

The way in which that the proxy for OpenLineage works is straightforward:

Amazon API Gateway exposes an HTTP endpoint and path. Jobs working with the OpenLineage package deal on prime of the supported knowledge processing instruments might be arrange with the HTTP transport possibility pointing to this endpoint and path. If connectivity permits, lineage occasions shall be streamed into this endpoint because the job runs.
An Amazon Easy Queue Service (Amazon SQS) queue buffers the occasions as they arrive. By storing them in a queue, you’ve got the choice to implement methods for retries and errors when wanted. For instances the place occasion order is required, we suggest the usage of first-in-first-out (FIFO) queues; nevertheless, SageMaker and Amazon DataZone are in a position to map incoming OpenLineage occasions, even when they’re out of order.
An AWS Lambda operate retrieves occasions from the queue in batches. For each occasion in a batch, the operate can carry out transformations when wanted and put up the ensuing occasion to the goal SageMaker or Amazon DataZone area.
Despite the fact that it’s not proven within the structure, AWS Id and Entry Administration (IAM) and Amazon CloudWatch are key capabilities that permit safe interplay between sources with minimal permissions and logging for troubleshooting and observability.

The AWS pattern OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone supplies a working implementation of this simplified structure which you could check and customise as wanted. To deploy in a check setting, observe the steps as described within the repository. We use an AWS CloudFormation template to deploy resolution sources.

After you’ve got deployed the OpenLineage HTTP Proxy resolution, you should use it to put up lineage occasions from knowledge processing instruments like dbt, Airflow, and Spark right into a SageMaker or Amazon DataZone area, as proven within the following examples.

Arrange the OpenLineage package deal for Spark in AWS Glue 4.0

AWS Glue added built-in assist for OpenLineage with AWS Glue 5.0 (to study extra, see Introducing AWS Glue 5.0 for Apache Spark). For jobs which are nonetheless working on AWS Glue 4.0, you continue to can stream OpenLineage occasions into SageMaker or Amazon DataZone through the use of the OpenLineage HTTP Proxy resolution. This serves for example that may be utilized to different platforms working Spark like Amazon EMR, third-party options, or self-managed clusters.

To correctly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage occasions into the OpenLineage HTTP Proxy resolution, full the next steps:

Obtain the official OpenLineage package deal for Spark. For our instance, we used the JAR package deal model 2.12 launch 1.9.1.
Retailer the JAR file in an Amazon Easy Storage Service (Amazon S3) bucket that may be accessed by your AWS Glue job.
On the AWS Glue console, open your job.
Underneath Libraries, for Dependent JARs path, enter the trail of the JAR package deal saved in your S3 bucket.

Within the Job parameters part, add the next parameters:
1. Allow the OpenLineage package deal:
  1. Key: --user-jars-first
  2. Worth: true
2. Configure how the OpenLineage package deal shall be used to stream lineage occasions. Change and with the corresponding values of the OpenLineage HTTP Proxy resolution. These values might be discovered as outputs of the deployed CloudFormation stack. Change together with your AWS account ID.
  1. Key: --conf
  2. Worth:
```
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
--conf spark.openlineage.transport.sort=http 
--conf spark.openlineage.transport.url=
--conf spark.openlineage.transport.endpoint=/
--conf spark.openlineage.sides.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
--conf spark.glue.accountId=
```

With this setup, the AWS Glue 4.0 job will use the HTTP transport possibility of the OpenLineage package deal to stream lineage occasions into the OpenLineage proxy, which can put up occasions to the SageMaker or Amazon DataZone area.

Run the AWS Glue 4.0 job.

The job’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you may observe its origin path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical consequence.

The origin path on this instance is in depth and maps the ensuing dataset all the way down to its origin, on this case, a few tables hosted in a relational database and remodeled by means of an information pipeline with two AWS Glue 4.0 (Spark) jobs.

Arrange the OpenLineage package deal for dbt

dbt has quickly change into a well-liked framework to construct knowledge pipelines on prime of information processing and knowledge warehouse instruments like Amazon Redshift, Amazon EMR, and AWS Glue, in addition to different conventional and third-party options. This framework helps OpenLineage as a option to standardize technology of lineage occasions and combine with the rising knowledge governance ecosystem.dbt deployments may fluctuate per setting, which is why we don’t dive into the specifics on this put up. Nonetheless, to easily configure your dbt mission to leverage the OpenLineage HTTP Proxy resolution, full the next steps:

Set up the OpenLineage package deal for dbt. You possibly can study extra within the OpenLineage documentation.
Within the root folder of your dbt mission, create an openlineage.yml file the place you may specify the transport configuration. Change and with the values of the OpenLineage HTTP Proxy resolution. These values might be discovered as outputs of the deployed CloudFormation stack.

transport:
  sort: http
  url: 
  endpoint: 
  timeout: 5

Run your dbt pipeline. As defined within the OpenLineage documentation, as a substitute of working the usual dbt run command, you run the dbt-ol run command. The latter command is only a wrapper on prime of the usual dbt run command in order that lineage occasions are captured and streamed as configured.

The job’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you may observe its lineage path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical consequence.

On this instance, the dbt mission is working on prime of Amazon Redshift, which is a typical use case amongst clients. Amazon Redshift is built-in for automated lineage seize with SageMaker and Amazon DataZone, however such capabilities weren’t used as a part of this instance as an instance how one can nonetheless combine OpenLineage occasions from dbt utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The dbt pipeline is made by two phases working sequentially, that are illustrated within the origin path because the nodes with the dbt sort.

Arrange the OpenLineage package deal for Airflow

Airflow is a well-positioned device to orchestrate knowledge pipelines at any scale. AWS supplies Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed different for purchasers that wish to scale back administration and speed up the event of their knowledge technique with Airflow in an economical approach. Airflow additionally helps OpenLineage, so you may centralize lineage with instruments like SageMaker and Amazon DataZone.

The next steps are particular for Amazon MWAA, however they are often extrapolated to different types of deployment of Airflow:

Set up the OpenLineage package deal for Airflow. You possibly can study extra within the OpenLineage documentation. For variations 2.7 and later, it’s really helpful to make use of the native Airflow OpenLineage package deal (apache-airflow-providers-openlineage), which is the case for this instance.
To put in the package deal, add it to the necessities.txt file that you’re storing in Amazon S3 and that you’re pointing to when provisioning your Amazon MWAA setting. To study extra, discuss with Managing Python dependencies in necessities.txt.
As you put in the OpenLineage package deal or afterwards, you may configure it to ship lineage occasions to the OpenLineage proxy:
1. When filling the shape to create a brand new Amazon MWAA setting or edit an present one, within the Airflow configuration choices part, add the next. Change and with the values of the OpenLineage HTTP Proxy resolution. These values might be discovered as outputs of the deployed CloudFormation stack:
  1. Configuration possibility: openlineage.transport
  2. Customized worth: {"sort": "http", "url": "", "endpoint": ""}

Run your pipeline.

The Airflow duties will mechanically use the transport configuration to stream lineage occasions into the OpenLineage proxy as they run. The duty’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them.As you discover the sourced dataset in SageMaker Unified Studio, you may observe its origin path as described by the OpenLineage occasions streamed by means of the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical consequence.

On this instance, the Amazon MWAA Directed Acyclic Graph (DAG) is working on prime of Amazon Redshift, much like the dbt instance earlier than. Nonetheless, it’s nonetheless not utilizing the native integration for automated knowledge seize between Amazon Redshift and SageMaker or Amazon DataZone. This fashion, we will illustrate how one can nonetheless combine OpenLineage occasions from Airflow utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The Airflow DAG is made by a single activity that outputs the ensuing dataset through the use of datasets that have been created as a part of the dbt pipeline within the earlier instance. That is illustrated within the origin path, the place it contains nodes with the dbt sort and a node with AIRFLOW sort. With this last instance, notice how SageMaker and Amazon DataZone map all datasets and jobs to mirror the truth of your knowledge pipelines.

Extra issues when implementing the OpenLineage proxy sample

The OpenLineage proxy sample applied within the pattern OpenLineage HTTP Proxy resolution and offered on this put up has proven to be a sensible strategy to combine a rising set of information processing instruments right into a centralized knowledge governance technique on prime of SageMaker. We encourage you to dive into it and use it in your check environments to find out how it may be finest used to your particular setup.If all in favour of taking this sample to manufacturing, we advise you first overview it totally and customise it to your specific wants. The next are some objects price reviewing as you consider this sample implementation:

The answer used within the examples of this put up makes use of a public API endpoint with no authentication or authorization mechanism. For a manufacturing workload, we suggest limiting entry to the endpoint to a minimal so solely approved sources are in a position to stream messages into it. To study extra, discuss with Management and handle entry to HTTP APIs in API Gateway.
The logic applied within the Lambda operate is meant to be personalized relying in your use case. You may have to implement transformation logic, relying on how OpenLineage occasions are created by the device you might be utilizing. As a reference, for the case of the Amazon MWAA instance of this put up, some minor transformations have been required on the title and namespace fields of the inputs and outputs parts of the occasion for full compatibility with the format anticipated for Amazon Redshift datasets as described within the dataset naming conventions of OpenLineage. You may additionally want to vary how the operate logs execution particulars or embrace retry/error logic and extra.
The SQS queue used within the OpenLineage HTTP Proxy resolution is customary, which suggests that occasions aren’t delivered so as. If it is a requirement, you could possibly use FIFO queues as a substitute.

For instances the place you wish to put up OpenLineage occasions straight into SageMaker or Amazon DataZone, with out utilizing the proxy sample defined on this put up, a customized transport is now obtainable as an extension of the OpenLineage mission model 1.33.0. Leverage this characteristic in instances the place you don’t want further controls in your OpenLineage occasion stream, for instance, when you don’t want any customized transformation logic.

Abstract

On this put up, we confirmed how you can use the OpenLineage-compatible APIs of SageMaker to seize knowledge lineage from any device supporting this customary, by following an architectural sample launched because the OpenLineage proxy. We offered some examples of how one can arrange instruments like dbt, Airflow, and Spark to stream lineage occasions to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone area. Lastly, we launched a working implementation of this sample which you could check and mentioned some issues when implementing this identical sample to manufacturing.

The SageMaker compatibility with OpenLineage may also help simplify governance of your knowledge property and enhance belief in your knowledge. This functionality is without doubt one of the many options that are actually obtainable to construct a complete governance technique powered by knowledge lineage, knowledge high quality, enterprise metadata, knowledge discovery, entry automation, and extra. By bundling knowledge governance capabilities with the rising set of instruments obtainable for knowledge and AI improvement, you may derive worth out of your knowledge quicker and get nearer to consolidating a data-driven tradition. Check out this resolution and get began with SageMaker to hitch the rising set of shoppers which are modernizing their knowledge platform.

In regards to the authors

Jose Romero is a Senior Options Architect for Startups at AWS, primarily based in Austin, Texas. He’s captivated with serving to clients architect trendy platforms at scale for knowledge, AI, and ML. As a former senior architect in AWS Skilled Companies, he enjoys constructing and sharing options for frequent complicated issues in order that clients can speed up their cloud journey and undertake finest practices. Join with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Supervisor with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on constructing merchandise and their capabilities in knowledge analytics and governance. She is captivated with constructing revolutionary merchandise to handle and simplify clients’ challenges of their end-to-end knowledge journey. Exterior of labor, she enjoys being open air to hike and seize nature’s magnificence. Join together with her on LinkedIn.

Previous articleVirginia Launches Safe Drone Grant Program

Next articleIndia’s GoKwik raised a small $13M spherical for a hefty leap in valuation

Seize knowledge lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

Resolution overview

Arrange the OpenLineage package deal for Spark in AWS Glue 4.0

Arrange the OpenLineage package deal for dbt

Arrange the OpenLineage package deal for Airflow

Extra issues when implementing the OpenLineage proxy sample

Abstract

In regards to the authors

8 Issues To Do With Microsoft’s MarkItDown Library

Constructing Manufacturing AI Brokers: An Engineer’s Information

How Bayer transforms Pharma R&D with a cloud-based information science ecosystem utilizing Amazon SageMaker

LEAVE A REPLY Cancel reply

Most Popular

Earlier than you construct your first enterprise AI app

Renesas Launches Its First Twin-Band Wi-Fi 6 and Wi-Fi/Bluetooth Low Vitality Chips, the RA6W1 and W2

This Vine-Like Grasper Provides Robots a Safe But Light Contact

8 Issues To Do With Microsoft’s MarkItDown Library

Recent Comments

ABOUT US

POPULAR POSTS

Earlier than you construct your first enterprise AI app

Renesas Launches Its First Twin-Band Wi-Fi 6 and Wi-Fi/Bluetooth Low Vitality Chips, the RA6W1 and W2

This Vine-Like Grasper Provides Robots a Safe But Light Contact

POPULAR CATEGORY