Visualize information lineage utilizing Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift

October 13, 2025

30

Amazon SageMaker gives a complete hub that integrates information, analytics, and AI capabilities, offering a unified expertise for customers to entry and work with their information. By way of Amazon SageMaker Unified Studio, a single and unified setting, you should use a variety of instruments and options to assist your information and AI improvement wants, together with information processing, SQL analytics, mannequin improvement, coaching, inference, and generative AI improvement. This providing is additional enhanced by the combination of Amazon Q and Amazon SageMaker Catalog, which offer an embedded generative AI and governance expertise, serving to customers work effectively and successfully throughout your complete information and AI lifecycle, from information preparation to mannequin deployment and monitoring.

With the SageMaker Catalog information lineage function, you’ll be able to visually monitor and perceive the movement of your information throughout totally different techniques and groups, gaining a whole image of your information property and the way they’re related. As an OpenLineage-compatible function, it helps you hint information origins, monitor transformations, and look at cross-organizational information consumption, supplying you with insights into cataloged property, subscribers, and exterior actions. By capturing lineage occasions from OpenLineage-enabled techniques or via APIs, you’ll be able to acquire a deeper understanding of your information’s journey, together with actions inside SageMaker Catalog and past, finally driving higher information governance, high quality, and collaboration throughout your group.

Moreover, the SageMaker Catalog information lineage function variations every occasion, so you’ll be able to monitor adjustments, visualize historic lineage, and evaluate transformations over time. This gives useful insights into information evolution, facilitating troubleshooting, auditing, and information integrity by displaying precisely how information property have advanced, and generates belief in information.

On this publish, we talk about the visualization of knowledge lineage in SageMaker Catalog and the way seize lineage from totally different AWS analytics companies resembling AWS Glue, Amazon Redshift, and Amazon EMR Serverless routinely, and visualize it with SageMaker Unified Studio.

Answer overview

The era of knowledge lineage in SageMaker Catalog operates via an automatic system that captures metadata and relationships between totally different information artifacts for AWS Glue, Amazon EMR, and Amazon Redshift. When information strikes via numerous AWS companies, SageMaker routinely tracks these actions, transformations, and dependencies, creating an in depth map of the info’s journey. This monitoring contains details about information sources, transformations, processing steps, and closing outputs, offering a whole audit path of knowledge motion and transformation.

The implementation of knowledge lineage in SageMaker Catalog gives a number of key advantages:

Compliance and audit assist – Organizations can show compliance with regulatory necessities by displaying full information provenance and transformation historical past
Influence evaluation – Groups can assess the potential influence of adjustments to information sources or transformations by understanding dependencies and relationships within the information pipeline
Troubleshooting and debugging – When points come up, the lineage system helps establish the basis trigger by displaying the whole path of knowledge transformation and processing
Information high quality administration – By monitoring transformations and dependencies, organizations can higher preserve information high quality and perceive how information high quality points may propagate via their techniques

Lineage seize is automated utilizing a number of instruments in SageMaker Unified Studio. To be taught extra, confer with Information lineage assist matrix.

Within the following sections, we present you find out how to configure your sources and implement the answer. For this publish, we create the answer sources within the us-west-2 AWS Area utilizing an AWS CloudFormation template.

Stipulations

Earlier than getting began, ensure you have the next:

Configure SageMaker Unified Studio with AWS CloudFormation

The vpc-analytics-lineage-sus.yaml stack creates a VPC, subnet, safety group, IAM roles, NAT gateway, web gateway, Amazon Elastic Compute Cloud (Amazon EC2) consumer, S3 buckets, SageMaker Unified Studio area, and SageMaker Unified Studio challenge. To create the answer sources, full the next steps:

Launch the stack vpc-analytics-lineage-sus utilizing the CloudFormation template:

Present the parameter values as listed within the following desk.

Parameters	Pattern worth
DatazoneS3Bucket	s3://datazone-{account_id}/
DomainName	dz-studio
EnvironmentName	sm-unifiedstudio
PrivateSubnet1CIDR	10.192.20.0/24
PrivateSubnet2CIDR	10.192.21.0/24
PrivateSubnet3CIDR	10.192.22.0/24
ProjectName	sidproject
PublicSubnet1CIDR	10.192.10.0/24
PublicSubnet2CIDR	10.192.11.0/24
PublicSubnet3CIDR	10.192.12.0/24
UsersList	analyst
VpcCIDR	10.192.0.0/16

The stack creation course of can take roughly 20 minutes to finish. You’ll be able to verify the Outputs tab for the stack after the stack is created.

Subsequent, we put together supply information, setup the AWS Glue ETL Job, Amazon EMR Serverless Spark Job and Amazon Redshift Job to generate the lineage and seize lineage from Amazon SageMaker Unified Studio

Put together information

The next is instance information from our CSV information:

attendance.csv

EmployeeID,Date,ShiftStart,ShiftEnd,Absent,OvertimeHours
E1000,2024-01-01,2024-01-01 08:00:00,2024-01-01 16:22:00,False,3
E1001,2024-01-08,2024-01-08 08:00:00,2024-01-08 16:38:00,False,2
E1002,2024-01-23,2024-01-23 08:00:00,2024-01-23 16:24:00,False,3
E1003,2024-01-09,2024-01-09 10:00:00,2024-01-09 18:31:00,False,0
E1004,2024-01-15,2024-01-15 09:00:00,2024-01-15 17:48:00,False,1

staff.csv

EmployeeID,Identify,Division,Function,HireDate,Wage,PerformanceRating,Shift,Location
E1000,Employee_0,High quality Management,Operator,2021-08-08,33002.0,1,Night time,Plant C
E1001,Employee_1,Upkeep,Supervisor,2015-12-31,69813.76,5,Night,Plant B
E1002,Employee_2,Manufacturing,Technician,2015-06-18,46753.32,1,Night,Plant A
E1003,Employee_3,Admin,Supervisor,2020-10-13,52853.4,5,Night time,Plant A
E1004,Employee_4,High quality Management,Supervisor,2023-09-21,55645.27,5,Night,Plant A

Add the pattern information from attendance.csv and staff.csv to the S3 bucket specified within the earlier CloudFormation stack (s3://datazone-{account_id}/csv/).

Ingest worker information in Amazon Relational Database Dervice (Amazon RDS) for MySQL desk

On the CloudFormation console, open the stack vpc-analytics-lineage-sus and accumulate the Amazon RDS for MySQL database endpoint to make use of within the following instructions to create a default employeedb database.

Connect with Amazon EC2 occasion with mysql bundle set up

Run the next command to hook up with the database

>MySQL -u admin -h database-1.cuqd06l5efvw.us-west-2.rds.amazonaws.com -p

Run the next command to create an worker desk

Use employeedb;

CREATE TABLE worker (
  EmployeeID longtext,
  Identify longtext,
  Division longtext,
  Function longtext,
  HireDate longtext,
  Wage longtext,
  PerformanceRating longtext,
  Shift longtext,
  Location longtext
);

Working the next command to insert rows.

INSERT INTO worker (EmployeeID, Identify, Division, Function, HireDate, Wage, PerformanceRating, Shift, Location) VALUES ('E1000', 'Employee_0', 'High quality Management', 'Operator', '2021-08-08', 33002.00, 1, 'Night time', 'Plant C'), ('E1001', 'Employee_1', 'Upkeep', 'Supervisor', '2015-12-31', 69813.76, 5, 'Night', 'Plant B'), ('E1002', 'Employee_2', 'Manufacturing', 'Technician', '2015-06-18', 46753.32, 1, 'Night', 'Plant A'), ('E1003', 'Employee_3', 'Admin', 'Supervisor', '2020-10-13', 52853.40, 5, 'Night time', 'Plant A'), ('E1004', 'Employee_4', 'High quality Management', 'Supervisor', '2023-09-21', 55645.27, 5, 'Night', 'Plant A');

Seize lineage from AWS Glue ETL job and pocket book

To show the lineage, we arrange an AWS Glue extract, remodel, and cargo (ETL) job to learn the worker information from an Amazon RDS for MySQL desk and the worker attendance information from Amazon S3, and be a part of each datasets. Lastly, we write the info to Amazon S3 and create the attendance_with_emp1 desk within the AWS Glue Information Catalog.

Create and configure AWS Glue job for lineage era

Full the next steps to create your AWS Glue ETL job:

On the AWS Glue console, create a brand new ETL job with AWS Glue model 5.0.
Allow Generate lineage occasions and supply the area ID (retrieve from the CloudFormation template output for DataZoneDomainid; it can have the format dzd_xxxxxxxx)

Use the next code snippet within the AWS Glue ETL job script. Present the S3 bucket (bucketname-{account_id}) used within the previous CloudFormation stack.

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.features import *
from pyspark.sql.varieties import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineageglue").enableHiveSupport().getOrCreate()
 
connection_details = glueContext.extract_jdbc_conf(connection_name="connectionname")

employee_df = spark.learn.format("jdbc").possibility("url", "jdbc:MySQL://dbhost:3306/database_name").possibility("dbtable", "worker").possibility("consumer", connection_details['user']).possibility("password", connection_details['password']).load()

s3_paths = {
'absent_data': 's3://bucketname-{account_id}/csv/attendance.csv'
}
absent_df = spark.learn.csv(s3_paths['absent_data'], header=True, inferSchema=True)

joined_df = employee_df.be a part of(absent_df, on="EmployeeID", how="internal")

joined_df.write.mode("overwrite").format("parquet").possibility("path", "s3://datazone-{account_id}/attendanceparquet/").saveAsTable("gluedbname.tablename")

Select Run to begin the job.
On the Runs tab, affirm the job ran with out failure.
After the job has executed efficiently, navigate to the SageMaker Unified Studio area.
Select Undertaking and underneath Overview, select Information Sources.
Choose the Information Catalog supply (accountid-AwsDataCatalog-glue_db_suffix-default-datasource).
On the Actions dropdown menu, select Edit.
Beneath Connection, allow Import information lineage.
Within the Information Choice part, underneath Desk Choice Standards, present a desk identify or use * to generate lineage.
Replace the info supply and select Run to create an asset referred to as attendance_with_emp1 in SageMaker Catalog.
Navigate to Property, select the attendance_with_emp1 asset, and navigate to the LINEAGE part.

The next lineage diagram reveals an AWS Glue job that integrates information from two sources: worker info saved in Amazon RDS for MySQL and worker absence data saved in Amazon S3. The AWS Glue job combines these datasets via a be a part of operation, then creates a desk within the Information Catalog and registers it as an asset in SageMaker Catalog, making the unified information obtainable for additional evaluation or machine studying functions.

Create and configure AWS Glue pocket book for lineage era

Full the next steps to create the AWS Glue pocket book:

On the AWS Glue console, select Creator utilizing an interactive code pocket book.
Beneath Choices, select Begin recent and select Create pocket book.
Within the pocket book, use the next code to generate lineage.
Within the following code, we add the required Spark configuration to generate lineage after which learn CSV information from Amazon S3 and write in Parquet format to the Information Catalog desk. The Spark configuration contains the next parameters:
- spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener – Registers the OpenLineage listener to seize Spark job execution occasions and metadata for lineage monitoring
- spark.openlineage.transport.sort=amazon_datazone_api – Specifies Amazon DataZone because the vacation spot service the place the lineage information will probably be despatched and saved
- spark.openlineage.transport.domainId=dzd_xxxxxxx – Defines the distinctive identifier of your Amazon DataZone area the place the lineage information will probably be related
- spark.glue.accountId={account_id} – Specifies the AWS account ID the place the AWS Glue job is operating for correct useful resource identification and entry
- spark.openlineage.sides.custom_environment_variables – Lists the precise setting variables to seize within the lineage information for context in regards to the AWS and AWS Glue setting
- spark.glue.JOB_NAME=lineagenotebook – Units a novel identifier identify for the AWS Glue job that may seem in lineage monitoring and logs
See the next code:
```
%%configure —identify challenge.spark -f
{
"—conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
--conf spark.openlineage.transport.sort=amazon_datazone_api 
--conf spark.openlineage.transport.domainId=dzd_xxxxxxxx 
--conf spark.glue.accountId={account_id} 
--conf spark.openlineage.sides.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
--conf spark.glue.JOB_NAME=lineagenotebook"
}

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.features import *
from pyspark.sql.varieties import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineagegluenotebook").enableHiveSupport().getOrCreate()

s3_paths = {
'absent_data': 's3://datazone-{account_id}/csv/attendance.csv'
}
absent_df = spark.learn.csv(s3_paths['absent_data'], header=True, inferSchema=True)

absent_df.write.mode("overwrite").format("parquet").possibility("path", "s3://datazone-{account_id}/attendanceparquet2/").saveAsTable("gluedbname.tablename")
```
After the pocket book has executed efficiently, navigate to the SageMaker Unified Studio area.
Select Undertaking and underneath Overview, select Information Sources.
Select the Information Catalog supply ({account_id}-AwsDataCatalog-glue_db_suffix-default-datasource).
Select Run to create the asset attendance_with_empnote in SageMaker Catalog.
Navigate to Property, select the attendance_with_empnote asset, and navigate to the LINEAGE part.

The next lineage diagram reveals an AWS Glue job that reads information from the worker absence data saved in Amazon S3. The AWS Glue job remodel CSV information into Parquet format, then creates a desk within the Information Catalog and registers it as an asset in SageMaker Catalog.

Seize lineage from Amazon Redshift

To show the lineage, we’re creating an worker desk and an attendance desk and be a part of each datasets. Lastly, we create a brand new desk referred to as employeewithabsent in Amazon Redshift. Full the next steps to create and configure lineage for Amazon Redshift tables:

In SageMaker Unified Studio, open your area.
Beneath Compute, select Information warehouse.
Open challenge.redshift and replica the endpoint identify (redshift-serverless-workgroup-xxxxxxx).
On the Amazon Redshift console, open the Question Editor v2, and connect with the Redshift Serverless workgroup with a secret. Use the AWS Secrets and techniques Supervisor possibility and select the key redshift-serverless-namespace-xxxxxxxx.

Use the next code to create tables in Amazon Redshift and cargo information from Amazon S3 utilizing the COPY command. Be sure the IAM function has GetObject permission on the S3 information attendance.csv and staff.csv.

Create Redshift desk absent

CREATE TABLE public.absent (
    employeeid character various(65535),
    date date,
    shiftstart timestamp with out time zone ,
    shiftend timestamp with out time zone,
    absent boolean,
    overtimehours integer
);

Load information into absent desk.

COPY absent
FROM 's3://datazone-{account_id}/csv/attendance.csv' 
IAM_ROLE 'arn:aws:iam::accountid:function/RedshiftAdmin'
csv
IGNOREHEADER 1;

Create Redshift desk worker

CREATE TABLE public.worker (
    employeeid character various(65535),
    identify character various(65535),
    division character various(65535),
    function character various(65535),
    hiredate date,
    wage double precision,
    performancerating integer,
    shift character various(65535),
    location character various(65535)
);

Load information into worker desk.

COPY worker
FROM 's3://datazone-{account_id}/csv/staff.csv' 
IAM_ROLE 'arn:aws:iam::account-id:function/RedshiftAdmin'
csv
IGNOREHEADER 1;

After the tables are created and the info is loaded, carry out the be a part of between the tables and create a brand new desk with a CTAS question:

CREATE TABLE public.employeewithabsent AS
SELECT 
  e.*,
  a.absent,
  a.overtimehours
FROM public.worker e
INNER JOIN public.absent a
ON e.EmployeeID = a.EmployeeID;

Navigate to the SageMaker Unified Studio area.
Select Undertaking and underneath Overview, select Information Sources.
Choose the Amazon Redshift supply (RedshiftServerless-default-redshift-datasource).
On the Actions dropdown menu, select Edit.
Beneath Connection, Allow Import information lineage.
Within the Information Choice part, underneath Desk Choice Standards, present a desk identify or use * to generate lineage.
Replace the info supply and select Run to create an asset referred to as employeewithabsent in SageMaker Catalog.
Navigate to Property, select the employeewithabsent asset, and navigate to the LINEAGE part.

The next lineage diagram reveals becoming a member of two redshift tables and creating a brand new redshift desk and registers it as an asset in SageMaker Catalog.

Seize lineage from EMR Serverless job

To show the lineage, we learn worker information from an RDS for MySQL desk and an attendance dataset from Amazon Redshift, and be a part of each datasets. Lastly, we write the info to Amazon S3 and create the attendance_with_employee desk within the Information Catalog. Full the next steps:

On the Amazon EMR console, select EMR Serverless within the navigation pane.
To create or handle EMR Serverless purposes, you want the EMR Studio UI.
1. If you have already got an EMR Studio within the Area the place you wish to create an software, select Handle purposes to navigate to your EMR Studio, or choose the EMR Studio that you simply wish to use.
2. For those who don’t have an EMR Studio within the Area the place you wish to create an software, select Get began after which select Create and launch Studio. EMR Serverless creates an EMR Studio for you so you’ll be able to create and handle purposes.
Within the Create studio UI that opens in a brand new tab, enter the identify, sort, and launch model in your software.
Select Create software.
Create an EMR Spark serverless software with the next configuration:
1. For Kind, select Spark.
2. For Launch model, select emr-7.8.0.
3. For Structure, select x86_64.
4. For Utility setup choices, choose Use customized settings.
5. For Interactive endpoint, allow the endpoint for EMR Studio.
6. For Utility configuration, use the next configuration:
```
[{
    "Classification": "iceberg-defaults",
    "Properties": {
        "iceberg.enabled": "true"
    }
}]
```
Select Create and Begin software.

After software has began, submit the Spark software to generate lineage occasions. Copy the next script and add it to the S3 bucket (s3://datazone-{account_id}/script/). Add the MySQL-connector-java JAR file to the S3 bucket (s3://datazone-{account_id}/jars/) to learn the info from MySQL.

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.features import *
from pyspark.sql.varieties import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineageglue").enableHiveSupport().getOrCreate()

employee_df = spark.learn.format("jdbc").possibility("driver","com.MySQL.cj.jdbc.Driver").possibility("url", "jdbc:MySQL://dbhostname:3306/databasename").possibility("dbtable", "worker").possibility("consumer", "admin").possibility("password", "xxxxxxx").load()

absent_df = spark.learn.format("jdbc").possibility("url", "jdbc:redshift://redshiftserverlessendpoint:5439/dev").possibility("dbtable", "public.absent").possibility("consumer", "admin").possibility("password", "xxxxxxxxxx").load()

joined_df = employee_df.be a part of(absent_df, on="EmployeeID", how="internal")

joined_df.write.mode("overwrite").format("parquet").possibility("path", "s3://datazone-{account_id}/emrparquetnew/").saveAsTable("gluedname.tablename")

After you add the script, use the next command to submit the Spark software. Change the next parameters in response to your setting particulars:

application-id: Present the Spark software ID you generated.
execution-role-arn: Present the EMR execution function.
entryPoint: Present the Spark script S3 path.
domainID: Present the area ID (from the CloudFormation template output for DataZoneDomainid: dzd_xxxxxxxx).

accountID: Present your AWS account ID.

aws emr-serverless start-job-run --application-id 00frv81tsqe0ok0l --execution-role-arn arn:aws:iam::{account_id}:function/service-role/AmazonEMR-ExecutionRole-1717662744320 --name "Spark-Lineage" --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://datazone-{account_id}/script/emrspark2.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.reminiscence=4g --conf spark.driver.cores=1 --conf spark.driver.reminiscence=4g --conf spark.executor.cases=2 --conf spark.hadoop.hive.metastore.consumer.manufacturing unit.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.jars=/usr/share/aws/datazone-openlineage-spark/lib/DataZoneOpenLineageSpark-1.0.jar,s3://datazone-{account_id}/jars/MySQL-connector-java-8.0.20.jar --conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.sort=amazon_datazone_api --conf spark.openlineage.transport.domainId=dzd_xxxxxxxx --conf spark.glue.accountId={account_id}"
        }
    }'

After the job has executed efficiently, navigate to the SageMaker Unified Studio area.
Select Undertaking and underneath Overview, select Information Sources.
Choose the Information Catalog supply ({account_id}-AwsDataCatalog-glue_db_xxxxxxxxxx-default-datasource).
On the Actions dropdown menu, select Edit.
Beneath Connection, allow Import information lineage.
Within the Information Choice part, underneath Desk Choice Standards, present a desk identify or use * to generate lineage.
Replace the info supply and select Run to create an asset referred to as attendancewithempnew in SageMaker Catalog.
Navigate to Property, select the attendancewithempnew asset, and navigate to the LINEAGE part.

The next lineage diagram reveals an AWS Glue job that integrates worker info saved in Amazon RDS for MySQL and worker absence data saved in Amazon Redshift. The AWS Glue job combines these datasets via a be a part of operation, then creates a desk within the Information Catalog and registers it as an asset in SageMaker Catalog.

Clear up

To wash up your sources, full the next steps:

On the AWS Glue console, delete the AWS Glue job.
On the Amazon EMR console, delete the EMR Serverless Spark software and EMR Studio.
On the AWS CloudFormation console, delete the CloudFormation stack vpc-analytics-lineage-sus.

Conclusion

On this publish, we confirmed how information lineage in SageMaker Catalog helps you monitor and perceive the whole lifecycle of your information throughout numerous AWS analytics companies. This complete monitoring system gives visibility into how information flows via totally different processing levels, transformations, and analytical workflows, making it a vital device for information governance, compliance, and operational effectivity.

Check out these lineage visualization strategies in your personal use circumstances, and share your questions and suggestions within the feedback part.

Concerning the Authors

Previous articleOpenAI Codex rivals Claude Code

Next articleProcessor Hint can not profile this course of with out correct permission

Visualize information lineage utilizing Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift

Answer overview

Stipulations

Configure SageMaker Unified Studio with AWS CloudFormation

Put together information

Ingest worker information in Amazon Relational Database Dervice (Amazon RDS) for MySQL desk

Seize lineage from AWS Glue ETL job and pocket book

Create and configure AWS Glue job for lineage era

Create and configure AWS Glue pocket book for lineage era

Seize lineage from Amazon Redshift

Seize lineage from EMR Serverless job

Clear up

Conclusion

Concerning the Authors

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

Medidata’s journey to a contemporary lakehouse structure on AWS

How KV Caching Makes Fashionable LLMs Quick?

LEAVE A REPLY Cancel reply

Most Popular

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

Obtain 2x quicker information lake question efficiency with Apache Iceberg on Amazon Redshift

Recent Comments

ABOUT US

POPULAR POSTS

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

POPULAR CATEGORY