HomeBig DataUtilizing AWS Glue Information Catalog views with Apache Spark in EMR Serverless...

Utilizing AWS Glue Information Catalog views with Apache Spark in EMR Serverless and Glue 5.0


The AWS Glue Information Catalog has expanded its Information Catalog views characteristic, and now helps Apache Spark environments along with Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it attainable to create, share, and question multi-engine SQL views throughout Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, in addition to Athena and Amazon Redshift Spectrum. The multi-dialect views empower knowledge groups to create SQL views one time and question them by means of supported engines—whether or not it’s Athena for ad-hoc analytics, Amazon Redshift for knowledge warehousing, or Spark for large-scale knowledge processing. This cross-engine compatibility means knowledge engineers can give attention to constructing knowledge merchandise somewhat than managing a number of view definitions or complicated permission schemes. Utilizing AWS Lake Formation permissions, organizations can share these views throughout the identical AWS account, throughout totally different AWS accounts, and with AWS IAM Identification Middle customers and teams, with out granting direct entry to the underlying tables. Options of Lake Formation equivalent to fine-grained entry management (FGAC) utilizing Lake Formation-tag primarily based entry management (LF-TBAC) may be utilized to Information Catalog views, enabling scalable sharing and entry management throughout organizations.

In an earlier weblog submit, we demonstrated the creation of Information Catalog views utilizing Athena, including a SQL dialect for Amazon Redshift, and querying the view utilizing Athena and Amazon Redshift. On this submit, we information you thru the method of making a Information Catalog view utilizing EMR Serverless, including the SQL dialect to the view for Athena, sharing it with one other account utilizing LF-Tags, after which querying the view within the recipient account utilizing a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the flexibility and cross-account capabilities of Information Catalog views and entry by means of varied AWS analytics providers.

Advantages of Information Catalog views

The next are key advantages of Information Catalog views for enterprise options:

  • Focused knowledge sharing and entry management – Information Catalog views, mixed with the sharing capabilities of Lake Formation, allow organizations to supply particular knowledge subsets to totally different groups or departments with out duplicating knowledge. For instance, a retail firm can create views that present gross sales knowledge to regional managers whereas proscribing entry to delicate buyer data. By making use of LF-TBAC to those views, corporations can effectively handle knowledge entry throughout giant, complicated organizational buildings, sustaining compliance with knowledge governance insurance policies whereas selling data-driven decision-making.
  • Multi-service analytics integration – The flexibility to create a view in a single analytics service and question it throughout Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down knowledge silos and promotes a unified analytics strategy. This characteristic permits companies to make use of the strengths of various providers for varied analytics wants. As an example, a monetary establishment might create a view of transaction knowledge and use Athena for ad-hoc queries, Amazon Redshift for complicated aggregations, and EMR Serverless for large-scale knowledge processing—all with out shifting or duplicating the information. This flexibility accelerates insights and improves useful resource utilization throughout the analytics stack.
  • Centralized auditing and compliance – With views saved within the central Information Catalog, companies can preserve a complete audit path of information entry throughout related accounts utilizing AWS CloudTrail logs. This centralization is essential for industries with strict regulatory necessities, equivalent to healthcare or finance. Compliance officers can seamlessly monitor and report on knowledge entry patterns, detect uncommon actions, and exhibit adherence to knowledge safety laws like GDPR or HIPAA. This centralized strategy simplifies compliance processes and reduces the chance of regulatory violations.

These capabilities of Information Catalog views present highly effective options for companies to boost knowledge governance, enhance analytics effectivity, and preserve sturdy compliance measures throughout their knowledge ecosystem.

Answer overview

An instance firm has a number of datasets containing particulars of their clients’ buy particulars blended with personally identifiable data (PII) knowledge. They categorize their datasets primarily based on sensitivity of the knowledge. The information steward needs to share a subset of their most well-liked clients knowledge for additional evaluation downstream by their knowledge engineering workforce.

To exhibit this use case, we use pattern Apache Iceberg tables buyer and customer_address. We create a Information Catalog view from these two tables to filter by most well-liked clients. We then use LF-Tags to share restricted columns of this view to the downstream engineering workforce. The answer is represented within the following diagram.

arch diagram

Stipulations

To implement this answer, you want two AWS accounts with an AWS Identification and Entry Administration (IAM) admin function. We use the function to run the offered AWS CloudFormation templates and in addition use the identical IAM roles added as Lake Formation administrator.

Arrange infrastructure within the producer account

We offer a CloudFormation template that deploys the next assets and completes the information lake setup:

  • Two Amazon Easy Storage Service (Amazon S3) buckets: one for scripts, logs, and question outcomes, and one for the information lake storage.
  • Lake Formation administrator and catalog settings. The IAM admin function that you just present is registered as Lake Formation administrator. Cross-account sharing model is about to 4. Default permissions for newly created databases and tables is about to make use of Lake Formation permissions solely.
    data catalog settings
  • An IAM function with learn, write, and delete permissions on the information lake bucket objects. The information lake bucket is registered with Lake Formation utilizing this IAM function.
    data lake locations
  • An AWS Glue database for the information lake.
  • Lake Formation tags. These tags are connected to the database.
    lf-tags
  • CSV and Iceberg format tables within the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are saved within the person account’s knowledge lake bucket.
  • An Athena workgroup.
  • An IAM function and an AWS Lambda operate to run Athena queries. Athena queries are run within the Athena workgroup to insert knowledge from CSV tables to Iceberg tables. Related Lake Formation permissions are granted to the Lambda function.
    lf-tables
  • An EMR Studio and associated digital non-public cloud (VPC), subnet, routing desk, safety teams, and EMR Studio service IAM function.
  • An IAM function with insurance policies for the EMR Studio runtime. Related Lake Formation permissions are granted to this function on the Iceberg tables. This function will likely be used because the definer function to create the Information Catalog view. A definer function is the IAM function with obligatory permissions to entry the referenced tables, and runs the SQL assertion that defines the view.

Full the next steps in your producer AWS account:

  1. Register to the AWS Administration Console as an IAM administrator function.
  2. Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. In the event you’re utilizing the producer account in Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-RuntimeRole.
    data permissions

Create an EMR Serverless utility

Full the next steps to create an EMR Serverless utility in your EMR Studio:

  1. On the Amazon EMR console, beneath EMR Studio within the navigation pane, select Studios.
  2. Select GlueViewBlog-emrstudio and select the URL hyperlink of the Studio to open it.
    glueviewblog-emrstudio
  3. On the EMR Studio dashboard, select Create utility.
    emr-studio-dashboard

You’ll be directed to the Create utility web page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless utility.

  1. Below Utility settings, present the next data:
    1. For Identify, enter a reputation (for instance, emr-glueview-application).
    2. For Kind, select Spark.
    3. For Launch model, select emr-7.8.0.
    4. For Structure, select x86_64.
  2. Below Utility setup choices, choose Use customized settings.
  3. Below Interactive endpoint, choose Allow endpoint for EMR studio.
  4. Below Extra configurations, for Metastore configuration, choose Use AWS Glue Information Catalog as metastore, then choose Use Lake Formation for fine-grained entry management.
  5. Below Community connections, select emrs-vpc for VPC, enter any two non-public subnets, and enter emr-serverless-sg for Safety teams.
  6. Select Create and begin the applying.

Create an EMR Workspace

Full the next steps to create an EMR Workspace:

  1. On the EMR Studio console, select Workspaces within the navigation pane and select Create Workspace.
  2. Enter a Workspace title (for instance, emrs-glueviewblog-workspace).
  3. Go away all different settings as default and select Create Workspace.
  4. Select Launch Workspace. Your browser may request to permit pop-up permissions for the primary time launching the Workspace.
  5. After the Workspace is launched, within the navigation pane, select Compute.
  6. For Compute kind, choose EMR Serverless utility and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime function.
  7. Be certain the kernel connected to the Workspace is PySpark.

Create a Information Catalog view and confirm

Full the next steps:

  1. Obtain the pocket book glueviewblog_producer.ipynb. The code creates a Information Catalog view customer_nonpii_view from the 2 Iceberg tables, customer_iceberg and customer_address_iceberg, within the database glueviewblog__db.
  2. In your EMR Workspace emrs-glueviewblog-workspace, go to the File browser part and select Add recordsdata.
  3. Add glueviewblog_producer.ipynb.
  4. Replace the information lake bucket title, AWS account ID, and AWS Area to match your assets.
  5. Replace the database_name, table1_name, and table2_name to match your assets.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel and rerun the pocket book.

The Information Catalog view customer_nonpii_view is created and verified.

  1. Within the navigation pane on the Lake Formation console, beneath Information Catalog, select Views.
  2. Select the brand new view customer_nonpii_view.
  3. On the SQL definitions tab, confirm EMR with Apache Spark exhibits up for Engine title.
  4. Select the tab LF-Tags. The view ought to present the LF-Tag sensitivity=pii-confidential inherited from the database.
  5. Select Edit LF-Tags.
  6. On the Values dropdown menu, select confidential to overwrite the Information Catalog view’s key worth of sensitivity from pii-confidential.
  7. Select Save.

With this, we have now created a non-PII view to share with the information engineering workforce from the datasets that has PII data of shoppers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime function GlueViewBlog-EMRStudio-RuntimeRole, the Admin can have solely describe permissions on it as a database creator and Lake Formation administrator. On this step, the Admin will grant itself alter permissions on the view, with the intention to add the Athena SQL dialect to the view.

  1. On the Lake Formation console, within the navigation pane, select Information permissions.
  2. Select Grant and supply the next data:
    1. For Principals, enter Admin.
    2. For LF-Tags or catalog assets, choose Sources matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential and pii-confidential.
    5. Below Database permissions, choose Tremendous for Database permissions and Grantable permissions.
    6. Below Desk permissions, choose Tremendous for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm the LF-Tags primarily based permissions the Admin.
  4. Open the Athena question editor, select the Workgroup GlueViewBlogWorkgroup and select the AWS Glue database glueviewblog__db.
  5. Run the next question. Change together with your account ID.
    ALTER VIEW glueviewblog__db.customer_nonpii_view ADD DIALECT
    AS
    choose c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
    from glueviewblog___db.customer_iceberg, glueviewblog___db.customer_address_iceberg
    the place c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  6. Confirm the Athena dialect by working a preview on the view.
  7. On the Lake Formation console, confirm the SQL dialects on the view customer_nonpii_view.

Share the view to the patron account

Full the next steps to share the Information Catalog view to the patron account:

  1. On the Lake Formation console, within the navigation pane, select Information permissions.
  2. Select Grant and supply the next data:
    1. For Principals, choose Exterior accounts and enter the patron account ID.
    2. For LF-Tags or catalog assets, choose Sources matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential.
    5. Below Database permissions, choose Describe for Database permissions and Grantable permissions.
    6. Below Desk permissions, choose Describe and Choose for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm granted permissions on the Information permissions web page.

With this, the producer account knowledge steward has created a Information Catalog view of a subset of information from two tables of their Information Catalog, utilizing the EMR runtime function because the definer function. They’ve shared it to their analytics account utilizing LF-Tags to run additional processing of the information downstream.

Arrange infrastructure within the shopper account

We offer a CloudFormation template to deploy the next assets and arrange the information lake as follows:

  • An S3 bucket for Amazon EMR and AWS Glue logs
  • Lake Formation administrator and catalog settings just like the producer account setup
  • An AWS Glue database for the information lake
  • An EMR Studio and associated VPC, subnet, routing desk, safety teams, and EMR Studio service IAM function
  • An IAM function with insurance policies for the EMR Studio runtime

Full the next steps in your shopper AWS account:

  1. Register to the console as an IAM administrator function.
  2. Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. In the event you’re utilizing the patron account Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-Client-RuntimeRole.

Settle for AWS RAM shares within the shopper account

Now you can log in to the AWS shopper account and settle for the AWS RAM invitation:

  1. Open the AWS RAM console with the IAM function that has AWS RAM entry.
  2. Within the navigation pane, select Useful resource shares beneath Shared with me.

You need to see two pending useful resource shares from the producer account.

  1. Settle for each invites.

Create a useful resource hyperlink for the shared view

To entry the view that was shared by the producer AWS account, it is advisable to create a useful resource hyperlink within the shopper AWS account. A useful resource hyperlink is a Information Catalog object that may be a hyperlink to an area or shared database, desk, or view. After you create a useful resource hyperlink to a view, you need to use the useful resource hyperlink title wherever you’ll use the view title. Moreover, you may grant permission on the useful resource hyperlink to the job runtime function GlueViewBlog-EMRStudio-Client-RuntimeRole to entry the view by means of EMR Serverless Spark.

To create a useful resource hyperlink, full the next steps:

  1. Open the Lake Formation console because the Lake Formation knowledge lake administrator within the shopper account.
  2. Within the navigation pane, select Tables.
  3. Select Create and Useful resource hyperlink.
  4. For Useful resource hyperlink title, enter the title of the useful resource hyperlink (for instance, customer_nonpii_view_rl).
  5. For Database, select the glueviewblog_customer__db database.
  6. For Shared desk area, select the Area of the shared desk.
  7. For Shared desk, select customer_nonpii_view.
  8. Select Create.

Grant permissions on the database to the EMR job runtime function

Full the next steps to grant permissions on the database glueviewblog_customer__db to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Databases.
  2. Choose the database glueviewblog_customer__db and on the Actions menu, select Grant.
  3. Within the Rules part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the Database permissions part, choose Describe.
  5. Select Grant.

Grant permissions on the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant.
  3. Within the Rules part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the Useful resource hyperlink permissions part, choose Describe for Useful resource hyperlink permissions.
  5. Select Grant.

This permits the EMR Serverless job runtime roles to explain the useful resource hyperlink. We don’t make any picks for grantable permissions as a result of runtime roles shouldn’t have the ability to grant permissions to different rules.

Grant permissions on the goal for the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the goal for the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant on the right track.
  3. Within the Rules part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the View permissions part, choose Choose and Describe.
  5. Select Grant.

Arrange an EMR Serverless utility and Workspace within the shopper account

Repeat the steps to create an EMR Serverless utility within the shopper account.

Repeat the steps to create a Workspace within the shopper account. For Compute kind, choose EMR Serverless utility and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-Client-RuntimeRole because the runtime function.

Confirm entry utilizing interactive notebooks from EMR Studio

Full the next steps to confirm entry in EMR Studio:

  1. Obtain the pocket book glueviewblog_emr_consumer.ipynb. The code runs a choose assertion on the view shared from the producer.
  2. In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser part and select Add recordsdata.
  3. Add glueviewblog_emr_consumer.ipynb.
  4. Replace the information lake bucket title, AWS account ID, and Area to match your assets.
  5. Replace the database to match your assets.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel with PySpark kernel and rerun the pocket book.

Confirm entry utilizing interactive notebooks from AWS Glue Studio

Full the next steps to confirm entry utilizing AWS Glue Studio:

  1. Obtain the pocket book glueviewblog_glue_consumer.ipynb
  2. Open the AWS Glue Studio console.
  3. Select Pocket book after which select Add pocket book.
  4. Add the pocket book glueviewblog_glue_consumer.ipynb.
  5. For IAM function, select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  6. Select Create pocket book.
  7. Replace the information lake bucket title, AWS account ID, and Area to match your assets.
  8. Replace the database to match your assets.
  9. Save the pocket book.
  10. Run all of the cells to confirm fine-grained entry.

Confirm entry utilizing the Athena question editor

As a result of the view from the producer account was shared to the patron account, the Lake Formation administrator can have entry to the view within the producer account. Additionally, as a result of the lake admin function created the useful resource hyperlink pointing to the view, it would even have entry to the useful resource hyperlink. Go to the Athena question editor and run a easy choose question on the useful resource hyperlink.

The analytics workforce within the shopper account was in a position to entry a subset of the information from a enterprise knowledge producer workforce, utilizing their analytics instruments of alternative.

Clear up

To keep away from incurring ongoing prices, clear up your assets:

  1. In your shopper account, delete AWS Glue pocket book, cease and delete the EMR utility, after which delete EMR Workspace.
  2. In your shopper account, delete the CloudFormation stack. This could take away the assets launched by the stack.
  3. In your producer account, log in to the Lake Formation console and revoke the LF-Tags primarily based permissions you had granted to the patron account.
  4. In your producer account, cease and delete the EMR utility after which delete the EMR Workspace.
  5. In your producer account, delete the CloudFormation stack. This could delete the assets launched by the stack.
  6. Assessment and clear up any further AWS Glue and Lake Formation assets and permissions.

Conclusion

On this submit, we demonstrated a robust, enterprise-grade answer for cross-account knowledge sharing and evaluation utilizing AWS providers. We walked you thru the best way to do the next key steps:

  • Create a Information Catalog view utilizing Spark in EMR Serverless inside one AWS account
  • Securely share this view with one other account utilizing LF-TBAC
  • Entry the shared view within the recipient account utilizing Spark in each EMR Serverless and AWS Glue ETL
  • Implement this answer with Iceberg tables (it’s additionally appropriate different open desk codecs like Apache Hudi and Delta Lake)

The answer strategy with multi-dialect knowledge catalog views offered on this submit is especially worthwhile for enterprises seeking to modernize their knowledge infrastructure whereas optimizing prices, enhance cross-functional collaboration whereas enhancing knowledge governance, and speed up enterprise insights whereas sustaining management over delicate data.

Confer with the next details about Information Catalog views with particular person analytics providers, and check out the answer. Tell us your suggestions and questions within the feedback part.


In regards to the Authors

Aarthi Srinivasan is a Senior Large Information Architect with Amazon SageMaker Lakehouse. As a part of the SageMaker Lakehouse workforce, she works with AWS clients and companions to architect lake home options, improve product options, and set up finest practices for knowledge governance.

Praveen Kumar is an Analytics Options Architect at AWS with experience in designing, constructing, and implementing trendy knowledge and analytics platforms utilizing cloud-based providers. His areas of curiosity are serverless know-how, knowledge governance, and data-driven AI functions.

Dhananjay Badaya is a Software program Developer at AWS, specializing in distributed knowledge processing engines together with Apache Spark and Apache Hadoop. As a member of the Amazon EMR workforce, he focuses on designing and implementing enterprise governance options for EMR Spark.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments