HomeBig DataIntroducing Apache Iceberg materialized views in AWS Glue Knowledge Catalog

Introducing Apache Iceberg materialized views in AWS Glue Knowledge Catalog


Tons of of hundreds of shoppers construct synthetic intelligence and machine studying (AI/ML) and analytics purposes on AWS, incessantly reworking knowledge by way of a number of levels for improved question efficiency—from uncooked knowledge to processed datasets to last analytical tables. Knowledge engineers should clear up complicated issues, together with detecting what knowledge has modified in base tables, writing and sustaining transformation logic, scheduling and orchestrating workflows throughout dependencies, provisioning and managing compute infrastructure, and troubleshooting failures whereas monitoring pipeline well being. Contemplate an ecommerce firm the place knowledge engineers have to constantly merge clickstream logs with orders knowledge for analytics. Every transformation requires constructing sturdy change detection mechanisms, writing complicated joins and aggregations, coordinating a number of workflow steps, scaling compute assets appropriately, and sustaining operational oversight—all whereas supporting knowledge high quality and pipeline reliability. This complexity calls for months of devoted engineering effort and ongoing upkeep, making knowledge transformation expensive and time-intensive for organizations searching for to unlock insights from their knowledge.

To deal with these challenges, AWS introduced a brand new materialized view functionality for Apache Iceberg tables within the AWS Glue Knowledge Catalog. The brand new materialized view functionality simplifies knowledge pipelines and accelerates knowledge lake question efficiency. A materialized view is a managed desk within the AWS Glue Knowledge Catalog that shops pre-computed outcomes of a question in Iceberg format that’s incrementally up to date to replicate adjustments to the underlying datasets. This alleviates the necessity to construct and preserve complicated knowledge pipelines to generate remodeled datasets and speed up question efficiency. Apache Spark engines throughout Amazon Athena, Amazon EMR, and AWS Glue help the brand new materialized views and intelligently rewrite queries to make use of materialized views that pace up efficiency whereas decreasing compute prices.

On this put up, we present you the way Iceberg materialized view works and how one can get began.

How Iceberg materialized views work

Iceberg materialized views supply a easy, managed answer constructed on acquainted SQL syntax. As an alternative of constructing complicated pipelines, you possibly can create materialized views utilizing customary SQL queries from Spark, reworking knowledge with aggregates, filters, and joins with out writing customized knowledge pipelines. Change detection, incremental updates, and monitoring supply tables are mechanically dealt with within the AWS Glue Knowledge Catalog and refreshing materialized views as new knowledge arrive, assuaging the necessity for handbook pipeline orchestration. Knowledge transformations run on totally managed compute infrastructure, eradicating the burden of provisioning, scaling, or sustaining servers.

The ensuing pre-computed knowledge is saved as Iceberg tables in an Amazon Easy Storage Service (Amazon S3) normal function bucket, or Amazon S3 Tables buckets throughout the your account, making remodeled knowledge instantly accessible to a number of question engines, together with Athena, Amazon Redshift, and AWS optimized Spark runtime. Spark engines throughout Athena, Amazon EMR, and AWS Glue help an computerized question rewrite performance that intelligently makes use of materialized views, delivering computerized efficiency enchancment for knowledge processing jobs or interactive pocket book queries.

Within the following sections, we stroll by way of the steps to create, question, and refresh materialized views.

Pre-requisite

To comply with together with this put up, you will need to have an AWS account.

To run the instruction on Amazon EMR, full the next steps to configure the cluster:

  1. Launch an Amazon EMR cluster 7.12.0 or increased.
  2. SSH login to the first node of your Amazon EMR cluster, and run the next command to start out a Spark software with required configurations:
    spark-sql 
      --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
      --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
      --conf spark.sql.catalog.glue_catalog.kind=glue 
      --conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse 
      --conf spark.sql.catalog.glue_catalog.glue.area=us-east-1 
      --conf spark.sql.catalog.glue_catalog.glue.id=123456789012 
      --conf spark.sql.catalog.glue_catalog.glue.account-id=123456789012 
      --conf spark.sql.catalog.glue_catalog.shopper.area=us-east-1 
      --conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true 
      --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true 
      --conf spark.sql.defaultCatalog=glue_catalog
      

To run the instruction on AWS Glue for Spark, full the next steps to configure the job:

  1. Create an AWS Glue model 5.1 job or increased.
  2. Configure a job parameter
    1. Key: --conf
    2. Worth: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  3. Configure your job with the next script:
    from pyspark.sql import SparkSession
    
    
    spark = (
        SparkSession.builder 
            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
            .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
            .config("spark.sql.catalog.glue_catalog.kind", "glue")
            .config("spark.sql.catalog.glue_catalog.warehouse", "s3://amzn- -demo-bucket/warehouse")
            .config("spark.sql.catalog.glue_catalog.glue.area", "us-east-1")
            .config("spark.sql.catalog.glue_catalog.glue.id", "123456789012")
            .config("spark.sql.catalog.glue_catalog.glue.account-id", "123456789012")
    		.config("spark.sql.catalog.glue_catalog.shopper.area", "us-east-1")
            .config("spark.sql.catalog.glue_catalog.glue.lakeformation-enabled", "true")
            .config("spark.sql.optimizer.answerQueriesWithMVs.enabled", "true")
            .config("spark.sql.defaultCatalog", "glue_catalog")
            .getOrCreate()
    )

  4. Run the next queries utilizing Spark SQL to arrange a base desk. In AWS Glue, you possibly can run them by way of spark.sql("QUERY STATEMENT").
    CREATE DATABASE IF NOT EXIST iceberg_mv;
    
    USE iceberg_mv;
    
    CREATE TABLE IF NOT EXISTS base_tbl (
        id INT,
        customer_name STRING,
        quantity INT,
        order_date DATE);
        
    INSERT INTO base_tbl VALUES (1, 'John Doe', 150, DATE('2025-12-01')), (2, 'Jane Smith', 200, DATE('2025-12-02')), (3, 'Bob Johnson', 75, DATE('2025-12-03'));
    
    SELECT * FROM base_tbl;

Within the subsequent sections, we create a materialized view with this base desk.

If you wish to retailer your materialized views in Amazon S3 Tables as an alternative of a normal Amazon S3 bucket, seek advice from Appendix 1 on the finish of this put up for the configuration particulars.

Create a materialized view

To create a materialized view, run the next command:

CREATE MATERIALIZED VIEW mv
AS SELECT
    customer_name, 
    COUNT(*) as mv_order_count, 
    SUM(quantity) as mv_total_amount 
FROM glue_catalog.iceberg_mv.base_tbl
GROUP BY customer_name;

After you create a materialized view, AWS Spark’s in-memory metadata cache wants time to populate with details about the brand new materialized view. Throughout this cache inhabitants interval, queries in opposition to the bottom desk will run usually with out utilizing the materialized view. After the cache is totally populated (sometimes inside tens of seconds), Spark mechanically detects that the materialized view can fulfill the question and rewrites it to make use of the pre-computed materialized view as an alternative, bettering efficiency.

To see this conduct, run the next EXPLAIN command instantly after creating the materialized view:

EXPLAIN EXTENDED
SELECT customer_name, COUNT(*) as mv_order_count, SUM(quantity) as mv_total_amount 
FROM base_tbl
GROUP BY customer_name;

The next output exhibits the preliminary outcome earlier than cache inhabitants:

== Parsed Logical Plan ==
'Combination ['customer_name], ['customer_name, 'COUNT(1) AS mv_order_count#0, 'SUM('amount) AS mv_total_amount#1]
+- 'UnresolvedRelation [base_tbl] , [], false

== Analyzed Logical Plan ==
customer_name: string, mv_order_count: bigint, mv_total_amount: bigint
Combination [customer_name#8], [customer_name#8, count(1) AS mv_order_count#0L, sum(amount#9) AS mv_total_amount#1L]
+- SubqueryAlias glue_catalog.iceberg_mv.base_tbl
   +- RelationV2[id#7, customer_name#8, amount#9, order_date#10] glue_catalog.iceberg_mv.base_tbl glue_catalog.iceberg_mv.base_tbl

== Optimized Logical Plan ==
Combination [customer_name#8], [customer_name#8, count(1) AS mv_order_count#0L, sum(amount#9) AS mv_total_amount#1L]
+- RelationV2[customer_name#8, amount#9] glue_catalog.iceberg_mv.base_tbl

== Bodily Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[customer_name#8], features=[count(1), sum(amount#9)], output=[customer_name#8, mv_order_count#0L, mv_total_amount#1L], schema specialised)
   +- Trade hashpartitioning(customer_name#8, 1000), ENSURE_REQUIREMENTS, [plan_id=19]
      +- HashAggregate(keys=[customer_name#8], features=[partial_count(1), partial_sum(amount#9)], output=[customer_name#8, count#27L, sum#29L], schema specialised)
         +- BatchScan glue_catalog.iceberg_mv.base_tbl[customer_name#8, amount#9] glue_catalog.iceberg_mv.base_tbl (department=null) [filters=, groupedBy=, pushedLimit=None] RuntimeFilters: []

On this preliminary execution plan, Spark scans the base_tbl straight (BatchScan glue_catalog.iceberg_mv.base_tbl) and runs aggregations (COUNT and SUM) on the uncooked knowledge. That is the conduct earlier than the materialized view metadata cache is populated.

After ready roughly tens of seconds for the metadata cache inhabitants, run the identical EXPLAIN command once more. The next output exhibits the first variations within the question optimization plan after cache inhabitants:

== Optimized Logical Plan ==
Combination [customer_name#97], [customer_name#97, coalesce(sum(mv_order_count#98L), 0) AS mv_order_count#72L, sum(mv_total_amount#99L) AS mv_total_amount#73L]
+- RelationV2[customer_name#97, mv_order_count#98L, mv_total_amount#99L] glue_catalog.iceberg_mv.mv

== Bodily  Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[customer_name#97], features=[sum(mv_order_count#98L), sum(mv_total_amount#99L)], output=[customer_name#97, mv_order_count#72L, mv_total_amount#73L], schema specialised)
   +- Trade hashpartitioning(customer_name#97, 1000), ENSURE_REQUIREMENTS, [plan_id=51]
      +- HashAggregate(keys=[customer_name#97], features=[partial_sum(mv_order_count#98L), partial_sum(mv_total_amount#99L)], output=[customer_name#97, sum#113L, sum#115L], schema specialised)
         +- BatchScan glue_catalog.iceberg_mv.mv[customer_name#97, mv_order_count#98L, mv_total_amount#99L] glue_catalog.iceberg_mv.mv (department=null) [filters=, groupedBy=, pushedLimit=None] RuntimeFilters: []

After the cache is populated, Spark now scans the materialized view (BatchScan glue_catalog.iceberg_mv.mv) as an alternative of the bottom desk. The question has been mechanically rewritten to learn from the pre-computed aggregated knowledge within the materialized view. The output particularly exhibits the aggregation features now merely sum the pre-computed values (sum(mv_order_count) and sum(mv_total_amount)) slightly than recalculating COUNT and SUM from uncooked knowledge.

Create a materialized view with scheduling computerized refresh

By default, a newly created materialized view accommodates the preliminary question outcomes. It’s not mechanically up to date when the underlying base desk knowledge adjustments. To maintain your materialized view synchronized with the bottom desk knowledge, you possibly can configure computerized refresh schedules. To allow computerized refresh, use the REFRESH EVERY clause when creating the materialized view. This clause accepts a time interval and unit, so you possibly can specify how incessantly the materialized view is up to date.

The next instance creates a materialized view that mechanically refreshes each 24 hours:

CREATE MATERIALIZED VIEW mv
REFRESH EVERY 24 HOURS
AS SELECT
    customer_name, 
    COUNT(*) as mv_order_count, 
    SUM(quantity) as mv_total_amount 
FROM glue_catalog.iceberg_mv.base_tbl
GROUP BY customer_name;

You’ll be able to configure the refresh interval utilizing any of the next time models: SECONDS, MINUTES, HOURS, or DAYS. Select an applicable interval primarily based in your knowledge freshness necessities and question patterns.

Should you want extra management over when your materialized view updates, or have to refresh it exterior of the scheduled intervals, you possibly can set off handbook refreshes at any time. We offer detailed directions on handbook refresh choices, together with full and incremental refresh, later on this put up.

Question a materialized view

To question a materialized view in your Amazon EMR cluster and retrieve its aggregated knowledge, you should utilize a typical SELECT assertion:

This question retrieves all rows from the materialized view. The output exhibits the aggregated buyer order counts and complete quantities. The outcome shows three prospects with their respective metrics:

-- Consequence
Jane Smith    1    200
Bob Johnson    1    75
John Doe    1    150

Moreover, you possibly can question the identical materialized view from Athena SQL. The next screenshot exhibits the identical question run on Athena and the ensuing output.

Refresh a materialized view

You’ll be able to refresh materialized views utilizing two refresh varieties: full refresh or incremental refresh. Full refresh re-computes your complete materialized view from all base desk knowledge. Incremental refresh processes solely the adjustments because the final refresh. Full refresh is good if you want consistency or after important knowledge adjustments. Incremental refresh is most popular if you want speedy updates. The next examples present each refresh varieties.

To make use of full refresh, full the next steps:

  1. Insert three new information into the bottom desk to simulate new knowledge arriving:
    INSERT INTO base_tbl VALUES 
    (4, 'Jane Smith', 350, DATE('2025-11-29')), 
    (5, 'Bob Johnson', 100, DATE('2025-11-30')), 
    (6, 'Kwaku Mensah', 40, DATE('2025-12-01'));

  2. Question the materialized view to confirm it nonetheless exhibits the previous aggregated values:
    SELECT * FROM mv;
    
    -- Consequence
    Jane Smith    1    200
    Bob Johnson    1    75
    John Doe    1    150

  3. Run a full refresh of the materialized view utilizing the next command:
    REFRESH MATERIALIZED VIEW mv FULL;

  4. Question the materialized view once more to confirm the aggregated values now embody the brand new information:
    SELECT * FROM mv;
    
    -- Consequence
    Jane Smith    2    550 // Up to date
    Bob Johnson    2    175  // Up to date
    John Doe    1    150
    Kwaku Mensah    1    40 // Added

To make use of incremental refresh, full the next steps:

  1. Allow incremental refresh by setting the Spark configuration properties:
    SET spark.sql.optimizer.incrementalMVRefresh.enabled=true;

  2. Insert two extra information into the bottom desk:
    INSERT INTO base_tbl VALUES 
    (7, 'Jane Smith', 120, DATE('2025-11-28')), 
    (8, 'Kwaku Mensah', 90, DATE('2025-12-02'));

  3. Run an incremental refresh utilizing the REFRESH command with out the FULL clause. To confirm if incremental refresh is enabled, seek advice from Appendix 2 on the finish of this put up.
    REFRESH MATERIALIZED VIEW mv;

  4. Question the materialized view to verify the incremental adjustments are mirrored within the aggregated outcomes:
    SELECT * FROM mv;
    
    --Consequence
    Jane Smith    3    670    3    3 // Up to date
    Bob Johnson    2    175    2    2 
    John Doe    1    150    1    1
    Kwaku Mensah    2    130    2    2 // Up to date

Along with utilizing Spark SQL, you may also set off handbook refreshes by way of AWS Glue APIs if you want updates exterior your scheduled intervals. Run the next AWS CLI command:

$ aws glue start-materialized-view-refresh-task-run 
    --catalog-id  
    --database-name  
    --table-name 

The AWS Lake Formation console shows refresh historical past for API-triggered updates. Open your materialized view to see the refresh kind (INCREMENTAL or FULL), begin and finish time, standing and so forth:

You might have discovered how one can use Iceberg materialized views to make your environment friendly knowledge processing and queries. You created a materialized view utilizing Spark on Amazon EMR, queried it from each Amazon EMR and Athena, and used two refresh mechanisms: full refresh and incremental refresh. Iceberg materialized views aid you remodel and optimize your knowledge pipelines effortlessly.

Concerns

There are essential facets to think about for optimum utilization of the aptitude:

  • We launched new SQL syntax to handle materialized views within the AWS optimized Spark runtime engine solely. These new SQL instructions can be found in Spark model 3.5.6 and above throughout Athena, Amazon EMR, and AWS Glue. Open supply Spark just isn’t supported.
  • Materialized views are ultimately in step with base tables. When supply tables change, the materialized views are up to date by way of background refresh processes as outlined by customers within the refresh schedule at creation. Through the refresh window, queries straight accessing materialized views would possibly see outdated knowledge. Nonetheless, prospects who want speedy entry to essentially the most up-to-date datasets can run a handbook refresh with a easy REFRESH MATERIALIZED VIEW SQL command.

Clear up

To keep away from incurring future fees, clear up the assets you created throughout this walkthrough:

  1. Run the next instructions to delete a materialized view and tables:
    DROP TABLE mv PURGE;
    -- Or, DROP MATERIALIZED VIEW mv;
    
    DROP TABLE base_tbl PURGE;
    -- If obligatory, delete the database by DROP DATABASE iceberg_mv;

  2. For Amazon EMR, terminate the Amazon EMR cluster.
  3. For AWS Glue, delete the AWS Glue job.

Conclusion

This put up demonstrated how Iceberg materialized views facilitate environment friendly knowledge lake operations on AWS. The brand new materialized view functionality simplifies knowledge pipelines and improves question efficiency by storing pre-computed outcomes which are mechanically up to date as base tables change. You’ll be able to create materialized views utilizing acquainted SQL syntax, utilizing each full and incremental refresh mechanisms to take care of knowledge consistency. This answer alleviates the necessity for complicated pipeline upkeep whereas offering seamless integration with AWS providers like Athena, Amazon EMR, and AWS Glue. The automated question rewrite performance additional optimizes efficiency by intelligently using materialized views when relevant, making it a robust instrument for organizations trying to streamline their knowledge transformation workflows and speed up question efficiency.

Appendix 1: Spark configuration to make use of Amazon S3 Tables storing Apache Iceberg materialized views

As talked about earlier on this put up, materialized views are saved as Iceberg tables in Amazon S3 Tables buckets inside your account. While you need to use Amazon S3 Tables because the storage location on your materialized views as an alternative of a normal Amazon S3 bucket, you will need to configure Spark with the Amazon S3 Tables catalog.

The distinction from the usual AWS Glue Knowledge Catalog configuration proven within the stipulations part is the glue.id parameter format. For Amazon S3 Tables, use the format :s3tablescatalog/ as an alternative of simply the account ID:

spark-sql 
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
  --conf spark.sql.catalog.s3t_catalog=org.apache.iceberg.spark.SparkCatalog 
  --conf spark.sql.catalog.s3t_catalog.kind=glue 
  --conf spark.sql.catalog.s3t_catalog.warehouse="s3://amzn-s3-demo-bucket/warehouse" 
  --conf spark.sql.catalog.s3t_catalog.glue.area="us-east-1" 
  --conf spark.sql.catalog.s3t_catalog.glue.id="123456789012:s3tablescatalog/amzn-s3-demo-table-bucket" 
  --conf spark.sql.catalog.s3t_catalog.glue.account-id=123456789012 
  --conf spark.sql.catalog.s3t_catalog.shopper.area="us-east-1" 
  --conf spark.sql.catalog.s3t_catalog.glue.lakeformation-enabled=true 
  --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true 
  --conf spark.sql.defaultCatalog=s3t_catalog

After you configure Spark with these settings, you possibly can create and handle materialized views utilizing the identical SQL instructions proven on this put up, and the materialized views are saved in your Amazon S3 Tables bucket.

Appendix 2: Confirm refreshing a materialized view with Spark SQL

Run SHOW TBLPROPERTIES in Spark SQL to examine which refresh methodology was used:

+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|key                            |worth                                                                                                                             |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
|IMV_ansiEnabled                |false                                                                                                                             |
|IMV_catalogInfo                |[{"catalogId":"123456789012","catalogName":"glue_catalog"}]                                                                       |
|IMV_mvCatalogID                |123456789012                                                                                                                      |
|IMV_mvNamespace                |iceberg_mv                                                                                                                        |
|IMV_region                     |us-east-1                                                                                                                         |
|IMV_sparkVersion               |3.5.6-amzn-1                                                                                                                      |
|current-snapshot-id            |5750703934418352571                                                                                                               |
|format                         |iceberg/parquet                                                                                                                   |
|format-version                 |2                                                                                                                                 |
|isMaterializedView             |true                                                                                                                              |
|lastRefreshType                |INCREMENTAL                                                                                                                       |
|subObjects                     |[{"Version":"4887707562550190856","DatabaseName":"iceberg_mv","Region":"us-east-1","CatalogId":"123456789012","Name":"base_tbl"}] |
|tableVersionToken              |*********(redacted)                                                                                                               |
|viewOriginalText               |SELECTncustomer_name, nCOUNT(*) as mv_order_count, nSUM(quantity) as mv_total_amount nFROM base_tblnGROUP BY customer_name     |
|viewVersionId                  |5750703934418352571                                                                                                               |
|viewVersionToken               |*********(redacted)                                                                                                               |
|write.parquet.compression-codec|zstd                                                                                                                              |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------+


Concerning the authors

Tomohiro Tanaka

Tomohiro Tanaka

Tomohiro is a Senior Cloud Help Engineer at AWS. He’s enthusiastic about serving to prospects use Apache Iceberg for his or her knowledge lakes on AWS. In his free time, he enjoys a espresso break along with his colleagues and making espresso at dwelling.

Leon Lin

Leon Lin

Leon is a Software program Improvement Engineer at AWS, the place he focuses on Apache Iceberg and Apache Spark improvement throughout the Open Knowledge Analytics Engines workforce. He’s additionally an energetic contributor to the open supply Apache Iceberg challenge.

Noritaka Sekiyama

Noritaka Sekiyama

Noritaka is a Principal Large Knowledge Architect with AWS Analytics providers. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking on his highway bike.

Mahesh Mishra

Mahesh Mishra

Mahesh is a Principal Product Supervisor with the AWS Analytics workforce. He works with a lot of AWS largest prospects on rising know-how wants, and leads a number of knowledge and analytics initiatives inside AWS, together with robust help for transactional knowledge lakes.

Layth Yassin

Layth Yassin

Layth is a Software program Improvement Engineer on the AWS Glue workforce. He’s enthusiastic about tackling difficult issues at a big scale, and constructing merchandise that push the bounds of the sector. Exterior of labor, he enjoys enjoying/watching basketball, and spending time with family and friends.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments