Constructing Trendy Knowledge Lakehouses on Google Cloud with Apache Iceberg and Apache Spark

July 8, 2025

116

The Rise of Apache Iceberg: A Sport-Changer for Knowledge Lakes

For years, knowledge lakes, usually constructed on cloud object storage like Google Cloud Storage (GCS), supplied unparalleled scalability and value effectivity. Nevertheless, they usually lacked the essential options present in conventional knowledge warehouses, resembling transactional consistency, schema evolution, and efficiency optimizations for analytical queries. That is the place Apache Iceberg shines.

Apache Iceberg is an open desk format designed to handle these limitations. It sits on prime of your knowledge information (like Parquet, ORC, or Avro) in cloud storage, offering a layer of metadata that transforms a group of information right into a high-performance, SQL-like desk. This is what makes Iceberg so highly effective:

ACID Compliance: Iceberg brings Atomicity, Consistency, Isolation, and Sturdiness (ACID) properties to your knowledge lake. Because of this knowledge writes are transactional, guaranteeing knowledge integrity even with concurrent operations. No extra partial writes or inconsistent reads.
Schema Evolution: One of many greatest ache factors in conventional knowledge lakes is managing schema adjustments. Iceberg handles schema evolution seamlessly, permitting you so as to add, drop, rename, or reorder columns with out rewriting the underlying knowledge. That is important for agile knowledge improvement.
Hidden Partitioning: Iceberg intelligently manages partitioning, abstracting away the bodily structure of your knowledge. Customers now not must know the partitioning scheme to write down environment friendly queries, and you may evolve your partitioning technique over time with out knowledge migrations.
Time Journey and Rollback: Iceberg maintains a whole historical past of desk snapshots. This allows “time journey” queries, permitting you to question knowledge because it existed at any level prior to now. It additionally supplies rollback capabilities, letting you revert a desk to a earlier good state, invaluable for debugging and knowledge restoration.
Efficiency Optimizations: Iceberg’s wealthy metadata permits question engines to prune irrelevant knowledge information and partitions effectively, considerably accelerating question execution. It avoids pricey file itemizing operations, immediately leaping to the related knowledge primarily based on its metadata.

By offering these knowledge warehouse-like options on prime of an information lake, Apache Iceberg allows the creation of a real “knowledge lakehouse,” providing the most effective of each worlds: the pliability and cost-effectiveness of cloud storage mixed with the reliability and efficiency of structured tables.

Google Cloud’s BigLake tables for Apache Iceberg in BigQuery presents a fully-managed desk expertise much like normal BigQuery tables, however the entire knowledge is saved in customer-owned storage buckets. Help options embody:

Desk mutations by way of GoogleSQL knowledge manipulation language (DML)
Unified batch and excessive throughput streaming utilizing the Storage Write API by BigLake connectors resembling Spark
Iceberg V2 snapshot export and automated refresh on every desk mutation
Schema evolution to replace column metadata
Automated storage optimization
Time journey for historic knowledge entry
Column-level safety and knowledge masking

Right here’s an instance of easy methods to create an empty BigLake Iceberg desk utilizing GoogleSQL:


SQL

CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
  title STRING,
  id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');

You’ll be able to then import knowledge into the information utilizing LOAD INTO to import knowledge from a file or INSERT INTO from one other desk.


SQL

# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=['gs://bucket/path/to/data'],
format="PARQUET");

# Load from desk
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT title, id
FROM PROJECT_ID.DATASET_ID.source_table

Along with a fully-managed providing, Apache Iceberg can also be supported as a read-exterior desk in BigQuery. Use this to level to an present path with knowledge information.


SQL

CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
  format="ICEBERG",
  uris =
    ['gs://BUCKET/PATH/TO/DATA'],
  require_partition_filter = FALSE);

Apache Spark: The Engine for Knowledge Lakehouse Analytics

Whereas Apache Iceberg supplies the construction and administration to your knowledge lakehouse, Apache Spark is the processing engine that brings it to life. Spark is a robust open-source, distributed processing system famend for its velocity, versatility, and talent to deal with various huge knowledge workloads. Spark’s in-memory processing, strong ecosystem of instruments together with ML and SQL-based processing, and deep Iceberg assist make it a superb alternative.

Apache Spark is deeply built-in into the Google Cloud ecosystem. Advantages of utilizing Apache Spark on Google Cloud embody:

Entry to a real serverless Spark expertise with out cluster administration utilizing Google Cloud Serverless for Apache Spark.
Absolutely managed Spark expertise with versatile cluster configuration and administration by way of Dataproc.
Speed up Spark jobs utilizing the brand new Lightning Engine for Apache Spark preview function.
Configure your runtime with GPUs and drivers preinstalled.
Run AI/ML jobs utilizing a strong set of libraries out there by default in Spark runtimes, together with XGBoost, PyTorch and Transformers.
Write PySpark code immediately inside BigQuery Studio by way of Colab Enterprise notebooks together with Gemini-powered PySpark code technology.
Simply hook up with your knowledge in BigQuery native tables, BigLake Iceberg tables, exterior tables and GCS
Integration with Vertex AI for end-to-end MLOps

Iceberg + Spark: Higher Collectively

Collectively, Iceberg and Spark type a potent mixture for constructing performant and dependable knowledge lakehouses. Spark can leverage Iceberg’s metadata to optimize question plans, carry out environment friendly knowledge pruning, and guarantee transactional consistency throughout your knowledge lake.

Your Iceberg tables and BigQuery native tables are accessible by way of BigLake metastore. This exposes your tables to open supply engines with BigQuery compatibility, together with Spark.


Python

from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder 
.appName("BigLake Metastore Iceberg") 
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") 
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") 
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") 
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") 
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") 
.getOrCreate()
spark.conf.set("viewsEnabled","true")

# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")

# Configure spark for temp outcomes
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")

# Listing the tables within the dataset
df = spark.sql("SHOW TABLES;")
df.present();

# Question the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.learn.format("bigquery").load(sql)
df.present()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.learn.format("bigquery").load(sql)
df.present()

sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.learn.format("bigquery").load(sql)
df.present()

Extending the performance of BigLake metastore is the Iceberg REST catalog (in preview) to entry Iceberg knowledge with any knowledge processing engine. Right here’s how to hook up with it utilizing Spark:


Python

import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog = ""
spark = SparkSession.builder.appName("") 
    .config("spark.sql.defaultCatalog", catalog) 
    .config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") 
    .config(f"spark.sql.catalog.{catalog}.sort", "relaxation") 
    .config(f"spark.sql.catalog.{catalog}.uri",
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog") 
    .config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") 
    .config(f"spark.sql.catalog.{catalog}.token", "") 
    .config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", "https://oauth2.googleapis.com/token")                    .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "")      .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO")     .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") 
.getOrCreate()

Finishing the lakehouse

Google Cloud supplies a complete suite of companies that complement Apache Iceberg and Apache Spark, enabling you to construct, handle, and scale your knowledge lakehouse with ease whereas leveraging most of the open-source applied sciences you already use:

Dataplex Common Catalog: Dataplex Common Catalog supplies a unified knowledge cloth for managing, monitoring, and governing your knowledge throughout knowledge lakes, knowledge warehouses, and knowledge marts. It integrates with BigLake Metastore, guaranteeing that governance insurance policies are persistently enforced throughout your Iceberg tables, and enabling capabilities like semantic search, knowledge lineage, and knowledge high quality checks.
Google Cloud Managed Service for Apache Kafka: Run fully-managed Kafka clusters on Google Cloud, together with Kafka Join. Knowledge streams might be learn on to BigQuery, together with to managed Iceberg tables with low latency reads.
Cloud Composer: A totally managed workflow orchestration service constructed on Apache Airflow.
Vertex AI: Use Vertex AI to handle the complete end-to-end ML Ops expertise. You can even use Vertex AI Workbench for a managed JupyterLab expertise to hook up with your serverless Spark and Dataproc situations.

Conclusion

The mix of Apache Iceberg and Apache Spark on Google Cloud presents a compelling answer for constructing fashionable, high-performance knowledge lakehouses. Iceberg supplies the transactional consistency, schema evolution, and efficiency optimizations that have been traditionally lacking from knowledge lakes, whereas Spark presents a flexible and scalable engine for processing these giant datasets.

To study extra, try our free webinar on July eighth at 11AM PST the place we’ll dive deeper into utilizing Apache Spark and supporting instruments on Google Cloud.

Writer: Brad Miro, Senior Developer Advocate – Google

Previous articleRevolutionizing Buyer Touchpoints with AI Throughout Digital Platforms

Next articleApple COO Jeff Williams set to retire later this yr

Constructing Trendy Knowledge Lakehouses on Google Cloud with Apache Iceberg and Apache Spark

The Rise of Apache Iceberg: A Sport-Changer for Knowledge Lakes

Apache Spark: The Engine for Knowledge Lakehouse Analytics

Iceberg + Spark: Higher Collectively

Finishing the lakehouse

Conclusion

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

T-Cellular US responds to Verizon lawsuit

New Canadian Defence Alliance ACDC Launches

U Cell indicators 5G wholesale contract with Telekom Malaysia

Saildrone Surveyor Maps Mariana Islands Seafloor for NOAA

Recent Comments

ABOUT US

POPULAR POSTS

T-Cellular US responds to Verizon lawsuit

New Canadian Defence Alliance ACDC Launches

U Cell indicators 5G wholesale contract with Telekom Malaysia

POPULAR CATEGORY