This publish is co-written with Haya Axelrod Stern, Zion Rubin and Michal Urbanowicz from Pure Intelligence.
Many organizations flip to knowledge lakes for the pliability and scale wanted to handle giant volumes of structured and unstructured knowledge. Nevertheless, migrating an present knowledge lake to a brand new desk format corresponding to Apache Iceberg can deliver important technical and organizational challenges
Pure Intelligence (NI) is a world chief in multi-category marketplaces. NI’s main manufacturers, Top10.com and BestMoney.com, assist tens of millions of individuals worldwide to make knowledgeable selections on daily basis. Not too long ago, NI launched into a journey to transition their legacy knowledge lake from Apache Hive to Apache Iceberg.
On this weblog publish, NI shares their journey, the progressive options developed, and the important thing takeaways that may information different organizations contemplating an analogous path.
This text particulars NI’s sensible method to this advanced migration, focusing much less on Apache Iceberg’s technical specs, however moderately on the real-world challenges and options encountered through the transition to Apache Iceberg, a problem that many organizations are grappling with.
Why Apache Iceberg?
The structure at NI adopted the generally used medallion structure, comprised of a bronze-silver-gold layered framework, proven within the determine that follows:
- Bronze layer: Unprocessed knowledge from numerous sources, saved in its uncooked format in Amazon Easy Storage Service (Amazon S3), ingested by way of Apache Kafka brokers.
- Silver layer: Accommodates cleaned and enriched knowledge, processed utilizing Apache Flink.
- Gold layer: Holds analytics-ready datasets designed for enterprise intelligence (BI) and reporting, produced utilizing Apache Spark pipelines, and consumed by companies corresponding to Snowflake, Amazon Athena, Tableau, and Apache Druid. The info is saved in Apache Parquet format with AWS Glue Catalog offering metadata administration.
Whereas this structure supported NI analytical wants, it lacked the pliability required for a very open and adaptable knowledge platform. The gold layer was coupled solely with question engines that supported Hive and AWS Glue Knowledge Catalog. It was doable to make use of Amazon Athena nonetheless Snowflake required sustaining one other catalog to be able to question these exterior tables. This situation made it tough to judge or undertake various instruments and engines with out pricey knowledge duplication, question rewrite knowledge catalog synchronization. As enterprise scaled, NI wanted an information platform that might seamlessly assist a number of question engines concurrently with a single knowledge catalog and avoiding any vendor lock-in.
The facility of Apache Iceberg
Apache Iceberg emerged as the proper resolution—a versatile, open desk format that aligns with NI’s method of Knowledge Lake First. Iceberg affords a number of important benefits corresponding to ACID transactions, schema evolution, time journey, efficiency enhancements and extra. However the important thing strategic advantages lay within the capability to assist a number of question engines concurrently. It additionally has the next benefits:
- Decoupling of storage and compute: The open desk format allows you to separate the storage layer from the question engine, permitting a simple swap and assist for a number of engines concurrently with out knowledge duplication.
- Vendor independence: As an open desk format, Apache Iceberg prevents vendor lock-in, providing you with the pliability to adapt to altering analytics wants.
- Vendor adoption: Apache Iceberg is extensively supported by main platforms and instruments, offering seamless integration and long-term ecosystem compatibility.
By transitioning to Iceberg, NI was capable of embrace a very open knowledge platform, offering long-term flexibility, scalability, and interoperability whereas sustaining a unified supply of fact for all analytics and reporting wants.
Challenges confronted
Migrating a reside manufacturing knowledge lake to Iceberg was difficult due to operational complexities and legacy constraints. The info service at NI runs lots of of Spark and machine studying pipelines, manages hundreds of tables, and helps over 400 dashboards—all working 24/7. Any migration would must be completed with out manufacturing interruptions; and coordinating such a migration whereas operations proceed seamlessly was daunting.
NI wanted to accommodate numerous customers with various necessities and timelines from knowledge engineers to knowledge analysts all the way in which to knowledge scientists and BI groups.
Including to the problem have been legacy constraints. A few of the present instruments didn’t totally assist Iceberg, so there was a necessity to take care of Hive-backed tables for compatibility. As NI realized that not all customers might undertake Iceberg instantly. A plan was required to permit for incremental transitions with out downtime or disruption to ongoing operations.
Key pillars for migration
To assist guarantee a clean and profitable transition, six important pillars have been outlined:
- Help ongoing operations: Keep uninterrupted compatibility with present programs and workflows through the migration course of.
- Person transparency: Decrease disruption for customers by preserving present desk names and entry patterns.
- Gradual client migration: Permit customers to undertake Iceberg at their very own tempo, avoiding a compelled, simultaneous switchover.
- ETL flexibility: Migrate ETL pipelines to Iceberg with out imposing constraints on improvement or deployment.
- Value effectiveness: Decrease storage and compute duplication and overhead through the migration interval.
- Decrease upkeep: Cut back the operational burden of managing twin desk codecs (Hive and Iceberg) through the transition.
Evaluating conventional migration approaches
Apache Iceberg helps two most important approaches for migration: In-place and rewrite-based migration.
In-place migration
The way it works: Converts an present dataset into an Iceberg desk with out duplicating knowledge by creating Iceberg metadata on prime of the prevailing recordsdata whereas preserving their format and format.
Benefits:
- Value-effective when it comes to storage (no knowledge duplication)
- Simplified implementation
- Maintains present desk names and places
- No knowledge motion and minimal compute necessities, translating into decrease price
Disadvantages:
- Downtime required: All write operations have to be paused throughout conversion, which was unacceptable in NI instances as a result of knowledge and analytics are thought-about mission important and run 24/7
- No gradual adoption: All customers should swap to Iceberg concurrently, rising the chance of disruption
- Restricted validation: No alternative to validate knowledge earlier than cutover; rollback requires restoring from backups
- Technical constraints: Schema evolution throughout migration could be difficult; knowledge kind incompatibilities can halt the whole course of
Rewrite-based migration
The way it works: Rewrite-based migration in Apache Iceberg entails creating a brand new Iceberg desk by rewriting and reorganizing present dataset recordsdata into Iceberg’s optimized format and construction for improved efficiency and knowledge administration.
Benefits:
- Zero downtime throughout migration
- Helps gradual client migration
- Allows thorough validation
- Easy rollback mechanism
Disadvantages:
- Useful resource overhead: Double storage and compute prices throughout migration
- Upkeep complexity: Managing two parallel knowledge pipelines will increase operational burden
- Consistency challenges: Sustaining good consistency between the 2 programs is difficult
- Efficiency impression: Elevated latency due to twin writes; potential pipeline slowdowns
Why neither possibility alone was adequate
NI determined that neither possibility might meet all important necessities:
- In-place migration fell quick due to unacceptable downtime and lack of assist for gradual migration.
- Rewrite-based migration fell quick due to prohibitive price overhead and sophisticated operational administration.
This evaluation led NI to develop a hybrid method that mixes the benefits of each strategies whereas mitigating and minimizing limitations.
The hybrid resolution
The hybrid migration technique was designed round 5 foundational components, utilizing AWS analytical companies for orchestration, processing, and state administration.
- Hive-to-Iceberg CDC: Mechanically synchronize Hive tables with Iceberg utilizing a customized change knowledge seize (CDC) course of to assist present customers. Not like conventional CDC specializing in row-level adjustments, the method was completed on the partition-level to protect Hive’s habits of updating tables by overwriting partitions. This helps be sure that knowledge consistency is maintained between Hive and Iceberg with out logic adjustments on the migration part, ensuring that the identical knowledge exists on each tables.
- Steady schema synchronization: Schema evolution through the migration launched upkeep challenges. Automated schema sync processes in contrast Hive and Iceberg schemas, reconciling variations whereas sustaining kind compatibility.
- Iceberg-to-Hive reverse CDC: To allow the info staff to transition extract, rework, and cargo (ETL) jobs to write down on to Iceberg whereas sustaining compatibility with present Hive-based processes not but migrated, a reverse CDC from Iceberg to Hive was applied. This allowed ETLs to write down to Iceberg whereas sustaining Hive tables for downstream processes that had not but migrated and nonetheless relied on them through the migration interval.
- Alias administration in Snowflake: Snowflake aliases made positive that Iceberg tables retained their authentic names, making the transition clear to customers. This method minimized reconfiguration efforts throughout dependent groups and workflows.
- Desk substitute: Swap manufacturing tables whereas retaining authentic names, finishing the migration.
Technical deep dive
The migration to from Hive to Iceberg was constructed of a number of steps:
1. Hive-to-Iceberg CDC pipeline
Goal: Preserve Hive and Iceberg tables synchronized with out duplicating effort.
The previous determine demonstrates how each partition written to the Hive desk is routinely and transparently copied to the Iceberg desk utilizing a CDC course of. This course of makes positive that each tables are synchronized, enabling a seamless and incremental migration with out disrupting downstream programs. NI selected partition-level synchronization as a result of the legacy Hive ETL jobs already wrote updates by overwriting total partitions and updating the partition location. Adopting that very same method within the CDC pipeline helped be sure that it remained in line with how knowledge was initially managed, making the migration smoother and avoiding the necessity to rework row-level logic.
Implementation:
- To maintain Hive and Iceberg tables synchronized with out duplicating effort, a streamlined pipeline was applied. At any time when partitions in Hive tables are up to date, the AWS Glue Catalog emits occasions corresponding to
UpdatePartition
. Amazon EventBridge captured these occasions, filtered them for the related databases and tables in accordance with the occasion bridge rule, and triggered an AWS Lambda This operate parsed the occasion metadata and despatched the partition updates to an Apache Kafka subject. - A Spark job operating on Amazon EMR consumed the messages from Kafka, which contained the up to date partition particulars from the Knowledge Catalog occasions. Utilizing that occasion metadata, the Spark job queried the related Hive desk, and wrote it to Iceberg desk in Amazon S3 utilizing the Spark Iceberg
overwritePartitions
API, as proven within the following instance:
- By focusing on solely modified partitions, the pipeline (proven within the following determine) considerably decreased the necessity for pricey full-table rewrites. Iceberg’s sturdy metadata layers, together with snapshots and manifest recordsdata, have been seamlessly up to date to seize these adjustments, offering environment friendly and correct synchronization between Hive and Iceberg tables.
2. Iceberg-to-Hive reverse CDC pipeline
Goal: Help Hive customers whereas permitting ETL pipelines to transition to Iceberg.
The previous determine reveals the reverse course of, the place each partition written to the Iceberg desk is routinely and transparently copied to the Hive desk utilizing a CDC mechanism. This course of helps guarantee synchronization between the 2 programs, enabling seamless knowledge updates for legacy programs that also depend on Hive whereas transitioning to Iceberg.
Implementation:
Synchronizing knowledge from Iceberg tables again to Hive tables introduced a special problem. Not like Hive tables, Knowledge Catalog doesn’t monitor partition updates for Iceberg tables as a result of partitions in Iceberg are managed internally and never inside the catalog. This meant NI couldn’t depend on Glue Catalog occasions to detect partition adjustments.
To deal with this, NI applied an answer much like the earlier stream however tailored to Iceberg’s structure. Apache Spark was used to question Iceberg’s metadata tables—particularly the snapshots and entries tables—to establish the partitions modified because the final synchronization. The question used was:
This question returned solely the partitions that had been up to date because the final synchronization, enabling it to focus solely on the modified knowledge. Utilizing this data, much like the sooner course of, a Spark job retrieved the up to date partitions from Iceberg and wrote them again to the corresponding Hive desk, offering seamless synchronization between each tables.
3. Steady schema synchronization
Goal: Automate schema updates to take care of consistency throughout Hive and Iceberg.
The previous determine reveals how the automated schema sync course of helps guarantee consistency between Hive and Iceberg tables schemas by routinely synchronizing schema adjustments. On this instance including the Channel
column, minimizing handbook work and double upkeep through the prolonged migration interval.
Implementation:
To deal with schema adjustments between Hive and Iceberg, a course of was applied to detect and reconcile variations routinely. When a schema change occurs in a Hive desk, Knowledge Catalog emits an UpdateTable
occasion. This occasion triggers a Lambda operate (routed by way of EventBridge), which retrieves the up to date schema from Knowledge Catalog for the Hive desk and compares it to the Iceberg schema. It’s essential to name out that in NI’s setup, schema adjustments originate from Hive as a result of the Iceberg desk is hidden behind aliases throughout the system. As a result of Iceberg is primarily used for Snowflake, a one-way sync from Hive to Iceberg is adequate. Consequently, there isn’t a mechanism to detect or deal with schema adjustments made straight in Iceberg, as a result of they aren’t wanted within the present workflow.
In the course of the schema reconciliation (proven within the following determine), knowledge sorts are normalized to assist guarantee compatibility—for instance, changing Hive’s VARCHAR
to Iceberg’s STRING
. Any new fields or kind adjustments are validated and utilized to the Iceberg schema utilizing a Spark job operating on Amazon EMR. Amazon DynamoDB shops schema synchronization checkpoints which permit monitoring adjustments over time and keep consistency between the Hive and Iceberg schemas.
By automating this schema synchronization, upkeep overhead was considerably decreased and freed builders from manually conserving schemas in sync, making the lengthy migration interval considerably extra manageable.
The previous determine depicts an automatic workflow to take care of schema consistency between Hive and Iceberg tables. AWS Glue captures desk state change occasions from Hive, which set off an EventBridge occasion. The occasion invokes a Lambda operate that fetches metadata from DynamoDB and compares schemas fetched from AWS Glue for each Hive and Iceberg tables. If a mismatch is detected, the schema in Iceberg is up to date to assist guarantee alignment, minimizing handbook intervention and supporting clean operation through the migration.
4. Alias administration in Snowflake
Goal: Allow Snowflake customers to undertake Iceberg with out altering question references.
The previous determine reveals how Snowflake aliases allow seamless migration by mapping queries like SELECT platform, COUNT(clickouts) FROM funnel.clickouts
to Iceberg tables within the Glue Catalog. Even with suffixes added through the Iceberg migration, present queries and workflows stay unchanged, minimizing disruption for BI instruments and analysts.
Implementation:
To assist guarantee a seamless expertise for BI instruments and analysts through the migration, Snowflake aliases have been used to map exterior tables to the Iceberg metadata saved in Knowledge Catalog. By assigning aliases that matched the unique Hive desk names, present queries and stories have been preserved with out interruption. For instance, an exterior desk was created in Snowflake and aliased it to the unique desk identify, as proven within the following question:
When migration was accomplished, a easy change again to the alias was completed to level to the brand new location or schema, making the transition seamless and minimizing any disruption to consumer workflows.
5. Desk substitute
Goal: When all ETLs and associated knowledge workflows have been efficiently transitioned to make use of Apache Iceberg’s capabilities, and every little thing was functioning appropriately with the synchronization stream, it was time to maneuver on to the ultimate part of the migration. The first goal was to take care of the unique desk names, avoiding the usage of any prefixes like these employed within the earlier, intermediate migration steps. This helped be sure that the configuration remained tidy and free from pointless naming issues.
The previous determine reveals the desk substitute to finish the migration, the place Hive on Amazon EMR was used to register Parquet recordsdata as Iceberg tables whereas preserving authentic desk names and avoiding knowledge duplication, serving to to make sure a seamless and tidy migration.
Implementation:
One of many challenges was that renaming tables isn’t doable inside AWS Glue, which prevents the usage of an easy renaming method for the prevailing synchronization stream tables. As well as, AWS Glue doesn’t assist the Migrate
process, which creates Iceberg metadata on prime of the prevailing knowledge file whereas preserving the unique desk identify. The technique to beat this limitation was to make use of a Hive metastore on an Amazon EMR cluster. Through the use of Hive on Amazon EMR, NI was capable of create the ultimate tables with their authentic names as a result of it operates in a separate metastore surroundings, giving the pliability to outline any required schema and desk names with out interference.
The add_files
process was used to methodically register all the prevailing Parquet recordsdata, thus establishing all needed metadata inside Hive. This was an important step, as a result of it helped be sure that all knowledge recordsdata have been appropriately cataloged and linked inside the metastore.
The previous determine reveals the transition of a manufacturing desk to Iceberg through the use of the add_files
process to register present Parquet recordsdata and create Iceberg metadata. This helped guarantee a clean migration whereas preserving the unique knowledge and avoiding duplication.
This setup allowed the usage of present Parquet recordsdata with out duplicating knowledge, thus saving assets. Though the sync stream used separate buckets for the ultimate structure, NI selected to take care of the unique buckets and cleaned the intermediate recordsdata. This resulted in a special folder construction on Amazon S3. The historic knowledge had subfolders for every partition beneath the foundation desk listing, whereas the brand new Iceberg knowledge organizes subfolders inside a knowledge folder. This distinction was acceptable to keep away from knowledge duplication and protect the unique Amazon S3 buckets.
Technical recap
The AWS Glue Knowledge Catalog served as the first supply of fact for schema and desk updates, with Amazon EventBridge capturing Knowledge Catalog occasions to set off synchronization workflows. AWS Lambda parsed occasion metadata and managed schema synchronization, whereas Apache Kafka buffered occasions for real-time processing. Apache Spark on Amazon EMR dealt with knowledge transformations and incremental updates, and Amazon DynamoDB maintained state, together with synchronization checkpoints and desk mappings. Lastly, Snowflake seamlessly consumed Iceberg tables through aliases with out disrupting present workflows.
Migration final result
The migration was accomplished with zero downtime; steady operations have been maintained all through the migration, supporting lots of of pipelines and dashboards with out interruption. The migration was completed with a price optimized mindset with incremental updates and partition-level synchronization that minimized the utilization of compute and storage assets. Lastly, NI Established a contemporary, vendor-neutral platform that allows scaling their evolving analytics and machine studying wants. It allows seamless integration with a number of compute and question engines, supporting flexibility and additional innovation.
Conclusion
Pure intelligence migration to Apache Iceberg was a pivotal step in modernizing the corporate’s knowledge infrastructure. By adopting a hybrid technique and utilizing the ability of event-driven architectures, NI helped guarantee a seamless transition that balanced innovation with operational stability. The journey underscored the significance of cautious planning, understanding the info ecosystem, and specializing in an organization-first method.
Above all, enterprise was stored in focus and continuity prioritized the consumer expertise. By doing so, NI unlocked the pliability and scalability of their knowledge lake whereas minimizing disruption, permitting groups to make use of cutting-edge analytics capabilities, positioning the corporate on the forefront of recent knowledge administration and readiness for the long run.
For those who’re contemplating an Apache Iceberg migration or going through comparable knowledge infrastructure challenges, we encourage you to discover the chances. Embrace open codecs, use automation, and design together with your group’s distinctive wants in thoughts. The journey could be advanced, however the rewards in scalability, flexibility, and innovation are nicely well worth the effort. You need to use the AWS prescriptive information to assist be taught extra about methods to finest use Apache Iceberg in your group
In regards to the Authors
Yonatan Dolan is a Principal Analytics Specialist at Amazon Net Providers. Yonatan is an Apache Iceberg evangelist.
Haya Stern is a Senior Director of Knowledge at Pure Intelligence. She leads the event of NI’s large-scale knowledge platform, with a deal with enabling analytics, streamlining knowledge workflows, and enhancing dev effectivity. Up to now yr, she led the profitable migration from the earlier knowledge structure to a contemporary lake home primarily based on Apache Iceberg and Snowflake.
Zion Rubin is a Knowledge Architect at Pure Intelligence with ten years of expertise architecting giant‑scale huge‑knowledge platforms, now centered on creating clever agent programs that flip advanced knowledge into actual‑time enterprise perception.
Michał Urbanowicz is a Cloud Knowledge Engineer at Pure Intelligence with experience in migrating knowledge warehouses and implementing sturdy retention, cleanup, and monitoring processes to make sure scalability and reliability. He additionally develops automations that streamline and assist marketing campaign administration operations in cloud-based environments.