HomeBig DataEnterprise scale in-place migration to Apache Iceberg: Implementation information

Enterprise scale in-place migration to Apache Iceberg: Implementation information


Organizations managing large-scale analytical workloads more and more face challenges with conventional Apache Parquet-based knowledge lakes with Hive-style partitioning, together with sluggish queries, complicated file administration, and restricted consistency ensures. Apache Iceberg addresses these ache factors by offering ACID transactions, seamless schema evolution, and point-in-time knowledge restoration capabilities that remodel how enterprises deal with their knowledge infrastructure.

On this submit, we display how one can obtain migration at scale from present Parquet tables to Apache Iceberg tables. Utilizing Amazon DynamoDB as a central orchestration mechanism, we present how one can implement in-place migrations which are extremely configurable, repeatable, and fault-tolerant—unlocking the total potential of contemporary knowledge lake architectures with out intensive knowledge motion or duplication.

Answer overview

When performing in-place migration, Apache Iceberg makes use of its capability to immediately reference present knowledge information. This functionality is barely supported for codecs corresponding to Parquet, ORC, and Avro, as a result of these codecs are self-describing and embody constant schema and metadata info. In contrast to uncooked codecs corresponding to CSV or JSON, they implement construction and help environment friendly columnar or row-based entry, which permits Iceberg to combine them with out rewriting the info.

On this submit, we display how one can migrate an present Parquet-based knowledge lake that isn’t cataloged in AWS Glue by utilizing two methodologies:

  • Apache Iceberg migrate and register_table method. Excellent for changing present Hive-registered Parquet tables into Iceberg-managed tables.
  • Iceberg add_files method. Greatest fitted to shortly onboarding uncooked Parquet knowledge into Iceberg with out rewriting information.

The answer additionally incorporates a DynamoDB desk that acts as a scalable management aircraft, so you possibly can carry out in-place migration of your knowledge lake from Parquet format to Iceberg format.

The next diagram reveals totally different methodologies that you should use to realize this in-place migration of your Hive-style partitioned knowledge lake:

AWS data pipeline architecture diagram showing data flow from Amazon DynamoDB through Amazon EMR and AWS Glue to a Data Lake and Apache Iceberg Lakehouse, both using Parquet format, within an AWS Region.

You employ DynamoDB to trace the migration state, dealing with retries and recording errors and outcomes. This gives the next advantages:

  • Centralized management over which Amazon Easy Storage Service (Amazon S3) paths want migration.
  • Lifecycle monitoring of every dataset by migration phases.
  • Seize and audit errors on a per-path foundation.
  • Allow re-runs by updating stateful flags or clearing failure messages.

Conditions

Earlier than you start, you want:

Create pattern Parquet dataset as a supply

You possibly can create the pattern Parquet dataset for testing the totally different methodologies utilizing the Athena question editor. Change with an obtainable bucket in your account.

  1. Create an AWS Glue database(test_db), if not current.
    CREATE DATABASE IF NOT EXISTS test_db

  2. Create a pattern Parquet desk (table1) and add for use for testing the add_files method.
    CREATE TABLE table1
    WITH (
      external_location = 's3:///table1/',
      format="PARQUET",
      partitioned_by = ARRAY['date', 'hour']
    )
    AS
    SELECT 
      1 as id,
      'John Doe' as title,
      25 as age,
      'Engineer' as job_title,
      current_date as created_date,
      current_date as date,
      hour(current_timestamp) as hour
    UNION ALL
    SELECT 2, 'Jane Smith', 30, 'Supervisor', current_date, current_date, hour(current_timestamp)
    UNION ALL  
    SELECT 3, 'Bob Johnson', 35, 'Analyst', current_date, current_date, hour(current_timestamp);

  3. Create a pattern Parquet desk (table2) and add knowledge for use for testing the migrate and register_table method. Change together with your bucket title.
    CREATE TABLE table2
    WITH (
      external_location = 's3:///table2/',
      format="PARQUET",
      partitioned_by = ARRAY['date', 'hour']
    )
    AS
    SELECT 
      1 as id,
      'John Doe' as title,
      25 as age,
      'Engineer' as job_title,
      current_date as created_date,
      current_date as date,
      hour(current_timestamp) as hour
    UNION ALL
    SELECT 2, 'Jane Smith', 30, 'Supervisor', current_date, current_date, hour(current_timestamp)
    UNION ALL  
    SELECT 3, 'Bob Johnson', 35, 'Analyst', current_date, current_date, hour(current_timestamp);

  4. Drop the tables from the Knowledge Catalog since you solely want Parquet knowledge with the Hive-style partitioning construction.
    DROP TABLE IF EXISTS test_db.table1

Create a DynamoDB management desk

Earlier than starting the migration course of, you could create a DynamoDB desk that serves because the management aircraft. This desk maps supply Amazon S3 paths to their corresponding Iceberg database and desk locations, enabling systematic monitoring of the migration course of.

To implement this management mechanism, create a desk with the next construction:

  • A main key s3_path that shops the supply Parquet knowledge location
  • Two attributes that outline the goal Iceberg location:
    • target_db_name
    • target_table_name

To create the DynamoDB management desk

  1. Create the Amazon DynamoDB desk utilizing the next AWS CLI command:
    aws dynamodb create-table 
    --table-name migration-control-table 
    --attribute-definitions 
    AttributeName=s3_path,AttributeType=S 
    --key-schema 
    AttributeName=s3_path,KeyType=HASH 
    --billing-mode PAY_PER_REQUEST 
    --region 

  2. Confirm the desk is created efficiently. Change with the AWS Area the place your knowledge is saved:
    aws dynamodb describe-table --table-name migration-control-table --region 

  3. Create a migration_data.json file with the next contents.

    On this instance:

    • Change and with the title of your S3 bucket and prefix containing the Parquet knowledge
    • Change with the title of your goal Iceberg database
    • Change with the title of your goal Iceberg desk
    {
        "your-migration-table": [
            {
                "PutRequest": {
                    "Item": {
                        "s3_path": {"S": "s3:///table1/"},
                        "target_db_name": {"S": "test_db"},
                        "target_table_name": {"S": "table1"}
                    }
                }
            },
            {
                "PutRequest": {
                    "Item": {
                        "s3_path": {"S": "s3:///table2/"},
                        "target_db_name": {"S": "test_db"},
                        "target_table_name": {"S": "table2"}
                    }
                }
            },
            {
                "PutRequest": {
                    "Item": {
                        "s3_path": {"S": "s3:////"},
                        "target_db_name": {"S": ""},
                        "target_table_name": {"S": ""}
                    }
                }
            }
        ]
    }

    This file defines the mapping between Amazon S3 paths and their corresponding Iceberg desk locations.

  4. Run the next CLI command to load the DynamoDB management desk.
    aws dynamodb batch-write-item 
    --request-items file://migration_data.json 
    --region 

Migration methodologies

On this part, you discover two methodologies for migrating your present Parquet tables to Apache Iceberg format:

  • Apache Iceberg migrate and register_table method – This method first converts your Parquet desk to Iceberg format utilizing the native migrate process, adopted by registering it in AWS Glue utilizing the register_table process.
  • Apache Iceberg add_files method – This technique creates an empty Iceberg desk and makes use of the add_files process to import present Parquet knowledge information with out bodily transferring them.

Apache Iceberg migrate and register_table process

Use the Apache Iceberg Migrate process that’s used for in-place conversion of an present Hive or Parquet desk into an Iceberg-managed desk. Thereafter, you should use the Apache Iceberg RegisterTable process to register the respective desk in AWS Glue.

AWS workflow diagram showing DynamoDB to Apache Iceberg migration using Amazon EMR with Hive Metastore for migration and Glue Metastore for registration, displaying configuration tables at each stage.

Migrate

  1. In your EMR cluster with Hive because the metastore, create a PySpark session with the next Iceberg Packages:
    pyspark 
    --name "Iceberg Migration" 
    --conf "spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar" 
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
    --conf spark.sql.catalog.spark_catalog.kind=hive

    This submit makes use of Iceberg v1.9.1 (Amazon EMR construct), which is native to Amazon EMR 7.11. At all times confirm the most recent supported model and replace bundle coordinates accordingly.

  2. Subsequent, create your corresponding desk in your Hive catalog (you possibly can skip this step if you have already got tables created in your hive catalog). Change with the title of your S3 bucket.

    Within the following snippet, change or take away the PARTITIONED BY command based mostly on the partition technique of your desk, the MSCK Restore desk command ought to solely be run in case your respective desk is partitioned.

    #You possibly can automate this for manufacturing Scaling with DynamoDB as management desk 
    s3_path = "s3:///table1/"
    target_db_name = "test_db"
    target_table_name = "table1"
    # Learn knowledge in a dataframe to deduce schema
    df = spark.learn.parquet(s3_path)
    df.createOrReplaceTempView("temp_view")
    # Get schema as string
    schema = spark.desk("temp_view").schema
    schema_string = ", ".be a part of([f"{field.name} {field.dataType.simpleString()}" for field in schema])
    # Create Database If not exists 
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {target_db_name}").present()
    # full_table_name= test_db.table1
    full_table_name = f"{target_db_name}.{target_table_name}"
    # Create desk
    spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {full_table_name} (
        {schema_string}
    )
    STORED AS PARQUET
    PARTITIONED BY (date, hour)
    LOCATION '{s3_path}'
    """)
    # Refresh, restore, and validate
    spark.sql(f"REFRESH TABLE {full_table_name}")
    spark.sql(f"MSCK REPAIR TABLE {full_table_name}")

  3. Convert the Parquet desk to an Iceberg desk in Hive
    # Run migration process
    spark.sql(f"CALL spark_catalog.system.migrate('{full_table_name}')")
    # Validate that the desk is efficiently migrated 
    spark.sql(f"DESCRIBE FORMATTED {full_table_name}").present(truncate=False)

Run the migrate command to transform the Parquet-based desk to an Iceberg desk, creating the metadata folder and the metadata.json file therein

You possibly can cease at this level if you happen to don’t intend emigrate your present iceberg desk from Hive to the Knowledge Catalog.

Register

  1. Register to the AWS Glue as Spark Catalog enabled EMR cluster.
  2. Register the Iceberg desk to your Knowledge Catalog.

    Create the session with the respective Iceberg Packages. Change together with your bucket title, and with warehouse listing.

    pyspark 
    --conf "spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar" 
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" 
    --conf "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" 
    --conf "spark.sql.catalog.glue_catalog.warehouse= s3:////"  
    --conf "spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog" 
    --conf "spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO"

  3. Run the register_table command to make the Iceberg desk seen in AWS Glue.
    • register_table registers an present Iceberg desk’s metadata file (metadata.json) with a catalog(glue_catalog) in order that Spark (and different engines) can question it.
    • The process creates a Knowledge Catalog entry for the desk, pointing it to the given metadata location.

    Change and with the title of your S3 bucket and metadata prefix title.

    Make sure that your EMR Spark Cluster has been configured with acceptable AWS Glue permissions

    # You possibly can automate this for manufacturing Scaling with DynamoDB as management desk
    metadata_location = "s3:///table1/metadata/.metadata.json"
    target_db_name = "test_db"
    target_table_name = "table1"
    full_table_name = f"{target_db_name}.{target_table_name}"
    # Register present Iceberg desk metadata in Glue Catalog
    spark.sql(f"CALL glue_catalog.system.register_table('{full_table_name}', '{metadata_location}')")
    # Set desk properties (instance: Iceberg format model 2)
    spark.sql(f"ALTER TABLE glue_catalog.{full_table_name} SET TBLPROPERTIES('format-version'='2')")

  4. Validate that the Iceberg desk is now seen within the Knowledge Catalog.
    # Lookout for format as iceberg/parquet
    spark.sql("SHOW TBLPROPERTIES glue_catalog.test_db.table1").present()

Apache Iceberg’s add_files process

AWS workflow diagram showing DynamoDB to Apache Iceberg migration using AWS Glue Add_Files procedure, displaying input configuration and output status tables with metadata location and registration confirmation.

Right here, you’re going to make use of Iceberg’s add_files process to import uncooked knowledge information (Parquet, ORC, Avro) into an present Iceberg desk by updating its metadata. This process works for each Hive and Knowledge Catalog, it doesn’t bodily transfer or rewrite the information—it solely registers them so Iceberg can handle them.

This system contains the next steps:

  1. Create an empty Iceberg desk in AWS Glue.

    As a result of the add_files process expects the iceberg desk to be already current, you might want to create an empty Iceberg desk by inferring the desk schema.
  2. Register present knowledge areas to the Iceberg desk

Utilizing the add_files process in a Glue-backed Iceberg catalog will register the goal S3 path together with all its subdirectories to the empty Iceberg desk created within the earlier step.

You possibly can consolidate each steps right into a single Spark job. For the next AWS Glue job, you’ve specified iceberg as a worth for the --datalake-formats job parameter. See the AWS Glue job configuration documentation for extra particulars.

Change together with your S3 bucket title and with warehouse listing.

from pyspark.sql import SparkSession
import logging
logging.basicConfig(degree=logging.INFO)
logger = logging.getLogger(__name__)
target_db_name = "test_db"
target_table_name = "table2"
s3_path = "s3:///table2"
# Set to None or [] for unpartitioned
partitioned_cols = ["date", "hour"]  
spark = SparkSession.builder 
    .appName("Iceberg Add Recordsdata") 
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") 
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") 
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") 
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3:////") 
    .getOrCreate()
full_table_name = f"glue_catalog.{target_db_name}.{target_table_name}"
# Learn schema from one file (schema inference)
df = spark.learn.parquet(s3_path)
schema = df.schema
# Create empty Iceberg desk
empty_df = spark.createDataFrame([], schema)
if partitioned_cols:
    empty_df.writeTo(full_table_name).utilizing("iceberg").partitionedBy(*partitioned_cols). .tableProperty("format-version", "2").create()
else:
    empty_df.writeTo(full_table_name).utilizing("iceberg").tableProperty("format-version", "2").create()
logger.data(f"Created empty Iceberg desk: {full_table_name}")
spark.sql(f"""
CALL glue_catalog.system.add_files(
  '{target_db_name}.{target_table_name}',
  'parquet.`{s3_path}`'
)
""")

When working with non-Hive partitioned datasets, a direct migration to Apache Iceberg utilizing add_files may not behave as anticipated. See Appendix C for extra info.

Concerns

Let’s discover two key issues that it’s best to handle when implementing your migration technique.

State administration utilizing DynamoDB management desk

Use the next pattern code snippet to replace the state of DynamoDB desk:

def update_dynamodb_record(self, s3_path, metadata_loc=None, error_msg=None):
    # Get present error message
    strive:
        response = self.dynamodb.get_item(
            TableName="migration-control-table",
            Key={'s3_path': {'S': s3_path}}
        )
        current_error = response.get('Merchandise', {}).get('error_message', {}).get('S', '')
    besides:
        current_error = ""
    if error_msg:
        # Error case
        error_msg = (error_msg or "Unknown error")[:1000]
        update_expr = "SET error_message = :err"
        attr_values = {':err': {'S': error_msg}}
        if current_error:
            update_expr += ", prev_error_message = :prev"
            attr_values[':prev'] = {'S': current_error}
        update_kwargs = {'TableName': 'Iceberg_migration','Key': {'s3_path': {'S': s3_path}},'UpdateExpression': update_expr,'ExpressionAttributeValues': attr_values}
        self.logger.error(f"Set error for {s3_path}: {error_msg}")
    else:
        # Success case
        update_kwargs = {
            'TableName': 'Iceberg_migration',
            'Key': {'s3_path': {'S': s3_path}},
            'UpdateExpression': 'SET #s = :standing, #m = :meta, #p = :prev, #e = :err',
            'ExpressionAttributeNames': {'#s': 'standing','#p': 'prev_error_message','#e': 'error_message','#m': 'metadata_location'
            },
            'ExpressionAttributeValues': {
                ':standing': {'S': 'Iceberg_Metadata_Populated and Registered'},
                ':prev': {'S': current_error},
                ':err': {'S': ''},
                ':meta': {'S': metadata_loc}
            }
        }
        self.logger.data(f"Up to date DynamoDB standing for {s3_path}: {metadata_loc}")

This ensures that any errors are logged and saved to DynamoDB as error_message. On successive retries, earlier errors transfer to prev_error_message and new errors overwrite error_message. Profitable operations clear error_message and archive the final error.

Defending your knowledge from unintended deletion

To guard your knowledge from unintended deletion, by no means delete knowledge or metadata information from Amazon S3 immediately. Iceberg tables which are registered in AWS Glue or Athena are managed tables and must be deleted utilizing the DROP TABLE command from Spark or Athena. The DROP TABLE command deletes each the desk metadata and the underlying knowledge information in S3. See Appendix D for extra info.

Clear up

Full the next steps to wash up your sources:

  1. Delete the DynamoDB management desk
  2. Delete the database and tables
  3. Delete the EMR clusters and AWS Glue job used for testing

Conclusion

On this submit, we confirmed you methods to modernize your Parquet-based knowledge lake into an Apache Iceberg–powered lakehouse with out rewriting or duplicating knowledge. You realized two complementary approaches for this in-place migration:

  • Migrate and register – Excellent for changing present Hive-registered Parquet tables into Iceberg-managed tables.
  • add_files – Greatest fitted to shortly onboarding uncooked Parquet knowledge into Iceberg with out rewriting information.

Each approaches profit from DynamoDB centralized state monitoring, which allows retries, error auditing, and lifecycle administration throughout a number of datasets.

By combining Apache Iceberg with Amazon EMR, AWS Glue, and Amazon DynamoDB, you possibly can create a production-ready migration pipeline that’s observable, automated, and simple to increase to future knowledge format upgrades. This sample kinds a stable basis for constructing an Iceberg-based lakehouse on AWS, serving to you obtain quicker analytics, higher knowledge governance, and long-term flexibility for evolving workloads.

To get began, strive implementing this resolution utilizing the pattern tables (table1 and table2) that you simply created utilizing Athena queries. we encourage you to share your migration experiences and questions within the feedback.


Appendix A — Creating an EMR cluster for Hive metastore utilizing console and AWS CLI

Console steps:

  1. Open AWS Administration Console for Amazon EMR and select Create cluster.
  2. Choose Spark or Hive underneath functions.
  3. Below AWS Glue Knowledge Catalog settings, be sure that the following choices will not be chosen:
    • Use for Hive desk metadata
    • Use for Spark desk metadata
  4. Configure SSH entry (KeyName).
  5. Configure community (VPC, subnets, SGs) to permit entry to S3.

AWS CLI steps:

aws emr create-cluster 
  --region us-east-1 
  --name "IcebergHiveCluster711" 
  --release-label emr-7.11.0 
  --applications Title=Hive Title=Spark Title=Hadoop 
  --ec2-attributes '{"KeyName":"","SubnetId":""}'  
  --instance-groups '[
    {
      "Name":"Master",
      "InstanceGroupType":"MASTER",
      "InstanceType":"m5.xlarge",
      "InstanceCount":1
    },
    {
      "Name":"Workers",
      "InstanceGroupType":"CORE",
      "InstanceType":"m5.xlarge",
      "InstanceCount":2
    }
  ]' 
  --use-default-roles


Appendix B — EMR cluster with AWS Glue as Spark Metastore

Console steps:

  1. Open the Amazon EMR console, select Create cluster after which choose EMR Serverless or provisioned EMR.
  2. Below Software program Configuration, confirm that Spark is put in.
  3. Below AWS Glue Knowledge Catalog settings, choose Use Glue Knowledge Catalog for Spark metadata.
  4. Configure SSH entry (KeyName).
  5. Configure community settings (VPC, subnets, and safety teams) to permit entry to Amazon S3 and AWS Glue.

AWS CLI (provisioned Amazon EMR):

aws emr create-cluster 
  --region us-east-1 
  --name "IcebergGlueCluster711" 
  --release-label emr-7.11.0 
  --applications Title=Spark Title=Hadoop 
  --ec2-attributes '{"KeyName":"","SubnetId":""}' 
  --instance-groups '[
    {
      "Name":"Master",
      "InstanceGroupType":"MASTER",
      "InstanceType":"m5.xlarge",
      "InstanceCount":1
    },
    {
      "Name":"Workers",
      "InstanceGroupType":"CORE",
      "InstanceType":"m5.xlarge",
      "InstanceCount":2
    }
  ]' 
 --configurations '[{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]' 
 --use-default-roles


Appendix C — Non-Hive partitioned datasets and Iceberg add_files

This appendix explains why a direct in-place migration utilizing an add_files-style process may not behave as anticipated for datasets that aren’t Hive-partitioned and reveals advisable fixes and examples.

AWS Glue and Athena comply with Hive-style partitioning, the place partition column values are encoded within the S3 path relatively than inside the info information. For instance, following the Parquet dataset created within the Create Pattern Parquet Dataset as a supply part of this submit:

s3://amzn-s3-demo-bucket/occasions/event_date=2024-09-01/hour=5/part-0000.parquet
s3://amzn-s3-demo-bucket/occasions/event_date=2024-09-02/hour=5/part-0001.parquet

  • Partition columns (event_date, hour) are represented within the folder construction.
  • Non-partition columns (for instance, id, title, age) stay contained in the Parquet information.
  • Iceberg add_files can accurately map partitions based mostly on the folder path, even when partition columns are lacking from the Parquet file itself.

Partition column

Saved in path

Saved in file

Athena or AWS Glue and Iceberg habits

event_date Sure Sure Partitions inferred accurately
hour Sure No Partitions nonetheless inferred from path

Non-Hive partitioning structure (downside case)

s3://amzn-s3-demo-bucket/occasions/date/part-0000.parquet
s3://amzn-s3-demo-bucket/occasions/date/part-0001.parquet

  • No partition columns within the path.
  • File may not include partition columns.

For those who attempt to create an empty Iceberg desk and immediately load it utilizing add_files on a non-hive structure, the next occurs:

  1. Iceberg can not routinely map partitions, add_files operations fail or register information with incorrect or lacking partition metadata.
  2. Queries in Athena or AWS Glue will return sudden NULLs or incomplete outcomes.
  3. Successive incremental writes utilizing add_files will fail.

Advisable approaches:

Create an AWS Glue desk and use the Iceberg snapshot process:

  1. Create a desk in AWS Glue pointing to your present Parquet dataset.

You would possibly must manually present the schema as a result of glue crawler would possibly fail to routinely infer it for you.

  1. Use Iceberg’ s snapshot process to transform and transfer the AWS Glue desk into your goal Iceberg desk.

This works as a result of Iceberg depends on AWS Glue for schema inference, so this method ensures right mapping of columns and partitions with out rewriting the info. For extra info, see Snapshot process.


Appendix D — Understanding desk sorts: Managed in comparison with exterior

By default, all non-Iceberg tables created in AWS Glue or Athena are exterior tables, Athena doesn’t handle the underlying knowledge. For those who use CREATE TABLE with out the EXTERNAL key phrase for non-Iceberg tables, Athena points an error.

Nevertheless, when coping with Iceberg tables, AWS Glue and Athena additionally handle the underlying knowledge for the respective tables, so these tables are handled as inner tables.

Working DROP TABLE on Iceberg tables will delete the desk and the underlying knowledge.

The next desk describes how the impact of DELETE and DROP TABLE actions on Iceberg tables in AWS Glue and Athena:

Operation What it does Impact on S3 knowledge
DELETE FROM mydb.products_iceberg WHERE date = 2025-10-06; Creates new snapshot, hides deleted rows Knowledge information keep till cleanup
DROP TABLE test_db.table1; Deletes desk and all knowledge Recordsdata are completely eliminated

Concerning the authors

Mihir Borkar

Mihir Borkar

Mihir is a seasoned AWS Knowledge Architect with almost a decade of expertise designing and implementing enterprise-scale knowledge options on AWS. He makes a speciality of modernizing knowledge architectures utilizing AWS knowledge analytical companies, designing scalable knowledge lakes and analytics platforms with a deal with environment friendly, cost-effective options. In his free time, Mihir likes to examine rising cloud applied sciences and discover newest developments in AI/ML.

Amit Maindola

Amit Maindola

Amit is a Senior Knowledge Architect with AWS ProServe crew targeted on knowledge engineering, analytics, and AI/ML at Amazon Internet Companies. He helps clients of their digital transformation journey and allows them to construct extremely scalable, strong, and safe cloud-based analytical options on AWS to realize well timed insights and make vital enterprise selections.

Arghya Banerjee

Arghya Banerjee

Arghya is a Sr. Options Architect at AWS within the San Francisco Bay Space, targeted on serving to clients undertake and use the AWS Cloud. He’s targeted on massive knowledge, knowledge lakes, streaming and batch analytics companies, and generative AI applied sciences.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments