HomeBig DataImplement fine-grained entry management for Iceberg tables utilizing Amazon EMR on EKS...

Implement fine-grained entry management for Iceberg tables utilizing Amazon EMR on EKS built-in with AWS Lake Formation


The rise of distributed information processing frameworks resembling Apache Spark has revolutionized the best way organizations handle and analyze large-scale information. Nonetheless, as the quantity and complexity of information proceed to develop, the necessity for fine-grained entry management (FGAC) has change into more and more vital. That is significantly true in eventualities the place delicate or proprietary information should be shared throughout a number of groups or organizations, resembling within the case of open information initiatives. Implementing strong entry management mechanisms is essential to keep up safe and managed entry to information saved in Open Desk Format (OTF) inside a contemporary information lake.

One method to addressing this problem is through the use of Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) and incorporating FGAC mechanisms. With Amazon EMR on EKS, you possibly can run open supply huge information frameworks resembling Spark on Amazon EKS. This integration supplies the scalability and suppleness of Kubernetes, whereas additionally utilizing the information processing capabilities of Amazon EMR.

On February 6th 2025, AWS launched fine-grained entry management primarily based on AWS Lake Formation for EMR on EKS from Amazon EMR 7.7 and better model. Now you can considerably improve your information governance and safety frameworks utilizing this function.

On this put up, we display tips on how to implement FGAC on Apache Iceberg tables utilizing EMR on EKS with Lake Formation.

Knowledge mesh use case

With FGAC in a information mesh structure, area homeowners can handle entry to their information merchandise at a granular degree. This decentralized method permits for better agility and management, ensuring information is accessible solely to approved customers and providers inside or throughout domains. Insurance policies might be tailor-made to particular information merchandise, contemplating elements like information sensitivity, consumer roles, and supposed use. This localized management enhances safety and compliance whereas supporting the self-service nature of the information mesh.

FGAC is particularly helpful in enterprise domains that take care of delicate information, resembling healthcare, finance, authorized, human assets, and others. On this put up, we concentrate on examples from the healthcare area, showcasing how we will obtain the next:

  • Share affected person information securely – Knowledge mesh allows totally different departments inside a hospital to handle their very own affected person information as unbiased domains. FGAC makes positive solely approved personnel can entry particular affected person data or information parts primarily based on their roles and need-to-know foundation.
  • Facilitate analysis and collaboration – Researchers can entry de-identified affected person information from varied hospital domains via the information mesh structure, enabling collaboration between multidisciplinary groups throughout totally different healthcare establishments, fostering data sharing, and accelerating analysis and discovery. FGAC helps compliance with privateness laws (resembling HIPAA) by limiting entry to delicate information parts or permitting entry solely to aggregated, anonymized datasets.
  • Enhance operational effectivity – Knowledge mesh can streamline information sharing between hospitals and insurance coverage firms, simplifying billing and claims processing. FGAC makes positive solely approved personnel inside every group can entry the mandatory information, defending delicate monetary data.

Answer overview

On this put up, we discover tips on how to implement FGAC on Iceberg tables inside an EMR on EKS utility, utilizing the capabilities of Lake Formation. For particulars on tips on how to implement FGAC on Amazon EMR on EC2, seek advice from Wonderful-grained entry management in Amazon EMR Serverless with AWS Lake Formation.

The next elements play crucial roles on this resolution design:

  • Apache Iceberg OTF:
    • Excessive-performance desk format for large-scale analytics
    • Helps schema evolution, ACID transactions, and time journey
    • Appropriate with Spark, Trino, Presto, and Flink
    • Amazon S3 Tables totally managed Iceberg tables for analytics workload
  • AWS Lake Formation:
    • FGAC for information lakes
    • Column-, row-, and cell-level safety controls
  • Knowledge mesh producers and shoppers:
    • Producers: Create and serve domain-specific information merchandise
    • Shoppers: Entry and combine information merchandise
    • Allows self-service information consumption

To display how you need to use Lake Formation to implement cross-account FGAC inside an EMR on EKS setting, we create tables within the AWS Glue Knowledge Catalog in a central AWS account appearing as producer and provision totally different consumer personas to replicate varied roles and entry ranges in a separate AWS account appearing as a number of shoppers. Shoppers might be unfold throughout a number of accounts in real-world eventualities.

The next diagram illustrates the high-level resolution structure.

AWS Healthcare Data Architecture: FGAC using Lake Formation Integration with EMR on EKS

Determine 1: Excessive Stage Answer Structure

To display the cross-account information sharing and information filtering with Lake Formation FGAC, the answer deploys two totally different Iceberg tables with different entry for various shoppers. The permission mapping for shoppers are with cross-account desk shares and information cell filters.

It has two totally different groups with totally different ranges of Lake Formation permissions to entry Sufferers and Claims Iceberg tables. The next desk summarizes the answer’s consumer personas.

Persona/Desk Identify Sufferers Claims

Sufferers Care Group

(team1 job execution position)

  • Exclude a column ssn
  • Embody rows solely from Texas and New York states
Full desk entry

Claims Care Group

(team2 job execution position)

No entry Full desk entry

Conditions

This resolution requires an AWS account with an AWS Identification and Entry Administration (IAM) energy consumer position that may create and work together with AWS providers, together with Amazon EMR, Amazon EKS, AWS Glue, Lake Formation, and Amazon Easy Storage Service (Amazon S3). Further particular necessities for every account are detailed within the related sections.

Clone the mission

To get began, obtain the mission both to your pc or the AWS CloudShell console:

git clone https://github.com/aws-samples/sample-emr-on-eks-fgac-iceberg
 cd sample-emr-on-eks-fgac-iceberg

Arrange infrastructure in producer account

To arrange the infrastructure within the producer account, you need to have the next further assets:

The setup script deploys the next infrastructure:

  • An S3 bucket to retailer pattern information in Iceberg desk format, registered as a knowledge location in Lake Formation
  • An AWS Glue database named healthcare_db
  • Two AWS Glue tables: Sufferers and Claims Iceberg tables
  • A Lake Formation information entry IAM position
  • Cross-account permissions enabled for the buyer account:
    • Permit the buyer to explain the database healthcare_db within the producer account
    • Permit to entry the Sufferers desk utilizing a knowledge cell filter, primarily based on row-level chosen state, and exclude column ssn
    • Permit full desk entry to the Claims desk

Run the next producer_iceberg_datalake_setup.sh script to create a growth setting within the producer account. Replace its parameters in line with your necessities:

export AWS_REGION=us-west-2
export PRODUCER_AWS_ACCOUNT= 
export CONSUMER_AWS_ACCOUNT= 
./producer_iceberg_datalake_setup.sh 
# run the clean-up script earlier than re-run the setup if wanted
./producer_clean_up.sh

Allow cross-account Lake Formation entry in producer account

A client account ID and an EMR on EKS Engine session tag should set within the producer’s setting. It permits the buyer to entry the producer’s AWS Glue tables ruled by Lake Formation. Full the next steps to allow cross-account entry:

  1. Open the Lake Formation console within the producer account.
  2. Select Utility integration settings underneath Administration within the navigation pane.
  3. Choose Permit exterior engines to filter information in Amazon S3 places registered with Lake Formation.
  4. For Session tag values, enter EMR on EKS Engine.
  5. For AWS account IDs, enter your client account ID.
  6. Select Save.
Comprehensive AWS Lake Formation application integration settings interface for managing third-party data access.

Determine 2: Producer Account – Lake Formation third-party engine configuration display screen with session tags, account IDs, and information entry permissions.

Validate FGAC setup in producer setting

To validate the FGAC setup within the producer account, examine the Iceberg tables, information filter, and FGAC permission settings.

Iceberg tables

Two AWS Glue tables in Iceberg format had been created by producer_iceberg_datalake_setup.sh. On the Lake Formation console, select Tables underneath Knowledge Catalog within the navigation pane to see the tables listed.

AWS Lake Formation Tables interface showing a success message for updated external data filtering settings, with a table list displaying healthcare database tables in Apache Iceberg format.

Determine 3: Lake Formation interface displaying claims and sufferers tables from healthcare_db with Apache Iceberg format.

The next screenshot exhibits an instance of the sufferers desk information.

Patients table data

Determine 4: Sufferers desk information

The next screenshot exhibits an instance of the claims desk information.

claims table data

Determine 5: Claims desk information

Knowledge cell filter towards sufferers desk

After efficiently working the producer_iceberg_datalake_setup.sh script, a brand new information cell filter named patients_column_row_filter was created in Lake Formation. This filter performs two capabilities:

  • Exclude the ssn column from the sufferers desk information
  • Embody rows the place the state is Texas or New York

To view the information cell filter, select Knowledge filters underneath Knowledge Catalog within the navigation pane of the Lake Formation console, and open the filter. Select View permission to view the permission particulars.

Data cell filter

Determine 6: Column and Row degree filter configuration for sufferers desk

FGAC permissions permitting cross-account entry

To view all of the FGAC permissions, select Knowledge permissions underneath Permissions within the navigation pane of the Lake Formation console, and filter by the database title healthcare_db.

Be sure that to revoke information permissions with the IAMAllowedPrincipals principal related to the healthcare_db tables, as a result of it’s going to trigger cross-account information sharing to fail, significantly with AWS Useful resource Entry Supervisor (AWS RAM).

Data permissions overview

Determine 7: Lake Formation information permissions interface displaying filtered healthcare database assets with granular entry controls

The next desk summarizes the general FGAC setup.

Useful resource Kind Useful resource Permissions Grant Permissions
Database Describe Describe
Knowledge Cell Filter
patients_column_row_filter

Choose Choose
Desk Choose, Describe Choose, Describe

Arrange infrastructure in client account

To arrange the infrastructure within the client account, you need to have the next further assets:

  • eksctl and kubectl packages should be put in
  • An IAM position within the client account should be a Lake Formation administrator to run consumer_emr_on_eks_setup.sh script
  • The Lake Formation admin should settle for the AWS RAM useful resource share invitations utilizing the AWS RAM console, if the buyer account is outdoors of the producer’s organizational unit
RAM resource share screen

Determine 8: Client account – Cross-account RAM share for Lake Formation useful resource

The setup script deploys the next infrastructure:

  • An EKS cluster known as fgac-blog with two namespaces:
    • Person namespace: lf-fgac-user
    • System namespace:lf-fgac-secure
  • An EMR on EKS digital cluster emr-on-eks-fgac-blog:
    • Arrange with a safety configuration emr-on-eks-fgac-sec-conifg
    • Two EMR on EKS job execution IAM roles:
      • Position for the Sufferers Care Group (team1): emr_on_eks_fgac_job_team1_execution_role
      • Position for Claims Care Group (team2): emr_on_eks_fgac_job_team2_execution_role
    • A question engine IAM position utilized by FGAC safe house: emr_on_eks_fgac_query_execution_role
  • An S3 bucket to retailer PySpark job scripts and logs
  • An AWS Glue native database named consumer_healthcare_db
  • Two useful resource hyperlinks to cross-account shared AWS Glue tables: rl_patients and rl_claims
  • Lake Formation permission on Amazon EMR IAM roles

Run the next consumer_emr_on_eks_setup.sh script to arrange a growth setting within the client account. Replace the parameters in line with your use case:

export AWS_REGION=us-west-2 
export PRODUCER_AWS_ACCOUNT= 
export EKSCLUSTER_NAME=fgac-blog 
./consumer_emr_on_eks_setup.sh 
# run the clean-up script earlier than re-run the setup if wanted
./consumer_clean_up.sh

Allow cross-account Lake Formation entry in client account

The buyer account should add the buyer account ID with an EMR on EKS Engine session tag in Lake Formation. This session tag might be utilized by EMR on EKS job execution IAM roles to entry Lake Formation tables. Full the next steps:

  1. Open the Lake Formation console within the client account.
  2. Select Utility integration settings underneath Administration within the navigation pane.
  3. Choose Permit exterior engines to filter information in Amazon S3 places registered with Lake Formation.
  4. For Session tag values, enter EMR on EKS Engine.
  5. For AWS account IDs, enter your client account ID.
  6. Select Save.

Determine 9: Client Account – Lake Formation third-party engine configuration display screen with session tags, account IDs, and information entry permissions

Validate FGAC setup in client setting

To validate the FGAC setup within the producer account, examine the EKS cluster, namespaces, and Spark job scripts to check information permissions.

EKS cluster

On the Amazon EKS console, select Clusters within the navigation pane and ensure the EKS cluster fgac-blog is listed.

EKS Cluster view page

Determine 10: Client Account – EKS Cluster console web page

Namespaces in Amazon EKS

Kubernetes makes use of namespaces as logical partitioning system for organizing objects resembling Pods and Deployments. Namespaces additionally function as a privilege boundary within the Kubernetes role-based entry management (RBAC) system. Multi-tenant workloads in Amazon EKS might be secured utilizing namespaces.

This resolution creates two namespaces:

  • lf-fgac-user
  • lf-fgac-secure

The StartJobRun API makes use of the backend workflows to submit a Spark job’s UserComponents (JobRunner, Driver, Executors) within the consumer namespace, and the corresponding system elements within the system namespace to perform the specified FGAC behaviors.

You may confirm the namespaces with the next command:kubectl get namespaceThe next screenshot exhibits an instance of the anticipated output.

Namespace summary page

Determine 11: EKS Cluster namespaces

Spark job script to check Sufferers Care Group’s information permissions

Beginning with Amazon EMR model 6.6.0, you need to use Spark on EMR on EKS with the Iceberg desk format. For extra data on how Iceberg works in an immutable information lake, see Construct a high-performance, ACID compliant, evolving information lake utilizing Apache Iceberg on Amazon EMR.

The next script is a snippet of the PySpark job that retrieves filtered information for the Claims and Affected person tables:

    print("Affected person Care Group PySpark job working on EMR on EKS! to question Sufferers and Claims tables!")
    print("This job queries Sufferers and Claims tables!")
    df1 = spark.sql('SELECT * FROM dev.${CONSUMER_DATABASE}.${rl_patients}')
    print("Sufferers tables information:")
    print("Observe: Sufferers desk is filtered on SSN column and it exhibits data just for Texas and New York states")
    df1.present(20)
    df2 = spark.sql('SELECT p.state,
                            c.claim_id,
                            c.claim_date, 
                            p.patient_name, 
                            c.diagnosis_code, 
                            c.procedure_code, 
                            c.quantity, 
                            c.standing, 
                            c.provider_id 
                    FROM dev.${CONSUMER_DATABASE}.${rl_claims} c 
                    JOIN dev.${CONSUMER_DATABASE}.${rl_patients} p
                   ON c.patient_id = p.patient_id 
                   ORDER BY p.state, c.claim_date')
    print("Present solely related Claims information for Sufferers chosen from Texas and New York state:")
    df2.present(20)
    print("Job Full")
....	

Spark job script to check Claims Care Group’s information permissions

The next script is a snippet of the PySpark job that retrieves information from the Claims desk:

    print("Claims Group PySpark job working on EMR on EKS to question Claims desk!")
    print("Observe: Claims Group has full entry to Claims desk!")
    df = spark.sql('SELECT * FROM     dev.${CONSUMER_DATABASE}.${rl_claims}')
    df.present(20)
....

Validate job execution roles for EMR on EKS

The Sufferers Care Group makes use of the emr_on_eks_fgac_job_team1_execution_role IAM position to execute a PySpark job on EMR on EKS. The job execution position has permission to question each the Sufferers and Claims tables.

The Claims Care Group makes use of the emr_on_eks_fgac_job_team2_execution_role IAM position to execute jobs on EMR on EKS. The job execution position solely has permission to entry Claims information.

Each IAM job execution roles have the next permissions:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "EmrGetCertificate",
            "Effect": "Allow",
            "Action": "emr-containers:CreateCertificate",
            "Resource": "*"
        },
        {
            "Sid": "LakeFormationManagedAccess",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess",
                "glue:GetTable",
                "glue:GetCatalog",
                "glue:Create*",
                "glue:Update*"
            ],
            "Useful resource": "*"
        },
        {
            "Sid": "EmrSparkJobAccess",
            "Impact": "Permit",
            "Motion": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Useful resource": [
                "arn:aws:s3:::${S3_BUCKET}*"
            ]
        }
        }
    ]
}

The next code is the job execution IAM position belief coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "TrustQueryEngineRoleToAssume",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::$CONSUMER_ACCOUNT:role/$query_engine_role"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Situation": {
                "StringLike": {
                    "aws:RequestTag/LakeFormationAuthorizedCaller": "EMR on EKS Engine"
                }
            }
        },
        {
            "Sid": "TrustQueryEngineRoleToAssumeRoleOnly",
            "Impact": "Permit",
            "Principal": {
                "AWS": "arn:aws:iam::$CONSUMER_ACCOUNT:position/$query_engine_role"
            },
            "Motion": "sts:AssumeRole"
        },
        {
            "Impact": "Permit",
            "Principal": {
                "Federated": "arn:aws:iam::$CONSUMER_ACCOUNT oidc-provider/oidc.eks.$AWS_REGION.amazonaws.com/id/xxxxx"
            },
            "Motion": "sts:AssumeRoleWithWebIdentity",
            "Situation": {
                "StringLike": {
                    "oidc.eks.$AWS_REGION.amazonaws.com/id/xxxxx:sub": "system:serviceaccount:lf-fgac-user:emr-containers-sa-*-*-$CONSUMER_ACCOUNT-"
                }
            }
        }
    ]
}

The next code is the question engine IAM position coverage (emr_on_eks_fgac_query_execution_role-policy):

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "AssumeJobExecutionRole",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Useful resource": ["arn:aws:iam::$CONSUMER_ACCOUNT:role/emr_on_eks_fgac_job_team1_execution_role",
                "arn:aws:iam::$CONSUMER_ACCOUNT:role/emr_on_eks_fgac_job_team2_execution_role"],
            "Situation": {
                "StringLike": {
                    "aws:RequestTag/LakeFormationAuthorizedCaller": "EMR on EKS Engine"
                }
            }
        },
        {
            "Sid": "AssumeJobExecutionRoleOnly",
            "Impact": "Permit",
            "Motion": [
                "sts:AssumeRole"
            ],
            "Useful resource": [
                "arn:aws:iam::$CONSUMER_ACCOUNT:role/emr_on_eks_fgac_job_team1_execution_role",
                "arn:aws:iam::$CONSUMER_ACCOUNT:role/emr_on_eks_fgac_job_team2_execution_role"
            ]
    ]
}

The next code is the question engine IAM position belief coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::$CONSUMER_ACCOUNT:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::$CONSUMER_ACCOUNT:oidc-provider/xxxxx"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "xxxxxx:sub": "system:serviceaccount:lf-fgac-secure:emr-containers-sa-*-*-$CONSUMER_ACCOUNT-"
                }
            }
        }
    ]
}

Run PySpark jobs on EMR on EKS with FGAC

For extra particulars about tips on how to work with Iceberg tables in EMR on EKS jobs, seek advice from Utilizing Apache Iceberg with Amazon EMR on EKS. Full the next steps to run the PySpark jobs on EMR on EKS with FGAC:

  1. Run the next instructions to run the sufferers and claims jobs:
bash /tmp/submit-patients-job.sh
bash /tmp/submit-claims-job.sh

  1. Watch the applying logs from the Spark driver pod:

kubectl logs drive-pod-name -c spark-kubernetes-driver -n lf-fgac-user -f

Alternatively, you possibly can navigate to the Amazon EMR console, open your digital cluster, and select the open icon subsequent to the job to open the Spark UI and monitor the job progress.

Spark UI navigation

Determine 12: EMR on EKS job runs

View PySpark jobs output on EMR on EKS with FGAC

In Amazon S3, navigate to the Spark output logs folder:

s3://blog-emr-eks-fgac-test--us-west-2-dev/spark-logs//jobs//containers/spark-xxxxxx/spark-xxxxx-driver/stdout.gz

S3 path to view logs

Determine 13: EMR on EKS job’s stdout.gz location on S3 Bucket

The Sufferers Care Group PySpark job has question entry to the Sufferers and Claims tables. The Sufferers desk has filtered out the SSN column and solely exhibits data for Texas and New York declare data, as laid out in our FGAC setup.

The next screenshot exhibits the Claims desk for less than Texas and New York.

Claims data in consumer view

Determine 14: EMR on EKS Spark job output

The next screenshot exhibits the Sufferers desk with out the SSN column.

Patients data in consumer view

Determine 15: EMR on EKS Spark job output

Equally, navigate to the Spark output log folder for the Claims Care Group job:

s3://blog-emr-eks-fgac-test--us-west-2-dev/spark-logs//jobs//containers/spark-xxxxxx/spark-xxxxx-driver/stdout.gz

As proven within the following screenshot, the Claims Care Group solely has entry to the Claims desk, so when the job tried to entry the Sufferers desk, it acquired an entry denied error.

Access denied for Claims team

Determine 16: EMR on EKS Spark job output

Issues and limitations

Though the method mentioned on this put up supplies helpful insights and sensible implementation methods, it’s vital to acknowledge the important thing issues and limitations earlier than you begin utilizing this function. To be taught extra about utilizing EMR on EKS with Lake Formation, seek advice from How Amazon EMR on EKS works with AWS Lake Formation.

Clear up

To keep away from incurring future expenses, delete the assets generated for those who don’t want the answer anymore. Run the next cleanup scripts (change the AWS Area if vital).Run the next script within the client account:

export AWS_REGION=us-west-2
export PRODUCER_AWS_ACCOUNT=
export EKSCLUSTER_NAME=fgac-blog
./consumer_clean_up.sh

Run the next script within the producer account:

export AWS_REGION=us-west-2
export PRODUCER_AWS_ACCOUNT=
export CONSUMER_AWS_ACCOUNT=
./producer_clean_up.sh

Conclusion

On this put up, we demonstrated tips on how to combine Lake Formation with EMR on EKS to implement fine-grained entry management on Iceberg tables. This integration provides organizations a contemporary method to imposing detailed information permissions inside a multi-account open information lake setting. By centralizing information administration in a main account and thoroughly regulating consumer entry in secondary accounts, this technique can simplify governance and improve safety.

For extra details about Amazon EMR 7.7 in reference to EMR on EKS, see Amazon EMR on EKS 7.7.0 releases. To be taught extra about utilizing Lake Formation with EMR on EKS, see Allow Lake Formation with Amazon EMR on EKS.

We encourage you to discover this resolution in your particular use instances and share your suggestions and questions within the feedback part.


Concerning the authors

Janakiraman Shanmugam

Janakiraman Shanmugam

Janakiraman is a Senior Knowledge Architect at Amazon Net Providers . He has a spotlight in Knowledge & Analytics and enjoys serving to prospects to unravel Massive information & machine studying issues. Outdoors of the workplace, he likes to be together with his family and friends and spend time outdoor.

Tejal Patel

Tejal Patel

Tejal is Sr. Supply Guide from AWS Skilled Providers crew, specializing in Knowledge Analytics and ML options. She helps prospects design scalable and revolutionary options with the AWS Cloud. Outdoors of her skilled life, Tejal enjoys spending time together with her household and buddies.

Prabhakaran Thatchinamoorthy

Prabhakaran Thatchinamoorthy

Prabhakaran is a Software program Engineer at Amazon Net Providers, engaged on the EMR on EKS service. He focuses on constructing and working multi-tenant information processing platforms on Kubernetes at scale. His areas of curiosity embrace open-source batch and streaming frameworks, information tooling, and DataOps.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments