When processing information at scale, many organizations use Apache Spark on Amazon EMR to run shared clusters that deal with workloads throughout tenants, enterprise items, or classification ranges. In such multi-tenant environments, completely different datasets typically require distinct AWS Key Administration Service (AWS KMS) keys to implement strict entry controls and meet compliance necessities. On the identical time, operational effectivity would possibly drive these organizations to consolidate their information pipelines. As an alternative of working separate Spark jobs for every dataset, it could possibly be extra environment friendly to run a single job on Amazon EMR that processes inputs as soon as and writes a number of outputs to Amazon Easy Storage Service (Amazon S3), every encrypted with its personal KMS key.
Though consolidating a number of datasets in a single Spark job reduces orchestration overhead and simplifies code upkeep, you would possibly encounter challenges with encryption configurations. By default, the EMRFS and S3A file system purchasers cache their settings, which may trigger encryption keys to persist incorrectly throughout writes. This implies when you change the encryption key between writes to Amazon S3 in an Apache Hadoop setting, some output recordsdata can find yourself encrypted with unintended keys, resulting in potential safety and compliance issues.
On this put up, we present methods to securely write information to Amazon S3 from Spark jobs working on Amazon EMR, whereas dynamically managing completely different KMS keys for encryption. We talk about three approaches to unravel this problem and the way to decide on the fitting answer to your use case.
Amazon S3 server-side encryption choices with Amazon EMR
When writing information to Amazon S3 from Amazon EMR, you possibly can select from a number of server-side encryption choices. The 2 mostly used choices are:
- Server-side encryption with Amazon S3 managed keys (SSE-S3) – Amazon S3 manages the encryption keys for you
- Server-side encryption with KMS keys (SSE-KMS) – AWS KMS manages the keys, and you need to use customized KMS keys with fine-grained entry management
When working Spark jobs on Amazon EMR, information writes to Amazon S3 happen via one of many following file system implementations:
- EMRFS – The default implementation for Amazon EMR variations beneath 7.10.0
- S3A – The default implementation ranging from Amazon EMR 7.10.0
Each implementations present configuration properties to regulate server-side encryption. The next tables present methods to specify the KMS key for SSE-KMS encryption.
For EMRFS (default in Amazon EMR variations beneath 7.10.0), confer with the next desk.
| Property | Description |
|
|
Permits server-side encryption. Defaults to SSE-S3 if no KMS secret’s supplied. |
|
|
Specifies the KMS key ID or ARN for SSE-KMS encryption. |
For S3A (default in Amazon EMR ranging from 7.10.0), confer with the next desk.
| Property | Description |
|
|
Specifies the encryption algorithm. |
|
|
Specifies the KMS key ID or ARN for SSE-KMS encryption. |
Ranging from the Amazon EMR 7.10.0 launch, the S3A file system has changed EMRFS because the default EMR S3 connector. For extra info, confer with Migration Information: EMRFS to S3A Filesystem.
Challenges to forestall encryption key reuse resulting from file system caching
In apply, a unified Spark job would possibly write outputs for a number of tenants or classifications in a single run. On this scenario, making use of the right encryption key for every output is essential to sustaining compliance and imposing isolation in multi-tenant S3 buckets with out the complexity of managing separate Spark jobs for every dataset.
When Spark executors write to Amazon S3, they use a file system (EMRFS or S3A) consumer that’s cached and reused for efficiency optimization. The issue is that every file system occasion retains the encryption settings it was first created with. Every executor’s Java Digital Machine (JVM) creates and caches a file system (and its underlying S3 consumer) for a given S3 bucket. This cached occasion, together with its encryption configuration, persists all through the executor’s lifecycle. For those who change the encryption key in Spark after some information has been written, the prevailing cached consumer can’t decide up the brand new key.
For instance, the next PySpark code first creates a Spark session with S3 server-side encryption enabled. It then writes a DataFrame to 2 completely different folders throughout the identical S3 bucket amzn-s3-demo-bucket1, however with completely different KMS keys. The primary write operation writes the info to folder1 utilizing kmsKey1 for encryption, and the second write operation writes to folder2 utilizing kmsKey2.
You would possibly count on recordsdata in folder1/ to make use of kmsKey1 and recordsdata in folder2/ to make use of kmsKey2. However resulting from caching, the second write can nonetheless use the consumer configured with kmsKey1. This results in blended or incorrect encryption key utilization throughout outputs.
Answer overview
Our goal is to realize right encryption of every output S3 object with its supposed KMS key, even when a single Spark job writes a number of outputs. To implement this, you need to use one of many following approaches:
- Disable file system cache – Flip off S3 consumer caching so a brand new consumer is created for every write, selecting up the present key
- Separate Spark functions or periods – Run a separate Spark utility (or session) for every distinct encryption key, so every consumer is initialized contemporary
- Use S3 bucket default encryption – Configure bucket-level SSE-KMS with the specified key so Amazon S3 routinely applies the right encryption key
Every technique affords a special stability of implementation complexity, efficiency, and suppleness. The next sections present detailed implementation steps and issues for every method.
Technique 1: Disable file system cache
Disabling the file system cache forces Spark to create a brand new S3 consumer for every write, which applies the up to date encryption settings. This may be performed utilizing a Spark configuration or EMR cluster settings.
The property title for disabling the cache is determined by your URI scheme (s3:// or s3a://), not in your selection of file system (EMRFS or S3A). The next desk summarizes which configuration property title you must use to disable the cache.
| Properties for s3:// URI scheme | Properties for s3a:// URI scheme |
|
|
|
|
|
|
To make use of this technique, full the next steps:
- Disable the file system cache in Spark configuration.
Fors3://scheme, you possibly can manually setspark.hadoop.fs.s3.impl.disable.cache=true, for instance (PySpark):Alternatively, you need to use the next
spark-defaultsconfiguration classification:For
s3a://scheme, you possibly can manually setspark.hadoop.fs.s3a.impl.disable.cache=true, for instance (PySpark):Alternatively, you need to use the next
spark-defaultsconfiguration classification: - As an alternative of disabling the cache particularly for Spark functions, you possibly can optionally configure the EMR cluster’s
core-site.xmlto disable the file system cache globally at cluster degree. You have to configure the/and so forth/hadoop/conf/core-site.xmlfile on the first nodes of your EMR cluster. For instance, when creating or modifying the cluster, use the next configuration.
Fors3://scheme:For
s3a://scheme: - Allow SSE-KMS encryption.
For EMRFS, setspark.hadoop.fs.s3.enableServerSideEncryption=truefor Spark functions solely or use the next configuration to allow encryption at cluster degree:For S3A, set
spark.hadoop.fs.s3a.encryption.algorithm=SSE-KMSfor Spark functions solely or use the next configuration to allow encryption at cluster degree: - When utilizing EMRFS with
fs.s3.impl.disable.cache=true, you have to additionally disable the EMRFS S3-optimized committer to keep away from errors. You are able to do this by both manually settingspark.sql.parquet.fs.optimized.committer.optimization-enabled=falseor utilizing the nextspark-defaultsconfiguration classification:
For extra details about configuring functions on EMR clusters, confer with Configure functions whenever you create a cluster and Reconfigure an occasion group in a working cluster.
Concerns
Use this technique when you have to write information to an S3 bucket utilizing a number of KMS keys inside a single Spark utility. This can be a fast, easy implementation that works nicely for the next use circumstances:
- Testing environments and debugging periods
- Proof-of-concept demonstrations
- Low-volume or one-time jobs the place write efficiency shouldn’t be essential
- Workloads that steadily change encryption keys to write down information to completely different S3 prefixes throughout the identical bucket
Earlier than implementation, think about the next efficiency impacts:
- Elevated latency for every write
- Extra S3 API operations
- Additional connection overhead
Though this technique gives a realistic answer when splitting work into separate Spark functions isn’t possible, we don’t suggest it for high-throughput or latency-sensitive manufacturing workloads. The elevated API site visitors can result in increased prices and potential throttling. For manufacturing implementations, think about Technique 2 and Technique 3.
Technique 2: Use separate Spark functions or periods
When writing information with a number of encryption keys, use a separate Spark utility (or Spark session) for every distinct key. The file system must be initialized with the right encryption key in a contemporary JVM context when writing information with completely different keys. You possibly can obtain this by both submitting a separate Spark utility or beginning a brand new Spark session. This permits the S3 consumer to be created with the supposed encryption key.
Full the next steps:
- Divide the write duties by KMS key. For instance, put together separate DataFrames or filter logic for every key.
- Submit separate jobs. Select both of the next choices:
- Use
spark-submitinstructions. For instance: - Use Spark periods in code (PySpark instance):
- Use
- Use your most popular workflow (reminiscent of AWS Step Capabilities, Apache Airflow, or a wrapper script) to launch jobs in sequence or parallel.
Concerns
Use this technique whenever you want a production-grade answer for making use of completely different KMS keys at scale. This method maintains file system caching advantages and works nicely for the next use circumstances:
- Excessive-throughput or latency-sensitive workloads with frequent write operations
- Eventualities requiring sturdy isolation between completely different KMS keys
- Multi-tenant environments with separate compliance boundaries
This technique creates contemporary S3 purchasers with every new Spark utility or session. In comparison with Technique 1, it affords a number of benefits:
- Avoids per-write connection and API overhead
- Maintains full compatibility with the EMRFS S3-optimized committer
- Enforces credential boundaries between workloads and improves operational and compliance isolation by assigning every Spark utility or session a devoted key
Earlier than implementation, think about the next trade-offs:
- Requires orchestration of a number of Spark functions and periods or clusters
- Entails increased useful resource overhead
- Will increase operational complexity
Select this technique when efficiency, price predictability, and safety isolation are extra vital than single-process simplicity.
Technique 3: Use S3 bucket default encryption
The place potential, configure S3 bucket-level default encryption (SSE-KMS) with the specified KMS key to routinely encrypt objects written to that bucket.
Full the next steps:
- On the Amazon S3 console or utilizing the AWS Command Line Interface (AWS CLI), allow default SSE-KMS for the bucket with the specified key. For directions on enabling SSE-KMS for S3 buckets, confer with Configuring default encryption.
- With default encryption enabled for the S3 bucket, you possibly can write with out specifying a KMS key in Spark. Amazon S3 can encrypt every object with the bucket’s key. Your Spark code solely wants normal write operations.
Concerns
Use this technique when your workloads can use a single KMS key per bucket. This method works nicely for the next use circumstances:
- Manufacturing environments prioritizing operational simplicity
- Workloads the place all information in a bucket shares the identical safety necessities
- Eventualities the place encryption configuration needs to be managed on the bucket degree
- Use circumstances that map naturally to per-bucket separation
This technique gives a number of benefits:
- Alleviates the necessity to configure encryption in Spark functions
- Routinely applies the default KMS key for all writes
- Simplifies encryption administration
Earlier than implementation, think about the next limitations:
- You have to have all information in a single bucket to make use of the identical encryption key
- You possibly can’t apply completely different keys to completely different prefixes throughout the identical bucket
Select this technique whenever you want a easy, dependable method that gives sturdy safety whereas simplifying operational administration.
Selecting the best method
Select the strategy based mostly in your workload’s safety necessities, efficiency wants, and operational constraints:
- Technique 1 – Use when you have to apply a number of KMS keys inside a single Spark job and may settle for some efficiency influence
- Technique 2 – Use for manufacturing workloads that require completely different encryption keys throughout the identical bucket and wish optimum efficiency
- Technique 3 – Use when a single KMS key per bucket meets your encryption necessities and also you need simplified operations
Conclusion
On this put up, we demonstrated methods to deal with a number of KMS keys when writing to Amazon S3 from Spark jobs on Amazon EMR. When encrypting a number of outputs with completely different encryption keys in a single Spark utility, it’s vital to think about the file system caching conduct. We offered a number of sensible options with their respective trade-offs. You can begin implementing these options in your setting by first testing the file system cache-disable technique, which gives a simple method to dealing with a number of encryption keys. As your workload grows, think about evolving to separate Spark periods or S3 bucket default encryption based mostly in your particular necessities. After implementing an answer, confirm that every S3 object’s SSE-KMS secret’s the supposed one (for instance, by checking S3 object metadata). We additionally suggest measuring job efficiency and S3 API utilization, particularly for the cache-disable method.
Concerning the authors

