The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that’s 100% API suitable with open supply Apache Spark. With Amazon EMR launch 7.9.0, the EMR runtime for Apache Spark introduces vital efficiency enhancements for encrypted workloads, supporting Spark model 3.5.5.
For compliance and safety necessities, many shoppers have to allow Apache Spark’s native storage encryption (spark.io.encryption.enabled = true) along with Amazon Easy Storage Service (Amazon S3) encryption (akin to server-side encryption (SSE) or AWS Key Administration Service (AWS KMS)). This function encrypts shuffle recordsdata, cached information, and different intermediate information written to native disk throughout Spark operations, defending delicate information at relaxation on Amazon EMR cluster cases.
Industries topic to rules such because the Well being Insurance coverage Portability and Accountability Act (HIPAA) for healthcare, Fee Card Business Information Safety Normal (PCI-DSS) for monetary providers, Basic Information Safety Regulation (GDPR) for private information, and Federal Threat and Authorization Administration Program (FedRAMP) for presidency typically require encryption of all information at relaxation, together with momentary recordsdata on native storage. Whereas Amazon S3 encryption protects information in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill information that Spark writes to native disk throughout distributed processing—information that by no means reaches Amazon S3 however may include delicate data extracted from supply datasets. Typically, encrypted operations require extra computational overhead that may impression general job efficiency.
With the built-in encryption optimizations of Amazon EMR 7.9.0, clients may see vital efficiency enhancements of their Apache Spark purposes with out requiring any software modifications. In our efficiency benchmark checks, derived from TPC-DS efficiency checks at 3 TB scale, we noticed as much as 20% quicker efficiency with the EMR 7.9 optimized Spark runtime in comparison with Spark with out these optimizations. Particular person outcomes could range relying on particular workloads and configurations.
On this put up, we analyze the outcomes from our benchmark checks evaluating the Amazon EMR 7.9 optimized Spark runtime towards Spark 3.5.5 with out encryption optimizations. We stroll by means of an in depth price evaluation and supply step-by-step directions to breed the benchmark.
Outcomes noticed
To judge the efficiency enhancements, we used an open supply Spark efficiency check utility derived from the TPC-DS efficiency check toolkit. We ran the checks on two nine-node (eight core nodes and one main node) r5d.4xlarge Amazon EMR 7.9.0 clusters, evaluating two configurations:
- Baseline: EMR 7.9.0 cluster with a bootstrap motion putting in Spark 3.5.5 with out encryption optimizations
- Optimized: EMR 7.9.0 cluster utilizing the EMR Spark 3.5.5 runtime with encryption optimizations
Each checks used information saved in Amazon Easy Storage Service (Amazon S3). All information processing was configured identically apart from the Spark runtime model.
To take care of benchmarking consistency and guarantee a constant, equal comparability, we disabled Dynamic Useful resource Allocation (DRA) in each check configurations. This strategy eliminates variability from dynamic scaling and so we are able to measure pure computational efficiency enhancements.
The next desk exhibits the full job runtime for all queries (in seconds) within the 3 TB question dataset between the baseline and Amazon EMR 7.9 optimized configurations:
| Configuration | Whole runtime (seconds) | Geometric imply (seconds) | Efficiency enchancment |
| Baseline (Spark 3.5.5 with out optimization) | 1,485 | 10.24 | |
| EMR 7.9 (with encryption optimization) | 1,176 | 8.15 | 20% quicker |
We noticed that our TPC-DS checks with the Amazon EMR 7.9 optimized Spark runtime accomplished about 20% quicker based mostly on whole runtime and 20% quicker based mostly on geometric imply in comparison with the baseline configuration.
The encryption optimizations in Amazon EMR 7.9 ship efficiency advantages by means of:
- Improved shuffle and decryption operations lowering overhead throughout information alternate with out compromising safety
- Higher reminiscence administration for intermediate outcomes
Value evaluation
The efficiency enhancements of the Amazon EMR 7.9 optimized Spark runtime immediately translate to decrease prices. We realized an roughly 20% price financial savings operating the benchmark software with encryption optimizations in comparison with the baseline configuration, due to lowered hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Retailer (Amazon EBS) utilizing Basic Objective SSD (gp2).
The next desk summarizes the fee comparability within the us-east-1 AWS Area:
| Configuration | Runtime (hours) | Estimated price | Whole EC2 cases | Whole vCPU | Whole reminiscence (GiB) | Root gadget (EBS) |
| Baseline: Spark 3.5.5 with out optimization, 1 main and eight core nodes | 0.41 | $5.28 | 9 | 144 | 1152 | 64 GiB gp2 |
| Amazon EMR 7.9 with optimization, 1 main and eight core nodes | 0.33 | $4.25 | 9 | 144 | 1152 | 64 GiB gp2 |
Value breakdown
Formulation used:
- Amazon EMR price – Variety of cases × EMR hourly charge × Runtime hours
- Amazon EC2 price – Variety of cases × EC2 hourly charge × Runtime hour)
- Amazon EBS price – (EBS price per GB per thirty days ÷ hours in a month) × EBS quantity dimension × variety of cases × runtime hours
Word: EBS is priced month-to-month ($0.1 per GB per thirty days), so we divide by 730 hours to transform to an hourly charge. EMR and EC2 are already priced hourly, so no conversion is required.
Baseline configuration (0.41 hours):
- Amazon EMR price – 9 × $0.27 × 0.41 = $1.00
- Amazon EC2 price – 9 × $1.152 × 0.41 = $4.25
- Amazon EBS price – ($0.1/730 × 64 × 9 × 0.41) = $0.032
- Whole price – $5.28
EMR 7.9 optimized configuration (0.33 hours):
- Amazon EMR price – (9 × $0.27 × 0.33) = $0.80
- Amazon EC2 price – (9 × $1.152 × 0.33) = $3.42
- Amazon EBS price – ($0.1/730 × 64 × 9 × 0.33) = $0.025
- Whole price: $4.25
Whole price financial savings: 20% per benchmark run, which scales linearly along with your manufacturing workload frequency.
Arrange EMR benchmarking
For detailed directions and scripts, see the companion GitHub repository.
Stipulations
To arrange Amazon EMR benchmarking, begin by finishing the next prerequisite steps:
- Configure your AWS Command Line Interface (AWS CLI) by operating
aws configureto level to your benchmarking account, - Create an S3 bucket for check information and outcomes.
- Copy the TPC-DS 3TB supply information from a publicly out there dataset to your S3 bucket utilizing the next command:
Exchange
with the title of the S3 bucket you created in step 2. - Construct or obtain the benchmark software JAR file (spark-benchmark-assembly-3.3.0.jar)
- Guarantee you could have applicable AWS Id Entry Administration (IAM) roles for EMR cluster creation and Amazon S3 entry
Deploy the baseline EMR cluster (with out optimization)
Step 1: Launch EMR 7.9.0 cluster with bootstrap motion
The baseline configuration makes use of a bootstrap motion to put in Spark 3.5.5 with out encryption optimizations. We’ve made the bootstrap script publicly out there in an S3 bucket on your comfort.
Create the default Amazon EMR roles:
Now create the cluster:
Word: The bootstrap script is on the market in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 with out the encryption optimizations current within the Amazon EMR runtime.
Step 2: Submit the benchmark job to the baseline cluster
Subsequent submit the Spark job utilizing the next instructions:
Deploy the optimized EMR cluster (with encryption optimization)
Step 1: Launch EMR 7.9.0 cluster with Spark runtime
The optimized configuration makes use of the EMR 7.9.0 Spark runtime with none bootstrap actions:
Instance:
Step 2: Submit the benchmark job to optimized cluster
ext submit the Spark job utilizing the next instructions:
Benchmark command parameters defined
The Amazon EMR Spark step makes use of the next parameters:
- EMR step configuration:
- Kind=Spark: Specifies this can be a Spark software step
- Identify=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable title for the step
- ActionOnFailure=CONTINUE: Proceed with different steps if this one fails
- Spark submit arguments:
- –deploy-mode shopper: Run the motive force on the grasp node (not cluster mode)
- –class com.amazonaws.eks.tpcds.BenchmarkSQL: Fundamental class for the TPC-DS benchmark
- Utility parameters:
- JAR file:
s3:///jar/spark-benchmark-assembly-3.3.0.jar - Enter information
: s3:///weblog/BLOG_TPCDS-TEST-3T-partitioned(3 TB TPC-DS dataset) - Output location:
s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT(S3 path for outcomes) - TPC-DS instruments path:
/choose/tpcds-kit/instruments(native path on EMR nodes) - Format:
parquet(output format) - Scale issue:
3000(3 TB dataset dimension) - Iterations:
3(run every question 3 instances for averaging) - Acquire outcomes: false (don’t accumulate outcomes to driver)
- Question listing:
"q1-v2.4,q10-v2.4,...,ss_max-v2.4"(all 104 TPC-DS queries) - Closing parameter:
true(allow detailed logging and metrics)
- JAR file:
- Question protection:
- All 104 normal TPC-DS benchmark queries (
q1-v2.4by means ofq99-v2.4) - Plus the
ss_max-v2.4question for extra testing - Every question runs 3 instances to calculate common efficiency
- All 104 normal TPC-DS benchmark queries (
Summarize the outcomes
- Obtain the check consequence recordsdata from each output S3 areas:
- The CSV recordsdata include 4 columns (with out headers):
- Question title
- Median time (seconds)
- Minimal time (seconds)
- Most time (seconds)
- Calculate efficiency metrics for comparability:
- Common time per question:
AVERAGE(median, min, max)for every question - Whole runtime: Sum of all median instances
- Geometric imply:
GEOMEAN(common instances)throughout all queries - Speedup: Calculate the ratio between baseline and optimized for every question
- Common time per question:
- Create comparability evaluation:
Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%
Testing configuration particulars
The next desk summarizes the check setting used for this put up:
| Parameter | Worth |
| EMR launch | emr-7.9.0 (each configurations) |
| Baseline Spark model | 3.5.5 (put in by means of bootstrap motion) |
| Baseline bootstrap script | s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public) |
| Optimized spark model | Amazon EMR Spark runtime |
| Cluster dimension | 9 nodes (1 main and eight core) |
| Occasion sort | r5d.4xlarge |
| vCPUs per node | 16 |
| Reminiscence per node | 128 GB |
| Occasion storage | 600 GB SSD |
| EBS quantity | 64 GB gp2 (2 volumes per occasion) |
| Whole vCPUs | 144 (9 × 16) |
| Whole reminiscence | 1152 GB (9 × 128) |
| Dataset | TPC-DS 3TB (Parquet format) |
| Queries | 104 queries (TPC-DS v2.4) |
| Iterations | 3 runs per question |
| DRA | Disabled for constant benchmarking |
Clear up
To keep away from incurring future costs, delete the sources you created:
- Terminate each EMR clusters:
- Delete S3 check outcomes if not wanted:
- Take away IAM roles if created particularly for testing
Key findings
- As much as 20% efficiency enchancment utilizing the Amazon EMR 7.9’s Spark runtime with no code modifications required
- 20% price financial savings due to lowered runtime
- Important good points for shuffle-heavy, join-intensive workloads
- 100% API compatibility with open supply Apache Spark
- Easy migration from customized Spark builds to EMR runtime
- Straightforward benchmarking utilizing publicly out there bootstrap scripts
Conclusion
You may run your Apache Spark workloads as much as 20% quicker and at decrease price with out making any modifications to your purposes through the use of the Amazon EMR 7.9.0 optimized Spark runtime. This enchancment is achieved by means of quite a few optimizations within the EMR Spark runtime, together with enhanced encryption dealing with, improved information serialization, and optimized shuffle operations.
To study extra about Amazon EMR 7.9 and finest practices, see the EMR documentation. For configuration steering and tuning recommendation, subscribe to the AWS Massive Information Weblog.
Associated sources:
When you’re operating Spark workloads on Amazon EMR as we speak, we encourage you to check the EMR 7.9 Spark runtime along with your manufacturing workloads and measure the enhancements particular to your use case.
In regards to the authors

