HomeBig DataPetabyte-scale information migration made easy: AppsFlyer’s greatest observe journey with Amazon EMR...

Petabyte-scale information migration made easy: AppsFlyer’s greatest observe journey with Amazon EMR Serverless


This put up is co-written with Roy Ninio from Appsflyer.

Organizations worldwide purpose to harness the facility of information to drive smarter, extra knowledgeable decision-making by embedding information on the core of their processes. Utilizing data-driven insights lets you reply extra successfully to surprising challenges, foster innovation, and ship enhanced experiences to your prospects. In truth, information has reworked how organizations drive decision-making, however traditionally, managing the infrastructure to assist it posed important challenges and required particular ability units and devoted personnel. The complexity of establishing, scaling, and sustaining large-scale information methods impacted agility and tempo of innovation. This reliance on specialists and complex setups typically diverted assets from innovation, slowed time-to-market, and hindered the power to answer modifications in trade calls for.

AppsFlyer is a number one analytics and attribution firm designed to assist companies measure and optimize their advertising and marketing efforts throughout cell, net, and related units. With a give attention to privacy-first innovation, AppsFlyer empowers organizations to make data-driven selections whereas respecting consumer privateness and compliance laws. AppsFlyer offers instruments for monitoring consumer acquisition, engagement, and retention, delivering actionable insights to boost ROI and streamline advertising and marketing methods.

On this put up, we share how AppsFlyer efficiently migrated their large information infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their greatest practices, challenges to beat, and classes realized that may assist information different organizations in comparable transformations.

Why AppsFlyer embraced a serverless strategy for giant information

AppsFlyer manages one of many largest-scale information infrastructures within the trade, processing 100 PB of information each day, dealing with thousands and thousands of occasions per second, and operating hundreds of jobs throughout practically 100 self-managed Hadoop clusters. The AppsFlyer structure is comprised of many information engineering open supply applied sciences, together with however not restricted to Apache Spark, Apache Kafka, Apache Iceberg, and Apache Airflow. Though this setup has powered operations for years, the rising complexity of scaling assets to fulfill fluctuating calls for, coupled with the operational overhead of sustaining clusters, prompted AppsFlyer to rethink their massive information processing technique.

EMR Serverless is a contemporary, scalable answer that alleviates the necessity for guide cluster administration whereas dynamically adjusting assets to match real-time workload necessities. With EMR Serverless, scaling up or down occurs inside seconds, minimizing idle time and interruptions like spot terminations.

This shift has freed engineering groups to give attention to innovation, improved resilience and excessive availability, and future-proofed the structure to assist their ever-increasing calls for. By solely paying for compute and reminiscence assets used throughout runtime, AppsFlyer additionally optimized prices and minimized expenses for idle assets, marking a big step ahead in effectivity and scalability.

Resolution overview

AppsFlyer’s earlier structure was constructed round self-managed Hadoop clusters operating on Amazon Elastic Compute Cloud (Amazon EC2) and dealt with the dimensions and complexity of the info workflows. Though this setup supported operational wants, it required substantial guide effort to take care of, scale, and optimize.

AppsFlyer orchestrated over 100,000 each day workflows with Airflow, managing each streaming and batch operations. Streaming pipelines used Spark Streaming to ingest real-time information from Kafka, writing uncooked datasets to an Amazon Easy Storage Service (Amazon S3) information lake whereas concurrently loading them into BigQuery and Google Cloud Storage to construct logical information layers. Batch jobs then processed this uncooked information, remodeling it into actionable datasets for inside groups, dashboards, and analytics workflows. Moreover, some processed outputs have been ingested into exterior information sources, enabling seamless supply of AppsFlyer insights to prospects throughout the net.

For analytics and quick queries, real-time information streams have been ingested into ClickHouse and Druid to energy dashboards. Moreover, Iceberg tables have been created from Delta Lake uncooked information and made accessible by means of Amazon Athena for additional information exploration and analytics.

With the migration to EMR Serverless, AppsFlyer changed its self-managed Hadoop clusters, bringing important enhancements to scalability, cost-efficiency, and operational simplicity.

Spark-based workflows, together with streaming and batch jobs, have been migrated to run on EMR Serverless and benefit from the elasticity of EMR Serverless, dynamically scaling to fulfill workload calls for.

This transition has considerably lowered operational overhead, assuaging the necessity for guide cluster administration, so groups can focus extra on information processing and fewer on infrastructure.

The next diagram illustrates the answer structure.

This put up opinions the principle challenges and classes realized by the workforce at AppsFlyer from this migration.

Challenges and classes realized

Migrating a large-scale group like AppsFlyer, with dozens of groups, from Hadoop to EMR Serverless was a big problem—particularly as a result of many R&D groups had restricted or no prior expertise managing infrastructure. To offer a easy transition, AppsFlyer’s Information Infrastructure (DataInfra) workforce developed a complete migration technique that empowered the R&D groups to seamlessly migrate their pipelines.

On this part, we focus on how AppsFlyer approached the problem and achieved success for the complete group.

Centralized preparation by the DataInfra workforce

To offer a seamless transition to EMR Serverless, the DataInfra workforce took the lead in centralizing preparation efforts:

  • Clear possession – Taking full duty for the migration, the workforce deliberate, guided, and supported R&D groups all through the method.
  • Structured migration information – An in depth, step-by-step information was created to streamline the transition from Hadoop, breaking down the complexities and making it accessible to groups with restricted infrastructure expertise.

Constructing a powerful assist community

To verify the R&D groups had the assets they wanted, AppsFlyer established a strong assist surroundings:

  • Information neighborhood – The first useful resource for answering technical questions. It inspired data sharing throughout groups and was spearheaded by the DataInfra workforce.
  • Slack assist channel – A devoted channel the place the DataInfra workforce actively responded to questions and guided groups by means of the migration course of. This real-time assist considerably lowered bottlenecks and helped groups resolve points shortly.

Infrastructure templates with greatest practices

Recognizing the complexity of the workforce’s migration, the DataInfra workforce had standardized templates to assist groups begin shortly and effectively:

  • Infrastructure as code (IaC) templates – They developed Terraform templates with greatest practices for constructing functions on EMR Serverless. These templates included code examples and actual manufacturing workflows already migrated to EMR Serverless. Groups may shortly bootstrap their initiatives by utilizing these ready-made templates.
  • Cross-account entry options – Working throughout a number of AWS accounts required managing safe entry between EMR Serverless accounts (the place jobs run) and information storage accounts (the place datasets reside). To streamline this, a step-by-step module was developed for establishing cross-account entry utilizing Assume Position permissions. Moreover, a devoted repository was created, so groups can outline and automate position and coverage creation, offering seamless and scalable entry administration.

Airflow integration

As AppsFlyer’s major workflow scheduler, Airflow performs a important position, making it important to supply a seamless transition for its customers.

AppsFlyer developed a devoted Airflow operator for executing Spark jobs on EMR Serverless, rigorously designed to copy the performance of the prevailing Hadoop-based Spark operator. As well as, a Python package deal was made accessible throughout all Airflow clusters with the related operators. This strategy minimized code modifications, permitting groups to transition seamlessly with minimal modifications.

Fixing frequent permission challenges

To streamline permissions administration, AppsFlyer developed focused options for frequent use instances:

  • Complete documentation – Offered detailed directions for dealing with permissions for companies like Athena, BigQuery, Vault, GIT, Kafka, and plenty of extra.
  • Standardized Spark defaults configuration for groups to use to their functions – Included built-in options for gathering lineage from Spark jobs operating on EMR Serverless, offering accountability and traceability.

Steady engagement with R&D groups

To advertise progress and preserve alignment throughout groups, AppsFlyer launched the next measures:

  • Weekly conferences – Weekly standing conferences to evaluation the standing of every workforce’s migration efforts. Groups shared updates, challenges, and commitments, fostering transparency and collaboration.
  • Help – Proactive help was supplied for points raised throughout conferences to attenuate delays. This made positive that the groups have been on observe and had the assist they wanted to fulfill their commitments.

By implementing these methods, AppsFlyer reworked the migration course of from a frightening problem right into a structured and well-supported journey. Key outcomes included:

  • Empowered groups – R&D groups with minimal infrastructure expertise have been capable of confidently migrate their pipelines.
  • Standardized practices – Infrastructure templates and predefined options supplied consistency and greatest practices throughout the group.
  • Lowered downtime – The {custom} Airflow operator and detailed documentation minimized disruptions to current workflows.
  • Cross-account compatibility – With seamless cross-account entry, groups may run jobs and entry information effectively.
  • Improved collaboration – The information neighborhood and Slack assist channel fostered a way of collaboration and shared duty throughout groups.

Migrating a complete group’s information workflows to EMR Serverless is a fancy activity, however by investing in preparation, templates, and assist, AppsFlyer efficiently streamlined the method for all R&D groups within the firm.

This strategy can function a mannequin for organizations enterprise comparable migrations.

Spark utility code administration and deployment

For AppsFlyer information engineers, growing and deploying Spark functions is a core each day duty. The Information Platform workforce focuses on figuring out and implementing the proper set of instruments and safeguards that may not solely simplify the migration to EMR Serverless, but additionally streamline ongoing operations.

There are two totally different approaches accessible for operating Spark code on EMR Serverless: {custom} container pictures and JARs or Python recordsdata. At first of the exploration, {custom} pictures appeared promising as a result of it permits higher customization than JARs, which ought to enable the DataInfra workforce smoother migration for current workloads. After deeper analysis, it was realized that {custom} pictures have nice energy, however include a price that in giant scale would have to be evaluated. Customized pictures offered the next challenges:

  • Customized pictures are supported as of model 6.9.0, however a few of AppsFlyer’s workloads used earlier variations.
  • EMR Serverless assets run from the second EMR Serverless begins downloading the picture till employees are stopped. This implies a cost is completed for combination vCPU, reminiscence, and storage assets in the course of the picture obtain section.
  • They required a distinct steady integration and supply (CI/CD) strategy than compiling a JAR or Python file, resulting in operational work that needs to be minimized as a lot as attainable.

AppsFlyer determined to go all in with JARs and permit solely in distinctive instances, the place the customization required using {custom} pictures. Ultimately, it was realized that utilizing non-custom pictures was appropriate for AppsFlyer use instances.

CI/CD perspective

From a CI/CD perspective, AppsFlyer’s DataInfra workforce determined to align with AppsFlyer’s GitOps imaginative and prescient, ensuring that each infrastructure and utility code are version-controlled, constructed, and deployed utilizing Git operations.

The next diagram illustrates the GitOps strategy AppsFlyer adopted.

JARs steady integration

For CI, the method in command of constructing the appliance artifacts, a number of choices have been explored. The next key issues drove the exploration course of:

  • Use Amazon S3 because the native JAR supply for EMR Serverless
  • Help totally different variations for a similar job
  • Help staging and manufacturing environments
  • Permit hotfixes, patches, and rollbacks

Utilizing AppsFlyer’s present exterior package deal repository led to challenges, as a result of it required them to construct a {custom} supply into Amazon S3 or a fancy runtime capability to fetch the code externally.

Utilizing Amazon S3 immediately additionally had a number of various approaches:

  • Buckets – Use single vs. separated buckets for staging and manufacturing
  • Variations – Use Amazon S3 native object versioning vs. importing a brand new file
  • Hotfix – Override the identical job’s JAR file vs. importing a brand new one

Lastly, the choice was to go along with immutable builds for constant deployment throughout the environments.

Every Spark job git repository pushes to the principle department, triggers a CI course of to validate the semantic versioning (semver) project, compiles the JAR artifact, and uploads it to Amazon S3. Every artifact is uploaded to a few totally different paths in accordance with the model of the JAR, and likewise embrace a model tag for the S3 object:

  • //".""."/app.jar
  • //".""/app.jar
  • ///app.jar

AppsFlyer can now have deep granularity and assign every EMR Serverless job to a pinpointed model. Some jobs can run with the most recent main model, and different stability and SLA delicate jobs require a lock to a particular patch model.

EMR Serverless steady deployment

Importing the recordsdata to Amazon S3 was the ultimate step within the CI course of, which then results in a distinct CD course of.

CD is completed by altering the infrastructure code, which is Terraform based mostly, to level to the brand new JAR that was uploaded to Amazon S3. Then the staging or manufacturing utility can begin utilizing the newly uploaded code and the method may be thought of deployed.

Spark utility rollbacks

In the event that they want an utility rollback, AppsFlyer factors the EMR Serverless job IaC configuration from the present impaired JAR model to the earlier secure JAR model within the related Amazon S3 path.

AppsFlyer believes that each automation impacting manufacturing, like CD, requires a breaking glass mechanism for an emergency scenario. In such instances, AppsFlyer can manually override the wanted S3 object (JAR file) whereas nonetheless utilizing Amazon S3 variations in an effort to have higher visibility and guide model management.

Single-job vs. multi-job functions

When utilizing EMR Serverless, one essential architectural determination is whether or not to create a separate utility for every Spark job or use an computerized scaling utility shared throughout a number of Spark jobs. The next desk summarizes these issues.

Facet Single-Job Utility Multi-Job Utility
Logical Nature Devoted utility for every job. Shared utility for a number of jobs.
Shared Configurations Restricted shared configurations; every utility is independently configured. Permits shared configurations by means of spark-defaults, together with executors, reminiscence settings, and JARs.
Isolation Most isolation; every job runs independently. Maintains job-level isolation by means of distinct IAM roles regardless of sharing the appliance.
Flexibility Versatile for distinctive configurations or useful resource necessities. Reduces overhead by reusing configurations and utilizing computerized scaling.
Overhead Increased setup and administration overhead as a consequence of a number of functions. Decrease administrative overhead however requires cautious useful resource competition administration.
Use Circumstances Appropriate for jobs with distinctive necessities or strict isolation wants. Best for associated workloads that profit from shared settings and dynamic scaling.

By balancing these issues, AppsFlyer tailor-made its EMR Serverless utilization to effectively meet the calls for of various Spark workloads throughout their groups.

Airflow operator: Simplifying the transition to EMR Serverless

Earlier than the migration to EMR Serverless, AppsFlyer’s groups relied on a {custom} Airflow Spark operator created by the DataInfra workforce.

This operator, packaged as a Python library, was built-in into the Airflow surroundings and have become a key part of the info workflows.

It supplied important capabilities, together with:

  • Retries and alerts – Constructed-in retry logic and PagerDuty alert integration
  • AWS role-based entry – Computerized fetching of AWS permissions based mostly on position names
  • Customized defaults – Setting Spark configurations and package deal defaults tailor-made for every job
  • State administration – Job state monitoring

This operator streamlined operating Spark jobs on Hadoop and was extremely tailor-made to AppsFlyer’s necessities.

When transferring to EMR Serverless, the workforce selected to construct a {custom} Airflow operator to align with their current Spark-based workflows. They already had dozens of Directed Acyclic Graphs (DAGs) in manufacturing, so with this strategy, they might preserve their acquainted interface, together with {custom} dealing with for retries, alerting, and configurations—all with out requiring broad modifications throughout the board.

This abstraction supplied a smoother migration by preserving the identical growth patterns and minimizing the migration efforts of adapting to the native operator semantics.

The DataInfra workforce developed a devoted, {custom}, EMR Serverless operator to assist the next targets:

  • Seamless migration – The operator was designed to carefully mimic the interface of the prevailing Spark operator on Hadoop. This made positive that groups may migrate with minimal code modifications.
  • Function parity – They added the options lacking from the native operator:
    • Constructed-in retry logic.
    • PagerDuty integration for alerts.
    • Computerized role-based permission fetching.
    • Default Spark configurations and package deal assist for every job.
  • Simplified integration – It’s packaged as a Python library accessible in Airflow clusters. Groups may use the operator similar to they did with the earlier Spark operator.

The {custom} operator abstracts a few of the underlying configurations required to submit jobs to EMR Serverless, aligning with AppsFlyer’s inside greatest practices and including important options.

The next is from an instance DAG utilizing the operator:

return SparkBatchJobEmrServerlessOperator(
    task_id=task_id,  # Distinctive activity identifier within the DAG

    jar_file=jar_file,  # Path to the Spark job JAR file on S3
    main_class="
", spark_conf=spark_conf, app_id=default_args[""], # EMR Serverless app ID execution_role=default_args[""], # IAM position for job execution polling_interval_sec=120, # How typically to ballot for job standing execution_timeout=timedelta(hours=1), # Max allowed runtime retries=5, # Retry makes an attempt for failed jobs app_args=[], # Arguments to cross to the Spark job depends_on_past=True, # Guarantee sequential activity execution tags={'proprietor': ''}, # Metadata for possession aws_assume_role="", # Position for cross-account entry alerting_policy=ALERT_POLICY_CRITICAL.with_slack_channel(sc), # Alerting integration proprietor="", dag=dag # DAG this activity belongs to )

Cross-account permissions on AWS: Simplifying EMRs workflows

AppsFlyer operates throughout a number of AWS accounts, creating a necessity for safe and environment friendly cross-account entry. EMR Serverless jobs are executed within the manufacturing account, and the info they course of resides in a separate information account. To allow seamless operation, Assume Position permissions are used to confirm that EMR Serverless jobs operating within the manufacturing account can entry the info and companies within the information account. The next diagram illustrates this structure.

Beneath is a diagram demonstrating the cross-account permissions AppsFlyer adopted:

Position administration technique

To handle cross-account entry effectively, three distinct roles have been created and maintained:

  • EMR position – Used for executing and managing EMR Serverless functions within the manufacturing account. Built-in immediately into Airflow employees to make it accessible for the DAGs on the devoted workforce Airflow cluster.
  • Execution position – Assigned to the Spark job operating on EMR Serverless. Handed by the EMR position within the DAG code to supply seamless integration.
  • Information position – Resides within the information account and is assumed by the execution position to entry information saved in Amazon S3 and different AWS companies.

To implement entry boundaries, every position and coverage is tagged with team-specific identifiers.
This makes positive that groups can solely entry their very own information and roles, minimizing unauthorized entry to different groups’ assets.

Simplifying Airflow migration

A streamlined course of to make cross-account permissions clear for groups migrating their workloads to EMR Serverless was developed:

  1. The EMR position is embedded into Airflow employees, making it accessible for DAGs within the devoted Airflow cluster for every workforce:
{
   "Model":"2012-10-17",
   "Assertion":[
      "..."{
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::account-id:role/execution-role",
         "Condition":{
            "StringEquals":{
               "iam:ResourceTag/Team":"team-tag"
            }
         }
      }
   ]
}

  1. The EMR position routinely passes the execution position to the job inside the DAG code:
{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::data-account-id:role/data-role",
      "Condition": {
        "StringEquals": {
          "iam:ResourceTag/Team": "team-tag"
        }
      }
    }
  ]
}

  1. The execution position assumes the info position dynamically throughout job execution to entry the required information and companies within the information account:

Permits the Execution Position within the Manufacturing account to imagine the Information Position.

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::production-account-id:role/execution-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

  1. Insurance policies, belief relationships, and position definitions are managed in a devoted GitLab repository. GitLab CI/CD pipelines automate the creation and integration of roles and insurance policies, offering consistency and lowering guide overhead.

Advantages of AppsFlyer’s strategy

This strategy provided the next advantages:

  • Seamless entry – Groups now not have to deal with cross-account permissions manually as a result of these are automated by means of preconfigured roles and insurance policies, offering seamless and safe entry to assets throughout accounts.
  • Scalable and safe – Position-based and tag-based permissions present safety and scalability throughout a number of groups and accounts. Through the use of roles and tags, it alleviates the necessity to create separate hardcoded insurance policies for every workforce or account. As an alternative, they will outline generalized insurance policies that scale routinely as new assets, accounts, or groups are added.
  • Automated administration – GitLab CI/CD streamlines the deployment and integration of insurance policies and roles, lowering guide effort whereas enhancing consistency. It additionally minimizes human errors, improves change transparency, and simplifies model administration.
  • Flexibility for groups – Groups have the flexibleness to make use of their very own or native EMR Serverless operators whereas sustaining safe entry to information.

By implementing a strong, automated cross-account permissions system, AppsFlyer has enabled safe and environment friendly entry to information and companies throughout a number of AWS accounts. This makes positive that groups can give attention to their workloads with out worrying about infrastructure complexities, accelerating their migration to EMR Serverless.

Integrating lineage into EMR Serverless

AppsFlyer developed a strong answer for column-level lineage assortment to supply complete visibility into information transformations throughout pipelines. Lineage information is saved in Amazon S3 and subsequently ingested into DataHub, AppsFlyer’s lineage and metadata administration surroundings.

At present, AppsFlyer collects column-level lineage from quite a lot of sources, together with Amazon Athena, BigQuery, Spark, and extra.

This part focuses on how AppsFlyer collects Spark column-level lineage particularly inside the EMR Serverless infrastructure.

Gathering Spark lineage with Spline

To seize lineage from Spark jobs, AppsFlyer makes use of Spline, an open supply device designed for automated monitoring of information lineage and pipeline constructions.

AppsFlyer modified Spline’s default habits to output a personalized Spline object that aligns with AppsFlyer’s particular necessities. AppsFlyer tailored the Spline integration into each legacy and fashionable environments. Within the pre-migration section, they injected the Spline agent into Spark jobs by means of their personalized Airflow Spark operator. Within the post-migration section, they built-in Spline immediately into EMR Serverless functions.

The lineage workflow consists of the next steps:

  1. As Spark jobs execute, Spline captures detailed metadata concerning the queries and transformations carried out.
  2. The captured metadata is exported as Spline object recordsdata to a devoted S3 bucket.
  3. These Spline objects are processed into column-level lineage objects personalized to suit AppsFlyer’s information structure and necessities.
  4. The processed lineage information is ingested into DataHub, offering a centralized and interactive view of information dependencies.

The next determine is an instance of a lineage diagram from DataHub.

Challenges and the way AppsFlyer addressed them

AppsFlyer encountered the next challenges:

  • Supporting totally different EMR Serverless functions – Every EMR Serverless utility has its personal Spark and Scala model necessities.
  • Various operator utilization – Groups typically use {custom} or native EMR Serverless operators, making uniform Spline integration difficult.
  • Confirming common adoption – They want to verify Spark jobs throughout a number of accounts use the Spline agent for lineage monitoring.

AppsFlyer addressed these challenges with the next options:

  • Model-specific Spline brokers – AppsFlyer created a devoted Spline agent for every EMR Serverless utility model to match its Spark and Scala variations. For instance, EMR Serverless utility model 7.0.1 and Spline.7.0.1.
  • Spark defaults integration – They built-in the Spline agent into EMR Serverless utility Spark defaults to confirm lineage assortment for jobs executed on the appliance—no job-specific modifications wanted.
  • Automation for compliance – This course of consists of the next steps:
    • Detect a newly created EMR Serverless utility throughout accounts.
    • Confirm that Spline is correctly outlined within the utility’s Spark defaults.
    • Ship a PagerDuty alert to the devoted workforce if misconfigurations are detected.

Instance integration with Terraform

To automate Spline integration, AppsFlyer used Terraform and local-exec to outline Spark defaults for EMR Serverless functions. With Amazon EMR, you’ll be able to set unified Spark configuration properties by means of spark-defaults, that are then utilized to Spark jobs.

This configuration makes positive the Spline agent is routinely utilized to each Spark job with out requiring modifications to the Airflow operator or the job itself.

This strong lineage integration offers the next advantages:

  • Full visibility – Computerized lineage monitoring offers detailed insights into information transformations
  • Seamless scalability – Model-specific Spline brokers present compatibility with EMR Serverless functions
  • Proactive monitoring – Automated compliance checks confirm that lineage monitoring is constantly enabled throughout accounts
  • Enhanced governance – Ingesting lineage information into DataHub offers traceability, helps audits, and fosters a deeper understanding of information dependencies

By integrating Spline with EMR Serverless functions, AppsFlyer has supplied complete and automatic lineage monitoring, so groups can perceive their information pipelines higher whereas assembly compliance necessities. This scalable strategy aligns with AppsFlyer’s dedication to sustaining transparency and reliability all through their information panorama.

Monitoring and observability

When embarking on a big migration, and as a day-to-day best-practice course of, monitoring and observability are key components of with the ability to run workloads efficiently for stability, debugging, and price.

AppsFlyer’s DataInfra workforce set a number of KPIs for monitoring and observability in EMR Serverless:

  • Monitor infrastructure-level metrics and logs:
    • EMR Serverless useful resource utilization, together with price
    • EMR Serverless API utilization
  • Monitor Spark application-level metrics and logs:
    • stdout and stderr logs
    • Spark engine metrics
  • Centralized observability over the prevailing environments, Datadog

Metrics

Utilizing EMR Serverless native metrics, AppsFlyer’s DataInfra workforce arrange a number of dashboards to assist monitoring each the migration and the day-to-day utilization of EMR Serverless throughout the corporate. The next are the principle metrics that have been monitored:

  • Service quota utilization metrics:
    • vCPU utilization monitoring (ResourceCount with vCPU dimension)
    • API utilization monitoring (API precise utilization vs. API limits)
  • Utility standing metrics:
    • RunningJobs, SuccessJobs, FailedJobs, PendingJobs, CancelledJobs
  • Useful resource limits monitoring:
    • MaxCPUAllowed vs. CPUAllocated
    • MaxMemoryAllowed vs. MemoryAllocated
    • MaxStorageAllowed vs. StorageAllocated
  • Employee-level metrics:
    • WorkerCpuAllocated vs. WorkerCpuUsed
    • WorkerMemoryAllocated vs. WorkerMemoryUsed
    • WorkerEphemeralStorageAllocated vs. WorkerEphemeralStorageUsed
  • Capability allocation monitoring:
    • Metrics filtered by CapacityAllocationType (PreInitCapacity vs. OnDemandCapacity)
    • ResourceCount
  • Employee sort distribution:
    • Metrics filtered by WorkerType (SPARK_DRIVER vs. SPARK_EXECUTORS)
  • Job success charges over time:
    • SuccessJobs vs. FailedJobs ratio
    • SubmitedJobs vs. PendingJobs

The next screenshot reveals an instance of the tracked metrics.

Logs

For logs administration, AppsFlyer’s DataInfra workforce explored a number of choices:

Streamlining EMR Serverless log delivery to Datadog

As a result of AppsFlyer determined to maintain their logs in an exterior logging surroundings, the DataInfra workforce aimed to scale back the variety of parts concerned within the delivery course of and reduce upkeep overhead. As an alternative of managing a Lambda based mostly log shipper, they developed a {custom} Spark plugin that seamlessly exports logs from EMR Serverless to Datadog.

Firms already storing logs in Amazon S3 or CloudWatch Logs can benefit from EMR Serverless native assist for these environments. Nevertheless, for groups needing a direct, real-time integration with Datadog, this strategy alleviates the necessity for additional infrastructure, offering a extra environment friendly and maintainable logging answer.

The {custom} Spark plugin presents the next capabilities:

  • Automated log export – Streams logs from EMR Serverless to Datadog
  • Fewer additional parts – Alleviates the necessity for Lambda based mostly log shippers
  • Safe API key administration – Makes use of Vault as an alternative of hardcoding credentials
  • Customizable logging – Helps {custom} Log4j settings and log ranges
  • Full integration with Spark – Works on each driver and executor nodes

How the plugin works

On this part, we stroll by means of the parts of how the plugin works and supply a pseudocode overview:

  • Driver pluginLoggerDriverPlugin runs on the Spark driver to configure logging. The plugin fetches EMR job metadata, calls Vault to retrieve the Datadog API key, and configures logging settings.
initialize() {
  if (consumer supplied log4j.xml) {
     Use {custom} log configuration
  } else {
     Fetch EMR job metadata (utility title, job ID, tags)
     Retrieve Datadog API key from Vault
     Apply default logging settings
  }
}

  • Executor plugin – LoggerExecutorPlugin offers constant logging throughout executor nodes. It inherits the motive force’s log configuration and makes positive the executors use constant logging
initialize() {
   fetch logging config from Driver
   apply log settings (log4j, log ranges)
}

  • Primary plugin – LoggerSparkPlugin registers the motive force and executor plugins in Spark. It serves because the entry level for Spark and applies {custom} logging settings dynamically.
perform registerPlugin() {
  return (driverPlugin, executorPlugin);
}

loginToVault(position, vaultAddress) {
    create AWS signed request
    authenticate with Vault
    return vault token
}

getDatadogApiKey(vaultToken, secretPath) {
    fetch API key from Vault
    return key
}

Arrange the plugin

To arrange the plugin, full the next steps:

  1. Add the next dependencies to your challenge:

  com.AppsFlyer.datacom
  emr-serverless-logger-plugin
  

  1. Configure the Spark plugin. The next code permits the {custom} Spark plugin and assigns the Vault position to entry the Datadog API key:

--conf "spark.plugins=com.AppsFlyer.datacom.emr.plugin.LoggerSparkPlugin"

--conf "spark.datacom.emr.plugin.vaultAuthRole=your_vault_role"

  1. Use a {custom} or default Log4j configuration:

--conf "spark.datacom.emr.plugin.location=classpath:my_custom_log4j.xml"

  1. Set the surroundings variables for various log ranges. This adjusts the logging for particular packages.

--conf "spark.emr-serverless.driverEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.executorEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.emr-serverless.driverEnv.LOG_LEVEL=DEBUG"

--conf "spark.executorEnv.LOG_LEVEL=DEBUG"

  1. Configure the Vault and Datadog API key and confirm safe Datadog API key retrieval.

By adopting this plugin, AppsFlyer was capable of considerably simplify log delivery, lowering the variety of transferring components whereas sustaining real-time log visibility in Datadog. This strategy offers reliability, safety, and ease of upkeep, making it an excellent answer for groups utilizing EMR Serverless with Datadog.

Abstract

By their migration to EMR Serverless, AppsFlyer achieved a big transformation in workforce autonomy and operational effectivity. Particular person groups now have higher freedom to decide on and construct their very own assets with out relying on a central infrastructure workforce, and may work extra independently and innovatively. The minimization of spot interruptions, which have been frequent of their earlier self-managed Hadoop clusters, has considerably improved stability and agility of their operations. Because of this autonomy and reliability, mixed with the automated scaling capabilities of EMR Serverless, the AppsFlyer groups can focus extra on information processing and innovation slightly than infrastructure administration. The result’s a extra environment friendly, versatile, and self-sufficient growth surroundings the place groups can higher reply to their particular wants whereas sustaining excessive efficiency requirements.

Ruli Weisbach, AppsFlyer EVP of R&D, says,

“EMR-Serverless is a recreation changer for AppsFlyer; we’re capable of save considerably our price with remarkably decrease administration overhead and maximal elasticity.”

If the AppsFlyer strategy sparked your curiosity and you’re enthusiastic about implementing the same answer in your group, seek advice from the next assets:

Migrating to EMR Serverless can remodel your group’s information processing capabilities, providing a totally managed, cloud-based expertise that routinely scales assets and eases the operational complexity of conventional cluster administration, whereas enabling superior analytics and machine studying workloads with higher cost-efficiency.


Concerning the authors

Roy Ninio is an AI Platform Lead with deep experience in scalable information platform and cloud-native architectures. At AppsFlyer, Roy led the design of a high-performance Information Lake dealing with PB of each day occasions, pushed the adoption of EMR Serverless for dynamic massive information processing, and architected lineage and governance methods throughout platforms.

Avichay Marciano is a Sr. Analytics Options Architect at Amazon Internet Providers. He has over a decade of expertise in constructing large-scale information platforms utilizing Apache Spark, fashionable information lake architectures, and OpenSearch. He’s captivated with data-intensive methods, analytics at scale, and it’s intersection with machine studying.

Eitav Arditti is AWS Senior Options Architect with 15 years in AdTech trade, specializing in Serverless, Containers, Platform engineering, and Edge applied sciences. Designs cost-efficient, large-scale AWS architectures that leverage the cloud-native and edge computing to ship scalable, dependable options for enterprise progress.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Internet Providers. Yonatan is an Apache Iceberg evangelist, serving to prospects design scalable, open information lakehouse architectures and undertake fashionable analytics options throughout industries.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments