HomeBig DataHow BMW Group constructed a serverless terabyte-scale knowledge transformation structure with dbt...

How BMW Group constructed a serverless terabyte-scale knowledge transformation structure with dbt and Amazon Athena


Companies more and more require scalable, cost-efficient architectures to course of and remodel huge datasets. On the BMW Group, our Cloud Effectivity Analytics (CLEA) group has developed a FinOps resolution to optimize prices throughout over 10,000 cloud accounts. Whereas enabling organization-wide effectivity, the group additionally utilized these rules to the info structure, ensuring that CLEA itself operates frugally. After evaluating numerous instruments, we constructed a serverless knowledge transformation pipeline utilizing Amazon Athena and dbt.

This submit explores our journey, from the preliminary challenges to our present structure, and particulars the steps we took to realize a extremely environment friendly, serverless knowledge transformation setup.

Challenges: Ranging from a inflexible and dear setup

In our early levels, we encountered a number of inefficiencies that made scaling tough. We have been managing complicated schemas with large tables that required important effort in maintainability. Initially, we used Terraform to create tables and views in Athena, permitting us to handle our knowledge infrastructure as code (IaC) and automate deployments via steady integration and supply (CI/CD) pipelines. Nevertheless, this technique slowed us down when altering knowledge fashions or coping with schema adjustments, subsequently requiring excessive improvement efforts.

As our resolution grew, we confronted challenges with question efficiency and prices. Every question scanned massive quantities of uncooked knowledge, leading to elevated processing time and better Athena prices. We used views to offer a clear abstraction layer, however this masked underlying complexity as a result of seemingly easy queries towards these views scanned massive volumes of uncooked knowledge, and our partitioning technique wasn’t optimized for these entry patterns. As our datasets grew, the dearth of modularity in our knowledge design elevated complexity, making scalability and upkeep more and more tough. We would have liked an answer for pre-aggregating, computing, and storing question outcomes of computationally intensive transformations. The absence of strong testing and lineage options made it difficult to determine the foundation causes of knowledge inconsistencies after they occurred.

As a part of our enterprise intelligence (BI) resolution, we used Amazon QuickSight to construct our dashboards, offering visible insights into our cloud price knowledge. Nevertheless, our preliminary knowledge structure led to challenges. We have been constructing dashboards on prime of enormous, large datasets, with some hitting the QuickSight per-dataset SPICE restrict of 1 TB. Moreover, throughout SPICE ingest, our largest datasets required 4–5 hours of processing time because of performing full scans every time, usually scanning over a terabyte of knowledge. This structure wasn’t serving to us be extra agile and fast whereas scaling up. The lengthy processing occasions and storage limitations hindered our means to offer well timed insights and develop our analytics capabilities.

To deal with these points, we enhanced the info structure with AWS Lambda, AWS Step Capabilities, AWS Glue, and dbt. This software stack considerably enhanced our improvement agility, empowering us to shortly modify and introduce new knowledge fashions. On the similar time, we improved our general knowledge processing effectivity with incremental masses and higher schema administration.

Resolution overview

Our present structure consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We selected Athena as our major question engine for a number of strategic causes: it aligns completely with our group’s SQL experience, excels at querying Parquet knowledge straight in our knowledge lake, and alleviates the necessity for devoted compute assets. This makes Athena a super match for CLEA’s structure, the place we course of round 300 GB day by day from an information lake of 15 TB, with our largest dataset containing 50 billion rows throughout as much as 400 columns. The potential of Athena to effectively question large-scale Parquet knowledge, mixed with its serverless nature, allows us to give attention to writing environment friendly transformations quite than managing infrastructure.

The next diagram illustrates the answer structure.

Utilizing this structure, we’ve streamlined our knowledge transformation course of utilizing dbt. In dbt, an information mannequin represents a single SQL transformation that creates both a desk or a view—primarily a constructing block of our knowledge transformation pipeline. Our implementation consists of round 400 such fashions, 50 knowledge sources, and round 100 knowledge checks. This setup allows seamless updates—whether or not creating new fashions, updating schemas, or modifying views—triggered just by making a pull request in our supply code repository, with the remainder dealt with robotically.

Our workflow automation consists of the next options:

  • Pull request – After we create a pull request, it’s deployed to our testing atmosphere first. After passing validation and being authorised or merged, it’s deployed to manufacturing utilizing GitHub workflows. This setup allows seamless mannequin creation, schema updates, or view adjustments—triggered simply by making a pull request, with the remainder dealt with robotically.
  • Cron scheduler – For nightly runs or a number of day by day runs to cut back knowledge latency, we use scheduled GitHub workflows. This setup permits us to configure particular fashions with totally different replace methods based mostly on knowledge wants. We will set fashions to replace incrementally (processing solely new or modified knowledge), as views (querying with out materializing knowledge), or as full masses (utterly refreshing the info). This flexibility optimizes processing time and useful resource utilization. We will goal solely particular folders—like supply, ready, or semantic layers—and run the dbt check afterward to validate mannequin high quality.
  • On demand – When including new columns or altering enterprise logic, we have to replace historic knowledge to keep up consistency. For this, we use a backfill course of, which is a customized GitHub workflow created by our group. The workflow permits us to pick out particular fashions, embody their upstream dependencies, and set parameters like begin and finish dates. This makes certain that adjustments are utilized precisely throughout the complete historic dataset, sustaining knowledge consistency and integrity.

Our pipeline is organized into three major levels—Supply, Ready, and Semantic—every serving a selected objective in our knowledge transformation journey. The Supply stage maintains uncooked knowledge in its unique kind. The Ready stage cleanses and standardizes this knowledge, dealing with duties like deduplication and knowledge sort conversions. The Semantic stage transforms this ready knowledge into business-ready fashions aligned with our analytical wants. A further QuickSight step handles visualization necessities. To attain low price and excessive efficiency, we use dbt fashions and SQL code to handle all transformations and schema adjustments. By implementing incremental processing methods, our fashions course of solely new or modified knowledge quite than reprocessing the complete dataset with every run.

The Semantic stage (to not be confused with dbt’s semantic layer characteristic) introduces enterprise logic, remodeling knowledge into aggregated datasets which might be straight consumable by BMW’s Cloud Knowledge Hub, inside CLEA dashboards, knowledge APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step additional optimizes knowledge by deciding on solely needed columns through the use of a column-level lineage resolution and setting a dynamic date filter with a sliding window to ingest solely related sizzling knowledge into SPICE, avoiding unused knowledge in dashboards or stories.

This strategy aligns with BMW Group’s broader knowledge technique, which incorporates streamlining knowledge entry utilizing AWS Lake Formation for fine-grained entry management.

Total, as a high-level construction, we’ve absolutely automated schema adjustments, knowledge updates, and testing via GitHub pull requests and dbt instructions. This strategy allows managed deployment with strong model management and alter administration. Steady testing and monitoring workflows uphold knowledge accuracy, reliability, and high quality throughout transformations, supporting environment friendly, collaborative mannequin iteration.

Key advantages of the dbt-Athena structure

To design and handle dbt fashions successfully, we use a multi-layered strategy mixed with price and efficiency optimizations. On this part, we talk about how our strategy has yielded important advantages in 5 key areas.

SQL-based, developer-friendly atmosphere

Our group already had robust SQL expertise, so dbt’s SQL-centric strategy was a pure match. As an alternative of studying a brand new language or framework, builders might instantly begin writing transformations utilizing acquainted SQL syntax with dbt. This familiarity aligns nicely with the SQL interface of Athena and, mixed with dbt’s added performance, has elevated our group’s productiveness.

Behind the scenes, dbt robotically handles synchronization between Amazon Easy Storage Service (Amazon S3), the AWS Glue Knowledge Catalog, and our fashions. When we have to change a mannequin’s materialization sort—for instance, from a view to a desk—it’s so simple as updating a configuration parameter quite than rewriting code. This flexibility has decreased our improvement time dramatically, allowed us to give attention to constructing higher knowledge fashions quite than managing infrastructure.

Agility in modeling and deployment

Documentation is essential for any knowledge platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our knowledge fashions. This documentation consists of desk schemas, relationships between fashions, and utilization examples, enabling group members to grasp how fashions interconnect and the best way to use them successfully.

We use dbt’s built-in testing capabilities to implement complete knowledge high quality checks. These embody schema checks that confirm column uniqueness, referential integrity, and null constraints, in addition to customized SQL checks that validate enterprise logic and knowledge consistency. The testing framework runs robotically on each pull request, validating knowledge transformations at every step of our pipeline. Moreover, dbt’s dependency graph supplies a visible illustration of how our fashions interconnect, serving to us perceive the upstream and downstream impacts of any adjustments earlier than we implement them. When stakeholders want to switch fashions, they will submit adjustments via pull requests, which, after they’re authorised and merged, robotically set off the mandatory knowledge transformations via our CI/CD pipeline. This streamlined course of enabled us to create new knowledge merchandise inside days in comparison with weeks and decreased ongoing upkeep work by catching points early within the improvement cycle.

Athena workgroup separation

We use Athena workgroups to isolate totally different question patterns based mostly on their execution triggers and functions. Every workgroup has its personal configuration and metric reporting, permitting us to watch and optimize individually. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests via our Supply, Ready, and Semantic levels. The dbt-test workgroup executes automated knowledge high quality checks throughout pull request validation and nightly builds. The QuickSight workgroup manages SPICE knowledge ingestion queries, and the Advert-hoc workgroup helps interactive knowledge exploration by our group.

Every workgroup may be configured with particular knowledge utilization quotas, enabling groups to implement granular governance insurance policies. This separation supplies a number of advantages: it allows clear price allocation, supplies remoted monitoring of question patterns throughout totally different use circumstances, and helps implement knowledge governance via customized workgroup settings. Amazon CloudWatch monitoring per workgroup helps us observe utilization patterns, determine question efficiency points, and alter configurations based mostly on precise wants.

Utilizing QuickSight SPICE

QuickSight SPICE (Tremendous-fast, Parallel, In-memory Calculation Engine) supplies highly effective in-memory processing capabilities that we’ve optimized for our particular use circumstances. Moderately than loading total tables into SPICE, we create specialised views on prime of our materialized semantic fashions. These views are fastidiously crafted to incorporate solely the mandatory columns, related metadata joins, and applicable time filtering to have solely current knowledge obtainable in dashboards.

We’ve carried out a hybrid refresh technique for these SPICE datasets: day by day incremental updates preserve the info contemporary, and weekly full refreshes preserve knowledge consistency. This strategy strikes a steadiness between knowledge freshness and processing effectivity. The result’s responsive dashboards that preserve excessive efficiency whereas preserving processing prices below management.

Scalability and cost-efficiency

The serverless structure of Athena eliminates handbook infrastructure administration, robotically scaling based mostly on question demand. As a result of prices are based mostly solely on the quantity of knowledge scanned by queries, optimizing queries to scan as little knowledge as potential straight reduces our prices. We use the distributed question execution capabilities of Athena via our dbt mannequin construction, enabling parallel processing throughout knowledge partitions. By implementing efficient partitioning methods and utilizing Parquet file format, we decrease the quantity of knowledge scanned whereas maximizing question efficiency.

Our structure gives flexibility in how we materialize knowledge via views, full tables, and incremental tables. With dbt’s incremental fashions and partitioning technique, we course of solely new or modified knowledge as an alternative of total datasets. This strategy has confirmed extremely efficient—we’ve noticed important reductions in knowledge processing quantity in addition to knowledge scanning, notably in our QuickSight workgroup.

The effectiveness of those optimizations carried out on the finish of 2023 is seen within the following diagram, displaying prices by Athena workgroups.

The workgroups are illustrated as follows:

  • Inexperienced (QuickSight): Exhibits decreased knowledge scanning post-optimization.
  • Gentle blue (Advert-hoc): Varies based mostly on evaluation wants.
  • Darkish blue (dbt): Maintains constant processing patterns
  • Orange (dbt-test): Exhibits common, environment friendly check execution.

The elevated dbt workload prices straight correlate with decreased QuickSight prices, reflecting our architectural shift from utilizing complicated views in QuickSight workgroups (which beforehand masked question complexity however led to repeated computations) to utilizing dbt for materializing these transformations. Though this elevated the dbt workload, the general cost-efficiency improved considerably as a result of materialized tables decreased redundant computations in QuickSight. This demonstrates how our optimization methods efficiently handle rising knowledge volumes whereas reaching web price discount via environment friendly knowledge materialization patterns.

Conclusion

Our knowledge structure makes use of dbt and Athena to offer a scalable, cost-efficient, and versatile framework for constructing and managing knowledge transformation pipelines. Athena’s means to question knowledge straight in Amazon S3 alleviates the necessity to transfer or copy knowledge right into a separate knowledge warehouse, and its serverless mannequin and dbt’s incremental processing decrease each operational overhead and processing prices. Given our group’s robust SQL experience, expressing these transformations in SQL via dbt and Athena was a pure selection, enabling speedy mannequin improvement and deployment. With dbt’s automated documentation and lineage, troubleshooting and figuring out knowledge points is simplified, and the system’s modularity permits for fast changes to satisfy evolving enterprise wants.

Beginning with this structure is fast and easy: all that’s wanted is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, as a result of it’s a totally serverless service with seamless integration with Amazon S3. This structure is right for groups trying to quickly prototype, check, and deploy knowledge fashions, optimizing useful resource utilization, accelerating deployment, and offering high-quality, correct knowledge processing.

For these focused on a managed resolution from dbt, see From knowledge lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.


Concerning the Authors

Philipp Karg is a Lead FinOps Engineer at BMW Group and has a robust background in knowledge engineering, AI, and FinOps. He focuses on driving cloud effectivity initiatives and fostering a cost-aware tradition throughout the firm to leverage the cloud sustainably.

Selman Ay is a Knowledge Architect specializing in end-to-end knowledge options, structure, and AI on AWS. Exterior of labor, he enjoys taking part in tennis and fascinating out of doors actions.

Cizer Pereira is a Senior DevOps Architect at AWS Skilled Companies. He works intently with AWS prospects to speed up their journey to the cloud. He has a deep ardour for cloud-based and DevOps options, and in his free time, he additionally enjoys contributing to open supply tasks.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments