How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

June 23, 2025

3

This visitor put up was co-authored with Kostas Diamantis from Skroutz.

At Skroutz, we’re enthusiastic about our product, and it’s all the time our prime precedence. We’re consistently working to enhance and evolve it, supported by a big and gifted crew of software program engineers. Our product’s steady innovation and evolution result in frequent updates, usually necessitating adjustments and additions to the schemas of our operational databases.

After we determined to construct our personal knowledge platform to fulfill our knowledge wants, equivalent to supporting reporting, enterprise intelligence (BI), and decision-making, the principle problem—and in addition a strict requirement—was to verify it wouldn’t block or delay our product growth.

We selected Amazon Redshift to advertise knowledge democratization, empowering groups throughout the group with seamless entry to knowledge, enabling quicker insights and extra knowledgeable decision-making. This selection helps a tradition of transparency and collaboration, as knowledge turns into available for evaluation and innovation throughout all departments.

Nonetheless, maintaining with schema adjustments from our operational databases, whereas updating the information warehouse with out consistently coordinating with growth groups, delaying releases, or risking knowledge loss, turned a brand new problem for us.

On this put up, we share how we dealt with real-time schema evolution in Amazon Redshift with Debezium.

Resolution overview

Most of our knowledge resides in our operational databases, equivalent to MariaDB and MongoDB. Our strategy entails utilizing the change knowledge seize (CDC) method, which robotically handles the schema evolution of the information shops being captured. For this, we used Debezium together with a Kafka cluster. This answer permits schema adjustments to be propagated with out disrupting the Kafka shoppers.

Nonetheless, dealing with schema evolution in Amazon Redshift turned a bottleneck, prompting us to develop a method to handle this problem. It’s vital to notice that, in our case, adjustments in our operational databases primarily contain including new columns fairly than breaking adjustments like altering knowledge varieties. Subsequently, we’ve carried out a semi-manual course of to resolve this concern, together with a compulsory alerting mechanism to inform us of any schema adjustments. This two-step course of consists of dealing with schema evolution in actual time and dealing with knowledge updates in an asynchronous guide step. The next architectural diagram illustrates a hybrid deployment mannequin, integrating each on-premises and cloud-based parts.

End-to-end data migration workflow from on-premises databases to AWS cloud using CDC, messaging, and data warehouse services

The information stream begins with knowledge from MariaDB and MongoDB, captured utilizing Debezium for CDC in close to real-time mode. The captured knowledge is streamed to a Kafka cluster, the place Kafka shoppers (constructed on the Ruby Karafka framework) learn and write them to the staging space, both in Amazon Redshift or Amazon Easy Storage Service (Amazon S3). From the staging space, DataLoaders promote the information to manufacturing tables in Amazon Redshift. At this stage, we apply the slowly altering dimension (SCD) idea to those tables, utilizing Kind 7 for many of them.

In knowledge warehousing, an SCD is a dimension that shops knowledge, and although it’s typically secure, it’d change over time. Varied methodologies deal with the complexities of SCD administration. SCD Kind 7 locations each the surrogate key and the pure key into the actual fact desk. This permits the consumer to pick out the suitable dimension data based mostly on:

The first efficient date on the actual fact report
The newest or present data
Different dates related to the actual fact report

Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The next diagram gives an instance of the information modeling course of from a staging desk to a manufacturing desk.

Database schema evolution: staging.shops to production.shops with added temporal and versioning columns

The structure depicted within the diagram reveals solely our CDC pipeline, which fetches knowledge from our operational databases and doesn’t embody different pipelines, equivalent to these for fetching knowledge by APIs, scheduled batch processes, and lots of extra. Additionally word that our conference is that dw_* columns are used to catch SCD metadata data and different metadata basically. Within the following sections, we focus on the important thing parts of the answer in additional element.

Actual-time workflow

For the schema evolution half, we deal with the column dw_md_missing_data, which captures schema evolution adjustments in close to actual time that happen within the supply databases. When a brand new change is produced to the Kafka cluster, the Kafka client is liable for writing this alteration to the staging desk in Amazon Redshift. For instance, a message produced by Debezium to the Kafka cluster could have the next construction when a brand new store entity is created:

{
  "earlier than": null,
  "after": {
    "id": 1,
    "title": "shop1",
    "state": "hidden"
  },
  "supply": {
    ...
    "ts_ms": "1704114000000",
    ...
  },
  "op": "c",
  ...
}

The Kafka client is liable for making ready and executing the SQL INSERT assertion:

INSERT INTO staging.outlets (
  id,
  "title",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 13:00:00',
    'create',
    NULL
  )
;

After that, let’s say a brand new column is added to the supply desk known as new_column, with the worth new_value.
The brand new message produced to the Kafka cluster could have the next format:

{
  "earlier than": { ... },
  "after": {
    "id": 1,
    "title": "shop1",
    "state": "hidden",
    "new_column": "new_value"
  },
  "supply": {
    ...
    "ts_ms": "1704121200000"
    ...
  },
  "op": "u"
  ...
}

Now the SQL INSERT assertion executed by the Kafka client can be as follows:

INSERT INTO staging.outlets (
  id,
  "title",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 15:00:00',
    'replace',
    JSON_PARSE('{"new_column": "new_value"}') /*

The patron performs an INSERT as it could for the identified schema, and something new is added to the dw_md_missing_data column as key-value JSON. After the information is promoted from the staging desk to the manufacturing desk, it should have the next construction.

Production.shops table displaying temporal data versioning with creation, update history, and current state indicators

At this level, the information stream continues working with none knowledge loss or the necessity for communication with groups liable for sustaining the schema within the operational databases. Nonetheless, this knowledge won’t be simply accessible for the information shoppers, analysts, or different personas. It’s value noting that dw_md_missing_data is outlined as a column of the SUPER knowledge sort, which was launched in Amazon Redshift to retailer semistructured knowledge or paperwork as values.

Monitoring mechanism

To trace new columns added to a desk, we’ve a scheduled course of that runs weekly. This course of checks for tables in Amazon Redshift with values within the dw_md_missing_data column and generates a listing of tables requiring guide motion to make this knowledge obtainable by a structured schema. A notification is then despatched to the crew.

Handbook remediation steps

Within the aforementioned instance, the guide steps to make this column obtainable could be:

Add the brand new columns to each staging and manufacturing tables:

ALTER TABLE staging.outlets ADD COLUMN new_column varchar(255);
ALTER TABLE manufacturing.outlets ADD COLUMN new_column varchar(255);

Replace the Kafka client’s identified schema. On this step, we simply want so as to add the brand new column title to a easy array listing. For instance:

Replace the DataLoader’s SQL logic for the brand new column. A DataLoader is liable for selling the information from the staging space to the manufacturing desk.

class DataLoader::ShopsTable  this one is the brand new column  this one is the brand new column

Switch the information that has been loaded within the meantime from the dw_md_missing_data SUPER column to the newly added column after which clear up. On this step, we simply have to run a knowledge migration like the next:

BEGIN;
 
  /*
    Switch the information from the `dw_md_missing_data` to the corresponding column
  */
  UPDATE manufacturing.outlets
  SET new_column = dw_md_missing_data.new_column::varchar(255)
  WHERE dw_md_missing_data.new_column IS NOT NULL;
 
  /*
    Clear up dw_md_missing_data column
  */
  UPDATE manufacturing.outlets
  SET dw_md_missing_data = NULL
  WHERE dw_md_missing_data IS NOT NULL;
 
END TRANSACTION;

To carry out the previous operations, we make it possible for nobody else performs adjustments to the manufacturing.outlets desk as a result of we wish no new knowledge to be added to the dw_md_missing_data column.

Conclusion

The answer mentioned on this put up enabled Skroutz to handle schema evolution in operational databases whereas seamlessly updating the information warehouse. This alleviated the necessity for fixed growth crew coordination and eliminated dangers of knowledge loss throughout releases, in the end fostering innovation fairly than stifling it.

Because the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the present structure may be tailored to align extra carefully with AWS-centered ideas. To that finish, one of many adjustments being thought-about is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open supply Kafka, which is able to make it potential for Skroutz to course of massive volumes of streaming knowledge from a number of sources with low latency and excessive throughput to derive insights in seconds.

When you face related challenges, focus on with an AWS consultant and work backward out of your use case to offer probably the most appropriate answer.

Concerning the authors

Konstantina Mavrodimitraki is a Senior Options Architect at Amazon Internet Providers, the place she assists clients in designing scalable, sturdy, and safe methods in world markets. With deep experience in knowledge technique, knowledge warehousing, and large knowledge methods, she helps organizations remodel their knowledge landscapes. A passionate technologist and other people particular person, Konstantina loves exploring rising applied sciences and helps the native tech communities. Moreover, she enjoys studying books and taking part in together with her canine.

Kostas Diamantis is the Head of the Knowledge Warehouse at Skroutz firm. With a background in software program engineering, he transitioned into knowledge engineering, utilizing his technical experience to construct scalable knowledge options. Captivated with data-driven decision-making, he focuses on optimizing knowledge pipelines, enhancing analytics capabilities, and driving enterprise insights.

Previous articleAgentic AI received’t wait in your knowledge structure to catch up

Next articleDay by day crossword on iPad and iPhone: Tips on how to play it

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

Resolution overview

Actual-time workflow

Monitoring mechanism

Handbook remediation steps

Conclusion

Concerning the authors

Construct Your First LLM Utility: A Newbie’s Tutorial

Ocient Finds Massive Information Success with Mixture of New Tech, Confirmed Approaches

12 AI Instruments Everyone seems to be Utilizing in 2025

LEAVE A REPLY Cancel reply

Most Popular

Trump Warns, ‘Hold Oil Costs Down’ After US Strikes Iran: What’s Subsequent for Gasoline Costs?

Star Wars: Battlefront 2 Evaluations, Execs and Cons

Zain KSA begins 5G SA rollout on 600 MHz band

Etsy’s New 3D Printing Restrictions: What Sellers Must Know

Recent Comments

ABOUT US

POPULAR POSTS

Trump Warns, ‘Hold Oil Costs Down’ After US Strikes Iran: What’s Subsequent for Gasoline Costs?

Star Wars: Battlefront 2 Evaluations, Execs and Cons

Zain KSA begins 5G SA rollout on 600 MHz band

POPULAR CATEGORY