Construct multi-Area resilient Apache Kafka purposes with an identical subject names utilizing Amazon MSK and Amazon MSK Replicator

April 14, 2025

78

Resilience has all the time been a prime precedence for purchasers operating mission-critical Apache Kafka purposes. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is deployed throughout a number of Availability Zones and gives resilience inside an AWS Area. Nevertheless, mission-critical Kafka deployments require cross-Area resilience to attenuate downtime throughout service impairment in a Area. With Amazon MSK Replicator, you may construct multi-Area resilient streaming purposes to offer enterprise continuity, share knowledge with companions, combination knowledge from a number of clusters for analytics, and serve world purchasers with lowered latency. This put up explains methods to use MSK Replicator for cross-cluster knowledge replication and particulars the failover and failback processes whereas maintaining the identical subject title throughout Areas.

MSK Replicator overview

Amazon MSK presents two cluster varieties: Provisioned and Serverless. Provisioned cluster helps two dealer varieties: Customary and Categorical. With the introduction of Amazon MSK Categorical brokers, now you can deploy MSK clusters that considerably cut back restoration time by as much as 90% whereas delivering constant efficiency. Categorical brokers present as much as 3 instances the throughput per dealer and scale as much as 20 instances sooner in comparison with Customary brokers operating Kafka. MSK Replicator works with each dealer varieties in Provisioned clusters and together with Serverless clusters.

MSK Replicator helps an an identical subject title configuration, enabling seamless subject title retention throughout each active-active or active-passive replication. This avoids the chance of infinite replication loops generally related to third-party or open supply replication instruments. When deploying an active-passive cluster structure for regional resilience, the place one cluster handles dwell visitors and the opposite acts as a standby, an an identical subject configuration simplifies the failover course of. Functions can transition to the standby cluster with out reconfiguration as a result of subject names stay constant throughout the supply and goal clusters.

To arrange an active-passive deployment, you need to allow multi-VPC connectivity for the MSK cluster within the major Area and deploy an MSK Replicator within the secondary Area. The replicator will eat knowledge from the first Area’s MSK cluster and asynchronously replicate it to the secondary Area. You join the purchasers initially to the first cluster however fail over the purchasers to the secondary cluster within the case of major Area impairment. When the first Area recovers, you deploy a brand new MSK Replicator to copy knowledge again from the secondary cluster to the first. It is advisable to cease the shopper purposes within the secondary Area and restart them within the major Area.

As a result of replication with MSK Replicator is asynchronous, there’s a risk of duplicate knowledge within the secondary cluster. Throughout a failover, shoppers may reprocess some messages from Kafka matters. To deal with this, deduplication ought to happen on the buyer facet, equivalent to by utilizing an idempotent downstream system like a database.

Within the subsequent sections, we exhibit methods to deploy MSK Replicator in an active-passive structure with an identical subject names. We offer a step-by-step information for failing over to the secondary Area throughout a major Area impairment and failing again when the first Area recovers. For an active-active setup, seek advice from Create an active-active setup utilizing MSK Replicator.

Answer overview

On this setup, we deploy a major MSK Provisioned cluster with Categorical brokers within the us-east-1 Area. To offer cross-Area resilience for Amazon MSK, we set up a secondary MSK cluster with Categorical brokers within the us-east-2 Area and replicate matters from the first MSK cluster to the secondary cluster utilizing MSK Replicator. This configuration gives excessive resilience inside every Area by utilizing Categorical brokers, and cross-Area resilience is achieved by way of an active-passive structure, with replication managed by MSK Replicator.

The next diagram illustrates the answer structure.

The first Area MSK cluster handles shopper requests. Within the occasion of a failure to speak to MSK cluster as a consequence of major area impairment, it’s worthwhile to fail over the purchasers to the secondary MSK cluster. The producer writes to the buyer subject within the major MSK cluster, and the buyer with the group ID msk-consumer reads from the identical subject. As a part of the active-passive setup, we configure MSK Replicator to make use of an identical subject names, ensuring that the buyer subject stays constant throughout each clusters with out requiring adjustments from the purchasers. Your complete setup is deployed inside a single AWS account.

Within the subsequent sections, we describe methods to arrange a multi-Area resilient MSK cluster utilizing MSK Replicator and likewise present the failover and failback technique.

Provision an MSK cluster utilizing AWS CloudFormation

We offer AWS CloudFormation templates to provision sure sources:

It will create the digital non-public cloud (VPC), subnets, and the MSK Provisioned cluster with Categorical brokers throughout the VPC configured with AWS Id and Entry Administration (IAM) authentication in every Area. It should additionally create a Kafka shopper Amazon Elastic Compute Cloud (Amazon EC2) occasion, the place we are able to use the Kafka command line to create and examine a Kafka subject and produce and eat messages to and from the subject.

Configure multi-VPC connectivity within the major MSK cluster

After the clusters are deployed, it’s worthwhile to allow the multi-VPC connectivity within the major MSK cluster deployed in us-east-1. It will permit MSK Replicator to hook up with the first MSK cluster utilizing multi-VPC connectivity (powered by AWS PrivateLink). Multi-VPC connectivity is barely required for cross-Area replication. For same-Area replication, MSK Replicator makes use of an IAM coverage to hook up with the first MSK cluster.

MSK Replicator makes use of IAM authentication solely to hook up with each major and secondary MSK clusters. Due to this fact, though different Kafka purchasers can nonetheless proceed to make use of SASL/SCRAM or mTLS authentication, for MSK Replicator to work, IAM authentication needs to be enabled.

To allow multi-VPC connectivity, full the next steps:

On the Amazon MSK console, navigate to the MSK cluster.
On the Properties tab, below Community settings, select Activate multi-VPC connectivity on the Edit dropdown menu.

For Authentication kind, choose IAM role-based authentication.
Select Activate choice.

Enabling multi-VPC connectivity is a one-time setup and it will possibly take roughly 30–45 minutes relying on the variety of brokers. After that is enabled, it’s worthwhile to present the MSK cluster useful resource coverage to permit MSK Replicator to speak to the first cluster.

Below Safety settings¸ select Edit cluster coverage.
Choose Embody Kafka service principal.

Now that the cluster is enabled to obtain requests from MSK Replicator utilizing PrivateLink, we have to arrange the replicator.

Create a MSK Replicator

Full the next steps to create an MSK Replicator:

Within the secondary Area (us-east-2), open the Amazon MSK console.
Select Replicators within the navigation pane.
Select Create replicator.
Enter a reputation and non-compulsory description.

Within the Supply cluster part, present the next info:
1. For Cluster area, select us-east-1.
2. For MSK cluster, enter the Amazon Useful resource Identify (ARN) for the first MSK cluster.

For cross-Area setup, the first cluster will seem disabled if the multi-VPC connectivity isn’t enabled and the cluster useful resource coverage isn’t configured within the major MSK cluster. After you select the first cluster, it routinely selects the subnets related to major cluster. Safety teams usually are not required as a result of the first cluster’s entry is ruled by the cluster useful resource coverage.

Subsequent, you choose the goal cluster. The goal cluster Area is defaulted to the Area the place the MSK Replicator is created. On this case, it’s us-east-2.

Within the Goal cluster part, present the next info:
1. For MSK cluster, enter the ARN of the secondary MSK cluster. It will routinely choose the cluster subnets and the safety group related to the secondary cluster.
2. For Safety teams, select any extra safety teams.

Be sure that the safety teams have outbound guidelines to permit visitors to your secondary cluster’s safety teams. Additionally ensure that your secondary cluster’s safety teams have inbound guidelines that settle for visitors from the MSK Replicator safety teams supplied right here.

Now let’s present the MSK Replicator settings.

Within the Replicator settings part, enter the next info:
1. For Matters to copy, we maintain the matters to copy as a default worth that replicates all matters from the first to secondary cluster.
2. For Replication beginning place, we select Earliest, in order that we are able to get all of the occasions from the beginning of the supply matters.
3. For Copy settings, choose Preserve the identical subject names to configure the subject title within the secondary cluster as an identical to the first cluster.

This makes positive that the MSK purchasers don’t want so as to add a prefix to the subject names.

For this instance, we maintain the Client group replication setting as default and set Goal compression kind as None.

Additionally, MSK Replicator will routinely create the required IAM insurance policies.

Select Create to create the replicator.

The method takes round 15–20 minutes to deploy the replicator. After the MSK Replicator is operating, this might be mirrored within the standing.

Configure the MSK shopper for the first cluster

Full the next steps to configure the MSK shopper:

On the Amazon EC2 console, navigate to the EC2 occasion of the first Area (us-east-1) and hook up with the EC2 occasion dr-test-primary-KafkaClientInstance1 utilizing Session Supervisor, a functionality of AWS Methods Supervisor.

After you’ve got logged in, it’s worthwhile to configure the first MSK cluster bootstrap deal with to create a subject and publish knowledge to the cluster. You may get the bootstrap deal with for IAM authentication on the Amazon MSK console below View Shopper Data on the cluster particulars web page.

Configure the bootstrap deal with with the next code:

sudo su - ec2-user

export BS_PRIMARY=>

Configure the shopper configuration for IAM authentication to speak to the MSK cluster:

echo -n "safety.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software program.amazon.msk.auth.iam.IAMLoginModule required;
sasl.shopper.callback.handler.class=software program.amazon.msk.auth.iam.IAMClientCallbackHandler
" > /residence/ec2-user/kafka/config/client_iam.properties

Create a subject and produce and eat messages to the subject

Full the next steps to create a subject after which produce and eat messages to it:

Create a buyer subject:

/residence/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_PRIMARY 
--create --replication-factor 3 --partitions 3 
--topic buyer 
--command-config=/residence/ec2-user/kafka/config/client_iam.properties

Create a console producer to write down to the subject:

/residence/ec2-user/kafka/bin/kafka-console-producer.sh 
--bootstrap-server=$BS_PRIMARY --topic buyer 
--producer.config=/residence/ec2-user/kafka/config/client_iam.properties

Produce the next pattern textual content to the subject:

It is a buyer subject
That is the 2nd message to the subject.

Press Ctrl+C to exit the console immediate.
Create a shopper with group.id msk-consumer to learn all of the messages from the start of the client subject:

/residence/ec2-user/kafka/bin/kafka-console-consumer.sh 
--bootstrap-server=$BS_PRIMARY --topic buyer --from-beginning 
--consumer.config=/residence/ec2-user/kafka/config/client_iam.properties 
--consumer-property group.id=msk-consumer

It will eat each the pattern messages from the subject.

Press Ctrl+C to exit the console immediate.

Configure the MSK shopper for the secondary MSK cluster

Go to the EC2 cluster of the secondary Area us-east-2 and comply with the beforehand talked about steps to configure an MSK shopper. The one distinction from the earlier steps is that you need to use the bootstrap deal with of the secondary MSK cluster because the surroundings variable. Configure the variable $BS_SECONDARY to configure the secondary Area MSK cluster bootstrap deal with.

Confirm replication

After the shopper is configured to speak to the secondary MSK cluster utilizing IAM authentication, listing the matters within the cluster. As a result of the MSK Replicator is now operating, the buyer subject is replicated. To confirm it, let’s see the listing of matters within the cluster:

/residence/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_SECONDARY 
--list --command-config=/residence/ec2-user/kafka/config/client_iam.properties

The subject title is buyer with none prefix.

By default, MSK Replicator replicates the small print of all the buyer teams. Since you used the default configuration, you may confirm utilizing the next command if the buyer group ID msk-consumer can also be replicated to the secondary cluster:

/residence/ec2-user/kafka/bin/kafka-consumer-groups.sh --bootstrap-server=$BS_SECONDARY 
--list --command-config=/residence/ec2-user/kafka/config/client_iam.properties

Now that now we have verified the subject is replicated, let’s perceive the important thing metrics to observe.

Monitor replication

Monitoring MSK Replicator is essential to ensure that replication of knowledge is occurring quick. This reduces the chance of knowledge loss in case an unplanned failure happens. Some vital MSK Replicator metrics to observe are ReplicationLatency, MessageLag, and ReplicatorThroughput. For an in depth listing, see Monitor replication.

To grasp what number of bytes are processed by MSK Replicator, you need to monitor the metric ReplicatorBytesInPerSec. This metric signifies the common variety of bytes processed by the replicator per second. Knowledge processed by MSK Replicator consists of all knowledge MSK Replicator receives. This contains the information replicated to the goal cluster and filtered by MSK Replicator. This metric is relevant when you use Preserve identical subject title within the MSK Replicator copy settings. Throughout a failback state of affairs, MSK Replicator begins to learn from the earliest offset and replicates data from the secondary again to the first. Relying on the retention settings, some knowledge may exist within the major cluster. To forestall duplicates, MSK Replicator processes the information however routinely filters out duplicate knowledge.

Fail over purchasers to the secondary MSK cluster

Within the case of an surprising occasion within the major Area through which purchasers can’t hook up with the first MSK cluster or the purchasers are receiving surprising produce and eat errors, this could possibly be an indication that the first MSK cluster is impacted. You could discover a sudden spike in replication latency. If the latency continues to rise, it may point out a regional impairment in Amazon MSK. To confirm this, you may verify the AWS Well being Dashboard, although there’s a likelihood that standing updates could also be delayed. When you determine indicators of a regional impairment in Amazon MSK, you need to put together to fail over the purchasers to the secondary area.

For essential workloads we suggest not taking a dependency on management airplane actions for failover. To mitigate this threat, you possibly can implement a pilot mild deployment, the place important elements of the stack are stored operating in a secondary area and scaled up when the first area is impaired. Alternatively, for sooner and smoother failover with minimal downtime, a sizzling standby strategy is beneficial. This includes pre-deploying all the stack in a secondary area in order that, in a catastrophe restoration state of affairs, the pre-deployed purchasers could be rapidly activated within the secondary area.

Failover course of

To carry out the failover, you first have to cease the purchasers pointed to the first MSK cluster. Nevertheless, for the aim of the demo, we’re utilizing console producer and shoppers, so our purchasers are already stopped.

In an actual failover state of affairs, utilizing major Area purchasers to speak with the secondary Area MSK cluster isn’t beneficial, because it breaches fault isolation boundaries and results in elevated latency. To simulate the failover utilizing the previous setup, let’s begin a producer and shopper within the secondary Area (us-east-2). For this, run a console producer within the EC2 occasion (dr-test-secondary-KafkaClientInstance1) of the secondary Area.

The next diagram illustrates this setup.

Full the next steps to carry out a failover:

Create a console producer utilizing the next code:

/residence/ec2-user/kafka/bin/kafka-console-producer.sh 
--bootstrap-server=$BS_SECONDARY --topic buyer 
--producer.config=/residence/ec2-user/kafka/config/client_iam.properties

Produce the next pattern textual content to the subject:

That is the third message to the subject.
That is the 4th message to the subject.

Now, let’s create a console shopper. It’s vital to ensure the buyer group ID is strictly the identical as the buyer hooked up to the first MSK cluster. For this, we use the group.id msk-consumer to learn the messages from the buyer subject. This simulates that we’re citing the identical shopper hooked up to the first cluster.

Create a console shopper with the next code:

/residence/ec2-user/kafka/bin/kafka-console-consumer.sh 
--bootstrap-server=$BS_SECONDARY --topic buyer --from-beginning 
--consumer.config=/residence/ec2-user/kafka/config/client_iam.properties 
--consumer-property group.id=msk-consumer

Though the buyer is configured to learn all the information from the earliest offset, it solely consumes the final two messages produced by the console producer. It is because MSK Replicator has replicated the buyer group particulars together with the offsets learn by the buyer with the buyer group ID msk-consumer. The console shopper with the identical group.id mimic the behaviour that the buyer is failed over to the secondary Amazon MSK cluster.

Fail again purchasers to the first MSK cluster

Failing again purchasers to the first MSK cluster is the frequent sample in an active-passive state of affairs, when the service within the major area has recovered. Earlier than we fail again purchasers to the first MSK cluster, it’s vital to sync the first MSK cluster with the secondary MSK cluster. For this, we have to deploy one other MSK Replicator within the major Area configured to learn from the earliest offset from the secondary MSK cluster and write to the first cluster with the identical subject title. The MSK Replicator will copy the information from the secondary MSK cluster to the first MSK cluster. Though the MSK Replicator is configured to begin from the earliest offset, it won’t duplicate the information already current within the major MSK cluster. It should routinely filter out the prevailing messages and can solely write again the brand new knowledge produced within the secondary MSK cluster when the first MSK cluster was down. The replication step from secondary to major wouldn’t be required when you don’t have a enterprise requirement of maintaining the information identical throughout each clusters.

After the MSK Replicator is up and operating, monitor the MessageLag metric of MSK Replicator. This metric signifies what number of messages are but to be replicated from the secondary MSK cluster to the first MSK cluster. The MessageLag metric ought to come down near 0. Now you need to cease the producers writing to the secondary MSK cluster and restart connecting to the first MSK cluster. You also needs to permit the shoppers to learn knowledge from the secondary MSK cluster till the MaxOffsetLag metric for the shoppers isn’t 0. This makes positive that the shoppers have already processed all of the messages from the secondary MSK cluster. The MessageLag metric needs to be 0 by this time as a result of no producer is producing data within the secondary cluster. MSK Replicator replicated all messages from the secondary cluster to the first cluster. At this level, you need to begin the buyer with the identical group.id within the major Area. You’ll be able to delete the MSK Replicator created to repeat messages from the secondary to the first cluster. Be sure that the beforehand current MSK Replicator is in RUNNING standing and efficiently replicating messages from the first to secondary. This may be confirmed by trying on the ReplicatorThroughput metric, which needs to be better than 0.

Failback course of

To simulate a failback, you first have to allow multi-VPC connectivity within the secondary MSK cluster (us-east-2) and add a cluster coverage for the Kafka service principal like we did earlier than.

Deploy the MSK Replicator within the major Area (us-east-1) with the supply MSK cluster pointed to us-east-2 and the goal cluster pointed to us-east-1. Configure Replication beginning place as Earliest and Copy settings as Preserve the identical subject names.

The next diagram illustrates this setup.

After the MSK Replicator is in RUNNING standing, let’s confirm there is no such thing as a duplicate whereas replicating the information from the secondary to the first MSK cluster.

Run a console shopper with out the group.id within the EC2 occasion (dr-test-primary-KafkaClientInstance1) of the first Area (us-east-1):

/residence/ec2-user/kafka/bin/kafka-console-consumer.sh 
--bootstrap-server=$BS_PRIMARY --topic buyer --from-beginning 
--consumer.config=/residence/ec2-user/kafka/config/client_iam.properties

This could present the 4 messages with none duplicates. Though within the shopper we specify to learn from the earliest offset, MSK Replicator makes positive the duplicate knowledge isn’t replicated again to the first cluster from the secondary cluster.

It is a buyer subject
That is the 2nd message to the subject.
That is the third message to the subject.
That is the 4th message to the subject.

Now you can level the purchasers to begin producing to and consuming from the first MSK cluster.

Clear up

At this level, you may tear down the MSK Replicator deployed within the major Area.

Conclusion

This put up explored methods to improve Kafka resilience by establishing a secondary MSK cluster in one other Area and synchronizing it with the first cluster utilizing MSK Replicator. We demonstrated methods to implement an active-passive catastrophe restoration technique whereas sustaining constant subject names throughout each clusters. We supplied a step-by-step information for configuring replication with an identical subject names and detailed the processes for failover and failback. Moreover, we highlighted key metrics to observe and outlined actions to offer environment friendly and steady knowledge replication.

For extra info, seek advice from What’s Amazon MSK Replicator? For a hands-on expertise, check out the Amazon MSK Replicator Workshop. We encourage you to check out this function and share your suggestions with us.

In regards to the Writer

Subham Rakshit is a Senior Streaming Options Architect for Analytics at AWS primarily based within the UK. He works with clients to design and construct streaming architectures to allow them to get worth from analyzing their streaming knowledge. His two little daughters maintain him occupied more often than not exterior work, and he loves fixing jigsaw puzzles with them. Join with him on LinkedIn.

Previous articleOliver Tuszik joins Cisco’s Govt Management Staff as EVP, International Gross sales

Next articleConstruct Your Personal OCR Engine for Wingdings

Construct multi-Area resilient Apache Kafka purposes with an identical subject names utilizing Amazon MSK and Amazon MSK Replicator

MSK Replicator overview

Answer overview

Provision an MSK cluster utilizing AWS CloudFormation

Configure multi-VPC connectivity within the major MSK cluster

Create a MSK Replicator

Configure the MSK shopper for the first cluster

Create a subject and produce and eat messages to the subject

Configure the MSK shopper for the secondary MSK cluster

Confirm replication

Monitor replication

Fail over purchasers to the secondary MSK cluster

Failover course of

Fail again purchasers to the first MSK cluster

Failback course of

Clear up

Conclusion

In regards to the Writer

MetaGraph Goals To Be The “Google For DNA,” Giving Scientists Management Of Huge Information

7 Finest GitHub Repositories For Mastering RAG Techniques

How Huge Information Helped Russia Develop into A Chief In Automobile Sharing

LEAVE A REPLY Cancel reply

Most Popular

MetaGraph Goals To Be The “Google For DNA,” Giving Scientists Management Of Huge Information

GSMA digs deeper into cellular trade cohesion with new Entitlements Service

How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

The right way to run an R knowledge visualization chatbot you’ll be able to speak to

Recent Comments

ABOUT US

POPULAR POSTS

MetaGraph Goals To Be The “Google For DNA,” Giving Scientists Management Of Huge Information

GSMA digs deeper into cellular trade cohesion with new Entitlements Service

How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

POPULAR CATEGORY