Optimizing Flink’s be part of operations on Amazon EMR with Alluxio

February 4, 2026

3

Whenever you’re working with knowledge evaluation, you usually face the problem of successfully correlating real-time knowledge with historic knowledge to realize actionable insights. This turns into significantly crucial once you’re coping with eventualities like e-commerce order processing, the place your real-time selections can considerably affect enterprise outcomes. The complexity arises when you could mix streaming knowledge with static reference data to create a complete analytical framework that helps each your rapid operational wants and strategic planning

To deal with this problem, you’ll be able to make use of stream processing applied sciences that deal with steady knowledge flows whereas seamlessly integrating dwell knowledge streams with static dimension tables. These options allow you to carry out detailed evaluation and aggregation of information, supplying you with a complete view that mixes the immediacy of real-time knowledge with the depth of historic context. Apache Flink has emerged as a number one stream computing platform that gives strong capabilities for becoming a member of real-time and offline knowledge sources by means of its intensive connector ecosystem and SQL API.

On this submit, we present you the right way to implement real-time knowledge correlation utilizing Apache Flink to affix streaming order knowledge with historic buyer and product data, enabling you to make knowledgeable selections based mostly on complete, up-to-date analytics.

We additionally introduce an optimized resolution to routinely load Hive dimension desk knowledge into Alluxio Common Flash Storage (UFS) by means of the Alluxio cache layer. This permits Flink to carry out temporal joins on altering knowledge, precisely reflecting the content material of a desk at particular deadlines.

Resolution structure

Relating to becoming a member of Flink SQL tables with stream tables, the lookup be part of is a go-to technique. This strategy is especially efficient when you could correlate streaming knowledge with static or slowly altering knowledge. In Flink, you should utilize connectors just like the Flink Hive SQL connector or the FileSystem connector to archive the state of affairs.

The next structure reveals normal strategy which we describe forward：

Right here’s how we do that:

We use offline knowledge to assemble a Flink desk. This knowledge could possibly be from an offline Hive database desk or from recordsdata saved in a system like Amazon S3. Concurrently, we will create a stream desk from the information flowing in by means of a Kafka message stream
Use a batch cluster for offline knowledge processing. On this instance, we use an Amazon EMR cluster which creates a truth desk in it. It additionally supplies a Element Large Knowledge (DWD) desk which has been used as a Flink dynamic desk to carry out consequence processing after a lookup be part of
- It’s sometimes positioned within the center layer of an information warehouse, between the uncooked knowledge contained within the Operational Knowledge Retailer (ODS) and the extremely aggregated knowledge discovered within the Knowledge Warehouse (DW), or Knowledge Mart (DM).
- The first objective of the DWD layer is to assist advanced knowledge evaluation and reporting wants by offering an in depth and complete knowledge view.
- Each the actual fact desk and DWD desk are hive tables on Hadoop
Use a streaming cluster for the real-time processing. On this instance, we use an Amazon EMR cluster to stream occasion ingestion and analyze it utilizing Flink, utilizing Flink Kafka connector and Hive connector to affix the streaming occasion knowledge and statics dimension knowledge (truth desk)

One of many key challenges encountered with this strategy is expounded to the administration of the lookup dimension desk knowledge. Initially, when the Flink software is began, this knowledge is saved within the activity supervisor’s state. Nonetheless, throughout subsequent operations like steady queries or window aggregations, the dimension desk knowledge isn’t routinely refreshed. Which means that the operator should both restart the Flink software periodically or manually refresh the dimension desk knowledge within the short-term desk. This step is essential to make sure that the be part of operations and aggregations are at all times carried out with essentially the most present dimension knowledge.

One other important problem with this strategy is needing to drag all the dimension desk knowledge and carry out a chilly begin every time. This turns into significantly problematic when coping with a big quantity of dimension desk knowledge. As an example, when dealing with tables with tens of hundreds of thousands of registered customers or tens of hundreds of product SKU attributes, this course of generates substantial enter/output (IO) overhead. Consequently, it results in efficiency bottlenecks, impacting the effectivity of the system.

Flink’s checkpointing mechanism processes the information and shops checkpoint snapshots of all of the states throughout steady queries or window aggregations, leading to state snapshots knowledge bloat.

Optimizing the answer

This submit contains an optimized resolution to deal with the aforementioned challenges, by routinely loading Hive dimension desk knowledge into the Alluxio UFS by way of the Alluxio cache layer. We be part of this knowledge with Flink’s temporal joins to create a view on a altering desk. This view displays the content material of a desk at a selected time limit

Alluxio is a distributed cache engine for giant knowledge know-how stacks. It supplies a unified UFS that may connect with the underlying Amazon S3 and HDFS knowledge. Alluxio UFS learn and write operations heat up the distributed storage layers on S3 and HDFS and thus considerably improve throughput and lowering community overhead. Deeply built-in with higher stage computing engines corresponding to Hive, Spark, and Trino, Alluxio is a superb cache accelerator for offline dimension knowledge.

Moreover, we make the most of Flink’s temporal desk perform to go a time parameter. This perform returns a view of the temporal desk on the specified time. By doing so, when the primary desk of the real-time dynamic desk is correlated with the temporal desk, it may be related to a selected historic model of the dimension knowledge

Resolution implementation particulars

For this submit, we use “consumer conduct” log knowledge in Kafka as real-time stream truth desk knowledge, and consumer data knowledge on Hive as offline dimension desk knowledge. A demo with Alluxio + Flink temporal be part of is used to confirm the Flink be part of optimized resolution.

Actual-time truth tables

For this demonstration, we make the most of consumer conduct JSON knowledge simulated by the open-source part json-data-generator. We write the information to Amazon Managed Kafka (Amazon MSK) in real-time. Utilizing the Flink Kafka Connector, we convert this stream right into a Flink stream desk for steady queries. This served as our truth desk knowledge for real-time joins.

A pattern of the consumer conduct simulation knowledge in JSON format is as follows:

[{          
	"timestamp": "nowTimestamp()",
	"system": "BADGE",
	"actor": "Agnew",
	"action": "EXIT",
	"objects": ["Building 1"],
	"location": "45.5,44.3",
	"message": "Exited Constructing 1"
}]

It contains consumer conduct data corresponding to operation time, login system, consumer signature, behavioral actions, and repair objects, areas, and associated textual content fields. We create a truth desk in Flink SQL with the primary fields as follows:

CREATE TABLE logevent_source (`timestamp`  string, 
`system` string,
 actor STRING,
 motion STRING
) WITH (
'connector' = 'kafka',
'subject' = 'logevent',
'properties.bootstrap.servers' = 'b-6.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092,b-5.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092,b-1.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092 (http://b-6.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092percent2Cb-5.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092percent2Cb-1.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092/)',
'properties.group.id' = 'testGroup6',
'scan.startup.mode'='latest-offset',
'format' = 'json'
);

Caching dimension tables with Alluxio

Amazon EMR supplies stable integration with Alluxio. You should utilize the Amazon EMR bootstrap startup script to routinely deploy Alluxio parts and begin the Alluxio grasp and employee processes when an Amazon EMR cluster is created. For detailed set up and deployment steps, seek advice from the article Integrating Alluxio on Amazon EMR.

In an Amazon EMR cluster that integrates Alluxio, chances are you’ll use Alluxio to create a cache desk for the Hive offline dimension desk as follows:

##Arrange the consumer jar bundle in hive-env.sh:
$ export HIVE_AUX_JARS_PATH=//consumer/alluxio-2.2.0-client.jar:${HIVE_AU

##Ensure the UFS is configured on the EMR cluster the place Alluxio is put in and that the desk/db path has been created:
alluxio fs mkdir alluxio://ip-xxx-xxx-xxx-xxx.ap-southeast-1.compute.inner:19998/s3/buyer
alluxio fs chown hadoop:hadoop alluxio://ip-xxx-xxx-xxx-xxx.ap-southeast-1.compute.inner:19998/s3/buyer

##On the AWS EMR cluster, create a Hive desk path pointing to Alluxio namespace URI:
!join jdbc:hive2://xxx.xxx.xxx.xxx:10000/default;
hive> CREATE TABLE buyer(
    c_customer_sk             bigint,
    c_customer_id             string,
    c_current_cdemo_sk        bigint,
    c_current_hdemo_sk        bigint,
    c_current_addr_sk         bigint,
    c_first_shipto_date_sk    bigint,
    c_first_sales_date_sk     bigint,
    c_salutation              string,
    c_first_name              string,
    c_last_name               string,
    c_preferred_cust_flag     string,
    c_birth_day               int,
    c_birth_month             int,
    c_birth_year              int,
    c_birth_country           string,
    c_login                   string,
    c_email_address           string
)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION 'alluxio://ip-xxx-xxx-xxx-xxx.ap-southeast-1.compute.inner:19998/s3/buyer';
OK
Time taken: 3.485 seconds

As proven within the earlier part, the Alluxio desk location alluxio://ip-xxx-xx:19998/s3/buyer factors to the S3 path the place the Hive dimension desk is positioned; writing to the shopper dimension desk is routinely synchronized to the Alluxio cache.

After creating the Alluxio Hive offline dimension desk, you’ll be able to view the small print of the Alluxio cache desk by connecting to the Hive metadata by means of the Hive catalog in Flink SQL:

CREATE CATALOG hiveCatalog WITH (  'sort' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/and so forth/hive/conf/',
    'hive-version' = '3.1.2',
    'hadoop-conf-dir'='/and so forth/hadoop/conf/'
);
-- set the HiveCatalog as the present catalog of the session
USE CATALOG hiveCatalog;
present create desk buyer;
create exterior desk buyer(
    c_customer_sk             bigint,
    c_customer_id             string,
    c_current_cdemo_sk        bigint,
    c_current_hdemo_sk        bigint,
    c_current_addr_sk         bigint,
    c_first_shipto_date_sk    bigint,
    c_first_sales_date_sk     bigint,
    c_salutation              string,
    c_first_name              string,
    c_last_name               string,
    c_preferred_cust_flag     string,
    c_birth_day               int,
    c_birth_month             int,
    c_birth_year              int,
    c_birth_country           string,
    c_login                   string,
    c_email_address           string
) 
row format delimited fields terminated by '|'
location 'alluxio://ip-xxx-xxx-xxx-xxx.ap-southeast-1.compute.inner:19998/s3/30/buyer' 
TBLPROPERTIES (
  'streaming-source.allow' = 'false',  
  'lookup.be part of.cache.ttl' = '12 h'
)

As proven within the previous code, the placement path of the dimension desk is the UFS cache path Uniform Useful resource Identifier (URI). When the enterprise program reads and writes the dimension desk, Alluxio routinely updates the shopper dimension desk knowledge within the cache and asynchronously writes it to the Alluxio backend storage path of the S3 desk to realize desk knowledge synchronization within the knowledge lake.

Flink temporal desk be part of

Flink temporal desk can be a sort of dynamic desk. Every report within the temporal desk is correlated with a number of time fields. Once we be part of the actual fact desk and the dimension desk, we normally have to receive real-time dimension desk knowledge for the lookup be part of. Thus, when creating or becoming a member of a desk, we normally want to make use of the proctime() perform to specify the time subject of the actual fact desk. Once we be part of the tables, we use the syntax of FOR SYSTEM_TIME AS OF to specify the time model of the actual fact desk that corresponds to the time of the lookup dimension desk.

For this submit, the shopper data is a altering dimension desk within the Hive offline desk, whereas the shopper conduct is the actual fact desk in Kafka. We specified the time subject with proctime() within the Flink Kafka supply desk. Then when becoming a member of the Flink Hive desk, we used FOR SYSTEM_TIME AS OF to specify the time subject of the lookup Kafka supply desk to permit us to comprehend the Flink temporal desk be part of operation

As proven within the following code, a truth desk of consumer conduct is created by means of the Kafka Connector in Flink SQL. The ts subject refers back to the timestamp when the temporal desk is joined：

CREATE TABLE logevent_source (`timestamp`  string, 
`system` string,
 actor STRING,
 motion STRING,
 ts as PROCTIME()
) WITH (
'connector' = 'kafka',
'subject' = 'logevent',
'properties.bootstrap.servers' = 'b-6.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092,b-5.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092,b-1.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092 (http://b-6.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092percent2Cb-5.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092percent2Cb-1.msk06.dr04w4.c3.kafka.ap-southeast-1.amazonaws.com:9092/)',
'properties.group.id' = 'testGroup-01',
'scan.startup.mode'='latest-offset',
'format' = 'json'
);

The Flink offline dimension desk and the streaming real-time desk are joined as follows:

choose a.`timestamp`,a.`system`,a.actor,a.motion,b.c_login from 
       (choose *, proctime() as proctime from user_logevent_source) as a 
 left be part of buyer  FOR SYSTEM_TIME AS OF a.proctime as b on a.actor=b.c_last_name;

When the actual fact desk logevent_source joins the lookup dimension desk, the proctime perform ensures real-time joins by acquiring the most recent dimension desk model. This dimension knowledge, cached in Alluxio, delivers considerably higher learn efficiency than direct S3 entry.

On the identical time, the dimension desk knowledge is already cached in Alluxio; the learn efficiency is a lot better than offline knowledge learn on S3.

The comparability check reveals that Alluxio cache brings a transparent efficiency benefit by switching the S3 and Alluxio paths of the shopper dimension desk by means of Hive

You possibly can simply swap the native and cache location paths with alter desk in hive cli:

alter desk buyer set location "s3://xxxxxx/knowledge/s3/30/buyer";
alter desk buyer  set location "alluxio://ip-xxx-xxx-xxx-xxx.ap-southeast-1.compute.inner:19998/s3/30/buyer";

You can even choose the Activity Supervisor log from the Flink dashboard for a cut up check.

The efficiency of the actual fact desk load was doubled by means of the implementation of optimized knowledge processing methods.

Earlier than caching (S3 path learn): 5s load time

2022-06-29 02:54:34,791 INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem           [] - Opening 's3://salunchbucket/knowledge/s3/30/buyer/data-m-00029' for studying
2022-06-29 02:54:39,971 INFO  org.apache.flink.desk.filesystem.FileSystemLookupFunction   [] - Loaded 433000 row(s) into lookup be part of cache

After caching (Alluxio learn): 2s load time

2022-06-29 03:25:14,476 INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem           [] - Opening 's3://salunchbucket/knowledge/s3/30/buyer/data-m-00029' for studying
2022-06-29 03:25:16,397 INFO  org.apache.flink.desk.filesystem.FileSystemLookupFunction   [] - Loaded 433000 row(s) into lookup be part of cache

The timeline on JobManager clearly reveals the distinction in execution period beneath Alluxio and S3 paths.

For single activity question ,we speed up by greater than 1 instances utilizing this resolution. The general job efficiency enchancment is much more seen.

Different optimalizations to contemplate

Implementing a steady be part of requires pulling dimension knowledge each time. Does it result in Flink’s checkpoint state bloat that may trigger Flink TaskManager RocksDB to blow up or reminiscence overflow.

In Flink, the state comes with a TTL mechanism. You possibly can set a TTL expiration coverage to set off Flink to scrub up expired state knowledge. Flink SQL could be set utilizing the trace technique.

insert into logevent_sink
choose a.`timestamp`,a.`system`,a.actor,a.motion,b.c_login from 
(choose *, proctime() as proctime from logevent_source) as a 
  left be part of 
buyer/*+ OPTIONS('lookup.be part of.cache.ttl' = '5 min')*/  FOR SYSTEM_TIME AS OF a.proctime as b 
on a.actor=b.c_last_name;

Flink Desk/Streaming API is comparable:

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.days(7))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .cleanupInRocksdbCompactFilter() 
    .construct();
ValueStateDescriptor lastUserLogin = 
    new ValueStateDescriptor("lastUserLogin", Lengthy.class);
lastUserLogin.enableTimeToLive(ttlConfig);
StreamTableEnvironment.getConfig().setIdleStateRetentionTime(min, max);

Restart the lookup be part of after the configuration. As you’ll be able to see from the Flink TM log, after TTL expires, it triggers clean-up and re-pull the Hive dimension desk knowledge：

2022-06-29 04:17:09,161 INFO  org.apache.flink.desk.filesystem.FileSystemLookupFunction   
[] - Lookup be part of cache has expired after 5 minute(s), reloading

As well as, you’ll be able to cut back the variety of checkpoint snapshots by configuring Flink state retention and thereby cut back the quantity of house taken up by state on the time of snapshot.

Flink job configuration as observe：
-D state.checkpoints.num-retained=5

After the configuration, you’ll be able to see that within the S3 checkpoint path, the Flink job routinely cleans up historic snapshots and retains the newest 5 snapshots, thus guaranteeing that checkpoint snapshots don’t accumulate.

[hadoop@ip-172-31-41-131 ~]$ aws s3 ls s3://salunchbucket/knowledge/checkpoints/7b9f2f9becbf3c879cd1e5f38c6239f8/
                           PRE chk-3/
                           PRE chk-4/
                           PRE chk-5/
                           PRE chk-6/
                           PRE chk-7/

Abstract

Prospects implementing Flink streaming framework to affix dimension and real-time truth tables steadily encounter efficiency challenges. On this submit, we offered an optimized resolution that makes use of Alluxio’s caching capabilities to routinely load Hive dimension desk knowledge into the UFS cache. By integrating with Flink temporal desk joins, dimension tables are reworked into time-versioned views, successfully addressing efficiency bottlenecks in conventional implementations.

Concerning the creator

Previous articleAWS IAM Id Middle now helps multi-Area replication for AWS account entry and utility use

Next articleThe SpaceX-xAI mega-merger is a gap gambit to a much bigger play

Optimizing Flink’s be part of operations on Amazon EMR with Alluxio

Resolution structure

Optimizing the answer

Resolution implementation particulars

Actual-time truth tables

Caching dimension tables with Alluxio

Flink temporal desk be part of

Different optimalizations to contemplate

Abstract

Concerning the creator

5 Kimi K2.5 Options for Builders [MUST TRY]

AI Just like Sweet AI for When You are Feeling Lonely at 2 AM

Construct an AI Examine Assistant with Claude Code + Android Studio

LEAVE A REPLY Cancel reply

Most Popular

5 Kimi K2.5 Options for Builders [MUST TRY]

Prime 5 Takeaways to Discover Your self within the Way forward for Knowledge Science

Section One Unveils New Sub-Centimeter Mapping Digital camera

Be taught, Scale & Win in a Subscription-First World

Recent Comments

ABOUT US

POPULAR POSTS

5 Kimi K2.5 Options for Builders [MUST TRY]

Prime 5 Takeaways to Discover Your self within the Way forward for Knowledge Science

Section One Unveils New Sub-Centimeter Mapping Digital camera

POPULAR CATEGORY