HomeBig DataConstruct an information pipeline from Google Search Console to Amazon Redshift utilizing...

Construct an information pipeline from Google Search Console to Amazon Redshift utilizing AWS Glue


Google Search Console (GSC) is a service provided by Google that helps you monitor, preserve, and troubleshoot your website’s presence in Google Search outcomes. It supplies you distinctive insights straight from Google about how the search engine sees your website, serving to you enhance your efficiency in Search Engine Outcomes Pages (SERPs).

When there’s a have to merge Google Search Console information with a number of information sources or conduct advanced efficiency evaluation, conventional strategies can grow to be time-consuming and error-prone. That is the place Amazon Redshift and AWS Glue supply a complete information integration resolution.

On this put up, we discover how AWS Glue extract, remodel, and cargo (ETL) capabilities join Google functions and Amazon Redshift, serving to you unlock deeper insights and drive data-informed choices by means of automated information pipeline administration. We stroll you thru the method of utilizing AWS Glue to combine information from Google Search Console and write it to Amazon Redshift.

Answer overview

AWS Glue is a serverless information integration service that helps uncover, put together, and mix information for analytics, machine studying (ML), and software growth. You should use AWS Glue to create, run, and monitor information integration and ETL pipelines and catalog your property throughout a number of information shops.

Amazon Redshift is a quick, scalable, and absolutely managed cloud information warehouse that permits you to to course of and run advanced SQL analytics workloads on structured and semi-structured information. It additionally helps you securely entry your information in operational databases, information lakes, or third-party datasets with minimal motion or copying of knowledge. Tens of hundreds of consumers use Amazon Redshift to course of giant quantities of knowledge, modernize their information analytics workloads, and supply insights for his or her enterprise customers.

The next diagram illustrates the structure that we implement on this put up.

Architecture diagram showing AWS Glue data pipeline workflow from Google Search Console to Amazon Redshift, illustrating the ETL process with AWS Glue job reading data from three Google Search Console entities (Search Analytics, Sites, and Sitemaps) and writing to a Redshift provisioned cluster.

The workflow consists of an AWS Glue job studying information from Google Search Console for the three entities that Google Search Console helps (Search Analytics, Websites, and Sitemaps), and writing the information in a Redshift provisioned cluster. AWS Glue helps Google Search Console API v3.

Within the following sections, we stroll by means of the next steps to configure AWS Glue to arrange a connection between Google Search Console and Amazon Redshift for information migration:

  1. Create an OAuth shopper.
  2. Create an IAM position for AWS Glue integration with Google Search Console, AWS Secrets and techniques Supervisor, and Amazon Redshift.
  3. Create a secret in Secrets and techniques Supervisor to retailer the shopper secret created within the earlier step.
  4. Create a connection to Google Search Console in AWS Glue.
  5. Create a connection to Amazon Redshift in AWS Glue.
  6. Arrange a desk and permissions in Amazon Redshift.
  7. Create an ETL job in AWS Glue.

Conditions

Earlier than beginning this walkthrough, it’s essential to have the next stipulations in place:

  • An AWS account.
  • A Google Cloud account and a Google Cloud venture.
  • In your Google Cloud venture, it’s essential to allow the Google Search Console API.

    For directions, see Allow and disable APIs on the API Console Assist for Google Cloud Platform.
  • A provisioned cluster or Amazon Redshift Serverless .

    On this put up, we use a single-node ra3.giant Redshift provisioned cluster deployed in a single Availability Zone. This configuration is used for demonstration functions solely. For manufacturing environments, we suggest utilizing multi-node clusters with a minimal of two nodes deployed throughout a number of Availability Zones for top availability and higher efficiency.
  • An Amazon Easy Service Storage (Amazon S3) bucket.
  • An AWS Id and Entry Administration (IAM) position that grants AWS Glue and Amazon Redshift read-only entry to Amazon S3. This position will probably be connected to the Redshift cluster or Redshift Serverless namespace throughout creation, and also will be used when working the AWS Glue job together with permissions to learn and write secrets and techniques to Secrets and techniques Supervisor. Discuss with the Amazon Redshift Database Developer Information for extra particulars.

Create OAuth shopper

To connect with Google Search Console, AWS Glue requires OAuth 2.0 for authentication. You need to create an OAuth 2.0 shopper ID, which AWS Glue makes use of when requesting an OAuth 2.0 entry token. To create an OAuth 2.0 shopper ID within the Google Cloud Platform console, comply with these steps:

  1. On the Google Cloud Platform console, from the initiatives checklist, select a venture or create a brand new one.
  2. If the APIs & Companies web page isn’t already open, select the menu icon on the higher left and select APIs & Companies.
  3. Within the navigation pane, select Credentials.
  4. Select Create Credentials, then select OAuth shopper ID.
  5. Choose Internet software as the applying sort, enter NewClient because the title, and supply https://console.aws.amazon.com for Licensed JavaScript origins.
  6. For Licensed redirect URIs, add https://us-east-1.console.aws.amazon.com/gluestudio/oauth. This instance makes use of us-east-1 for establishing AWS Glue jobs; change the redirect URIs in accordance with your AWS Area. A number of redirect URIs will also be specified.
  7. Select Create.
  8. Open the main points web page on your new shopper.
  9. Underneath Further info, be aware down the shopper ID and shopper secret. You have to these particulars when configuring the key in Secrets and techniques Supervisor.

Create IAM position for AWS Glue integration with Google Search Console, Secrets and techniques Supervisor, and Amazon Redshift

You should use AWS Glue to switch information from supported sources into your Redshift databases. You want an IAM position as a result of AWS Glue wants authorization to jot down into Redshift databases. To create a job, full the next steps:

  1. Check in to the IAM console with ample entry to create insurance policies.
  2. Select Insurance policies within the navigation pane.
  3. Select Create coverage.
  4. On the JSON tab, enter the next coverage. AWS Glue wants the next permissions to entry and run SQL statements within the Redshift database and create and retrieve secrets and techniques with Secrets and techniques Supervisor:
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:DescribeSecret",
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:PutSecretValue",
                    "ec2:CreateNetworkInterface",
                    "ec2:DescribeNetworkInterfaces",
                    "ec2:DeleteNetworkInterface"
                ],
                "Useful resource": "*"
            },
            {
                "Impact": "Permit",
                "Motion": "s3:GetObject",
                "Useful resource": "arn:aws:s3:::aws-glue-studio-transforms-510798373988-prod-us-east-1/*"
            },
            {
                "Impact": "Permit",
                "Motion": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Useful resource": [
                    "arn:aws:s3:::aws-glue-assets-testbucket/*"
                ]
            },
            {
                "Sid": "DataAPIPermissions",
                "Impact": "Permit",
                "Motion": [
                    "redshift-data:ExecuteStatement",
                    "redshift-data:GetStatementResult",
                    "redshift-data:DescribeStatement"
                ],
                "Useful resource": "*"
            },
            {
                "Sid": "GetCredentialsForAPIUser",
                "Impact": "Permit",
                "Motion": "redshift:GetClusterCredentials",
                "Useful resource": [
                    "arn:aws:redshift:*:*:dbname:*/*",
                    "arn:aws:redshift:*:*:dbuser:*/*"
                ]
            },
            {
                "Sid": "GetCredentialsForServerless",
                "Impact": "Permit",
                "Motion": "redshift-serverless:GetCredentials",
                "Useful resource": "*"
            },
            {
                "Sid": "DenyCreateAPIUser",
                "Impact": "Deny",
                "Motion": "redshift:CreateClusterUser",
                "Useful resource": [
                    "arn:aws:redshift:*:*:dbuser:*/*"
                ]
            },
            {
                "Sid": "ServiceLinkedRole",
                "Impact": "Permit",
                "Motion": "iam:CreateServiceLinkedRole",
                "Useful resource": "arn:aws:iam::*:position/aws-service-role/redshift-data.amazonaws.com/AWSServiceRoleForRedshift",
                "Situation": {
                    "StringLike": {
                        "iam:AWSServiceName": "redshift-data.amazonaws.com"
                    }
                }
            }
        ]
    }

    Modify the S3 bucket title that you’re utilizing because the staging bucket. Moreover, AWS Glue will need to have entry to particular AWS owned S3 buckets for internet hosting AWS Glue transforms. On this instance, the IAM coverage makes use of aws-glue-studio-transforms-510798373988-prod-us-east-1, which is the AWS owned bucket within the us-east-1 Area. Discuss with Evaluate IAM permissions wanted for ETL jobs for the suitable bucket title on your Area.

  5. Select Subsequent.
  6. For Coverage title, enter a reputation (for this put up, we use glue-redshift-gsc-policy).
  7. Enter an outline, then select Create coverage.
  8. Within the navigation pane, select Roles and Create position.
  9. Select Customized belief coverage and enter the next, then select Subsequent.
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "glue.amazonaws.com"
                    ]
                },
                "Motion": "sts:AssumeRole"
            }
        ]
    }
    

  10. Seek for and choose the coverage glue-redshift-gsc-policy, then select Subsequent.
  11. Present the position title GlueIAMRoleRedshiftNew or one other title and related Description, then select Create position.
  12. After the position is created, select Add permissions and Connect insurance policies.
  13. Seek for AWSGlueServiceRole and select Add Permissions. This coverage is usually connected to roles specified when defining crawlers, jobs, and growth endpoints.

Screenshot of AWS IAM console showing the policy attachment interface where the AWSGlueServiceRole policy is being added to the GlueIAMRoleRedshiftNew role.

Create secret in Secrets and techniques Supervisor

Full the next steps to create a Secrets and techniques Supervisor secret:

  1. On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
  2. Choose Different sort of secret.
  3. For the customer-managed linked software, the key ought to comprise the linked software’s client secret with USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET as the important thing and the shopper secret worth as created within the earlier step.

    Screenshot of AWS Secrets Manager console showing the "Store a new secret" interface with "Other type of secret" selected and a key-value pair entry for USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET.
  4. Select Subsequent.
  5. Enter a secret title and select Subsequent.
  6. Select Retailer.

Create connection to Google Search Console in AWS Glue

To create a connection to Google Search Console in AWS Glue, comply with these steps:

  1. Check in to the AWS Glue console with a certified e mail ID with permissions already offered in Google Search Console.
  2. Within the navigation pane, select Knowledge connections.
  3. Underneath Connections, select Create connection.
  4. In Knowledge sources, seek for Google Search Console and select Subsequent.

    Screenshot of AWS Glue console showing the Data connections page with Google Search Console selected as a data source in the connection creation wizard.
  5. For IAM Position ARN, select the position created earlier.
  6. For Token URL, use https://oauth2.googleapis.com/token, which is the default worth.
  7. For Person Managed Consumer Software ClientId, enter the shopper ID created earlier whereas creating the OAuth shopper.
  8. For AWS Secret, select the key created earlier.
  9. In case your AWS Glue jobs must run in an Amazon digital personal cloud (VPC), present applicable particulars. For extra info, seek advice from Configure a VPC on your ETL job.

    Screenshot of AWS Glue connection configuration form showing fields for IAM Role ARN, Token URL, User Managed Client Application ClientId, AWS Secret selection, and VPC configuration options
  10. Select Check connection, select your Google ID, and select Proceed.

    Google account selection dialog prompting the user to choose which Google account to use for authentication with the AWS Glue connection.
  11. Select Proceed to belief the connection.

    Google OAuth consent screen asking the user to continue and trust the connection between AWS Glue and their Google account.

    If the consumer has licensed entry, the connection take a look at will probably be profitable.

    AWS Glue console showing a successful connection test result with a green checkmark indicating the Google Search Console connection was established successfully.

  12. Select Subsequent.
  13. Present a connection title and select Create connection.

Create connection to Amazon Redshift in AWS Glue

Full the next steps to arrange an AWS Glue connection for Amazon Redshift. Discuss with Redshift connections for extra info.

  1. On the AWS Glue console, within the navigation pane, select Knowledge connections.
  2. Underneath Connections, select Create connection.
  3. In Knowledge sources, seek for JDBC and select Subsequent. For Amazon Redshift, you too can use Redshift connections. On this put up, we use JDBC. On this instance, we’re utilizing a Redshift provisioned cluster.
  4. Present the Amazon Redshift JDBC URL and both use a Secrets and techniques Supervisor secret for storing credentials or present the consumer title and password straight. As a greatest apply, it is strongly recommended to make use of Secrets and techniques Supervisor.
  5. Configure community choices with Amazon VPC settings for working the AWS Glue job in a VPC. On this instance, we use the identical VPC, subnet, and safety group the place the Redshift cluster is provisioned. All JDBC information shops have to be accessible from the VPC subnet. A VPC endpoint is required to entry Amazon S3 from inside your VPC. In case your job must entry each VPC sources and the general public web, configure a NAT gateway within the VPC.Screenshot of AWS Glue connection configuration for Amazon Redshift showing JDBC URL entry, credentials configuration options (Secrets Manager or direct username/password), and VPC network settings including VPC, subnet, and security group selections.

Arrange desk and permissions in Amazon Redshift

To arrange desk and permissions in Amazon Redshift, comply with these steps:

  1. On the Amazon Redshift console, select Question editor v2.
  2. Connect with your current Redshift cluster.
  3. Create a desk with the next DDL. For this put up, we create a brand new database named take a look at and create the next tables within the public schema of take a look at database:
    #Create Database command
    CREATE DATABASE take a look at; 
    
    #Sitemap desk creation
    CREATE TABLE public.sitemap(
        path VARCHAR(4096) ENCODE lzo,
        sort VARCHAR(255) ENCODE lzo,
        lastSubmitted TIMESTAMP ENCODE delta,
        isPending BOOLEAN NULL ENCODE uncooked,
        isSitemapsIndex BOOLEAN NULL ENCODE uncooked,
        lastDownloaded TIMESTAMP NULL ENCODE delta,
        warnings BIGINT NULL ENCODE delta,
        errors BIGINT NULL ENCODE delta,
        contents VARCHAR(65535) NULL ENCODE lzo) DISTSTYLE AUTO;
        
    #Search Analytics desk creation
    CREATE TABLE public.search_analytics (
        keys character various(2048) ENCODE lzo,
        clicks double precision ENCODE uncooked,
        impressions double precision ENCODE uncooked,
        ctr numeric(38, 18) ENCODE az64,
        place double precision ENCODE uncooked
    ) DISTSTYLE AUTO;
    
    #Websites desk creation
     CREATE TABLE public.websites (
        siteurl character various(2048) ENCODE lzo,
        permissionLevel character various(50) ENCODE lzo
    ) DISTSTYLE AUTO;

    Screenshot of AWS Glue ETL job visual editor showing the job creation interface with source and target selection options, displaying Google Search Console as source and Amazon Redshift as target.

Create ETL job in AWS Glue

To create an information move in AWS Glue, comply with these steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Select Visible ETL below Create job.

    Every ETL job in AWS Glue is priced based mostly on its period.

    Screenshot of AWS Glue visual ETL canvas showing a data flow diagram with Google Search Console source node connected to Amazon Redshift target node.

  3. For the supply, select Google Search Console, and for the goal, select Amazon Redshift.

    Screenshot of AWS Glue source node configuration panel showing Google Search Console connection settings with entity selection (Sites) and field selection options (siteUrl and permissionLevel).
  4. Select Supply (Google Search Console) to configure the properties, which opens in the suitable window pane.
  5. Select the Google Search Console connection created within the earlier sections, and supply the entity title. On the time of writing, there are three supported entities: Search Analytics, Websites, and Sitemaps, with a number of supported fields and operators for every entity. Select the entity title and the corresponding fields; by default, the connector selects all fields. The instance reveals choosing the entity Website and corresponding fields siteUrl and permissionLevel.

    Screenshot of AWS Glue target node configuration panel showing Amazon Redshift connection settings including schema selection, table name, data handling method (Append to target table), and S3 staging directory configuration.
  6. Select Target (Amazon Redshift) to configure the properties, which opens in the suitable pane.
  7. Select the Amazon Redshift connection, schema, and desk title that had been created within the earlier steps. On this instance, we use Append to focus on desk as the tactic for dealing with the information. An S3 listing is offered for staging non permanent information.

    Screenshot of AWS Glue target node configuration panel showing Amazon Redshift connection settings including schema selection, table name, data handling method (Append to target table), and S3 staging directory configuration.
  8. Navigate to Job particulars and supply a job title and IAM position (which the job will assume whereas working). This is similar position created earlier.
  9. Select Save and Run. For this instance, we use AWS Glue model 5.0, maintaining all different configuration values below Job particulars at their defaults. For this instance, now we have not applied any schema mapping, so the columns in Amazon Redshift had been created to match the output response for the Search entity.
  10. After the job has accomplished efficiently, navigate to Question Editor v2 in Amazon Redshift and question the Websites desk to preview the information.

    Screenshot of Amazon Redshift Query Editor v2 showing query results from the Sites table with columns for siteurl and permissionlevel, displaying sample data rows.Screenshot of Amazon Redshift Query Editor v2 showing query results from the Sites table with columns for siteurl and permissionlevel, displaying sample data rows.
  11. Within the case of job failures, validate the connections by doing an information preview, and seek advice from Troubleshooting AWS Glue.
  12. Much like the Website entity, you’ll be able to load Sitemap entity information by altering the supply properties and vacation spot desk within the goal Redshift cluster, then selecting Run.

    Screenshot of AWS Glue source node configuration showing Google Search Console entity selection changed to Sitemaps with corresponding fields selected.
  13. Navigate to Question Editor v2 in Amazon Redshift and question the sitemap desk to preview the information.

    Screenshot of Amazon Redshift Query Editor v2 showing query results from the sitemap table with columns including path, type, lastsubmitted, ispending, issitemapsindex, lastdownloaded, warnings, errors, and contents.
  14. Much like Sitemap, you’ll be able to load Search Analytics entity information by altering the supply properties and vacation spot desk within the goal Redshift cluster, then selecting Run.

    Screenshot of AWS Glue source node configuration showing Google Search Console entity selection changed to Search Analytics with corresponding fields selected.
  15. Navigate to Question Editor v2 in Amazon Redshift and question the search_analytics desk and preview the information.

    Screenshot of Amazon Redshift Query Editor v2 showing query results from the search_analytics table with columns for keys, clicks, impressions, ctr, and position.

Filter predicates with Search Analytics

The Search Analytics entity supplies help for a number of filters that can be utilized to view the visitors information for the websites. The next examples present use of some filter predicates you should use that Google Search Console connections help.

  • start_end_date – The default worth for start_end_date is between AND . To make use of a distinct date vary, use the between The next instance shows search information from January by means of September 2025:
    start_end_date between '2025-01-01' AND '2025-09-30'

    Screenshot of AWS Glue source node configuration showing Search Analytics entity with a filter predicate for start_end_date between '2025-01-01' AND '2025-09-30'.

  • machine – The machine filters outcome in opposition to specified machine sort like DESKOP, MOBILE, and TABLET:

    Screenshot of AWS Glue source node configuration showing Search Analytics entity with a filter predicate for device=

  • nation – You’ll be able to filter in opposition to the desired nation, as specified by three-letter nation code (ISO 3166-1 alpha-3):

    Screenshot of AWS Glue source node configuration showing Search Analytics entity with dimensions set to 'country'.

  • dimensions: Dimensions assist group zero or extra outcomes for filtering search information by nation or machine. The next instance shows search information grouped by nation, and likewise grouping by nation and filtering for cellular units:
    dimensions="nation" AND nation='ind' AND machine="MOBILE"

    Screenshot of AWS Glue source node configuration showing Search Analytics entity with multiple filter predicates including dimensions=

Run analytical queries on Amazon Redshift

On this part, we run analytical queries utilizing aggregated information throughout totally different search entities.

Record all international locations the place website place is lower than 10 and machine sort is MOBILE:

SELECT * from search_analytics_device_country the place place 

Screenshot of Amazon Redshift Query Editor v2 showing query results for countries where site position is less than 10 and device type is MOBILE, displaying data from the search_analytics_device_country table.

Record all international locations the place impressions are larger than 1 and place is lower than 10:

SELECT * FROM "take a look at"."public"."search_analytics_country" the place impressions > 1 and place 

Screenshot of Amazon Redshift Query Editor v2 showing query results for countries where impressions are greater than 1 and position is less than 10, displaying data from the search_analytics_country table.

Clear up

To keep away from incurring expenses, clear up the sources in your AWS account by finishing the next steps:

  1. On the AWS Glue console, within the navigation pane, select Job monitoring.
  2. Cease any working jobs created for Google Search Console connections.
  3. From the checklist of connections, choose the connection title created and delete it.
  4. Delete the Redshift provisioned cluster or the Redshift Serverless workspace and namespace. Amazon Redshift pricing is utilized through the cluster’s runtime based mostly on cluster configuration.
  5. Clear up sources in your Google account by deleting the venture that incorporates the Google Mission sources. For directions, seek advice from Delete your venture.

Conclusion

On this put up, we walked you thru the method of utilizing AWS Glue to combine information from Google Search Console and write it to Amazon Redshift, a petabyte-scale information warehouse. Whether or not you’re archiving historic information, performing advanced analytics, or getting ready information for machine studying, this connector streamlines the method and helps create an built-in information pipeline.

For extra info, seek advice from AWS Glue help for Google Search Console.


In regards to the authors

Anirudh Chawla

Anirudh Chawla

Anirudh is an AWS Analytics Specialist Options Architect. He likes to learn books, take lengthy walks in nature, and take part in group applications.

Shubham Purwar

Shubham Purwar

Shubham is an AWS Analytics Specialist Answer Architect. In his free time, Shubham likes to spend time together with his household and journey all over the world.

Shaswat Mandhanya

Shaswat Mandhanya

Shaswat is an AWS Analytics Specialist BD. In his free time, he likes to observe Components 1 races and journey throughout the nation.

Prabhu G

Prabhu G

Prabhu is a Options Architect at AWS. He’s an avid supporter of Chennai Tremendous Kings and a big-time fan of MS Dhoni.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments