HomeArtificial IntelligenceLearn how to Construct an Superior BrightData Internet Scraper with Google Gemini...

Learn how to Construct an Superior BrightData Internet Scraper with Google Gemini for AI-Powered Knowledge Extraction


On this tutorial, we stroll you thru constructing an enhanced internet scraping software that leverages BrightData’s highly effective proxy community alongside Google’s Gemini API for clever knowledge extraction. You’ll see find out how to construction your Python mission, set up and import the mandatory libraries, and encapsulate scraping logic inside a clear, reusable BrightDataScraper class. Whether or not you’re focusing on Amazon product pages, bestseller listings, or LinkedIn profiles, the scraper’s modular strategies display find out how to configure scraping parameters, deal with errors gracefully, and return structured JSON outcomes. An optionally available React-style AI agent integration additionally reveals you find out how to mix LLM-driven reasoning with real-time scraping, empowering you to pose pure language queries for on-the-fly knowledge evaluation.

!pip set up langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

We set up the entire key libraries wanted for the tutorial in a single step: langchain-brightdata for BrightData internet scraping, langchain-google-genai and google-generativeai for Google Gemini integration, langgraph for agent orchestration, and langchain-core for the core LangChain framework.

import os
import json
from typing import Dict, Any, Non-obligatory
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent

These imports put together your atmosphere and core performance: os and json deal with system operations and knowledge serialization, whereas typing supplies structured kind hints. You then usher in BrightDataWebScraperAPI for BrightData scraping, ChatGoogleGenerativeAI to interface with Google’s Gemini LLM, and create_react_agent to orchestrate these elements in a React-style agent.

class BrightDataScraper:
    """Enhanced internet scraper utilizing BrightData API"""
   
    def __init__(self, api_key: str, google_api_key: Non-obligatory[str] = None):
        """Initialize scraper with API keys"""
        self.api_key = api_key
        self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
       
        if google_api_key:
            self.llm = ChatGoogleGenerativeAI(
                mannequin="gemini-2.0-flash",
                google_api_key=google_api_key
            )
            self.agent = create_react_agent(self.llm, [self.scraper])
   
    def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
        """Scrape Amazon product knowledge"""
        strive:
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product",
                "zipcode": zipcode
            })
            return {"success": True, "knowledge": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_amazon_bestsellers(self, area: str = "in") -> Dict[str, Any]:
        """Scrape Amazon bestsellers"""
        strive:
            url = f"https://www.amazon.{area}/gp/bestsellers/"
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product"
            })
            return {"success": True, "knowledge": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
        """Scrape LinkedIn profile knowledge"""
        strive:
            outcomes = self.scraper.invoke({
                "url": url,
                "dataset_type": "linkedin_person_profile"
            })
            return {"success": True, "knowledge": outcomes}
        besides Exception as e:
            return {"success": False, "error": str(e)}
   
    def run_agent_query(self, question: str) -> None:
        """Run AI agent with pure language question"""
        if not hasattr(self, 'agent'):
            print("Error: Google API key required for agent performance")
            return
       
        strive:
            for step in self.agent.stream(
                {"messages": question},
                stream_mode="values"
            ):
                step["messages"][-1].pretty_print()
        besides Exception as e:
            print(f"Agent error: {e}")
   
    def print_results(self, outcomes: Dict[str, Any], title: str = "Outcomes") -> None:
        """Fairly print outcomes"""
        print(f"n{'='*50}")
        print(f"{title}")
        print(f"{'='*50}")
       
        if outcomes["success"]:
            print(json.dumps(outcomes["data"], indent=2, ensure_ascii=False))
        else:
            print(f"Error: {outcomes['error']}")
        print()

The BrightDataScraper class encapsulates all BrightData web-scraping logic and optionally available Gemini-powered intelligence beneath a single, reusable interface. Its strategies allow you to simply fetch Amazon product particulars, bestseller lists, and LinkedIn profiles, dealing with API calls, error dealing with, and JSON formatting, and even stream natural-language “agent” queries when a Google API secret’s supplied. A handy print_results helper ensures your output is all the time cleanly formatted for inspection.

def essential():
    """Fundamental execution perform"""
    BRIGHT_DATA_API_KEY = "Use Your Personal API Key"
    GOOGLE_API_KEY = "Use Your Personal API Key"
   
    scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
   
    print("🛍️ Scraping Amazon India Bestsellers...")
    bestsellers = scraper.scrape_amazon_bestsellers("in")
    scraper.print_results(bestsellers, "Amazon India Bestsellers")
   
    print("📦 Scraping Amazon Product...")
    product_url = "https://www.amazon.com/dp/B08L5TNJHG"
    product_data = scraper.scrape_amazon_product(product_url, "10001")
    scraper.print_results(product_data, "Amazon Product Knowledge")
   
    print("👤 Scraping LinkedIn Profile...")
    linkedin_url = "https://www.linkedin.com/in/satyanadella/"
    linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
    scraper.print_results(linkedin_data, "LinkedIn Profile Knowledge")
   
    print("🤖 Working AI Agent Question...")
    agent_query = """
    Scrape Amazon product knowledge for https://www.amazon.com/dp/B0D2Q9397Y?th=1
    in New York (zipcode 10001) and summarize the important thing product particulars.
    """
    scraper.run_agent_query(agent_query)

The principle() perform ties all the things collectively by setting your BrightData and Google API keys, instantiating the BrightDataScraper, after which demonstrating every characteristic: it scrapes Amazon India’s bestsellers, fetches particulars for a selected product, retrieves a LinkedIn profile, and eventually runs a natural-language agent question, printing neatly formatted outcomes after every step.

if __name__ == "__main__":
    print("Putting in required packages...")
    os.system("pip set up -q langchain-brightdata langchain-google-genai langgraph")
   
    os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Personal API Key"
   
    essential()

Lastly, this entry-point block ensures that, when run as a standalone script, the required scraping libraries are quietly put in, and the BrightData API secret’s set within the atmosphere. Then the principle perform is executed to provoke all scraping and agent workflows.

In conclusion, by the top of this tutorial, you’ll have a ready-to-use Python script that automates tedious knowledge assortment duties, abstracts away low-level API particulars, and optionally faucets into generative AI for superior question dealing with. You may prolong this basis by including assist for different dataset sorts, integrating further LLMs, or deploying the scraper as half of a bigger knowledge pipeline or internet service. With these constructing blocks in place, you’re now geared up to collect, analyze, and current internet knowledge extra effectively, whether or not for market analysis, aggressive intelligence, or customized AI-driven functions.


Try the Pocket book. All credit score for this analysis goes to the researchers of this mission. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments