Getting Began with Mirascope: Eradicating Semantic Duplicates utilizing an LLM

July 16, 2025

4

Mirascope is a strong and user-friendly library that gives a unified interface for working with a variety of Giant Language Mannequin (LLM) suppliers, together with OpenAI, Anthropic, Mistral, Google (Gemini and Vertex AI), Groq, Cohere, LiteLLM, Azure AI, and Amazon Bedrock. It simplifies all the pieces from textual content technology and structured information extraction to constructing complicated AI-powered workflows and agent programs.

On this information, we’ll give attention to utilizing Mirascope’s OpenAI integration to determine and take away semantic duplicates (entries which will differ in wording however carry the identical which means) from a listing of buyer critiques.

Putting in the dependencies

pip set up "mirascope[openai]"

OpenAI Key

To get an OpenAI API key, go to https://platform.openai.com/settings/group/api-keys and generate a brand new key. If you happen to’re a brand new consumer, chances are you’ll want so as to add billing particulars and make a minimal cost of $5 to activate API entry.

import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Defining the checklist of buyer critiques

customer_reviews = [
    "Sound quality is amazing!",
    "Audio is crystal clear and very immersive.",
    "Incredible sound, especially the bass response.",
    "Battery doesn't last as advertised.",
    "Needs charging too often.",
    "Battery drains quickly -- not ideal for travel.",
    "Setup was super easy and straightforward.",
    "Very user-friendly, even for my parents.",
    "Simple interface and smooth experience.",
    "Feels cheap and plasticky.",
    "Build quality could be better.",
    "Broke within the first week of use.",
    "People say they can't hear me during calls.",
    "Mic quality is terrible on Zoom meetings.",
    "Great product for the price!"
]

These critiques seize key buyer sentiments: reward for sound high quality and ease of use, complaints about battery life, construct high quality, and name/mic points, together with a constructive notice on worth for cash. They replicate frequent themes present in actual consumer suggestions.

Defining a Pydantic Schema

This Pydantic mannequin defines the construction for the response of a semantic deduplication job on buyer critiques. This schema helps construction and validate the output of a language mannequin tasked with clustering or deduplicating pure language enter (e.g., consumer suggestions, bug experiences, product critiques).

from pydantic import BaseModel, Discipline

class DeduplicatedReviews(BaseModel):
    duplicates: checklist[list[str]] = Discipline(
        ..., description="An inventory of semantically equal buyer assessment teams"
    )
    critiques: checklist[str] = Discipline(
        ..., description="The deduplicated checklist of core buyer suggestions themes"
    )

Defining a Mirascope @openai.name for Semantic Deduplication

This code defines a semantic deduplication perform utilizing Mirascope’s @openai.name decorator, which allows seamless integration with OpenAI’s gpt-4o mannequin. The deduplicate_customer_reviews perform takes a listing of buyer critiques and makes use of a structured immediate—outlined by the @prompt_template decorator—to information the LLM in figuring out and grouping semantically related critiques.

The system message instructs the mannequin to investigate the which means, tone, and intent behind every assessment, clustering those who convey the identical suggestions even when worded in a different way. The perform expects a structured response conforming to the DeduplicatedReviews Pydantic mannequin, which incorporates two outputs: a listing of distinctive, deduplicated assessment sentiments, and a listing of grouped duplicates.

This design ensures that the LLM’s output is each correct and machine-readable, making it superb for buyer suggestions evaluation, survey deduplication, or product assessment clustering.

from mirascope.core import openai, prompt_template

@openai.name(mannequin="gpt-4o", response_model=DeduplicatedReviews)
@prompt_template(
    """
    SYSTEM:
    You're an AI assistant serving to to investigate buyer critiques. 
    Your job is to group semantically related critiques collectively -- even when they're worded in a different way.

    - Use your understanding of which means, tone, and implication to group duplicates.
    - Return two lists:
      1. A deduplicated checklist of the important thing distinct assessment sentiments.
      2. An inventory of grouped duplicates that share the identical underlying suggestions.

    USER:
    {critiques}
    """
)
def deduplicate_customer_reviews(critiques: checklist[str]): ...

The next code executes the deduplicate_customer_reviews perform utilizing a listing of buyer critiques and prints the structured output. First, it calls the perform and shops the end result within the response variable. To make sure that the mannequin’s output conforms to the anticipated format, it makes use of an assert assertion to validate that the response is an occasion of the DeduplicatedReviews Pydantic mannequin.

As soon as validated, it prints the deduplicated leads to two sections. The primary part, labeled “✅ Distinct Buyer Suggestions,” shows the checklist of distinctive assessment sentiments recognized by the mannequin. The second part, “🌀 Grouped Duplicates,” lists clusters of critiques that have been acknowledged as semantically equal.

response = deduplicate_customer_reviews(customer_reviews)

# Guarantee response format
assert isinstance(response, DeduplicatedReviews)

# Print Output
print("✅ Distinct Buyer Suggestions:")
for merchandise in response.critiques:
    print("-", merchandise)

print("n🌀 Grouped Duplicates:")
for group in response.duplicates:
    print("-", group)

The output exhibits a clear abstract of buyer suggestions by grouping semantically related critiques. The Distinct Buyer Suggestions part highlights key insights, whereas the Grouped Duplicates part captures completely different phrasings of the identical sentiment. This helps get rid of redundancy and makes the suggestions simpler to investigate.

Take a look at the full Codes. All credit score for this analysis goes to the researchers of this challenge.

Prepared to attach with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Analysis, and prime AI firms leverage MarkTechPost to succeed in their audience [Learn More]

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their software in varied areas.

Previous articleHow Analytics Improves Transportation Technique

Next articleQ&A: 30 years of the Wohlers Report

Getting Began with Mirascope: Eradicating Semantic Duplicates utilizing an LLM

Putting in the dependencies

OpenAI Key

Defining the checklist of buyer critiques

Defining a Pydantic Schema

Defining a Mirascope @openai.name for Semantic Deduplication

The Obtain: Veo 3’s subtitles drawback, and the way forward for our planet’s assets

NeuralOS: A Generative Framework for Simulating Interactive Working System Interfaces

Researchers announce infants born from a trial of three-person IVF

LEAVE A REPLY Cancel reply

Most Popular

At the moment’s NYT Mini Crossword Solutions for July 17

Cloudflare says 1.1.1.1 outage not attributable to assault or BGP hijack

6 Unstated Keys to Adopting a Profitable White-Label UCaaS

Phase3D, Addiguru & Additive Assurance amongst 5 finalists in ASTRO America in-situ monitoring problem

Recent Comments

ABOUT US

POPULAR POSTS

At the moment’s NYT Mini Crossword Solutions for July 17

Cloudflare says 1.1.1.1 outage not attributable to assault or BGP hijack

6 Unstated Keys to Adopting a Profitable White-Label UCaaS

POPULAR CATEGORY