Getting Began with Microsoft’s Presidio: A Step-by-Step Information to Detecting and Anonymizing Personally Identifiable Info PII in Textual content

June 24, 2025

55

On this tutorial, we’ll discover how you can use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable info (PII) in free-form textual content. Constructed on prime of the environment friendly spaCy NLP library, Presidio is each light-weight and modular, making it simple to combine into real-time purposes and pipelines.

We’ll cowl how you can:

Arrange and set up the required Presidio packages

Detect widespread PII entities equivalent to names, cellphone numbers, and bank card particulars

Outline customized recognizers for domain-specific entities (e.g., PAN, Aadhaar)

Create and register customized anonymizers (like hashing or pseudonymization)

Reuse anonymization mappings for constant re-anonymization

Putting in the libraries

To get began with Presidio, you’ll want to put in the next key libraries:

presidio-analyzer: That is the core library liable for detecting PII entities in textual content utilizing built-in and customized recognizers.

presidio-anonymizer: This library offers instruments to anonymize (e.g., redact, substitute, hash) the detected PII utilizing configurable operators.

spaCy NLP mannequin (en_core_web_lg): Presidio makes use of spaCy underneath the hood for pure language processing duties like named entity recognition. The en_core_web_lg mannequin offers high-accuracy outcomes and is advisable for English-language PII detection.

pip set up presidio-analyzer presidio-anonymizer
python -m spacy obtain en_core_web_lg

You would possibly have to restart the session to put in the libraries, in case you are utilizing Jupyter/Colab.

Presidio Analyzer

Fundamental PII Detection

On this block, we initialize the Presidio Analyzer Engine and run a fundamental evaluation to detect a U.S. cellphone quantity from a pattern textual content. We additionally suppress lower-level log warnings from the Presidio library for cleaner output.

The AnalyzerEngine masses spaCy’s NLP pipeline and predefined recognizers to scan the enter textual content for delicate entities. On this instance, we specify PHONE_NUMBER because the goal entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Arrange the engine, masses the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()

# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(outcomes)

Making a Customized PII Recognizer with a Deny Listing (Educational Titles)

This code block exhibits how you can create a customized PII recognizer in Presidio utilizing a easy deny checklist, best for detecting fastened phrases like tutorial titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and utilized by the analyzer to scan enter textual content.
Whereas this tutorial covers solely the deny checklist strategy, Presidio additionally helps regex-based patterns, NLP fashions, and exterior recognizers. For these superior strategies, confer with the official docs: Including Customized Recognizers.

Presidio Analyzer

Fundamental PII Detection

On this block, we initialize the Presidio Analyzer Engine and run a fundamental evaluation to detect a U.S. cellphone quantity from a pattern textual content. We additionally suppress lower-level log warnings from the Presidio library for cleaner output.

The AnalyzerEngine masses spaCy’s NLP pipeline and predefined recognizers to scan the enter textual content for delicate entities. On this instance, we specify PHONE_NUMBER because the goal entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Arrange the engine, masses the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()

# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(outcomes)

Making a Customized PII Recognizer with a Deny Listing (Educational Titles)

This code block exhibits how you can create a customized PII recognizer in Presidio utilizing a easy deny checklist, best for detecting fastened phrases like tutorial titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and utilized by the analyzer to scan enter textual content.
Whereas this tutorial covers solely the deny checklist strategy, Presidio additionally helps regex-based patterns, NLP fashions, and exterior recognizers. For these superior strategies, confer with the official docs: Including Customized Recognizers.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a customized sample recognizer utilizing deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the up to date registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze textual content
textual content = "Prof. John Smith is assembly with Dr. Alice Brown."
outcomes = analyzer.analyze(textual content=textual content, language="en")

for lead to outcomes:
    print(consequence)

Presidio Anonymizer

This code block demonstrates how you can use the Presidio Anonymizer Engine to anonymize detected PII entities in a given textual content. On this instance, we manually outline two PERSON entities utilizing RecognizerResult, simulating output from the Presidio Analyzer. These entities symbolize the names “Bond” and “James Bond” within the pattern textual content.

We use the “substitute” operator to substitute each names with a placeholder worth (“BIP”), successfully anonymizing the delicate knowledge. That is carried out by passing an OperatorConfig with the specified anonymization technique (substitute) to the AnonymizerEngine.

This sample may be simply prolonged to use different built-in operations like “redact”, “hash”, or customized pseudonymization methods.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize operate with the textual content, 
# analyzer outcomes (probably coming from presidio-analyzer) and
# Operators to get the anonymization output:
consequence = engine.anonymize(
    textual content="My identify is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("substitute", {"new_value": "BIP"})},
)

print(consequence)

Customized Entity Recognition, Hash-Based mostly Anonymization, and Constant Re-Anonymization with Presidio

On this instance, we take Presidio a step additional by demonstrating:

✅ Defining customized PII entities (e.g., Aadhaar and PAN numbers) utilizing regex-based PatternRecognizers

🔐 Anonymizing delicate knowledge utilizing a customized hash-based operator (ReAnonymizer)

♻️ Re-anonymizing the identical values persistently throughout a number of texts by sustaining a mapping of authentic → hashed values

We implement a customized ReAnonymizer operator that checks if a given worth has already been hashed and reuses the identical output to protect consistency. That is notably helpful when anonymized knowledge must retain some utility — for instance, linking data by pseudonymous IDs.

Outline a Customized Hash-Based mostly Anonymizer (ReAnonymizer)

This block defines a customized Operator known as ReAnonymizer that makes use of SHA-256 hashing to anonymize entities and ensures the identical enter at all times will get the identical anonymized output by storing hashes in a shared mapping.

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict

class ReAnonymizer(Operator):
    """
    Anonymizer that replaces textual content with a reusable SHA-256 hash,
    saved in a shared mapping dict.
    """

    def function(self, textual content: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")

        if mapping is None:
            elevate ValueError("Lacking `entity_mapping` in params")

        # Verify if already hashed
        if entity_type in mapping and textual content in mapping[entity_type]:
            return mapping[entity_type][text]

        # Hash and retailer
        hashed = ""
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed

    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            elevate ValueError("You will need to move an 'entity_mapping' dictionary.")

    def operator_name(self) -> str:
        return "reanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Outline Customized PII Recognizers for PAN and Aadhaar Numbers

We outline two customized regex-based PatternRecognizers — one for Indian PAN numbers and one for Aadhaar numbers. These will detect customized PII entities in your textual content.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Sample

# Outline customized recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    identify="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", rating=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    identify="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", rating=0.8)],
    supported_language="en"
)

Set Up Analyzer and Anonymizer Engines

Right here we arrange the Presidio AnalyzerEngine, register the customized recognizers, and add the customized anonymizer to the AnonymizerEngine.

from presidio_anonymizer import AnonymizerEngine, OperatorConfig

# Initialize analyzer and register customized recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)

# Initialize anonymizer and add customized operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)

# Shared mapping dictionary for constant re-anonymization
entity_mapping = {}

Analyze and Anonymize Enter Texts

We analyze two separate texts that each embrace the identical PAN and Aadhaar values. The customized operator ensures they’re anonymized persistently throughout each inputs.

from pprint import pprint

# Instance texts
text1 = "My PAN is ABCDE1234F and Aadhaar quantity is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first textual content
results1 = analyzer.analyze(textual content=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second textual content
results2 = analyzer.analyze(textual content=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

View Anonymization Outcomes and Mapping

Lastly, we print each anonymized outputs and examine the mapping used internally to keep up constant hashes throughout values.

print("📄 Authentic 1:", text1)
print("🔐 Anonymized 1:", anon1.textual content)
print("📄 Authentic 2:", text2)
print("🔐 Anonymized 2:", anon2.textual content)

print("n📦 Mapping used:")
pprint(entity_mapping)

Take a look at the Codes. All credit score for this analysis goes to the researchers of this mission. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their software in varied areas.

Previous articleIs ChatGPT Making Us Dumber? A New MIT Examine Has Solutions

Next articleHow an information processing downside at Lyft turned the premise for Eventual

Getting Began with Microsoft’s Presidio: A Step-by-Step Information to Detecting and Anonymizing Personally Identifiable Info PII in Textual content

Putting in the libraries

Presidio Analyzer

Fundamental PII Detection

Making a Customized PII Recognizer with a Deny Listing (Educational Titles)

Presidio Analyzer

Fundamental PII Detection

Making a Customized PII Recognizer with a Deny Listing (Educational Titles)

Presidio Anonymizer

Customized Entity Recognition, Hash-Based mostly Anonymization, and Constant Re-Anonymization with Presidio

Outline a Customized Hash-Based mostly Anonymizer (ReAnonymizer)

Outline Customized PII Recognizers for PAN and Aadhaar Numbers

Set Up Analyzer and Anonymizer Engines

Analyze and Anonymize Enter Texts

View Anonymization Outcomes and Mapping

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Introducing catalog federation for Apache Iceberg tables within the AWS Glue Knowledge Catalog

Shawn Hymel’s CLI Information Frees Arduino UNO Q Customers From the “Fairly Limiting” App Lab

Safety researchers warning app builders about dangers in utilizing Google Antigravity

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Recent Comments

ABOUT US

POPULAR POSTS

Introducing catalog federation for Apache Iceberg tables within the AWS Glue Knowledge Catalog

Shawn Hymel’s CLI Information Frees Arduino UNO Q Customers From the “Fairly Limiting” App Lab

Safety researchers warning app builders about dangers in utilizing Google Antigravity

POPULAR CATEGORY