On this tutorial, we’ll discover how you can use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable info (PII) in free-form textual content. Constructed on prime of the environment friendly spaCy NLP library, Presidio is each light-weight and modular, making it simple to combine into real-time purposes and pipelines.
We’ll cowl how you can:
- Arrange and set up the required Presidio packages
- Detect widespread PII entities equivalent to names, cellphone numbers, and bank card particulars
- Outline customized recognizers for domain-specific entities (e.g., PAN, Aadhaar)
- Create and register customized anonymizers (like hashing or pseudonymization)
- Reuse anonymization mappings for constant re-anonymization
Putting in the libraries
To get began with Presidio, you’ll want to put in the next key libraries:
- presidio-analyzer: That is the core library liable for detecting PII entities in textual content utilizing built-in and customized recognizers.
- presidio-anonymizer: This library offers instruments to anonymize (e.g., redact, substitute, hash) the detected PII utilizing configurable operators.
- spaCy NLP mannequin (en_core_web_lg): Presidio makes use of spaCy underneath the hood for pure language processing duties like named entity recognition. The en_core_web_lg mannequin offers high-accuracy outcomes and is advisable for English-language PII detection.
pip set up presidio-analyzer presidio-anonymizer
python -m spacy obtain en_core_web_lg
You would possibly have to restart the session to put in the libraries, in case you are utilizing Jupyter/Colab.
Presidio Analyzer
Fundamental PII Detection
On this block, we initialize the Presidio Analyzer Engine and run a fundamental evaluation to detect a U.S. cellphone quantity from a pattern textual content. We additionally suppress lower-level log warnings from the Presidio library for cleaner output.
The AnalyzerEngine masses spaCy’s NLP pipeline and predefined recognizers to scan the enter textual content for delicate entities. On this instance, we specify PHONE_NUMBER because the goal entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Arrange the engine, masses the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()
# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
entities=["PHONE_NUMBER"],
language="en")
print(outcomes)
Making a Customized PII Recognizer with a Deny Listing (Educational Titles)
This code block exhibits how you can create a customized PII recognizer in Presidio utilizing a easy deny checklist, best for detecting fastened phrases like tutorial titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and utilized by the analyzer to scan enter textual content.
Whereas this tutorial covers solely the deny checklist strategy, Presidio additionally helps regex-based patterns, NLP fashions, and exterior recognizers. For these superior strategies, confer with the official docs: Including Customized Recognizers.
Presidio Analyzer
Fundamental PII Detection
On this block, we initialize the Presidio Analyzer Engine and run a fundamental evaluation to detect a U.S. cellphone quantity from a pattern textual content. We additionally suppress lower-level log warnings from the Presidio library for cleaner output.
The AnalyzerEngine masses spaCy’s NLP pipeline and predefined recognizers to scan the enter textual content for delicate entities. On this instance, we specify PHONE_NUMBER because the goal entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Arrange the engine, masses the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()
# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
entities=["PHONE_NUMBER"],
language="en")
print(outcomes)
Making a Customized PII Recognizer with a Deny Listing (Educational Titles)
This code block exhibits how you can create a customized PII recognizer in Presidio utilizing a easy deny checklist, best for detecting fastened phrases like tutorial titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and utilized by the analyzer to scan enter textual content.
Whereas this tutorial covers solely the deny checklist strategy, Presidio additionally helps regex-based patterns, NLP fashions, and exterior recognizers. For these superior strategies, confer with the official docs: Including Customized Recognizers.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry
# Step 1: Create a customized sample recognizer utilizing deny_list
academic_title_recognizer = PatternRecognizer(
supported_entity="ACADEMIC_TITLE",
deny_list=["Dr.", "Dr", "Professor", "Prof."]
)
# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)
# Step 3: Create analyzer engine with the up to date registry
analyzer = AnalyzerEngine(registry=registry)
# Step 4: Analyze textual content
textual content = "Prof. John Smith is assembly with Dr. Alice Brown."
outcomes = analyzer.analyze(textual content=textual content, language="en")
for lead to outcomes:
print(consequence)
Presidio Anonymizer
This code block demonstrates how you can use the Presidio Anonymizer Engine to anonymize detected PII entities in a given textual content. On this instance, we manually outline two PERSON entities utilizing RecognizerResult, simulating output from the Presidio Analyzer. These entities symbolize the names “Bond” and “James Bond” within the pattern textual content.
We use the “substitute” operator to substitute each names with a placeholder worth (“BIP”), successfully anonymizing the delicate knowledge. That is carried out by passing an OperatorConfig with the specified anonymization technique (substitute) to the AnonymizerEngine.
This sample may be simply prolonged to use different built-in operations like “redact”, “hash”, or customized pseudonymization methods.
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
# Initialize the engine:
engine = AnonymizerEngine()
# Invoke the anonymize operate with the textual content,
# analyzer outcomes (probably coming from presidio-analyzer) and
# Operators to get the anonymization output:
consequence = engine.anonymize(
textual content="My identify is Bond, James Bond",
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("substitute", {"new_value": "BIP"})},
)
print(consequence)
Customized Entity Recognition, Hash-Based mostly Anonymization, and Constant Re-Anonymization with Presidio
On this instance, we take Presidio a step additional by demonstrating:
- ✅ Defining customized PII entities (e.g., Aadhaar and PAN numbers) utilizing regex-based PatternRecognizers
- 🔐 Anonymizing delicate knowledge utilizing a customized hash-based operator (ReAnonymizer)
- ♻️ Re-anonymizing the identical values persistently throughout a number of texts by sustaining a mapping of authentic → hashed values
We implement a customized ReAnonymizer operator that checks if a given worth has already been hashed and reuses the identical output to protect consistency. That is notably helpful when anonymized knowledge must retain some utility — for instance, linking data by pseudonymous IDs.
Outline a Customized Hash-Based mostly Anonymizer (ReAnonymizer)
This block defines a customized Operator known as ReAnonymizer that makes use of SHA-256 hashing to anonymize entities and ensures the identical enter at all times will get the identical anonymized output by storing hashes in a shared mapping.
from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict
class ReAnonymizer(Operator):
"""
Anonymizer that replaces textual content with a reusable SHA-256 hash,
saved in a shared mapping dict.
"""
def function(self, textual content: str, params: Dict = None) -> str:
entity_type = params.get("entity_type", "DEFAULT")
mapping = params.get("entity_mapping")
if mapping is None:
elevate ValueError("Lacking `entity_mapping` in params")
# Verify if already hashed
if entity_type in mapping and textual content in mapping[entity_type]:
return mapping[entity_type][text]
# Hash and retailer
hashed = ""
mapping.setdefault(entity_type, {})[text] = hashed
return hashed
def validate(self, params: Dict = None) -> None:
if "entity_mapping" not in params:
elevate ValueError("You will need to move an 'entity_mapping' dictionary.")
def operator_name(self) -> str:
return "reanonymizer"
def operator_type(self) -> OperatorType:
return OperatorType.Anonymize
Outline Customized PII Recognizers for PAN and Aadhaar Numbers
We outline two customized regex-based PatternRecognizers — one for Indian PAN numbers and one for Aadhaar numbers. These will detect customized PII entities in your textual content.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Sample
# Outline customized recognizers
pan_recognizer = PatternRecognizer(
supported_entity="IND_PAN",
identify="PAN Recognizer",
patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", rating=0.8)],
supported_language="en"
)
aadhaar_recognizer = PatternRecognizer(
supported_entity="AADHAAR",
identify="Aadhaar Recognizer",
patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", rating=0.8)],
supported_language="en"
)
Set Up Analyzer and Anonymizer Engines
Right here we arrange the Presidio AnalyzerEngine, register the customized recognizers, and add the customized anonymizer to the AnonymizerEngine.
from presidio_anonymizer import AnonymizerEngine, OperatorConfig
# Initialize analyzer and register customized recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)
# Initialize anonymizer and add customized operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)
# Shared mapping dictionary for constant re-anonymization
entity_mapping = {}
Analyze and Anonymize Enter Texts
We analyze two separate texts that each embrace the identical PAN and Aadhaar values. The customized operator ensures they’re anonymized persistently throughout each inputs.
from pprint import pprint
# Instance texts
text1 = "My PAN is ABCDE1234F and Aadhaar quantity is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."
# Analyze and anonymize first textual content
results1 = analyzer.analyze(textual content=text1, language="en")
anon1 = anonymizer.anonymize(
text1,
results1,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
# Analyze and anonymize second textual content
results2 = analyzer.analyze(textual content=text2, language="en")
anon2 = anonymizer.anonymize(
text2,
results2,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
View Anonymization Outcomes and Mapping
Lastly, we print each anonymized outputs and examine the mapping used internally to keep up constant hashes throughout values.
print("📄 Authentic 1:", text1)
print("🔐 Anonymized 1:", anon1.textual content)
print("📄 Authentic 2:", text2)
print("🔐 Anonymized 2:", anon2.textual content)
print("n📦 Mapping used:")
pprint(entity_mapping)
Take a look at the Codes. All credit score for this analysis goes to the researchers of this mission. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication.