PII Detection and Redaction with GLiNER: Complete Guide
Build a privacy-compliant data processing pipeline that automatically detects and redacts personally identifiable information from text documents.
Overview
This cookbook shows how to use GLiNER to identify PII entities like names, email addresses, phone numbers, social security numbers, and other sensitive data—then redact or mask them for safe storage and sharing.
What You'll Learn
- Configure GLiNER for PII entity extraction
- Define comprehensive PII entity types (names, addresses, IDs, financial data)
- Implement different redaction strategies (masking, replacement, tokenization)
- Process documents while preserving formatting
- Validate redaction completeness
Prerequisites
- Python 3.8+
- GLiNER library installed
- Sample documents containing PII (for testing)
Use Cases
- GDPR and CCPA compliance
- Safe data sharing with third parties
- Training data anonymization
- Customer data protection in logs and analytics
The GLiNER PII Model
The knowledgator/gliner-pii-large-v1.0 model is specifically fine-tuned for detecting personally identifiable information across multiple categories. It provides high-accuracy extraction of sensitive data without requiring additional training.
Supported PII Entity Types
| Category | Entity Types |
|---|---|
| Personal Identifiers | person, username, password |
| Contact Information | email, phone_number, address, city, state, zip_code, country |
| Government IDs | social_security_number, driver_license, passport_number, tax_id |
| Financial Data | credit_card_number, bank_account, routing_number, iban |
| Healthcare | medical_record_number, health_insurance_id |
| Digital Identifiers | ip_address, mac_address, url, device_id |
| Temporal & Biographic | date_of_birth, age, gender, ethnicity, nationality |
| Employment | company, job_title, employee_id |
Installation
pip install gliner
Quick Start
from gliner import GLiNER
# Load the PII detection model
model = GLiNER.from_pretrained("knowledgator/gliner-pii-large-v1.0")
# Define PII entity types to detect
pii_labels = [
"person", "email", "phone_number", "address",
"social_security_number", "credit_card_number",
"date_of_birth", "driver_license"
]
# Sample text with PII
text = """
Dear John Smith,
Thank you for your application. We have your contact information on file:
Email: john.smith@email.com, Phone: (555) 123-4567.
For verification, please confirm your SSN (123-45-6789) and
date of birth (March 15, 1985). Your application ID is APP-2024-78432.
Best regards,
HR Department
"""
# Detect PII entities
entities = model.predict_entities(text, pii_labels, threshold=0.5)
# Display detected entities
for entity in entities:
print(f"Type: {entity['label']}")
print(f" Text: {entity['text']}")
print(f" Position: {entity['start']}-{entity['end']}")
print(f" Confidence: {entity['score']:.2f}")
print()
Output:
Type: person
Text: John Smith
Position: 5-15
Confidence: 0.98
Type: email
Text: john.smith@email.com
Position: 108-128
Confidence: 0.99
Type: phone_number
Text: (555) 123-4567
Position: 137-151
Confidence: 0.97
Type: social_security_number
Text: 123-45-6789
Position: 196-207
Confidence: 0.96
Type: date_of_birth
Text: March 15, 1985
Position: 229-243
Confidence: 0.94
Redaction Strategies
Strategy 1: Simple Masking
Replace PII with a fixed mask pattern:
def redact_with_mask(text: str, entities: list, mask: str = "[REDACTED]") -> str:
"""Replace all PII entities with a mask string."""
# Sort entities by position (descending) to preserve indices
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for entity in sorted_entities:
redacted = redacted[:entity['start']] + mask + redacted[entity['end']:]
return redacted
# Apply masking
redacted_text = redact_with_mask(text, entities)
print(redacted_text)
Output:
Dear [REDACTED],
Thank you for your application. We have your contact information on file:
Email: [REDACTED], Phone: [REDACTED].
For verification, please confirm your SSN ([REDACTED]) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.
Strategy 2: Type-Aware Masking
Replace PII with type-specific placeholders:
def redact_with_type_labels(text: str, entities: list) -> str:
"""Replace PII with type-specific placeholders."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for entity in sorted_entities:
placeholder = f"[{entity['label'].upper()}]"
redacted = redacted[:entity['start']] + placeholder + redacted[entity['end']:]
return redacted
redacted_text = redact_with_type_labels(text, entities)
print(redacted_text)
Output:
Dear [PERSON],
Thank you for your application. We have your contact information on file:
Email: [EMAIL], Phone: [PHONE_NUMBER].
For verification, please confirm your SSN ([SOCIAL_SECURITY_NUMBER]) and
date of birth ([DATE_OF_BIRTH]). Your application ID is APP-2024-78432.
Strategy 3: Consistent Pseudonymization
Replace PII with consistent fake values (same entity always maps to same pseudonym):
import hashlib
from faker import Faker
fake = Faker()
Faker.seed(42)
def generate_pseudonym(entity_text: str, entity_type: str) -> str:
"""Generate a consistent pseudonym based on entity hash."""
# Create deterministic seed from entity text
seed = int(hashlib.md5(entity_text.encode()).hexdigest(), 16) % (10**9)
Faker.seed(seed)
generators = {
"person": fake.name,
"email": fake.email,
"phone_number": fake.phone_number,
"address": fake.address,
"social_security_number": lambda: fake.ssn(),
"credit_card_number": fake.credit_card_number,
"date_of_birth": lambda: fake.date_of_birth().strftime("%B %d, %Y"),
"company": fake.company,
}
generator = generators.get(entity_type, lambda: f"[{entity_type.upper()}]")
return generator()
def pseudonymize(text: str, entities: list) -> tuple[str, dict]:
"""Replace PII with consistent pseudonyms. Returns text and mapping."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
mapping = {}
result = text
for entity in sorted_entities:
original = entity['text']
if original not in mapping:
mapping[original] = generate_pseudonym(original, entity['label'])
pseudonym = mapping[original]
result = result[:entity['start']] + pseudonym + result[entity['end']:]
return result, mapping
pseudonymized_text, pii_mapping = pseudonymize(text, entities)
print(pseudonymized_text)
print("\nMapping (keep secure):", pii_mapping)
Strategy 4: Partial Masking
Preserve partial information for usability:
def partial_mask(text: str, entities: list) -> str:
"""Apply partial masking based on entity type."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
result = text
for entity in sorted_entities:
original = entity['text']
entity_type = entity['label']
if entity_type == "email":
# Show first char and domain: j***@email.com
parts = original.split('@')
masked = parts[0][0] + '***@' + parts[1] if '@' in original else '***'
elif entity_type == "phone_number":
# Show last 4 digits: ***-***-4567
digits = ''.join(c for c in original if c.isdigit())
masked = f"***-***-{digits[-4:]}" if len(digits) >= 4 else '***'
elif entity_type == "social_security_number":
# Show last 4 digits: ***-**-6789
digits = ''.join(c for c in original if c.isdigit())
masked = f"***-**-{digits[-4:]}" if len(digits) >= 4 else '***'
elif entity_type == "credit_card_number":
# Show last 4 digits: ****-****-****-1234
digits = ''.join(c for c in original if c.isdigit())
masked = f"****-****-****-{digits[-4:]}" if len(digits) >= 4 else '****'
elif entity_type == "person":
# Show initials: J.S.
words = original.split()
masked = '.'.join(w[0].upper() for w in words if w) + '.'
else:
masked = "[REDACTED]"
result = result[:entity['start']] + masked + result[entity['end']:]
return result
partial_masked = partial_mask(text, entities)
print(partial_masked)
Output:
Dear J.S.,
Thank you for your application. We have your contact information on file:
Email: j***@email.com, Phone: ***-***-4567.
For verification, please confirm your SSN (***-**-6789) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.
Complete PII Processor Class
from gliner import GLiNER
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class RedactionStrategy(Enum):
MASK = "mask"
TYPE_LABEL = "type_label"
PSEUDONYMIZE = "pseudonymize"
PARTIAL = "partial"
@dataclass
class PIIEntity:
text: str
label: str
start: int
end: int
score: float
class PIIProcessor:
"""Complete PII detection and redaction processor."""
DEFAULT_LABELS = [
"person", "email", "phone_number", "address", "city", "state", "zip_code",
"social_security_number", "driver_license", "passport_number",
"credit_card_number", "bank_account", "date_of_birth",
"ip_address", "username", "password", "company", "job_title"
]
def __init__(
self,
model_name: str = "knowledgator/gliner-pii-large-v1.0",
labels: Optional[list[str]] = None,
threshold: float = 0.5
):
self.model = GLiNER.from_pretrained(model_name)
self.labels = labels or self.DEFAULT_LABELS
self.threshold = threshold
def detect(self, text: str) -> list[PIIEntity]:
"""Detect PII entities in text."""
raw_entities = self.model.predict_entities(
text, self.labels, threshold=self.threshold
)
return [
PIIEntity(
text=e['text'],
label=e['label'],
start=e['start'],
end=e['end'],
score=e['score']
)
for e in raw_entities
]
def redact(
self,
text: str,
strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL,
entities: Optional[list[PIIEntity]] = None
) -> str:
"""Detect and redact PII from text."""
if entities is None:
entities = self.detect(text)
# Sort by position descending
sorted_entities = sorted(entities, key=lambda x: x.start, reverse=True)
result = text
for entity in sorted_entities:
if strategy == RedactionStrategy.MASK:
replacement = "[REDACTED]"
elif strategy == RedactionStrategy.TYPE_LABEL:
replacement = f"[{entity.label.upper()}]"
elif strategy == RedactionStrategy.PARTIAL:
replacement = self._partial_mask(entity)
else:
replacement = f"[{entity.label.upper()}]"
result = result[:entity.start] + replacement + result[entity.end:]
return result
def _partial_mask(self, entity: PIIEntity) -> str:
"""Generate partial mask for an entity."""
text = entity.text
label = entity.label
if label == "email" and '@' in text:
parts = text.split('@')
return parts[0][0] + '***@' + parts[1]
elif label in ("phone_number", "social_security_number", "credit_card_number"):
digits = ''.join(c for c in text if c.isdigit())
return f"***{digits[-4:]}" if len(digits) >= 4 else '***'
elif label == "person":
return '.'.join(w[0].upper() for w in text.split() if w) + '.'
else:
return "[REDACTED]"
def process_batch(
self,
documents: list[dict],
text_field: str = "text",
strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL
) -> list[dict]:
"""Process multiple documents."""
results = []
for doc in documents:
entities = self.detect(doc[text_field])
redacted = self.redact(doc[text_field], strategy, entities)
results.append({
**doc,
"redacted_text": redacted,
"pii_count": len(entities),
"pii_types": list(set(e.label for e in entities))
})
return results
def get_report(self, text: str) -> dict:
"""Generate a PII detection report."""
entities = self.detect(text)
by_type = {}
for entity in entities:
if entity.label not in by_type:
by_type[entity.label] = []
by_type[entity.label].append({
"text": entity.text,
"position": f"{entity.start}-{entity.end}",
"confidence": round(entity.score, 3)
})
return {
"total_pii_found": len(entities),
"pii_by_type": by_type,
"risk_level": self._assess_risk(entities)
}
def _assess_risk(self, entities: list[PIIEntity]) -> str:
"""Assess risk level based on PII types found."""
high_risk = {"social_security_number", "credit_card_number", "bank_account",
"driver_license", "passport_number", "password"}
medium_risk = {"date_of_birth", "address", "phone_number", "medical_record_number"}
found_types = set(e.label for e in entities)
if found_types & high_risk:
return "HIGH"
elif found_types & medium_risk:
return "MEDIUM"
elif found_types:
return "LOW"
return "NONE"
Usage Examples
Basic Usage
processor = PIIProcessor()
text = """
Contact Jane Doe at jane.doe@company.com or call 555-867-5309.
Her SSN is 987-65-4321 and she lives at 123 Main St, Boston, MA 02101.
"""
# Detect PII
entities = processor.detect(text)
print(f"Found {len(entities)} PII entities")
# Redact with different strategies
print("\n--- Type Labels ---")
print(processor.redact(text, RedactionStrategy.TYPE_LABEL))
print("\n--- Partial Masking ---")
print(processor.redact(text, RedactionStrategy.PARTIAL))
Batch Processing
documents = [
{"id": 1, "text": "Patient John Smith, DOB: 01/15/1980, SSN: 123-45-6789"},
{"id": 2, "text": "Contact: mary@email.com, Phone: (555) 123-4567"},
{"id": 3, "text": "Card ending 4532, Account holder: Bob Johnson"},
]
processor = PIIProcessor(threshold=0.4)
results = processor.process_batch(documents)
for doc in results:
print(f"Document {doc['id']}:")
print(f" PII found: {doc['pii_count']} ({', '.join(doc['pii_types'])})")
print(f" Redacted: {doc['redacted_text']}")
print()
Generate PII Report
report = processor.get_report(text)
print(f"Total PII found: {report['total_pii_found']}")
print(f"Risk level: {report['risk_level']}")
print("\nBreakdown by type:")
for pii_type, instances in report['pii_by_type'].items():
print(f" {pii_type}: {len(instances)} instance(s)")
Validation and Quality Assurance
Verify Redaction Completeness
def validate_redaction(original: str, redacted: str, entities: list[PIIEntity]) -> dict:
"""Verify that all PII was properly redacted."""
issues = []
for entity in entities:
if entity.text in redacted:
issues.append({
"type": entity.label,
"text": entity.text,
"issue": "PII still present in redacted text"
})
return {
"is_valid": len(issues) == 0,
"issues": issues,
"entities_processed": len(entities)
}
# Validate
validation = validate_redaction(text, redacted_text, entities)
if not validation['is_valid']:
print("WARNING: Redaction incomplete!")
for issue in validation['issues']:
print(f" - {issue['type']}: '{issue['text']}'")
Double-Pass Detection
def double_pass_redaction(processor: PIIProcessor, text: str) -> str:
"""Run detection twice to catch any missed PII."""
# First pass
redacted = processor.redact(text, RedactionStrategy.TYPE_LABEL)
# Second pass on redacted text (catches edge cases)
final = processor.redact(redacted, RedactionStrategy.TYPE_LABEL)
return final
Configuration Options
Adjusting Detection Threshold
# High precision (fewer false positives)
strict_processor = PIIProcessor(threshold=0.7)
# High recall (catch more PII, may have false positives)
sensitive_processor = PIIProcessor(threshold=0.3)
Custom Entity Types
# Focus on specific PII categories
financial_processor = PIIProcessor(
labels=["credit_card_number", "bank_account", "routing_number", "iban"]
)
healthcare_processor = PIIProcessor(
labels=["person", "date_of_birth", "medical_record_number",
"health_insurance_id", "social_security_number"]
)
Best Practices
-
Set appropriate thresholds: Start with 0.5, lower for sensitive data (0.3-0.4), raise for precision-critical applications (0.6-0.7)
-
Use type-aware redaction: Type labels (
[EMAIL],[SSN]) preserve document structure better than generic masks -
Validate redaction output: Always verify PII was successfully removed, especially for compliance requirements
-
Consider partial masking for usability: When recipients need some context (e.g., last 4 of SSN for verification)
-
Log PII detection, not PII values: Track what types were found, not the actual sensitive data
-
Handle edge cases: Test with varied formats (international phone numbers, different date formats, etc.)
-
Secure pseudonymization mappings: If using reversible pseudonymization, protect the mapping file with same security as original PII
Next Steps
- Named Entity Recognition Guide — Understand the underlying NER technology
- Social Media Categorization — Apply local classification to social media content
- Biomedical Entity Extraction — Extract medical entities with privacy awareness