Skip to main content

PII Detection and Redaction with GLiNER: Complete Guide

Build a privacy-compliant data processing pipeline that automatically detects and redacts personally identifiable information from text documents.

Overview

This cookbook shows how to use GLiNER to identify PII entities like names, email addresses, phone numbers, social security numbers, and other sensitive data—then redact or mask them for safe storage and sharing.

What You'll Learn

  • Configure GLiNER for PII entity extraction
  • Define comprehensive PII entity types (names, addresses, IDs, financial data)
  • Implement different redaction strategies (masking, replacement, tokenization)
  • Process documents while preserving formatting
  • Validate redaction completeness

Prerequisites

  • Python 3.8+
  • GLiNER library installed
  • Sample documents containing PII (for testing)

Use Cases

  • GDPR and CCPA compliance
  • Safe data sharing with third parties
  • Training data anonymization
  • Customer data protection in logs and analytics

The GLiNER PII Model

The knowledgator/gliner-pii-large-v1.0 model is specifically fine-tuned for detecting personally identifiable information across multiple categories. It provides high-accuracy extraction of sensitive data without requiring additional training.

Supported PII Entity Types

CategoryEntity Types
Personal Identifiersperson, username, password
Contact Informationemail, phone_number, address, city, state, zip_code, country
Government IDssocial_security_number, driver_license, passport_number, tax_id
Financial Datacredit_card_number, bank_account, routing_number, iban
Healthcaremedical_record_number, health_insurance_id
Digital Identifiersip_address, mac_address, url, device_id
Temporal & Biographicdate_of_birth, age, gender, ethnicity, nationality
Employmentcompany, job_title, employee_id

Installation

pip install gliner

Quick Start

from gliner import GLiNER

# Load the PII detection model
model = GLiNER.from_pretrained("knowledgator/gliner-pii-large-v1.0")

# Define PII entity types to detect
pii_labels = [
"person", "email", "phone_number", "address",
"social_security_number", "credit_card_number",
"date_of_birth", "driver_license"
]

# Sample text with PII
text = """
Dear John Smith,

Thank you for your application. We have your contact information on file:
Email: john.smith@email.com, Phone: (555) 123-4567.

For verification, please confirm your SSN (123-45-6789) and
date of birth (March 15, 1985). Your application ID is APP-2024-78432.

Best regards,
HR Department
"""

# Detect PII entities
entities = model.predict_entities(text, pii_labels, threshold=0.5)

# Display detected entities
for entity in entities:
print(f"Type: {entity['label']}")
print(f" Text: {entity['text']}")
print(f" Position: {entity['start']}-{entity['end']}")
print(f" Confidence: {entity['score']:.2f}")
print()

Output:

Type: person
Text: John Smith
Position: 5-15
Confidence: 0.98

Type: email
Text: john.smith@email.com
Position: 108-128
Confidence: 0.99

Type: phone_number
Text: (555) 123-4567
Position: 137-151
Confidence: 0.97

Type: social_security_number
Text: 123-45-6789
Position: 196-207
Confidence: 0.96

Type: date_of_birth
Text: March 15, 1985
Position: 229-243
Confidence: 0.94

Redaction Strategies

Strategy 1: Simple Masking

Replace PII with a fixed mask pattern:

def redact_with_mask(text: str, entities: list, mask: str = "[REDACTED]") -> str:
"""Replace all PII entities with a mask string."""
# Sort entities by position (descending) to preserve indices
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

redacted = text
for entity in sorted_entities:
redacted = redacted[:entity['start']] + mask + redacted[entity['end']:]

return redacted

# Apply masking
redacted_text = redact_with_mask(text, entities)
print(redacted_text)

Output:

Dear [REDACTED],

Thank you for your application. We have your contact information on file:
Email: [REDACTED], Phone: [REDACTED].

For verification, please confirm your SSN ([REDACTED]) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.

Strategy 2: Type-Aware Masking

Replace PII with type-specific placeholders:

def redact_with_type_labels(text: str, entities: list) -> str:
"""Replace PII with type-specific placeholders."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

redacted = text
for entity in sorted_entities:
placeholder = f"[{entity['label'].upper()}]"
redacted = redacted[:entity['start']] + placeholder + redacted[entity['end']:]

return redacted

redacted_text = redact_with_type_labels(text, entities)
print(redacted_text)

Output:

Dear [PERSON],

Thank you for your application. We have your contact information on file:
Email: [EMAIL], Phone: [PHONE_NUMBER].

For verification, please confirm your SSN ([SOCIAL_SECURITY_NUMBER]) and
date of birth ([DATE_OF_BIRTH]). Your application ID is APP-2024-78432.

Strategy 3: Consistent Pseudonymization

Replace PII with consistent fake values (same entity always maps to same pseudonym):

import hashlib
from faker import Faker

fake = Faker()
Faker.seed(42)

def generate_pseudonym(entity_text: str, entity_type: str) -> str:
"""Generate a consistent pseudonym based on entity hash."""
# Create deterministic seed from entity text
seed = int(hashlib.md5(entity_text.encode()).hexdigest(), 16) % (10**9)
Faker.seed(seed)

generators = {
"person": fake.name,
"email": fake.email,
"phone_number": fake.phone_number,
"address": fake.address,
"social_security_number": lambda: fake.ssn(),
"credit_card_number": fake.credit_card_number,
"date_of_birth": lambda: fake.date_of_birth().strftime("%B %d, %Y"),
"company": fake.company,
}

generator = generators.get(entity_type, lambda: f"[{entity_type.upper()}]")
return generator()

def pseudonymize(text: str, entities: list) -> tuple[str, dict]:
"""Replace PII with consistent pseudonyms. Returns text and mapping."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

mapping = {}
result = text

for entity in sorted_entities:
original = entity['text']
if original not in mapping:
mapping[original] = generate_pseudonym(original, entity['label'])

pseudonym = mapping[original]
result = result[:entity['start']] + pseudonym + result[entity['end']:]

return result, mapping

pseudonymized_text, pii_mapping = pseudonymize(text, entities)
print(pseudonymized_text)
print("\nMapping (keep secure):", pii_mapping)

Strategy 4: Partial Masking

Preserve partial information for usability:

def partial_mask(text: str, entities: list) -> str:
"""Apply partial masking based on entity type."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

result = text
for entity in sorted_entities:
original = entity['text']
entity_type = entity['label']

if entity_type == "email":
# Show first char and domain: j***@email.com
parts = original.split('@')
masked = parts[0][0] + '***@' + parts[1] if '@' in original else '***'

elif entity_type == "phone_number":
# Show last 4 digits: ***-***-4567
digits = ''.join(c for c in original if c.isdigit())
masked = f"***-***-{digits[-4:]}" if len(digits) >= 4 else '***'

elif entity_type == "social_security_number":
# Show last 4 digits: ***-**-6789
digits = ''.join(c for c in original if c.isdigit())
masked = f"***-**-{digits[-4:]}" if len(digits) >= 4 else '***'

elif entity_type == "credit_card_number":
# Show last 4 digits: ****-****-****-1234
digits = ''.join(c for c in original if c.isdigit())
masked = f"****-****-****-{digits[-4:]}" if len(digits) >= 4 else '****'

elif entity_type == "person":
# Show initials: J.S.
words = original.split()
masked = '.'.join(w[0].upper() for w in words if w) + '.'

else:
masked = "[REDACTED]"

result = result[:entity['start']] + masked + result[entity['end']:]

return result

partial_masked = partial_mask(text, entities)
print(partial_masked)

Output:

Dear J.S.,

Thank you for your application. We have your contact information on file:
Email: j***@email.com, Phone: ***-***-4567.

For verification, please confirm your SSN (***-**-6789) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.

Complete PII Processor Class

from gliner import GLiNER
from dataclasses import dataclass
from enum import Enum
from typing import Optional

class RedactionStrategy(Enum):
MASK = "mask"
TYPE_LABEL = "type_label"
PSEUDONYMIZE = "pseudonymize"
PARTIAL = "partial"

@dataclass
class PIIEntity:
text: str
label: str
start: int
end: int
score: float

class PIIProcessor:
"""Complete PII detection and redaction processor."""

DEFAULT_LABELS = [
"person", "email", "phone_number", "address", "city", "state", "zip_code",
"social_security_number", "driver_license", "passport_number",
"credit_card_number", "bank_account", "date_of_birth",
"ip_address", "username", "password", "company", "job_title"
]

def __init__(
self,
model_name: str = "knowledgator/gliner-pii-large-v1.0",
labels: Optional[list[str]] = None,
threshold: float = 0.5
):
self.model = GLiNER.from_pretrained(model_name)
self.labels = labels or self.DEFAULT_LABELS
self.threshold = threshold

def detect(self, text: str) -> list[PIIEntity]:
"""Detect PII entities in text."""
raw_entities = self.model.predict_entities(
text, self.labels, threshold=self.threshold
)
return [
PIIEntity(
text=e['text'],
label=e['label'],
start=e['start'],
end=e['end'],
score=e['score']
)
for e in raw_entities
]

def redact(
self,
text: str,
strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL,
entities: Optional[list[PIIEntity]] = None
) -> str:
"""Detect and redact PII from text."""
if entities is None:
entities = self.detect(text)

# Sort by position descending
sorted_entities = sorted(entities, key=lambda x: x.start, reverse=True)

result = text
for entity in sorted_entities:
if strategy == RedactionStrategy.MASK:
replacement = "[REDACTED]"
elif strategy == RedactionStrategy.TYPE_LABEL:
replacement = f"[{entity.label.upper()}]"
elif strategy == RedactionStrategy.PARTIAL:
replacement = self._partial_mask(entity)
else:
replacement = f"[{entity.label.upper()}]"

result = result[:entity.start] + replacement + result[entity.end:]

return result

def _partial_mask(self, entity: PIIEntity) -> str:
"""Generate partial mask for an entity."""
text = entity.text
label = entity.label

if label == "email" and '@' in text:
parts = text.split('@')
return parts[0][0] + '***@' + parts[1]
elif label in ("phone_number", "social_security_number", "credit_card_number"):
digits = ''.join(c for c in text if c.isdigit())
return f"***{digits[-4:]}" if len(digits) >= 4 else '***'
elif label == "person":
return '.'.join(w[0].upper() for w in text.split() if w) + '.'
else:
return "[REDACTED]"

def process_batch(
self,
documents: list[dict],
text_field: str = "text",
strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL
) -> list[dict]:
"""Process multiple documents."""
results = []
for doc in documents:
entities = self.detect(doc[text_field])
redacted = self.redact(doc[text_field], strategy, entities)
results.append({
**doc,
"redacted_text": redacted,
"pii_count": len(entities),
"pii_types": list(set(e.label for e in entities))
})
return results

def get_report(self, text: str) -> dict:
"""Generate a PII detection report."""
entities = self.detect(text)

by_type = {}
for entity in entities:
if entity.label not in by_type:
by_type[entity.label] = []
by_type[entity.label].append({
"text": entity.text,
"position": f"{entity.start}-{entity.end}",
"confidence": round(entity.score, 3)
})

return {
"total_pii_found": len(entities),
"pii_by_type": by_type,
"risk_level": self._assess_risk(entities)
}

def _assess_risk(self, entities: list[PIIEntity]) -> str:
"""Assess risk level based on PII types found."""
high_risk = {"social_security_number", "credit_card_number", "bank_account",
"driver_license", "passport_number", "password"}
medium_risk = {"date_of_birth", "address", "phone_number", "medical_record_number"}

found_types = set(e.label for e in entities)

if found_types & high_risk:
return "HIGH"
elif found_types & medium_risk:
return "MEDIUM"
elif found_types:
return "LOW"
return "NONE"

Usage Examples

Basic Usage

processor = PIIProcessor()

text = """
Contact Jane Doe at jane.doe@company.com or call 555-867-5309.
Her SSN is 987-65-4321 and she lives at 123 Main St, Boston, MA 02101.
"""

# Detect PII
entities = processor.detect(text)
print(f"Found {len(entities)} PII entities")

# Redact with different strategies
print("\n--- Type Labels ---")
print(processor.redact(text, RedactionStrategy.TYPE_LABEL))

print("\n--- Partial Masking ---")
print(processor.redact(text, RedactionStrategy.PARTIAL))

Batch Processing

documents = [
{"id": 1, "text": "Patient John Smith, DOB: 01/15/1980, SSN: 123-45-6789"},
{"id": 2, "text": "Contact: mary@email.com, Phone: (555) 123-4567"},
{"id": 3, "text": "Card ending 4532, Account holder: Bob Johnson"},
]

processor = PIIProcessor(threshold=0.4)
results = processor.process_batch(documents)

for doc in results:
print(f"Document {doc['id']}:")
print(f" PII found: {doc['pii_count']} ({', '.join(doc['pii_types'])})")
print(f" Redacted: {doc['redacted_text']}")
print()

Generate PII Report

report = processor.get_report(text)

print(f"Total PII found: {report['total_pii_found']}")
print(f"Risk level: {report['risk_level']}")
print("\nBreakdown by type:")
for pii_type, instances in report['pii_by_type'].items():
print(f" {pii_type}: {len(instances)} instance(s)")

Validation and Quality Assurance

Verify Redaction Completeness

def validate_redaction(original: str, redacted: str, entities: list[PIIEntity]) -> dict:
"""Verify that all PII was properly redacted."""
issues = []

for entity in entities:
if entity.text in redacted:
issues.append({
"type": entity.label,
"text": entity.text,
"issue": "PII still present in redacted text"
})

return {
"is_valid": len(issues) == 0,
"issues": issues,
"entities_processed": len(entities)
}

# Validate
validation = validate_redaction(text, redacted_text, entities)
if not validation['is_valid']:
print("WARNING: Redaction incomplete!")
for issue in validation['issues']:
print(f" - {issue['type']}: '{issue['text']}'")

Double-Pass Detection

def double_pass_redaction(processor: PIIProcessor, text: str) -> str:
"""Run detection twice to catch any missed PII."""
# First pass
redacted = processor.redact(text, RedactionStrategy.TYPE_LABEL)

# Second pass on redacted text (catches edge cases)
final = processor.redact(redacted, RedactionStrategy.TYPE_LABEL)

return final

Configuration Options

Adjusting Detection Threshold

# High precision (fewer false positives)
strict_processor = PIIProcessor(threshold=0.7)

# High recall (catch more PII, may have false positives)
sensitive_processor = PIIProcessor(threshold=0.3)

Custom Entity Types

# Focus on specific PII categories
financial_processor = PIIProcessor(
labels=["credit_card_number", "bank_account", "routing_number", "iban"]
)

healthcare_processor = PIIProcessor(
labels=["person", "date_of_birth", "medical_record_number",
"health_insurance_id", "social_security_number"]
)

Best Practices

  1. Set appropriate thresholds: Start with 0.5, lower for sensitive data (0.3-0.4), raise for precision-critical applications (0.6-0.7)

  2. Use type-aware redaction: Type labels ([EMAIL], [SSN]) preserve document structure better than generic masks

  3. Validate redaction output: Always verify PII was successfully removed, especially for compliance requirements

  4. Consider partial masking for usability: When recipients need some context (e.g., last 4 of SSN for verification)

  5. Log PII detection, not PII values: Track what types were found, not the actual sensitive data

  6. Handle edge cases: Test with varied formats (international phone numbers, different date formats, etc.)

  7. Secure pseudonymization mappings: If using reversible pseudonymization, protect the mapping file with same security as original PII

Next Steps