PII Detection and Redaction with GLiNER: Complete Guide

Build a privacy-compliant data processing pipeline that automatically detects and redacts personally identifiable information from text documents.

Overview

This cookbook shows how to use GLiNER to identify PII entities like names, email addresses, phone numbers, social security numbers, and other sensitive data—then redact or mask them for safe storage and sharing.

What You'll Learn

Configure GLiNER for PII entity extraction
Define comprehensive PII entity types (names, addresses, IDs, financial data)
Implement different redaction strategies (masking, replacement, tokenization)
Process documents while preserving formatting
Validate redaction completeness

Prerequisites

Python 3.8+
GLiNER library installed
Sample documents containing PII (for testing)

Use Cases

GDPR and CCPA compliance
Safe data sharing with third parties
Training data anonymization
Customer data protection in logs and analytics

The GLiNER PII Model

The knowledgator/gliner-pii-large-v1.0 model is specifically fine-tuned for detecting personally identifiable information across multiple categories. It provides high-accuracy extraction of sensitive data without requiring additional training.

Supported PII Entity Types

Category	Entity Types
Personal Identifiers	`person`, `username`, `password`
Contact Information	`email`, `phone_number`, `address`, `city`, `state`, `zip_code`, `country`
Government IDs	`social_security_number`, `driver_license`, `passport_number`, `tax_id`
Financial Data	`credit_card_number`, `bank_account`, `routing_number`, `iban`
Healthcare	`medical_record_number`, `health_insurance_id`
Digital Identifiers	`ip_address`, `mac_address`, `url`, `device_id`
Temporal & Biographic	`date_of_birth`, `age`, `gender`, `ethnicity`, `nationality`
Employment	`company`, `job_title`, `employee_id`

Installation

pip install gliner

Quick Start

from gliner import GLiNER

# Load the PII detection model
model = GLiNER.from_pretrained("knowledgator/gliner-pii-large-v1.0")

# Define PII entity types to detect
pii_labels = [
    "person", "email", "phone_number", "address",
    "social_security_number", "credit_card_number",
    "date_of_birth", "driver_license"
]

# Sample text with PII
text = """
Dear John Smith,

Thank you for your application. We have your contact information on file:
Email: john.smith@email.com, Phone: (555) 123-4567.

For verification, please confirm your SSN (123-45-6789) and
date of birth (March 15, 1985). Your application ID is APP-2024-78432.

Best regards,
HR Department
"""

# Detect PII entities
entities = model.predict_entities(text, pii_labels, threshold=0.5)

# Display detected entities
for entity in entities:
    print(f"Type: {entity['label']}")
    print(f"  Text: {entity['text']}")
    print(f"  Position: {entity['start']}-{entity['end']}")
    print(f"  Confidence: {entity['score']:.2f}")
    print()

Output:

Type: person
  Text: John Smith
  Position: 5-15
  Confidence: 0.98

Type: email
  Text: john.smith@email.com
  Position: 108-128
  Confidence: 0.99

Type: phone_number
  Text: (555) 123-4567
  Position: 137-151
  Confidence: 0.97

Type: social_security_number
  Text: 123-45-6789
  Position: 196-207
  Confidence: 0.96

Type: date_of_birth
  Text: March 15, 1985
  Position: 229-243
  Confidence: 0.94

Redaction Strategies

Strategy 1: Simple Masking

Replace PII with a fixed mask pattern:

def redact_with_mask(text: str, entities: list, mask: str = "[REDACTED]") -> str:
    """Replace all PII entities with a mask string."""
    # Sort entities by position (descending) to preserve indices
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    redacted = text
    for entity in sorted_entities:
        redacted = redacted[:entity['start']] + mask + redacted[entity['end']:]

    return redacted

# Apply masking
redacted_text = redact_with_mask(text, entities)
print(redacted_text)

Output:

Dear [REDACTED],

Thank you for your application. We have your contact information on file:
Email: [REDACTED], Phone: [REDACTED].

For verification, please confirm your SSN ([REDACTED]) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.

Strategy 2: Type-Aware Masking

Replace PII with type-specific placeholders:

def redact_with_type_labels(text: str, entities: list) -> str:
    """Replace PII with type-specific placeholders."""
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    redacted = text
    for entity in sorted_entities:
        placeholder = f"[{entity['label'].upper()}]"
        redacted = redacted[:entity['start']] + placeholder + redacted[entity['end']:]

    return redacted

redacted_text = redact_with_type_labels(text, entities)
print(redacted_text)

Output:

Dear [PERSON],

Thank you for your application. We have your contact information on file:
Email: [EMAIL], Phone: [PHONE_NUMBER].

For verification, please confirm your SSN ([SOCIAL_SECURITY_NUMBER]) and
date of birth ([DATE_OF_BIRTH]). Your application ID is APP-2024-78432.

Strategy 3: Consistent Pseudonymization

Replace PII with consistent fake values (same entity always maps to same pseudonym):

import hashlib
from faker import Faker

fake = Faker()
Faker.seed(42)

def generate_pseudonym(entity_text: str, entity_type: str) -> str:
    """Generate a consistent pseudonym based on entity hash."""
    # Create deterministic seed from entity text
    seed = int(hashlib.md5(entity_text.encode()).hexdigest(), 16) % (10**9)
    Faker.seed(seed)

    generators = {
        "person": fake.name,
        "email": fake.email,
        "phone_number": fake.phone_number,
        "address": fake.address,
        "social_security_number": lambda: fake.ssn(),
        "credit_card_number": fake.credit_card_number,
        "date_of_birth": lambda: fake.date_of_birth().strftime("%B %d, %Y"),
        "company": fake.company,
    }

    generator = generators.get(entity_type, lambda: f"[{entity_type.upper()}]")
    return generator()

def pseudonymize(text: str, entities: list) -> tuple[str, dict]:
    """Replace PII with consistent pseudonyms. Returns text and mapping."""
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    mapping = {}
    result = text

    for entity in sorted_entities:
        original = entity['text']
        if original not in mapping:
            mapping[original] = generate_pseudonym(original, entity['label'])

        pseudonym = mapping[original]
        result = result[:entity['start']] + pseudonym + result[entity['end']:]

    return result, mapping

pseudonymized_text, pii_mapping = pseudonymize(text, entities)
print(pseudonymized_text)
print("\nMapping (keep secure):", pii_mapping)

Strategy 4: Partial Masking

Preserve partial information for usability:

def partial_mask(text: str, entities: list) -> str:
    """Apply partial masking based on entity type."""
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    result = text
    for entity in sorted_entities:
        original = entity['text']
        entity_type = entity['label']

        if entity_type == "email":
            # Show first char and domain: j***@email.com
            parts = original.split('@')
            masked = parts[0][0] + '***@' + parts[1] if '@' in original else '***'

        elif entity_type == "phone_number":
            # Show last 4 digits: ***-***-4567
            digits = ''.join(c for c in original if c.isdigit())
            masked = f"***-***-{digits[-4:]}" if len(digits) >= 4 else '***'

        elif entity_type == "social_security_number":
            # Show last 4 digits: ***-**-6789
            digits = ''.join(c for c in original if c.isdigit())
            masked = f"***-**-{digits[-4:]}" if len(digits) >= 4 else '***'

        elif entity_type == "credit_card_number":
            # Show last 4 digits: ****-****-****-1234
            digits = ''.join(c for c in original if c.isdigit())
            masked = f"****-****-****-{digits[-4:]}" if len(digits) >= 4 else '****'

        elif entity_type == "person":
            # Show initials: J.S.
            words = original.split()
            masked = '.'.join(w[0].upper() for w in words if w) + '.'

        else:
            masked = "[REDACTED]"

        result = result[:entity['start']] + masked + result[entity['end']:]

    return result

partial_masked = partial_mask(text, entities)
print(partial_masked)

Output:

Dear J.S.,

Thank you for your application. We have your contact information on file:
Email: j***@email.com, Phone: ***-***-4567.

For verification, please confirm your SSN (***-**-6789) and
date of birth ([REDACTED]). Your application ID is APP-2024-78432.

Complete PII Processor Class

from gliner import GLiNER
from dataclasses import dataclass
from enum import Enum
from typing import Optional

class RedactionStrategy(Enum):
    MASK = "mask"
    TYPE_LABEL = "type_label"
    PSEUDONYMIZE = "pseudonymize"
    PARTIAL = "partial"

@dataclass
class PIIEntity:
    text: str
    label: str
    start: int
    end: int
    score: float

class PIIProcessor:
    """Complete PII detection and redaction processor."""

    DEFAULT_LABELS = [
        "person", "email", "phone_number", "address", "city", "state", "zip_code",
        "social_security_number", "driver_license", "passport_number",
        "credit_card_number", "bank_account", "date_of_birth",
        "ip_address", "username", "password", "company", "job_title"
    ]

    def __init__(
        self,
        model_name: str = "knowledgator/gliner-pii-large-v1.0",
        labels: Optional[list[str]] = None,
        threshold: float = 0.5
    ):
        self.model = GLiNER.from_pretrained(model_name)
        self.labels = labels or self.DEFAULT_LABELS
        self.threshold = threshold

    def detect(self, text: str) -> list[PIIEntity]:
        """Detect PII entities in text."""
        raw_entities = self.model.predict_entities(
            text, self.labels, threshold=self.threshold
        )
        return [
            PIIEntity(
                text=e['text'],
                label=e['label'],
                start=e['start'],
                end=e['end'],
                score=e['score']
            )
            for e in raw_entities
        ]

    def redact(
        self,
        text: str,
        strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL,
        entities: Optional[list[PIIEntity]] = None
    ) -> str:
        """Detect and redact PII from text."""
        if entities is None:
            entities = self.detect(text)

        # Sort by position descending
        sorted_entities = sorted(entities, key=lambda x: x.start, reverse=True)

        result = text
        for entity in sorted_entities:
            if strategy == RedactionStrategy.MASK:
                replacement = "[REDACTED]"
            elif strategy == RedactionStrategy.TYPE_LABEL:
                replacement = f"[{entity.label.upper()}]"
            elif strategy == RedactionStrategy.PARTIAL:
                replacement = self._partial_mask(entity)
            else:
                replacement = f"[{entity.label.upper()}]"

            result = result[:entity.start] + replacement + result[entity.end:]

        return result

    def _partial_mask(self, entity: PIIEntity) -> str:
        """Generate partial mask for an entity."""
        text = entity.text
        label = entity.label

        if label == "email" and '@' in text:
            parts = text.split('@')
            return parts[0][0] + '***@' + parts[1]
        elif label in ("phone_number", "social_security_number", "credit_card_number"):
            digits = ''.join(c for c in text if c.isdigit())
            return f"***{digits[-4:]}" if len(digits) >= 4 else '***'
        elif label == "person":
            return '.'.join(w[0].upper() for w in text.split() if w) + '.'
        else:
            return "[REDACTED]"

    def process_batch(
        self,
        documents: list[dict],
        text_field: str = "text",
        strategy: RedactionStrategy = RedactionStrategy.TYPE_LABEL
    ) -> list[dict]:
        """Process multiple documents."""
        results = []
        for doc in documents:
            entities = self.detect(doc[text_field])
            redacted = self.redact(doc[text_field], strategy, entities)
            results.append({
                **doc,
                "redacted_text": redacted,
                "pii_count": len(entities),
                "pii_types": list(set(e.label for e in entities))
            })
        return results

    def get_report(self, text: str) -> dict:
        """Generate a PII detection report."""
        entities = self.detect(text)

        by_type = {}
        for entity in entities:
            if entity.label not in by_type:
                by_type[entity.label] = []
            by_type[entity.label].append({
                "text": entity.text,
                "position": f"{entity.start}-{entity.end}",
                "confidence": round(entity.score, 3)
            })

        return {
            "total_pii_found": len(entities),
            "pii_by_type": by_type,
            "risk_level": self._assess_risk(entities)
        }

    def _assess_risk(self, entities: list[PIIEntity]) -> str:
        """Assess risk level based on PII types found."""
        high_risk = {"social_security_number", "credit_card_number", "bank_account",
                     "driver_license", "passport_number", "password"}
        medium_risk = {"date_of_birth", "address", "phone_number", "medical_record_number"}

        found_types = set(e.label for e in entities)

        if found_types & high_risk:
            return "HIGH"
        elif found_types & medium_risk:
            return "MEDIUM"
        elif found_types:
            return "LOW"
        return "NONE"

Usage Examples

Basic Usage

processor = PIIProcessor()

text = """
Contact Jane Doe at jane.doe@company.com or call 555-867-5309.
Her SSN is 987-65-4321 and she lives at 123 Main St, Boston, MA 02101.
"""

# Detect PII
entities = processor.detect(text)
print(f"Found {len(entities)} PII entities")

# Redact with different strategies
print("\n--- Type Labels ---")
print(processor.redact(text, RedactionStrategy.TYPE_LABEL))

print("\n--- Partial Masking ---")
print(processor.redact(text, RedactionStrategy.PARTIAL))

Batch Processing

documents = [
    {"id": 1, "text": "Patient John Smith, DOB: 01/15/1980, SSN: 123-45-6789"},
    {"id": 2, "text": "Contact: mary@email.com, Phone: (555) 123-4567"},
    {"id": 3, "text": "Card ending 4532, Account holder: Bob Johnson"},
]

processor = PIIProcessor(threshold=0.4)
results = processor.process_batch(documents)

for doc in results:
    print(f"Document {doc['id']}:")
    print(f"  PII found: {doc['pii_count']} ({', '.join(doc['pii_types'])})")
    print(f"  Redacted: {doc['redacted_text']}")
    print()

Generate PII Report

report = processor.get_report(text)

print(f"Total PII found: {report['total_pii_found']}")
print(f"Risk level: {report['risk_level']}")
print("\nBreakdown by type:")
for pii_type, instances in report['pii_by_type'].items():
    print(f"  {pii_type}: {len(instances)} instance(s)")

Validation and Quality Assurance

Verify Redaction Completeness

def validate_redaction(original: str, redacted: str, entities: list[PIIEntity]) -> dict:
    """Verify that all PII was properly redacted."""
    issues = []

    for entity in entities:
        if entity.text in redacted:
            issues.append({
                "type": entity.label,
                "text": entity.text,
                "issue": "PII still present in redacted text"
            })

    return {
        "is_valid": len(issues) == 0,
        "issues": issues,
        "entities_processed": len(entities)
    }

# Validate
validation = validate_redaction(text, redacted_text, entities)
if not validation['is_valid']:
    print("WARNING: Redaction incomplete!")
    for issue in validation['issues']:
        print(f"  - {issue['type']}: '{issue['text']}'")

Double-Pass Detection

def double_pass_redaction(processor: PIIProcessor, text: str) -> str:
    """Run detection twice to catch any missed PII."""
    # First pass
    redacted = processor.redact(text, RedactionStrategy.TYPE_LABEL)

    # Second pass on redacted text (catches edge cases)
    final = processor.redact(redacted, RedactionStrategy.TYPE_LABEL)

    return final

Configuration Options

Adjusting Detection Threshold

# High precision (fewer false positives)
strict_processor = PIIProcessor(threshold=0.7)

# High recall (catch more PII, may have false positives)
sensitive_processor = PIIProcessor(threshold=0.3)

Custom Entity Types

# Focus on specific PII categories
financial_processor = PIIProcessor(
    labels=["credit_card_number", "bank_account", "routing_number", "iban"]
)

healthcare_processor = PIIProcessor(
    labels=["person", "date_of_birth", "medical_record_number",
            "health_insurance_id", "social_security_number"]
)

Best Practices

Set appropriate thresholds: Start with 0.5, lower for sensitive data (0.3-0.4), raise for precision-critical applications (0.6-0.7)
Use type-aware redaction: Type labels ([EMAIL], [SSN]) preserve document structure better than generic masks
Validate redaction output: Always verify PII was successfully removed, especially for compliance requirements
Consider partial masking for usability: When recipients need some context (e.g., last 4 of SSN for verification)
Log PII detection, not PII values: Track what types were found, not the actual sensitive data
Handle edge cases: Test with varied formats (international phone numbers, different date formats, etc.)
Secure pseudonymization mappings: If using reversible pseudonymization, protect the mapping file with same security as original PII

Next Steps

Named Entity Recognition Guide — Understand the underlying NER technology
Social Media Categorization — Apply local classification to social media content
Biomedical Entity Extraction — Extract medical entities with privacy awareness

Overview​

What You'll Learn​

Prerequisites​

Use Cases​

The GLiNER PII Model​

Supported PII Entity Types​

Installation​

Quick Start​

Redaction Strategies​

Strategy 1: Simple Masking​

Strategy 2: Type-Aware Masking​

Strategy 3: Consistent Pseudonymization​

Strategy 4: Partial Masking​

Complete PII Processor Class​

Usage Examples​

Basic Usage​

Batch Processing​

Generate PII Report​

Validation and Quality Assurance​

Verify Redaction Completeness​

Double-Pass Detection​

Configuration Options​

Adjusting Detection Threshold​

Custom Entity Types​

Best Practices​

Next Steps​

Overview

What You'll Learn

Prerequisites

Use Cases

The GLiNER PII Model

Supported PII Entity Types

Installation

Quick Start

Redaction Strategies

Strategy 1: Simple Masking

Strategy 2: Type-Aware Masking

Strategy 3: Consistent Pseudonymization

Strategy 4: Partial Masking

Complete PII Processor Class

Usage Examples

Basic Usage

Batch Processing

Generate PII Report

Validation and Quality Assurance

Verify Redaction Completeness

Double-Pass Detection

Configuration Options

Adjusting Detection Threshold

Custom Entity Types

Best Practices

Next Steps