Skip to main content

How to Link Medical Entities to Hospital Databases

Connect extracted medical entities to your hospital's internal databases and terminology systems for consistent coding and record matching.

Overview

This cookbook demonstrates using entity linking to disambiguate medical terms extracted from clinical text and map them to entries in your hospital's master data—including patient records, medication formularies, and procedure catalogs.

What You'll Learn

  • Set up entity linking with custom medical knowledge bases
  • Handle medical synonyms and abbreviations
  • Link to internal hospital databases
  • Resolve ambiguous mentions using clinical context
  • Deduplicate records by matching against existing entries

Prerequisites

  • Python 3.8+
  • Access to hospital knowledge bases or terminology servers
  • Sample clinical text with entities to link

Use Cases

  • EHR data normalization
  • Medication reconciliation
  • Clinical record deduplication
  • Clinical research data harmonization

The GLinker Pipeline Approach

GLinker provides a layered pipeline architecture for entity extraction and linking. Each layer handles a specific task in the entity resolution process:

LayerPurposeDescription
L1Mention extractionZero-shot NER using GLiNER
L2Candidate retrievalDictionary lookup with exact and fuzzy matching
L3Entity disambiguationEntity linking via GLiNER linker model
L0AggregationFiltering, confidence thresholds, and final output

Models Used

  • L1 NER: knowledgator/gliner-bi-edge-v2.0 — Zero-shot entity extraction
  • L3 Linking: knowledgator/gliner-linker-base-v1.0 — Entity disambiguation

Setting Up Your Database

First, create a knowledge base file in JSONL format. Each record should include:

  • A unique identifier
  • The canonical entity name
  • Entity type
  • Aliases for name variation handling

Mock Data Format

Each line in the JSONL file represents a single entity record with the following structure:

FieldTypeDescription
entity_idstringUnique identifier for the entity (e.g., P001 for patients, D001 for doctors)
labelstringCanonical/official name of the entity
typestringEntity category matching your NER labels (e.g., Patient, Doctor, Disease)
aliasesarrayList of alternative names, abbreviations, and common misspellings

Example mock_db.jsonl:

mock_data = """{"entity_id": "P001", "label": "John Doe", "type": "Patient", "aliases": ["Jon Doe", "John H Doe"]}
{"entity_id": "P002", "label": "Sarah Connor", "type": "Patient", "aliases": ["S. Connor"]}
{"entity_id": "D001", "label": "Dr. Gregory House", "type": "Doctor", "aliases": ["Dr. House", "Gregory House"]}
{"entity_id": "D002", "label": "Dr. Stephen Strange", "type": "Doctor", "aliases": ["Dr. Strange"]}
{"entity_id": "C001", "label": "Diabetes Mellitus", "type": "Disease", "aliases": ["diabetes", "DM"]}
{"entity_id": "C002", "label": "Hypertension", "type": "Disease", "aliases": ["high blood pressure", "HTN"]}
"""

with open("mock_db.jsonl", "w", encoding="utf-8") as f:
f.write(mock_data)

print("mock_db.jsonl created successfully!")

Database Contents by Entity Type

TypeIDCanonical NameAliases
PatientP001John DoeJon Doe, John H Doe
PatientP002Sarah ConnorS. Connor
DoctorD001Dr. Gregory HouseDr. House, Gregory House
DoctorD002Dr. Stephen StrangeDr. Strange
DiseaseC001Diabetes Mellitusdiabetes, DM
DiseaseC002Hypertensionhigh blood pressure, HTN
Alias Best Practices

Include these variations in your aliases:

  • Typos: Common misspellings (Jon for John)
  • Abbreviations: Standard medical abbreviations (DM for Diabetes Mellitus, HTN for Hypertension)
  • Informal names: Colloquial terms (high blood pressure for Hypertension)
  • Name variations: Middle initials, titles, shortened forms (Dr. House for Dr. Gregory House)

Installation

pip install glinker

Building the Pipeline

Step 1: Import and Initialize

from glinker import ConfigBuilder, DAGExecutor, DAGPipeline

Step 2: Configure the Pipeline

Use ConfigBuilder to define all four layers in a single configuration:

builder = ConfigBuilder(name="clinical_db_pipeline")

# Set schema template to use only labels (not descriptions) for L3 matching
builder.set_schema_template("{label}")

# L1: Zero-Shot NER — extract mentions with custom medical entity types
builder.l1.gliner(
model="knowledgator/gliner-bi-edge-v2.0",
labels=["Patient", "Doctor", "Disease", "Symptom"],
threshold=0.2
)

# L2: Dictionary Lookup — candidate generation with exact and fuzzy matching
builder.l2.add(
"dict",
priority=0,
search_mode=["exact", "fuzzy"],
fuzzy={"max_distance": 2, "min_similarity": 0.6}
)

# L3: Entity Linking — disambiguate candidates using context
builder.l3.configure(
model="knowledgator/gliner-linker-base-v1.0",
threshold=0.3,
device="cpu",
max_length=512
)

# L0: Aggregation — filter results and include unlinked entities
builder.l0.configure(
min_confidence=0.4,
include_unlinked=True # Include unlinked entities to detect new records
)

Step 3: Build and Load the Database

config = builder.get_config()
pipeline = DAGPipeline(**config)
executor = DAGExecutor(pipeline)

# Load existing hospital records into the pipeline
MOCK_DB_PATH = "mock_db.jsonl"
executor.load_entities(MOCK_DB_PATH, target_layers=['dict'])

Step 4: Process Clinical Notes

note = "Dr. House checked patient Jon Doe who complained of high blood pressure."

context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')

if results and results.entities:
entities = results.entities[0]

for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label

if ent.is_linked:
link = ent.linked_entity
eid = link.entity_id
print(f"[EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
print(f" -> Action: SKIP INSERTION (Record exists)")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
table = "DOCTORS" if entity_type == "Doctor" else "PATIENTS" if entity_type == "Patient" else "DISEASES"
print(f" -> Action: INSERT into {table} table")

Expected Output:

[EXISTING] Matched 'Dr. House' (Doctor) -> ID: D001 (Dr. Gregory House)
-> Action: SKIP INSERTION (Record exists)
[EXISTING] Matched 'Jon Doe' (Patient) -> ID: P001 (John Doe)
-> Action: SKIP INSERTION (Record exists)
[EXISTING] Matched 'high blood pressure' (Disease) -> ID: C002 (Hypertension)
-> Action: SKIP INSERTION (Record exists)

Step 5: Detect New Entities

Process a note containing entities not in the database:

note = "Referral: Dr. Meredith Grey examining new patient Jane Smith for possible Arrhythmia."

context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')

if results and results.entities:
entities = results.entities[0]

for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label

if ent.is_linked:
link = ent.linked_entity
eid = link.entity_id
print(f"[EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")

Expected Output:

[NEW RECORD] 'Dr. Meredith Grey' (Doctor) -> Insert into Database?
[NEW RECORD] 'Jane Smith' (Patient) -> Insert into Database?
[NEW RECORD] 'Arrhythmia' (Disease) -> Insert into Database?

Step 6: Handle Name Variations

The pipeline handles name variations through aliases, fuzzy matching, and contextual disambiguation:

note = "Patient John H Doe returned for follow-up appointment."

context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')

if results and results.entities:
entities = results.entities[0]

for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label

# Skip generic entity type names extracted as mentions
if entity_text.lower() == entity_type.lower():
continue

if ent.is_linked:
link = ent.linked_entity
print(f"[EXISTING] Matched '{entity_text}' -> ID: {link.entity_id} ({link.label})")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type})")

Expected Output:

[EXISTING] Matched 'John H Doe' -> ID: P001 (John Doe)

Key Features

Zero-Shot Learning

GLinker uses GLiNER for zero-shot entity extraction, meaning you can define custom entity types without training data:

# Define any entity types relevant to your use case
builder.l1.gliner(
model="knowledgator/gliner-bi-edge-v2.0",
labels=["Patient", "Doctor", "Disease", "Symptom", "Medication", "Procedure"],
threshold=0.2
)

Name Variation Handling

The pipeline automatically handles common name variations through multiple mechanisms:

Input MentionResolved EntityMethod
"Jon Doe"John DoeDatabase alias
"John H Doe"John DoeDatabase alias
"Dr. House"Dr. Gregory HouseDatabase alias
"high blood pressure"HypertensionL2 fuzzy + L3 linking

Deduplication Logic

The pipeline distinguishes between existing and new entities using is_linked:

for ent in entities:
if ent.is_linked:
# Entity found in database — skip insertion, use existing ID
db_id = ent.linked_entity.entity_id
print(f"Using existing record: {db_id}")
else:
# New entity — flag for insertion
print(f"New entity detected: {ent.mention_text}")

Best Practices

  1. Maintain comprehensive aliases: Add common misspellings, abbreviations, and variations to your database aliases
  2. Set appropriate confidence thresholds: Lower thresholds catch more matches but may introduce false positives
  3. Review new entities regularly: Entities flagged as "new" should be reviewed before database insertion
  4. Use context for disambiguation: When multiple candidates match, L3 uses surrounding text to disambiguate
  5. Enable include_unlinked: Set this to True in L0 to detect new records that need to be added to your database

Next Steps