How to Link Medical Entities to Hospital Databases
Connect extracted medical entities to your hospital's internal databases and terminology systems for consistent coding and record matching.
Overview
This cookbook demonstrates using entity linking to disambiguate medical terms extracted from clinical text and map them to entries in your hospital's master data—including patient records, medication formularies, and procedure catalogs.
What You'll Learn
- Set up entity linking with custom medical knowledge bases
- Handle medical synonyms and abbreviations
- Link to internal hospital databases
- Resolve ambiguous mentions using clinical context
- Deduplicate records by matching against existing entries
Prerequisites
- Python 3.8+
- Access to hospital knowledge bases or terminology servers
- Sample clinical text with entities to link
Use Cases
- EHR data normalization
- Medication reconciliation
- Clinical record deduplication
- Clinical research data harmonization
The GLinker Pipeline Approach
GLinker provides a layered pipeline architecture for entity extraction and linking. Each layer handles a specific task in the entity resolution process:
| Layer | Purpose | Description |
|---|---|---|
| L1 | Mention extraction | Zero-shot NER using GLiNER |
| L2 | Candidate retrieval | Dictionary lookup with exact and fuzzy matching |
| L3 | Entity disambiguation | Entity linking via GLiNER linker model |
| L0 | Aggregation | Filtering, confidence thresholds, and final output |
Models Used
- L1 NER:
knowledgator/gliner-bi-edge-v2.0— Zero-shot entity extraction - L3 Linking:
knowledgator/gliner-linker-base-v1.0— Entity disambiguation
Setting Up Your Database
First, create a knowledge base file in JSONL format. Each record should include:
- A unique identifier
- The canonical entity name
- Entity type
- Aliases for name variation handling
Mock Data Format
Each line in the JSONL file represents a single entity record with the following structure:
| Field | Type | Description |
|---|---|---|
entity_id | string | Unique identifier for the entity (e.g., P001 for patients, D001 for doctors) |
label | string | Canonical/official name of the entity |
type | string | Entity category matching your NER labels (e.g., Patient, Doctor, Disease) |
aliases | array | List of alternative names, abbreviations, and common misspellings |
Example mock_db.jsonl:
mock_data = """{"entity_id": "P001", "label": "John Doe", "type": "Patient", "aliases": ["Jon Doe", "John H Doe"]}
{"entity_id": "P002", "label": "Sarah Connor", "type": "Patient", "aliases": ["S. Connor"]}
{"entity_id": "D001", "label": "Dr. Gregory House", "type": "Doctor", "aliases": ["Dr. House", "Gregory House"]}
{"entity_id": "D002", "label": "Dr. Stephen Strange", "type": "Doctor", "aliases": ["Dr. Strange"]}
{"entity_id": "C001", "label": "Diabetes Mellitus", "type": "Disease", "aliases": ["diabetes", "DM"]}
{"entity_id": "C002", "label": "Hypertension", "type": "Disease", "aliases": ["high blood pressure", "HTN"]}
"""
with open("mock_db.jsonl", "w", encoding="utf-8") as f:
f.write(mock_data)
print("mock_db.jsonl created successfully!")
Database Contents by Entity Type
| Type | ID | Canonical Name | Aliases |
|---|---|---|---|
| Patient | P001 | John Doe | Jon Doe, John H Doe |
| Patient | P002 | Sarah Connor | S. Connor |
| Doctor | D001 | Dr. Gregory House | Dr. House, Gregory House |
| Doctor | D002 | Dr. Stephen Strange | Dr. Strange |
| Disease | C001 | Diabetes Mellitus | diabetes, DM |
| Disease | C002 | Hypertension | high blood pressure, HTN |
Include these variations in your aliases:
- Typos: Common misspellings (
JonforJohn) - Abbreviations: Standard medical abbreviations (
DMfor Diabetes Mellitus,HTNfor Hypertension) - Informal names: Colloquial terms (
high blood pressurefor Hypertension) - Name variations: Middle initials, titles, shortened forms (
Dr. HouseforDr. Gregory House)
Installation
pip install glinker
Building the Pipeline
Step 1: Import and Initialize
from glinker import ConfigBuilder, DAGExecutor, DAGPipeline
Step 2: Configure the Pipeline
Use ConfigBuilder to define all four layers in a single configuration:
builder = ConfigBuilder(name="clinical_db_pipeline")
# Set schema template to use only labels (not descriptions) for L3 matching
builder.set_schema_template("{label}")
# L1: Zero-Shot NER — extract mentions with custom medical entity types
builder.l1.gliner(
model="knowledgator/gliner-bi-edge-v2.0",
labels=["Patient", "Doctor", "Disease", "Symptom"],
threshold=0.2
)
# L2: Dictionary Lookup — candidate generation with exact and fuzzy matching
builder.l2.add(
"dict",
priority=0,
search_mode=["exact", "fuzzy"],
fuzzy={"max_distance": 2, "min_similarity": 0.6}
)
# L3: Entity Linking — disambiguate candidates using context
builder.l3.configure(
model="knowledgator/gliner-linker-base-v1.0",
threshold=0.3,
device="cpu",
max_length=512
)
# L0: Aggregation — filter results and include unlinked entities
builder.l0.configure(
min_confidence=0.4,
include_unlinked=True # Include unlinked entities to detect new records
)
Step 3: Build and Load the Database
config = builder.get_config()
pipeline = DAGPipeline(**config)
executor = DAGExecutor(pipeline)
# Load existing hospital records into the pipeline
MOCK_DB_PATH = "mock_db.jsonl"
executor.load_entities(MOCK_DB_PATH, target_layers=['dict'])
Step 4: Process Clinical Notes
note = "Dr. House checked patient Jon Doe who complained of high blood pressure."
context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')
if results and results.entities:
entities = results.entities[0]
for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label
if ent.is_linked:
link = ent.linked_entity
eid = link.entity_id
print(f"[EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
print(f" -> Action: SKIP INSERTION (Record exists)")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
table = "DOCTORS" if entity_type == "Doctor" else "PATIENTS" if entity_type == "Patient" else "DISEASES"
print(f" -> Action: INSERT into {table} table")
Expected Output:
[EXISTING] Matched 'Dr. House' (Doctor) -> ID: D001 (Dr. Gregory House)
-> Action: SKIP INSERTION (Record exists)
[EXISTING] Matched 'Jon Doe' (Patient) -> ID: P001 (John Doe)
-> Action: SKIP INSERTION (Record exists)
[EXISTING] Matched 'high blood pressure' (Disease) -> ID: C002 (Hypertension)
-> Action: SKIP INSERTION (Record exists)
Step 5: Detect New Entities
Process a note containing entities not in the database:
note = "Referral: Dr. Meredith Grey examining new patient Jane Smith for possible Arrhythmia."
context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')
if results and results.entities:
entities = results.entities[0]
for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label
if ent.is_linked:
link = ent.linked_entity
eid = link.entity_id
print(f"[EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
Expected Output:
[NEW RECORD] 'Dr. Meredith Grey' (Doctor) -> Insert into Database?
[NEW RECORD] 'Jane Smith' (Patient) -> Insert into Database?
[NEW RECORD] 'Arrhythmia' (Disease) -> Insert into Database?
Step 6: Handle Name Variations
The pipeline handles name variations through aliases, fuzzy matching, and contextual disambiguation:
note = "Patient John H Doe returned for follow-up appointment."
context = executor.execute({"texts": [note]})
results = context.data.get('l0_result')
if results and results.entities:
entities = results.entities[0]
for ent in entities:
entity_text = ent.mention_text
entity_type = ent.label
# Skip generic entity type names extracted as mentions
if entity_text.lower() == entity_type.lower():
continue
if ent.is_linked:
link = ent.linked_entity
print(f"[EXISTING] Matched '{entity_text}' -> ID: {link.entity_id} ({link.label})")
else:
print(f"[NEW RECORD] '{entity_text}' ({entity_type})")
Expected Output:
[EXISTING] Matched 'John H Doe' -> ID: P001 (John Doe)
Key Features
Zero-Shot Learning
GLinker uses GLiNER for zero-shot entity extraction, meaning you can define custom entity types without training data:
# Define any entity types relevant to your use case
builder.l1.gliner(
model="knowledgator/gliner-bi-edge-v2.0",
labels=["Patient", "Doctor", "Disease", "Symptom", "Medication", "Procedure"],
threshold=0.2
)
Name Variation Handling
The pipeline automatically handles common name variations through multiple mechanisms:
| Input Mention | Resolved Entity | Method |
|---|---|---|
| "Jon Doe" | John Doe | Database alias |
| "John H Doe" | John Doe | Database alias |
| "Dr. House" | Dr. Gregory House | Database alias |
| "high blood pressure" | Hypertension | L2 fuzzy + L3 linking |
Deduplication Logic
The pipeline distinguishes between existing and new entities using is_linked:
for ent in entities:
if ent.is_linked:
# Entity found in database — skip insertion, use existing ID
db_id = ent.linked_entity.entity_id
print(f"Using existing record: {db_id}")
else:
# New entity — flag for insertion
print(f"New entity detected: {ent.mention_text}")
Best Practices
- Maintain comprehensive aliases: Add common misspellings, abbreviations, and variations to your database aliases
- Set appropriate confidence thresholds: Lower thresholds catch more matches but may introduce false positives
- Review new entities regularly: Entities flagged as "new" should be reviewed before database insertion
- Use context for disambiguation: When multiple candidates match, L3 uses surrounding text to disambiguate
- Enable
include_unlinked: Set this toTruein L0 to detect new records that need to be added to your database
Next Steps
- Biomedical Entity Extraction — Extract biomedical entities from clinical text with GLiNER
- Adverse Drug Event Detection — Detect ADEs in medical text with GLiClass
- PII Detection and Redaction — Protect patient privacy with GLiNER