Extract Biomedical Entities from Clinical Reports with GLiNER
Use GLiNER locally to identify diseases, medications, procedures, symptoms, and other biomedical entities from clinical text -- no API keys, no cloud dependencies.
Overview
This cookbook shows how to build a biomedical named entity recognition (NER) pipeline using GLiNER, an open-source zero-shot NER framework. GLiNER lets you define arbitrary entity types at inference time, making it ideal for clinical text where entity categories vary across specialties.
What you will build:
- A local biomedical entity extractor that runs on your own hardware
- Medication parsing with structured dosage records
- Abbreviation-aware extraction with expansion mappings
- Ontology normalization to ICD-10, RxNorm, and SNOMED-CT
- A complete clinical NER pipeline with batch processing and JSON export
Installation
pip install gliner
GLiNER downloads model weights on first use. The knowledgator/gliner-multitask-large-v0.5 model is approximately 1.5 GB.
Quick Start
Extract biomedical entities from a clinical sentence in under 15 lines:
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
text = (
"Patient with type 2 diabetes mellitus and hypertension, "
"currently on metformin 1000mg and lisinopril 20mg daily."
)
labels = ["disease", "medication", "dosage"]
entities = model.predict_entities(text, labels, threshold=0.5)
for e in entities:
print(f"{e['text']:30} | {e['label']:12} | {e['score']:.2f}")
Expected output:
type 2 diabetes mellitus | disease | 0.94
hypertension | disease | 0.96
metformin | medication | 0.93
1000mg | dosage | 0.87
lisinopril | medication | 0.91
20mg | dosage | 0.85
Define Biomedical Entity Types
Core Clinical Entity Types
BIOMEDICAL_ENTITIES = [
"disease or medical condition",
"medication or drug",
"medical procedure",
"anatomical structure",
"laboratory test",
"symptom or clinical finding",
]
Detailed Entity Types
DETAILED_BIOMEDICAL_ENTITIES = [
# Conditions
"disease",
"syndrome",
"disorder",
"injury",
# Treatments
"medication",
"dosage",
"route of administration",
"therapeutic procedure",
"surgical procedure",
"diagnostic procedure",
# Anatomy
"body part",
"organ",
"tissue",
"body system",
# Clinical findings
"symptom",
"sign",
"vital sign measurement",
"laboratory value",
# Temporal
"duration",
"frequency",
"date of onset",
]
Domain-Specific Entity Sets
# Oncology
ONCOLOGY_ENTITIES = [
"cancer type",
"tumor location",
"cancer stage",
"tumor grade",
"chemotherapy drug",
"radiation therapy",
"immunotherapy",
"biomarker",
"genetic mutation",
]
# Cardiology
CARDIOLOGY_ENTITIES = [
"cardiac condition",
"cardiovascular medication",
"cardiac procedure",
"cardiac anatomy",
"ECG finding",
"cardiac biomarker",
"heart rhythm",
]
# Neurology
NEUROLOGY_ENTITIES = [
"neurological condition",
"neurological symptom",
"brain region",
"neurological medication",
"neuroimaging finding",
"cognitive assessment",
]
Basic Entity Extraction
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
def extract_biomedical_entities(
text: str,
entity_types: list[str] | None = None,
threshold: float = 0.5,
) -> list[dict]:
"""
Extract biomedical entities from clinical text.
Args:
text: Clinical text to analyze.
entity_types: Entity labels for GLiNER. Defaults to BIOMEDICAL_ENTITIES.
threshold: Minimum confidence score.
Returns:
List of entity dicts with keys: text, label, start, end, score.
"""
if entity_types is None:
entity_types = BIOMEDICAL_ENTITIES
return model.predict_entities(text, entity_types, threshold=threshold)
# --- Example: discharge summary excerpt ---
clinical_note = """
DISCHARGE SUMMARY
Patient is a 67-year-old male with history of type 2 diabetes mellitus,
hypertension, and chronic kidney disease stage 3. He was admitted for
acute exacerbation of congestive heart failure.
Medications on discharge:
- Metformin 1000mg twice daily
- Lisinopril 20mg daily
- Furosemide 40mg daily
- Carvedilol 12.5mg twice daily
Patient underwent echocardiogram showing ejection fraction of 35%.
Recommend follow-up with cardiology in 2 weeks.
"""
entities = extract_biomedical_entities(clinical_note)
for e in entities:
print(f"{e['text']:30} | {e['label']:25} | {e['score']:.2f}")
Expected output:
type 2 diabetes mellitus | disease or medical condition | 0.94
hypertension | disease or medical condition | 0.96
chronic kidney disease stage 3 | disease or medical condition | 0.92
congestive heart failure | disease or medical condition | 0.95
Metformin | medication or drug | 0.93
Lisinopril | medication or drug | 0.91
Furosemide | medication or drug | 0.90
Carvedilol | medication or drug | 0.89
echocardiogram | medical procedure | 0.88
ejection fraction | laboratory test | 0.85
Extract Medications with Dosage
Define medication-specific labels and parse the results into structured records:
from dataclasses import dataclass, field
MEDICATION_ENTITIES = [
"drug name",
"dosage amount",
"frequency",
"route of administration",
"indication",
]
@dataclass
class MedicationRecord:
drug_name: str
start: int
dosage_amount: str = ""
frequency: str = ""
route_of_administration: str = ""
indication: str = ""
def extract_medications(text: str, threshold: float = 0.4) -> list[dict]:
"""Extract medication-related entities from text."""
return model.predict_entities(text, MEDICATION_ENTITIES, threshold=threshold)
def parse_medication_list(text: str) -> list[MedicationRecord]:
"""
Parse a medication list into structured records.
Groups entities by proximity: each 'drug name' starts a new record.
"""
entities = extract_medications(text)
entities.sort(key=lambda e: e["start"])
medications: list[MedicationRecord] = []
current: MedicationRecord | None = None
for entity in entities:
if entity["label"] == "drug name":
if current is not None:
medications.append(current)
current = MedicationRecord(
drug_name=entity["text"], start=entity["start"]
)
elif current is not None:
attr = entity["label"].replace(" ", "_")
if hasattr(current, attr):
setattr(current, attr, entity["text"])
if current is not None:
medications.append(current)
return medications
# --- Example ---
med_text = """
Current Medications:
1. Metformin 500mg PO twice daily for diabetes
2. Atorvastatin 40mg PO at bedtime for hyperlipidemia
3. Aspirin 81mg PO daily for cardiovascular protection
4. Insulin glargine 20 units subcutaneous at bedtime
"""
for med in parse_medication_list(med_text):
print(med)
Expected output:
MedicationRecord(drug_name='Metformin', start=..., dosage_amount='500mg', frequency='twice daily', route_of_administration='PO', indication='diabetes')
MedicationRecord(drug_name='Atorvastatin', start=..., dosage_amount='40mg', frequency='at bedtime', route_of_administration='PO', indication='hyperlipidemia')
MedicationRecord(drug_name='Aspirin', start=..., dosage_amount='81mg', frequency='daily', route_of_administration='PO', indication='cardiovascular protection')
MedicationRecord(drug_name='Insulin glargine', start=..., dosage_amount='20 units', frequency='at bedtime', route_of_administration='subcutaneous', indication='')
Handle Clinical Abbreviations
Clinical text is dense with abbreviations. Use expanded label descriptions so GLiNER recognizes them:
ABBREVIATION_AWARE_ENTITIES = [
"disease or condition (including abbreviations like CHF, COPD, DM)",
"medication (including abbreviations like ASA, HCTZ)",
"procedure (including abbreviations like CABG, PCI, EGD)",
"laboratory test (including abbreviations like CBC, BMP, HbA1c)",
]
abbreviation_text = """
72 y/o M w/ PMHx of HTN, DM2, CAD s/p CABG, CHF (EF 30%), CKD3.
Admitted for SOB and LE edema. Labs: BNP 1200, Cr 2.1, K 5.2.
Started on IV Lasix, continued home meds including ASA, metoprolol, lisinopril.
Echo showed worsening EF. Cards consulted for ?ICD placement.
"""
entities = model.predict_entities(
abbreviation_text, ABBREVIATION_AWARE_ENTITIES, threshold=0.4
)
for e in entities:
print(f"{e['text']:20} | {e['label'][:45]:45} | {e['score']:.2f}")
Abbreviation Expansion Mapping
After extraction, expand abbreviations to their full forms:
ABBREVIATION_MAP = {
# Conditions
"HTN": "hypertension",
"DM": "diabetes mellitus",
"DM2": "type 2 diabetes mellitus",
"CAD": "coronary artery disease",
"CHF": "congestive heart failure",
"COPD": "chronic obstructive pulmonary disease",
"CKD": "chronic kidney disease",
"AFib": "atrial fibrillation",
"MI": "myocardial infarction",
"CVA": "cerebrovascular accident",
"DVT": "deep vein thrombosis",
"PE": "pulmonary embolism",
# Procedures
"CABG": "coronary artery bypass graft",
"PCI": "percutaneous coronary intervention",
"EGD": "esophagogastroduodenoscopy",
"ERCP": "endoscopic retrograde cholangiopancreatography",
"CT": "computed tomography",
"MRI": "magnetic resonance imaging",
# Medications
"ASA": "aspirin",
"HCTZ": "hydrochlorothiazide",
"MVI": "multivitamin",
"PPI": "proton pump inhibitor",
# Labs
"CBC": "complete blood count",
"BMP": "basic metabolic panel",
"CMP": "comprehensive metabolic panel",
"LFT": "liver function tests",
"HbA1c": "hemoglobin A1c",
"BNP": "B-type natriuretic peptide",
}
def expand_abbreviations(entities: list[dict]) -> list[dict]:
"""Add an 'expanded' field to entities whose text matches a known abbreviation."""
expanded = []
for e in entities:
key = e["text"].strip().upper()
expansion = ABBREVIATION_MAP.get(key) or ABBREVIATION_MAP.get(e["text"].strip())
expanded.append({
**e,
"expanded": expansion if expansion else e["text"],
})
return expanded
Normalize to Medical Ontologies
Map extracted entities to ICD-10, RxNorm, and SNOMED-CT codes. In production, connect to real terminology services (UMLS, NLM RxNorm API). The class below demonstrates the pattern with simplified lookup tables:
import re
from typing import Optional
class OntologyNormalizer:
"""Normalize extracted entities to standard medical ontologies."""
def __init__(self) -> None:
self.icd10_map = {
"type 2 diabetes mellitus": "E11",
"diabetes mellitus": "E11",
"hypertension": "I10",
"essential hypertension": "I10",
"congestive heart failure": "I50.9",
"heart failure": "I50.9",
"chronic kidney disease": "N18",
"atrial fibrillation": "I48",
"copd": "J44.9",
"chronic obstructive pulmonary disease": "J44.9",
"pneumonia": "J18.9",
"acute myocardial infarction": "I21.9",
}
self.rxnorm_map = {
"metformin": "6809",
"lisinopril": "29046",
"atorvastatin": "83367",
"aspirin": "1191",
"furosemide": "4603",
"carvedilol": "20352",
"amlodipine": "17767",
"omeprazole": "7646",
}
self.snomed_map = {
"chest pain": "29857009",
"dyspnea": "267036007",
"fatigue": "84229001",
"headache": "25064002",
"nausea": "422587007",
"fever": "386661006",
"cough": "49727002",
"edema": "267038008",
}
def normalize_condition(self, entity_text: str) -> Optional[dict]:
"""Normalize a disease or condition to ICD-10."""
normalized = entity_text.lower().strip()
for term, code in self.icd10_map.items():
if term in normalized or normalized in term:
return {"original": entity_text, "normalized_term": term,
"ontology": "ICD-10", "code": code}
return None
def normalize_medication(self, entity_text: str) -> Optional[dict]:
"""Normalize a medication to RxNorm."""
drug_name = re.sub(r"\d+\s*(mg|mcg|g|ml|units?)", "", entity_text.lower()).strip()
for term, code in self.rxnorm_map.items():
if term in drug_name:
return {"original": entity_text, "normalized_term": term,
"ontology": "RxNorm", "code": code}
return None
def normalize_symptom(self, entity_text: str) -> Optional[dict]:
"""Normalize a symptom to SNOMED-CT."""
normalized = entity_text.lower().strip()
for term, code in self.snomed_map.items():
if term in normalized or normalized in term:
return {"original": entity_text, "normalized_term": term,
"ontology": "SNOMED-CT", "code": code}
return None
def normalize(self, entity: dict) -> Optional[dict]:
"""Route normalization by entity label."""
label = entity["label"].lower()
if "disease" in label or "condition" in label:
return self.normalize_condition(entity["text"])
if "medication" in label or "drug" in label:
return self.normalize_medication(entity["text"])
if "symptom" in label:
return self.normalize_symptom(entity["text"])
return None
# --- Example ---
normalizer = OntologyNormalizer()
text = "Patient with hypertension and type 2 diabetes on metformin, presenting with chest pain."
entities = model.predict_entities(text, BIOMEDICAL_ENTITIES, threshold=0.4)
for e in entities:
norm = normalizer.normalize(e)
if norm:
print(f"{e['text']:25} -> {norm['ontology']}: {norm['code']}")
else:
print(f"{e['text']:25} -> (no normalization)")
Expected output:
hypertension -> ICD-10: I10
type 2 diabetes -> ICD-10: E11
metformin -> RxNorm: 6809
chest pain -> SNOMED-CT: 29857009
Full Clinical NER Pipeline
A complete pipeline class that loads the model once, processes documents, normalizes entities, and exports to JSON:
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
import json
from gliner import GLiNER
@dataclass
class ExtractedEntity:
"""A single extracted entity with optional ontology normalization."""
text: str
label: str
start: int
end: int
score: float
normalization: Optional[dict] = None
@dataclass
class ClinicalDocument:
"""A processed clinical document."""
document_id: str
source_text: str
entities: list[ExtractedEntity] = field(default_factory=list)
processed_at: str = ""
def to_dict(self) -> dict:
return {
"document_id": self.document_id,
"entity_count": len(self.entities),
"entities": [
{
"text": e.text,
"label": e.label,
"position": [e.start, e.end],
"confidence": round(e.score, 4),
"normalization": e.normalization,
}
for e in self.entities
],
"processed_at": self.processed_at,
}
class ClinicalNERPipeline:
"""
Local biomedical NER pipeline built on GLiNER.
Loads the model once and reuses it for all documents.
"""
def __init__(
self,
model_name: str = "knowledgator/gliner-multitask-large-v0.5",
entity_types: list[str] | None = None,
threshold: float = 0.5,
) -> None:
self.model = GLiNER.from_pretrained(model_name)
self.entity_types = entity_types or BIOMEDICAL_ENTITIES
self.threshold = threshold
self.normalizer = OntologyNormalizer()
def process_document(self, doc_id: str, text: str) -> ClinicalDocument:
"""Extract and normalize entities from a single document."""
doc = ClinicalDocument(
document_id=doc_id,
source_text=text,
processed_at=datetime.now(timezone.utc).isoformat(),
)
raw_entities = self.model.predict_entities(
text, self.entity_types, threshold=self.threshold
)
for e in raw_entities:
extracted = ExtractedEntity(
text=e["text"],
label=e["label"],
start=e["start"],
end=e["end"],
score=e["score"],
normalization=self.normalizer.normalize(e),
)
doc.entities.append(extracted)
return doc
def process_batch(self, documents: list[dict]) -> list[ClinicalDocument]:
"""
Process multiple documents.
Args:
documents: List of dicts, each with 'id' and 'text' keys.
Returns:
List of ClinicalDocument objects.
"""
return [
self.process_document(doc["id"], doc["text"])
for doc in documents
]
@staticmethod
def export_to_json(documents: list[ClinicalDocument], filepath: str) -> None:
"""Export processed documents to a JSON file."""
data = [doc.to_dict() for doc in documents]
with open(filepath, "w") as f:
json.dump(data, f, indent=2)
# --- Usage ---
pipeline = ClinicalNERPipeline(threshold=0.5)
documents = [
{
"id": "note_001",
"text": (
"Assessment: 65 y/o female with poorly controlled type 2 diabetes "
"(HbA1c 9.2%), hypertension, and hyperlipidemia. Patient reports "
"increased fatigue and polyuria.\n\n"
"Plan:\n"
"1. Increase metformin to 1000mg BID\n"
"2. Add glipizide 5mg daily\n"
"3. Continue lisinopril 20mg daily\n"
"4. Recheck HbA1c in 3 months"
),
},
{
"id": "note_002",
"text": (
"Chief Complaint: Chest pain\n\n"
"HPI: 58 y/o male with history of CAD s/p stent placement 2019 "
"presenting with substernal chest pain radiating to left arm, "
"associated with diaphoresis. Pain started 2 hours ago at rest.\n\n"
"Assessment: Acute coronary syndrome, rule out STEMI\n"
"Plan: Serial troponins, ECG, cardiology consult, continue aspirin "
"and heparin drip"
),
},
]
processed = pipeline.process_batch(documents)
for doc in processed:
print(f"\n=== {doc.document_id} ===")
print(f"Entities found: {len(doc.entities)}")
for entity in doc.entities[:5]:
norm_str = ""
if entity.normalization:
norm_str = f" -> {entity.normalization['ontology']}: {entity.normalization['code']}"
print(f" {entity.text} ({entity.label}, {entity.score:.2f}){norm_str}")
# Export results
pipeline.export_to_json(processed, "clinical_ner_results.json")
Best Practices
Use domain-specific entity types
Tailor your label lists to the clinical specialty. Radiology reports benefit from labels like "imaging finding", "anatomical location", and "measurement". Pathology reports need "specimen type", "tumor grade", and "margin status". More specific labels produce more accurate extractions.
Adjust thresholds by entity type
Different entity types have different precision-recall tradeoffs. Run separate predict_entities calls with different thresholds:
def extract_with_variable_thresholds(text: str) -> list[dict]:
"""Use higher thresholds for medications, lower for symptoms."""
meds = model.predict_entities(text, ["medication or drug"], threshold=0.6)
symptoms = model.predict_entities(text, ["symptom or clinical finding"], threshold=0.4)
return meds + symptoms
Post-process for quality
Deduplicate overlapping spans and filter noise:
def post_process_entities(entities: list[dict]) -> list[dict]:
"""Remove short spans, numeric-only spans, and duplicates."""
seen: set[tuple[int, int]] = set()
result = []
for e in sorted(entities, key=lambda x: -x["score"]):
if len(e["text"].strip()) < 2:
continue
if e["text"].strip().replace(".", "").isdigit():
continue
span_key = (e["start"], e["end"])
if span_key in seen:
continue
seen.add(span_key)
result.append(e)
return result
Validate against known terminologies
Cross-reference extracted medications against drug databases (RxNorm, local formulary) to catch false positives:
KNOWN_DRUGS = {"metformin", "lisinopril", "aspirin", "atorvastatin", "furosemide"}
def validate_medication(entity_text: str) -> bool:
"""Check if extracted text matches a known drug name."""
return entity_text.lower().split()[0] in KNOWN_DRUGS
Limitations and Considerations
-
Not a clinical decision tool. Extracted entities require clinical validation before use in patient care.
-
Abbreviation ambiguity. Some abbreviations have multiple meanings (e.g., "MS" could be multiple sclerosis, mitral stenosis, or morphine sulfate). Context-aware disambiguation requires additional logic.
-
Negation and uncertainty. Entity extraction alone does not capture negation ("no chest pain") or hedging ("possible pneumonia"). Consider adding a negation detection layer.
-
PHI compliance. Ensure all processing complies with HIPAA, GDPR, or applicable regulations. GLiNER runs locally, which avoids sending data to external APIs, but you still need proper data handling safeguards.
-
Model size and latency. The multitask-large model requires approximately 1.5 GB of memory. For high-throughput production workloads, consider GPU acceleration or the smaller GLiNER model variants.
-
Terminology updates. Medical terminology evolves. Regularly update normalization mappings and validation dictionaries.
Next Steps
- Add negation detection to distinguish "patient denies chest pain" from "patient reports chest pain"
- Connect ontology normalization to live terminology services (UMLS API, NLM RxNorm REST API)
- Fine-tune GLiNER on your institution's annotated clinical notes for improved accuracy
- Combine entity extraction with a relation extraction model for treatment-condition linkage