Extract Biomedical Entities from Clinical Reports with GLiNER

Use GLiNER locally to identify diseases, medications, procedures, symptoms, and other biomedical entities from clinical text -- no API keys, no cloud dependencies.

Overview

This cookbook shows how to build a biomedical named entity recognition (NER) pipeline using GLiNER, an open-source zero-shot NER framework. GLiNER lets you define arbitrary entity types at inference time, making it ideal for clinical text where entity categories vary across specialties.

What you will build:

A local biomedical entity extractor that runs on your own hardware
Medication parsing with structured dosage records
Abbreviation-aware extraction with expansion mappings
Ontology normalization to ICD-10, RxNorm, and SNOMED-CT
A complete clinical NER pipeline with batch processing and JSON export

Installation

pip install gliner

GLiNER downloads model weights on first use. The knowledgator/gliner-multitask-large-v0.5 model is approximately 1.5 GB.

Quick Start

Extract biomedical entities from a clinical sentence in under 15 lines:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

text = (
    "Patient with type 2 diabetes mellitus and hypertension, "
    "currently on metformin 1000mg and lisinopril 20mg daily."
)

labels = ["disease", "medication", "dosage"]

entities = model.predict_entities(text, labels, threshold=0.5)

for e in entities:
    print(f"{e['text']:30} | {e['label']:12} | {e['score']:.2f}")

Expected output:

type 2 diabetes mellitus       | disease      | 0.94
hypertension                   | disease      | 0.96
metformin                      | medication   | 0.93
1000mg                         | dosage       | 0.87
lisinopril                     | medication   | 0.91
20mg                           | dosage       | 0.85

Define Biomedical Entity Types

Core Clinical Entity Types

BIOMEDICAL_ENTITIES = [
    "disease or medical condition",
    "medication or drug",
    "medical procedure",
    "anatomical structure",
    "laboratory test",
    "symptom or clinical finding",
]

Detailed Entity Types

DETAILED_BIOMEDICAL_ENTITIES = [
    # Conditions
    "disease",
    "syndrome",
    "disorder",
    "injury",
    # Treatments
    "medication",
    "dosage",
    "route of administration",
    "therapeutic procedure",
    "surgical procedure",
    "diagnostic procedure",
    # Anatomy
    "body part",
    "organ",
    "tissue",
    "body system",
    # Clinical findings
    "symptom",
    "sign",
    "vital sign measurement",
    "laboratory value",
    # Temporal
    "duration",
    "frequency",
    "date of onset",
]

Domain-Specific Entity Sets

# Oncology
ONCOLOGY_ENTITIES = [
    "cancer type",
    "tumor location",
    "cancer stage",
    "tumor grade",
    "chemotherapy drug",
    "radiation therapy",
    "immunotherapy",
    "biomarker",
    "genetic mutation",
]

# Cardiology
CARDIOLOGY_ENTITIES = [
    "cardiac condition",
    "cardiovascular medication",
    "cardiac procedure",
    "cardiac anatomy",
    "ECG finding",
    "cardiac biomarker",
    "heart rhythm",
]

# Neurology
NEUROLOGY_ENTITIES = [
    "neurological condition",
    "neurological symptom",
    "brain region",
    "neurological medication",
    "neuroimaging finding",
    "cognitive assessment",
]

Basic Entity Extraction

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")


def extract_biomedical_entities(
    text: str,
    entity_types: list[str] | None = None,
    threshold: float = 0.5,
) -> list[dict]:
    """
    Extract biomedical entities from clinical text.

    Args:
        text: Clinical text to analyze.
        entity_types: Entity labels for GLiNER. Defaults to BIOMEDICAL_ENTITIES.
        threshold: Minimum confidence score.

    Returns:
        List of entity dicts with keys: text, label, start, end, score.
    """
    if entity_types is None:
        entity_types = BIOMEDICAL_ENTITIES

    return model.predict_entities(text, entity_types, threshold=threshold)


# --- Example: discharge summary excerpt ---
clinical_note = """
DISCHARGE SUMMARY

Patient is a 67-year-old male with history of type 2 diabetes mellitus,
hypertension, and chronic kidney disease stage 3. He was admitted for
acute exacerbation of congestive heart failure.

Medications on discharge:
- Metformin 1000mg twice daily
- Lisinopril 20mg daily
- Furosemide 40mg daily
- Carvedilol 12.5mg twice daily

Patient underwent echocardiogram showing ejection fraction of 35%.
Recommend follow-up with cardiology in 2 weeks.
"""

entities = extract_biomedical_entities(clinical_note)

for e in entities:
    print(f"{e['text']:30} | {e['label']:25} | {e['score']:.2f}")

Expected output:

type 2 diabetes mellitus       | disease or medical condition  | 0.94
hypertension                   | disease or medical condition  | 0.96
chronic kidney disease stage 3 | disease or medical condition  | 0.92
congestive heart failure       | disease or medical condition  | 0.95
Metformin                      | medication or drug            | 0.93
Lisinopril                     | medication or drug            | 0.91
Furosemide                     | medication or drug            | 0.90
Carvedilol                     | medication or drug            | 0.89
echocardiogram                 | medical procedure             | 0.88
ejection fraction              | laboratory test               | 0.85

Extract Medications with Dosage

Define medication-specific labels and parse the results into structured records:

from dataclasses import dataclass, field


MEDICATION_ENTITIES = [
    "drug name",
    "dosage amount",
    "frequency",
    "route of administration",
    "indication",
]


@dataclass
class MedicationRecord:
    drug_name: str
    start: int
    dosage_amount: str = ""
    frequency: str = ""
    route_of_administration: str = ""
    indication: str = ""


def extract_medications(text: str, threshold: float = 0.4) -> list[dict]:
    """Extract medication-related entities from text."""
    return model.predict_entities(text, MEDICATION_ENTITIES, threshold=threshold)


def parse_medication_list(text: str) -> list[MedicationRecord]:
    """
    Parse a medication list into structured records.
    Groups entities by proximity: each 'drug name' starts a new record.
    """
    entities = extract_medications(text)
    entities.sort(key=lambda e: e["start"])

    medications: list[MedicationRecord] = []
    current: MedicationRecord | None = None

    for entity in entities:
        if entity["label"] == "drug name":
            if current is not None:
                medications.append(current)
            current = MedicationRecord(
                drug_name=entity["text"], start=entity["start"]
            )
        elif current is not None:
            attr = entity["label"].replace(" ", "_")
            if hasattr(current, attr):
                setattr(current, attr, entity["text"])

    if current is not None:
        medications.append(current)

    return medications


# --- Example ---
med_text = """
Current Medications:
1. Metformin 500mg PO twice daily for diabetes
2. Atorvastatin 40mg PO at bedtime for hyperlipidemia
3. Aspirin 81mg PO daily for cardiovascular protection
4. Insulin glargine 20 units subcutaneous at bedtime
"""

for med in parse_medication_list(med_text):
    print(med)

Expected output:

MedicationRecord(drug_name='Metformin', start=..., dosage_amount='500mg', frequency='twice daily', route_of_administration='PO', indication='diabetes')
MedicationRecord(drug_name='Atorvastatin', start=..., dosage_amount='40mg', frequency='at bedtime', route_of_administration='PO', indication='hyperlipidemia')
MedicationRecord(drug_name='Aspirin', start=..., dosage_amount='81mg', frequency='daily', route_of_administration='PO', indication='cardiovascular protection')
MedicationRecord(drug_name='Insulin glargine', start=..., dosage_amount='20 units', frequency='at bedtime', route_of_administration='subcutaneous', indication='')

Handle Clinical Abbreviations

Clinical text is dense with abbreviations. Use expanded label descriptions so GLiNER recognizes them:

ABBREVIATION_AWARE_ENTITIES = [
    "disease or condition (including abbreviations like CHF, COPD, DM)",
    "medication (including abbreviations like ASA, HCTZ)",
    "procedure (including abbreviations like CABG, PCI, EGD)",
    "laboratory test (including abbreviations like CBC, BMP, HbA1c)",
]

abbreviation_text = """
72 y/o M w/ PMHx of HTN, DM2, CAD s/p CABG, CHF (EF 30%), CKD3.
Admitted for SOB and LE edema. Labs: BNP 1200, Cr 2.1, K 5.2.
Started on IV Lasix, continued home meds including ASA, metoprolol, lisinopril.
Echo showed worsening EF. Cards consulted for ?ICD placement.
"""

entities = model.predict_entities(
    abbreviation_text, ABBREVIATION_AWARE_ENTITIES, threshold=0.4
)

for e in entities:
    print(f"{e['text']:20} | {e['label'][:45]:45} | {e['score']:.2f}")

Abbreviation Expansion Mapping

After extraction, expand abbreviations to their full forms:

ABBREVIATION_MAP = {
    # Conditions
    "HTN": "hypertension",
    "DM": "diabetes mellitus",
    "DM2": "type 2 diabetes mellitus",
    "CAD": "coronary artery disease",
    "CHF": "congestive heart failure",
    "COPD": "chronic obstructive pulmonary disease",
    "CKD": "chronic kidney disease",
    "AFib": "atrial fibrillation",
    "MI": "myocardial infarction",
    "CVA": "cerebrovascular accident",
    "DVT": "deep vein thrombosis",
    "PE": "pulmonary embolism",
    # Procedures
    "CABG": "coronary artery bypass graft",
    "PCI": "percutaneous coronary intervention",
    "EGD": "esophagogastroduodenoscopy",
    "ERCP": "endoscopic retrograde cholangiopancreatography",
    "CT": "computed tomography",
    "MRI": "magnetic resonance imaging",
    # Medications
    "ASA": "aspirin",
    "HCTZ": "hydrochlorothiazide",
    "MVI": "multivitamin",
    "PPI": "proton pump inhibitor",
    # Labs
    "CBC": "complete blood count",
    "BMP": "basic metabolic panel",
    "CMP": "comprehensive metabolic panel",
    "LFT": "liver function tests",
    "HbA1c": "hemoglobin A1c",
    "BNP": "B-type natriuretic peptide",
}


def expand_abbreviations(entities: list[dict]) -> list[dict]:
    """Add an 'expanded' field to entities whose text matches a known abbreviation."""
    expanded = []
    for e in entities:
        key = e["text"].strip().upper()
        expansion = ABBREVIATION_MAP.get(key) or ABBREVIATION_MAP.get(e["text"].strip())
        expanded.append({
            **e,
            "expanded": expansion if expansion else e["text"],
        })
    return expanded

Normalize to Medical Ontologies

Map extracted entities to ICD-10, RxNorm, and SNOMED-CT codes. In production, connect to real terminology services (UMLS, NLM RxNorm API). The class below demonstrates the pattern with simplified lookup tables:

import re
from typing import Optional


class OntologyNormalizer:
    """Normalize extracted entities to standard medical ontologies."""

    def __init__(self) -> None:
        self.icd10_map = {
            "type 2 diabetes mellitus": "E11",
            "diabetes mellitus": "E11",
            "hypertension": "I10",
            "essential hypertension": "I10",
            "congestive heart failure": "I50.9",
            "heart failure": "I50.9",
            "chronic kidney disease": "N18",
            "atrial fibrillation": "I48",
            "copd": "J44.9",
            "chronic obstructive pulmonary disease": "J44.9",
            "pneumonia": "J18.9",
            "acute myocardial infarction": "I21.9",
        }
        self.rxnorm_map = {
            "metformin": "6809",
            "lisinopril": "29046",
            "atorvastatin": "83367",
            "aspirin": "1191",
            "furosemide": "4603",
            "carvedilol": "20352",
            "amlodipine": "17767",
            "omeprazole": "7646",
        }
        self.snomed_map = {
            "chest pain": "29857009",
            "dyspnea": "267036007",
            "fatigue": "84229001",
            "headache": "25064002",
            "nausea": "422587007",
            "fever": "386661006",
            "cough": "49727002",
            "edema": "267038008",
        }

    def normalize_condition(self, entity_text: str) -> Optional[dict]:
        """Normalize a disease or condition to ICD-10."""
        normalized = entity_text.lower().strip()
        for term, code in self.icd10_map.items():
            if term in normalized or normalized in term:
                return {"original": entity_text, "normalized_term": term,
                        "ontology": "ICD-10", "code": code}
        return None

    def normalize_medication(self, entity_text: str) -> Optional[dict]:
        """Normalize a medication to RxNorm."""
        drug_name = re.sub(r"\d+\s*(mg|mcg|g|ml|units?)", "", entity_text.lower()).strip()
        for term, code in self.rxnorm_map.items():
            if term in drug_name:
                return {"original": entity_text, "normalized_term": term,
                        "ontology": "RxNorm", "code": code}
        return None

    def normalize_symptom(self, entity_text: str) -> Optional[dict]:
        """Normalize a symptom to SNOMED-CT."""
        normalized = entity_text.lower().strip()
        for term, code in self.snomed_map.items():
            if term in normalized or normalized in term:
                return {"original": entity_text, "normalized_term": term,
                        "ontology": "SNOMED-CT", "code": code}
        return None

    def normalize(self, entity: dict) -> Optional[dict]:
        """Route normalization by entity label."""
        label = entity["label"].lower()
        if "disease" in label or "condition" in label:
            return self.normalize_condition(entity["text"])
        if "medication" in label or "drug" in label:
            return self.normalize_medication(entity["text"])
        if "symptom" in label:
            return self.normalize_symptom(entity["text"])
        return None


# --- Example ---
normalizer = OntologyNormalizer()

text = "Patient with hypertension and type 2 diabetes on metformin, presenting with chest pain."
entities = model.predict_entities(text, BIOMEDICAL_ENTITIES, threshold=0.4)

for e in entities:
    norm = normalizer.normalize(e)
    if norm:
        print(f"{e['text']:25} -> {norm['ontology']}: {norm['code']}")
    else:
        print(f"{e['text']:25} -> (no normalization)")

Expected output:

hypertension              -> ICD-10: I10
type 2 diabetes           -> ICD-10: E11
metformin                 -> RxNorm: 6809
chest pain                -> SNOMED-CT: 29857009

Full Clinical NER Pipeline

A complete pipeline class that loads the model once, processes documents, normalizes entities, and exports to JSON:

from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
import json

from gliner import GLiNER


@dataclass
class ExtractedEntity:
    """A single extracted entity with optional ontology normalization."""
    text: str
    label: str
    start: int
    end: int
    score: float
    normalization: Optional[dict] = None


@dataclass
class ClinicalDocument:
    """A processed clinical document."""
    document_id: str
    source_text: str
    entities: list[ExtractedEntity] = field(default_factory=list)
    processed_at: str = ""

    def to_dict(self) -> dict:
        return {
            "document_id": self.document_id,
            "entity_count": len(self.entities),
            "entities": [
                {
                    "text": e.text,
                    "label": e.label,
                    "position": [e.start, e.end],
                    "confidence": round(e.score, 4),
                    "normalization": e.normalization,
                }
                for e in self.entities
            ],
            "processed_at": self.processed_at,
        }


class ClinicalNERPipeline:
    """
    Local biomedical NER pipeline built on GLiNER.

    Loads the model once and reuses it for all documents.
    """

    def __init__(
        self,
        model_name: str = "knowledgator/gliner-multitask-large-v0.5",
        entity_types: list[str] | None = None,
        threshold: float = 0.5,
    ) -> None:
        self.model = GLiNER.from_pretrained(model_name)
        self.entity_types = entity_types or BIOMEDICAL_ENTITIES
        self.threshold = threshold
        self.normalizer = OntologyNormalizer()

    def process_document(self, doc_id: str, text: str) -> ClinicalDocument:
        """Extract and normalize entities from a single document."""
        doc = ClinicalDocument(
            document_id=doc_id,
            source_text=text,
            processed_at=datetime.now(timezone.utc).isoformat(),
        )

        raw_entities = self.model.predict_entities(
            text, self.entity_types, threshold=self.threshold
        )

        for e in raw_entities:
            extracted = ExtractedEntity(
                text=e["text"],
                label=e["label"],
                start=e["start"],
                end=e["end"],
                score=e["score"],
                normalization=self.normalizer.normalize(e),
            )
            doc.entities.append(extracted)

        return doc

    def process_batch(self, documents: list[dict]) -> list[ClinicalDocument]:
        """
        Process multiple documents.

        Args:
            documents: List of dicts, each with 'id' and 'text' keys.

        Returns:
            List of ClinicalDocument objects.
        """
        return [
            self.process_document(doc["id"], doc["text"])
            for doc in documents
        ]

    @staticmethod
    def export_to_json(documents: list[ClinicalDocument], filepath: str) -> None:
        """Export processed documents to a JSON file."""
        data = [doc.to_dict() for doc in documents]
        with open(filepath, "w") as f:
            json.dump(data, f, indent=2)


# --- Usage ---
pipeline = ClinicalNERPipeline(threshold=0.5)

documents = [
    {
        "id": "note_001",
        "text": (
            "Assessment: 65 y/o female with poorly controlled type 2 diabetes "
            "(HbA1c 9.2%), hypertension, and hyperlipidemia. Patient reports "
            "increased fatigue and polyuria.\n\n"
            "Plan:\n"
            "1. Increase metformin to 1000mg BID\n"
            "2. Add glipizide 5mg daily\n"
            "3. Continue lisinopril 20mg daily\n"
            "4. Recheck HbA1c in 3 months"
        ),
    },
    {
        "id": "note_002",
        "text": (
            "Chief Complaint: Chest pain\n\n"
            "HPI: 58 y/o male with history of CAD s/p stent placement 2019 "
            "presenting with substernal chest pain radiating to left arm, "
            "associated with diaphoresis. Pain started 2 hours ago at rest.\n\n"
            "Assessment: Acute coronary syndrome, rule out STEMI\n"
            "Plan: Serial troponins, ECG, cardiology consult, continue aspirin "
            "and heparin drip"
        ),
    },
]

processed = pipeline.process_batch(documents)

for doc in processed:
    print(f"\n=== {doc.document_id} ===")
    print(f"Entities found: {len(doc.entities)}")
    for entity in doc.entities[:5]:
        norm_str = ""
        if entity.normalization:
            norm_str = f" -> {entity.normalization['ontology']}: {entity.normalization['code']}"
        print(f"  {entity.text} ({entity.label}, {entity.score:.2f}){norm_str}")

# Export results
pipeline.export_to_json(processed, "clinical_ner_results.json")

Best Practices

Use domain-specific entity types

Tailor your label lists to the clinical specialty. Radiology reports benefit from labels like "imaging finding", "anatomical location", and "measurement". Pathology reports need "specimen type", "tumor grade", and "margin status". More specific labels produce more accurate extractions.

Adjust thresholds by entity type

Different entity types have different precision-recall tradeoffs. Run separate predict_entities calls with different thresholds:

def extract_with_variable_thresholds(text: str) -> list[dict]:
    """Use higher thresholds for medications, lower for symptoms."""
    meds = model.predict_entities(text, ["medication or drug"], threshold=0.6)
    symptoms = model.predict_entities(text, ["symptom or clinical finding"], threshold=0.4)
    return meds + symptoms

Post-process for quality

Deduplicate overlapping spans and filter noise:

def post_process_entities(entities: list[dict]) -> list[dict]:
    """Remove short spans, numeric-only spans, and duplicates."""
    seen: set[tuple[int, int]] = set()
    result = []
    for e in sorted(entities, key=lambda x: -x["score"]):
        if len(e["text"].strip()) < 2:
            continue
        if e["text"].strip().replace(".", "").isdigit():
            continue
        span_key = (e["start"], e["end"])
        if span_key in seen:
            continue
        seen.add(span_key)
        result.append(e)
    return result

Validate against known terminologies

Cross-reference extracted medications against drug databases (RxNorm, local formulary) to catch false positives:

KNOWN_DRUGS = {"metformin", "lisinopril", "aspirin", "atorvastatin", "furosemide"}

def validate_medication(entity_text: str) -> bool:
    """Check if extracted text matches a known drug name."""
    return entity_text.lower().split()[0] in KNOWN_DRUGS

Limitations and Considerations

Not a clinical decision tool. Extracted entities require clinical validation before use in patient care.
Abbreviation ambiguity. Some abbreviations have multiple meanings (e.g., "MS" could be multiple sclerosis, mitral stenosis, or morphine sulfate). Context-aware disambiguation requires additional logic.
Negation and uncertainty. Entity extraction alone does not capture negation ("no chest pain") or hedging ("possible pneumonia"). Consider adding a negation detection layer.
PHI compliance. Ensure all processing complies with HIPAA, GDPR, or applicable regulations. GLiNER runs locally, which avoids sending data to external APIs, but you still need proper data handling safeguards.
Model size and latency. The multitask-large model requires approximately 1.5 GB of memory. For high-throughput production workloads, consider GPU acceleration or the smaller GLiNER model variants.
Terminology updates. Medical terminology evolves. Regularly update normalization mappings and validation dictionaries.

Next Steps

Add negation detection to distinguish "patient denies chest pain" from "patient reports chest pain"
Connect ontology normalization to live terminology services (UMLS API, NLM RxNorm REST API)
Fine-tune GLiNER on your institution's annotated clinical notes for improved accuracy
Combine entity extraction with a relation extraction model for treatment-condition linkage

Overview​

Installation​

Quick Start​

Define Biomedical Entity Types​

Core Clinical Entity Types​

Detailed Entity Types​

Domain-Specific Entity Sets​

Basic Entity Extraction​

Extract Medications with Dosage​

Handle Clinical Abbreviations​

Abbreviation Expansion Mapping​

Normalize to Medical Ontologies​

Full Clinical NER Pipeline​

Best Practices​

Use domain-specific entity types​

Adjust thresholds by entity type​

Post-process for quality​

Validate against known terminologies​

Limitations and Considerations​

Next Steps​

Overview

Installation

Quick Start

Define Biomedical Entity Types

Core Clinical Entity Types

Detailed Entity Types

Domain-Specific Entity Sets

Basic Entity Extraction

Extract Medications with Dosage

Handle Clinical Abbreviations

Abbreviation Expansion Mapping

Normalize to Medical Ontologies

Full Clinical NER Pipeline

Best Practices

Use domain-specific entity types

Adjust thresholds by entity type

Post-process for quality

Validate against known terminologies

Limitations and Considerations

Next Steps