Skip to main content

Extract Structured Data from Invoices with GLiNER

Extract key fields from unstructured invoice text using GLiNER's local entity recognition, then assemble them into structured JSON ready for accounting systems.

Overview

This cookbook uses an entity-extraction approach to invoice parsing. Instead of relying on cloud APIs or instruction-based prompting, we run a local GLiNER model to detect named entities that correspond to invoice fields -- vendor name, amounts, dates, line items, and more. The extracted entities are then grouped and assembled into a clean, structured JSON format suitable for export to QuickBooks, Xero, Sage, or CSV.

Why this approach works: Invoice fields are essentially named entities. An invoice number is a specific text span, a vendor name is a specific text span, and so on. GLiNER's zero-shot entity recognition lets you define arbitrary entity labels at inference time, so you can tailor the label set to match exactly the fields your accounting system expects.

Installation

pip install gliner

The model weights are downloaded automatically on first use (approximately 1.5 GB).

Quick Start

Extract the most important fields from an invoice in under 20 lines:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

invoice_text = """
INVOICE #INV-2024-0892
Date: January 15, 2024
Due: February 14, 2024

From: TechSupply Corp, 123 Business Park Drive, San Francisco, CA 94102
Bill To: Acme Industries, 456 Corporate Blvd, New York, NY 10001

Widget Pro X200 10 x $49.99 $499.90
Premium Support 1 x $299.00 $299.00

Subtotal: $948.90 Tax (8.5%): $80.66 Total: $1,029.56
Payment Terms: Net 30
"""

labels = [
"invoice_number", "invoice_date", "due_date",
"vendor_name", "customer_name",
"subtotal", "tax_amount", "total", "payment_terms",
]

entities = model.predict_entities(invoice_text, labels, threshold=0.4)
for e in entities:
print(f"{e['label']:>20s}: {e['text']} (score={e['score']:.2f})")

Define Invoice Entity Types

A complete invoice extraction requires a rich set of entity labels. Define them in groups so you can mix and match depending on the invoice format:

# Core identifiers
HEADER_LABELS = [
"invoice_number",
"invoice_date",
"due_date",
"currency",
"payment_terms",
]

# Parties
PARTY_LABELS = [
"vendor_name",
"vendor_address",
"vendor_tax_id",
"customer_name",
"customer_address",
]

# Line-item fields
LINE_ITEM_LABELS = [
"line_item_description",
"quantity",
"unit_price",
"line_total",
]

# Totals
TOTALS_LABELS = [
"subtotal",
"tax_rate",
"tax_amount",
"total",
]

ALL_LABELS = HEADER_LABELS + PARTY_LABELS + LINE_ITEM_LABELS + TOTALS_LABELS

Basic Invoice Extraction

A function that runs entity extraction and returns the raw entity list:

from typing import List, Dict
from gliner import GLiNER


def extract_invoice_entities(
text: str,
model: GLiNER,
labels: List[str] | None = None,
threshold: float = 0.4,
) -> List[Dict]:
"""
Extract invoice-related entities from text.

Args:
text: Raw invoice text (from PDF text extraction or OCR).
model: A loaded GLiNER model instance.
labels: Entity labels to detect. Defaults to ALL_LABELS.
threshold: Minimum confidence score.

Returns:
List of entity dicts with keys: text, label, start, end, score.
"""
if labels is None:
labels = ALL_LABELS

entities = model.predict_entities(text, labels, threshold=threshold)

# Sort by position in the document
entities.sort(key=lambda e: e["start"])
return entities

Assemble Extracted Entities into JSON

The raw entity list needs to be grouped and structured. Single-value fields (invoice number, vendor name) keep the highest-scoring match. Repeating fields (line item descriptions, quantities, unit prices) are grouped by document position into line-item rows.

from dataclasses import dataclass, field
from typing import Any, Optional


@dataclass
class InvoiceData:
"""Structured invoice representation."""
invoice_number: str = ""
invoice_date: str = ""
due_date: str = ""
currency: str = ""
payment_terms: str = ""
vendor_name: str = ""
vendor_address: str = ""
vendor_tax_id: str = ""
customer_name: str = ""
customer_address: str = ""
line_items: list = field(default_factory=list)
subtotal: float = 0.0
tax_rate: float = 0.0
tax_amount: float = 0.0
total: float = 0.0

def to_dict(self) -> dict:
return {
"invoice_number": self.invoice_number,
"invoice_date": self.invoice_date,
"due_date": self.due_date,
"currency": self.currency,
"payment_terms": self.payment_terms,
"vendor": {
"name": self.vendor_name,
"address": self.vendor_address,
"tax_id": self.vendor_tax_id,
},
"customer": {
"name": self.customer_name,
"address": self.customer_address,
},
"line_items": self.line_items,
"subtotal": self.subtotal,
"tax_rate": self.tax_rate,
"tax_amount": self.tax_amount,
"total": self.total,
}


# Labels that appear once per invoice -- keep the best match
SINGLE_VALUE_FIELDS = {
"invoice_number", "invoice_date", "due_date", "currency",
"payment_terms", "vendor_name", "vendor_address", "vendor_tax_id",
"customer_name", "customer_address",
"subtotal", "tax_rate", "tax_amount", "total",
}

# Labels that repeat per line item
LINE_ITEM_FIELDS = {"line_item_description", "quantity", "unit_price", "line_total"}


def _parse_number(text: str) -> float:
"""Convert entity text to a float, stripping currency symbols and commas."""
cleaned = text.replace("$", "").replace(",", "").replace("%", "").strip()
try:
return float(cleaned)
except ValueError:
return 0.0


def _group_line_items(entities: List[Dict]) -> list:
"""
Group line-item entities into rows based on document order.

Strategy: each occurrence of 'line_item_description' starts a new row.
Subsequent quantity, unit_price, and line_total entities are attached
to the most recently opened row.
"""
rows: list = []
current_row: dict = {}

for ent in entities:
label = ent["label"]
if label not in LINE_ITEM_FIELDS:
continue

if label == "line_item_description":
# Start a new line-item row
if current_row:
rows.append(current_row)
current_row = {
"description": ent["text"],
"quantity": 0,
"unit_price": 0.0,
"total": 0.0,
}
elif current_row:
if label == "quantity":
current_row["quantity"] = int(_parse_number(ent["text"]))
elif label == "unit_price":
current_row["unit_price"] = _parse_number(ent["text"])
elif label == "line_total":
current_row["total"] = _parse_number(ent["text"])

if current_row:
rows.append(current_row)

return rows


def assemble_invoice(entities: List[Dict]) -> InvoiceData:
"""
Assemble a flat list of entities into a structured InvoiceData object.

Args:
entities: Output from extract_invoice_entities(), sorted by position.

Returns:
Populated InvoiceData instance.
"""
invoice = InvoiceData()

# For single-value fields, pick the highest-scoring entity
best: dict = {}
for ent in entities:
label = ent["label"]
if label in SINGLE_VALUE_FIELDS:
if label not in best or ent["score"] > best[label]["score"]:
best[label] = ent

# Assign single-value fields
numeric_fields = {"subtotal", "tax_rate", "tax_amount", "total"}
for label, ent in best.items():
value = ent["text"]
if label in numeric_fields:
setattr(invoice, label, _parse_number(value))
else:
setattr(invoice, label, value)

# Group line items
invoice.line_items = _group_line_items(entities)

return invoice

Full extraction example:

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

entities = extract_invoice_entities(invoice_text, model)
invoice = assemble_invoice(entities)

import json
print(json.dumps(invoice.to_dict(), indent=2))

Data Validation

Validate that the extracted data is complete and that the math adds up:

from dataclasses import dataclass
from typing import List
import re


@dataclass
class ValidationResult:
"""Result of invoice data validation."""
is_valid: bool
errors: List[str]
warnings: List[str]


def validate_invoice(invoice: InvoiceData) -> ValidationResult:
"""
Validate extracted invoice data for completeness and arithmetic consistency.

Checks:
- Required fields are present.
- Line-item totals match quantity * unit_price.
- Subtotal matches sum of line totals.
- Tax amount matches subtotal * tax_rate.
- Grand total matches subtotal + tax.
"""
errors: List[str] = []
warnings: List[str] = []

# --- Required fields ---
if not invoice.invoice_number:
errors.append("Missing invoice_number")
if not invoice.invoice_date:
errors.append("Missing invoice_date")
if not invoice.vendor_name:
errors.append("Missing vendor_name")
if invoice.total == 0.0:
errors.append("Total is zero or missing")

# --- Line items ---
if not invoice.line_items:
errors.append("No line items found")
else:
for i, item in enumerate(invoice.line_items, start=1):
if not item.get("description"):
warnings.append(f"Line item {i}: missing description")

qty = item.get("quantity", 0)
price = item.get("unit_price", 0.0)
total = item.get("total", 0.0)

if qty <= 0:
warnings.append(f"Line item {i}: quantity is {qty}")
if price < 0:
errors.append(f"Line item {i}: negative unit_price")

expected = round(qty * price, 2)
if total and abs(expected - total) > 0.02:
warnings.append(
f"Line item {i}: {qty} x {price} = {expected}, "
f"but reported total is {total}"
)

# --- Subtotal check ---
if invoice.line_items and invoice.subtotal:
calc_subtotal = round(sum(it.get("total", 0) for it in invoice.line_items), 2)
if abs(calc_subtotal - invoice.subtotal) > 0.02:
warnings.append(
f"Subtotal mismatch: line items sum to {calc_subtotal}, "
f"reported {invoice.subtotal}"
)

# --- Tax check ---
if invoice.tax_rate and invoice.subtotal:
expected_tax = round(invoice.subtotal * invoice.tax_rate / 100, 2)
if abs(expected_tax - invoice.tax_amount) > 0.02:
warnings.append(
f"Tax mismatch: {invoice.subtotal} x {invoice.tax_rate}% "
f"= {expected_tax}, reported {invoice.tax_amount}"
)

# --- Grand total check ---
if invoice.subtotal and invoice.tax_amount:
expected_total = round(invoice.subtotal + invoice.tax_amount, 2)
if abs(expected_total - invoice.total) > 0.02:
warnings.append(
f"Total mismatch: {invoice.subtotal} + {invoice.tax_amount} "
f"= {expected_total}, reported {invoice.total}"
)

# --- Date format check ---
iso_date = re.compile(r"^\d{4}-\d{2}-\d{2}$")
for field_name in ["invoice_date", "due_date"]:
value = getattr(invoice, field_name)
if value and not iso_date.match(value):
warnings.append(f"{field_name} is not ISO-8601: '{value}'")

return ValidationResult(
is_valid=len(errors) == 0,
errors=errors,
warnings=warnings,
)

Usage:

result = validate_invoice(invoice)

if result.is_valid:
print("Invoice data is valid")
else:
for err in result.errors:
print(f" ERROR: {err}")

for w in result.warnings:
print(f" WARNING: {w}")

Handle Different Invoice Formats

Different industries use different fields. Adjust the entity label set accordingly.

European Invoice (VAT)

EU_LABELS = [
"invoice_number", "invoice_date", "due_date",
"supplier_name", "supplier_address", "supplier_vat_number",
"customer_name", "customer_address", "customer_vat_number",
"line_item_description", "quantity", "unit_price_net",
"vat_rate", "line_total_net", "line_total_gross",
"total_net", "total_vat", "total_gross",
"currency", "iban", "bic", "payment_reference",
]

eu_invoice_text = """
RECHNUNG Nr. RE-2024-4781
Rechnungsdatum: 12. März 2024
Fälligkeitsdatum: 11. April 2024

Lieferant: Müller Maschinenbau GmbH
Industriestraße 42, 70565 Stuttgart, Deutschland
USt-IdNr.: DE298471562

Kunde: Dupont Électronique SA
14 Rue de la Paix, 75002 Paris, France
TVA: FR83429617503

Pos Beschreibung Menge Einzelpreis netto MwSt Gesamt brutto
1 CNC-Fräskopf Typ B400 2 €3.250,00 19% €7.735,00
2 Wartungskit Premium 5 €189,00 19% €1.124,55
3 Express-Lieferung 1 €420,00 19% €499,80

Nettobetrag: €7.108,00
MwSt (19%): €1.350,52
Bruttobetrag: €9.359,35

Bankverbindung: IBAN DE89 3704 0044 0532 0130 00 | BIC COBADEFFXXX
Zahlungsreferenz: RE-2024-4781-DUPONT
"""

eu_entities = model.predict_entities(eu_invoice_text, EU_LABELS, threshold=0.4)

Medical / Healthcare Invoice

MEDICAL_LABELS = [
"invoice_number", "service_date", "invoice_date",
"provider_name", "provider_npi", "provider_tax_id",
"patient_name", "patient_id", "date_of_birth",
"payer_name", "policy_number", "group_number",
"cpt_code", "service_description", "icd10_code",
"charge_amount", "allowed_amount", "adjustment",
"patient_responsibility",
"total_charges", "insurance_payment", "patient_balance",
]

medical_invoice_text = """
STATEMENT OF SERVICES
Invoice #: MED-2024-09831
Date of Service: February 5, 2024
Invoice Date: February 12, 2024

Provider: Riverside Family Medicine Associates
NPI: 1234567890 Tax ID: 45-6789012
200 Wellness Drive, Suite 310, Portland, OR 97201

Patient: Margaret J. Thompson
Patient ID: PT-884201 DOB: 03/15/1958

Insurance: Blue Cross Blue Shield of Oregon
Policy #: BCB-9920148763 Group #: GRP-44210

Code Description ICD-10 Charge Allowed Adjust
99214 Office visit, est. patient, 25 min Z00.00 $225.00 $187.50 $37.50
85025 Complete blood count (CBC) R79.89 $45.00 $38.20 $6.80
80053 Comprehensive metabolic panel R79.89 $95.00 $78.40 $16.60
36415 Venipuncture for lab draw Z00.00 $25.00 $21.00 $4.00

Total Charges: $390.00
Insurance Payment: $273.10
Patient Responsibility: $52.00
Patient Balance: $64.90
"""

medical_entities = model.predict_entities(
medical_invoice_text, MEDICAL_LABELS, threshold=0.35
)

Contractor / Construction Invoice

CONTRACTOR_LABELS = [
"invoice_number", "invoice_date",
"project_name", "contract_number",
"contractor_name", "contractor_license", "contractor_tax_id",
"client_name", "client_address",
"billing_period_start", "billing_period_end",
"labor_description", "labor_hours", "labor_rate", "labor_total",
"material_description", "material_quantity", "material_cost", "material_total",
"subtotal", "retention_percent", "retention_amount",
"previous_payments", "amount_due",
]

contractor_invoice_text = """
PROGRESS BILLING INVOICE
Invoice #: CBL-2024-0047
Date: March 1, 2024

Project: Oakwood Heights Residential Renovation
Contract #: CTR-2023-1182

Contractor: Summit Builders LLC
License #: CA-B-0987654 Tax ID: 82-3456789
8500 Redwood Highway, Mill Valley, CA 94941

Client: David & Sarah Chen
742 Hillcrest Avenue, San Rafael, CA 94901

Billing Period: February 1, 2024 -- February 29, 2024

LABOR
Description Hours Rate Total
Framing carpenter 120 $65.00 $7,800.00
Electrician (rough-in) 40 $85.00 $3,400.00
Plumber (rough-in) 32 $90.00 $2,880.00
General laborer 80 $42.00 $3,360.00

MATERIALS
Description Qty Cost Total
Douglas fir framing lumber 4,200 bf $1.15/bf $4,830.00
Romex 12/2 wire 2,500 ft $0.58/ft $1,450.00
PEX tubing 3/4" 800 ft $0.92/ft $736.00
Simpson Strong-Tie connectors 48 $12.50 $600.00

Subtotal: $25,056.00
Retention (10%): -$2,505.60
Previous Payments: $38,400.00
Amount Due This Period: $22,550.40
"""

contractor_entities = model.predict_entities(
contractor_invoice_text, CONTRACTOR_LABELS, threshold=0.4
)

Export to Accounting Systems

Transform the structured invoice data into formats expected by common accounting platforms:

import json
from typing import List


class InvoiceExporter:
"""Export structured invoice data to accounting system formats."""

@staticmethod
def to_quickbooks(invoice: dict) -> dict:
"""Convert to QuickBooks Online API format."""
return {
"DocNumber": invoice.get("invoice_number"),
"TxnDate": invoice.get("invoice_date"),
"DueDate": invoice.get("due_date"),
"CustomerRef": {
"name": invoice.get("customer", {}).get("name"),
},
"Line": [
{
"DetailType": "SalesItemLineDetail",
"Amount": item.get("total", 0),
"SalesItemLineDetail": {
"ItemRef": {"name": item.get("description")},
"Qty": item.get("quantity", 1),
"UnitPrice": item.get("unit_price", 0),
},
}
for item in invoice.get("line_items", [])
],
"TotalAmt": invoice.get("total"),
}

@staticmethod
def to_xero(invoice: dict) -> dict:
"""Convert to Xero API format."""
return {
"Type": "ACCPAY",
"InvoiceNumber": invoice.get("invoice_number"),
"Reference": invoice.get("invoice_number"),
"Date": invoice.get("invoice_date"),
"DueDate": invoice.get("due_date"),
"Contact": {
"Name": invoice.get("vendor", {}).get("name"),
},
"LineItems": [
{
"Description": item.get("description"),
"Quantity": item.get("quantity", 1),
"UnitAmount": item.get("unit_price", 0),
"LineAmount": item.get("total", 0),
"TaxType": "OUTPUT",
}
for item in invoice.get("line_items", [])
],
"CurrencyCode": invoice.get("currency", "USD"),
"Status": "DRAFT",
}

@staticmethod
def to_csv_row(invoice: dict) -> dict:
"""Flatten invoice into a single CSV-friendly row."""
vendor = invoice.get("vendor", {})
return {
"invoice_number": invoice.get("invoice_number"),
"invoice_date": invoice.get("invoice_date"),
"due_date": invoice.get("due_date"),
"vendor_name": vendor.get("name"),
"vendor_address": vendor.get("address"),
"vendor_tax_id": vendor.get("tax_id"),
"subtotal": invoice.get("subtotal"),
"tax_rate": invoice.get("tax_rate"),
"tax_amount": invoice.get("tax_amount"),
"total": invoice.get("total"),
"currency": invoice.get("currency"),
"payment_terms": invoice.get("payment_terms"),
"line_item_count": len(invoice.get("line_items", [])),
}

@staticmethod
def to_sage(invoice: dict) -> dict:
"""Convert to Sage 50 import format."""
return {
"VendorId": "", # Map to your vendor master
"InvoiceNo": invoice.get("invoice_number"),
"InvoiceDate": invoice.get("invoice_date"),
"DueDate": invoice.get("due_date"),
"TermsCode": invoice.get("payment_terms"),
"DistributionLines": [
{
"GLAccount": "", # Map to your chart of accounts
"Description": item.get("description"),
"Amount": item.get("total", 0),
}
for item in invoice.get("line_items", [])
],
"TaxAmount": invoice.get("tax_amount", 0),
"InvoiceTotal": invoice.get("total"),
}

Full Invoice Processing Pipeline

Tie everything together into a single pipeline that extracts, assembles, validates, and exports:

from dataclasses import dataclass, field
from typing import List, Dict, Optional
from enum import Enum
from gliner import GLiNER
import json


class InvoiceStatus(Enum):
PENDING = "pending"
EXTRACTED = "extracted"
VALIDATED = "validated"
EXPORTED = "exported"
FAILED = "failed"


@dataclass
class ProcessedInvoice:
"""Result of processing a single invoice."""
source: str
status: InvoiceStatus
data: Optional[dict] = None
validation: Optional[ValidationResult] = None
exports: Dict[str, dict] = field(default_factory=dict)
error: Optional[str] = None


class InvoicePipeline:
"""
End-to-end invoice processing pipeline.

Loads the GLiNER model once and reuses it for every invoice.
"""

def __init__(
self,
model_name: str = "knowledgator/gliner-multitask-large-v0.5",
labels: List[str] | None = None,
threshold: float = 0.4,
strict: bool = False,
):
self.model = GLiNER.from_pretrained(model_name)
self.labels = labels or ALL_LABELS
self.threshold = threshold
self.strict = strict
self.exporter = InvoiceExporter()
self.results: List[ProcessedInvoice] = []

def process_text(
self,
text: str,
source: str = "<inline>",
export_formats: List[str] | None = None,
) -> ProcessedInvoice:
"""
Run a single invoice text through the full pipeline.

Args:
text: Raw invoice text.
source: Label for this invoice (file path, ID, etc.).
export_formats: List of formats to export
("quickbooks", "xero", "csv", "sage").

Returns:
ProcessedInvoice with extraction, validation, and export results.
"""
result = ProcessedInvoice(source=source, status=InvoiceStatus.PENDING)

try:
# 1. Extract entities
entities = extract_invoice_entities(
text, self.model, self.labels, self.threshold
)

# 2. Assemble into structured format
invoice = assemble_invoice(entities)
result.data = invoice.to_dict()
result.status = InvoiceStatus.EXTRACTED

# 3. Validate
validation = validate_invoice(invoice)
result.validation = validation

if not validation.is_valid and self.strict:
result.status = InvoiceStatus.FAILED
result.error = "; ".join(validation.errors)
self.results.append(result)
return result

result.status = InvoiceStatus.VALIDATED

# 4. Export
if export_formats:
export_map = {
"quickbooks": self.exporter.to_quickbooks,
"xero": self.exporter.to_xero,
"csv": self.exporter.to_csv_row,
"sage": self.exporter.to_sage,
}
for fmt in export_formats:
fn = export_map.get(fmt)
if fn:
result.exports[fmt] = fn(result.data)

result.status = InvoiceStatus.EXPORTED

except Exception as exc:
result.status = InvoiceStatus.FAILED
result.error = str(exc)

self.results.append(result)
return result

def process_file(
self,
file_path: str,
export_formats: List[str] | None = None,
) -> ProcessedInvoice:
"""Read a text file and process it."""
with open(file_path, "r") as f:
text = f.read()
return self.process_text(text, source=file_path, export_formats=export_formats)

def process_batch(
self,
texts: Dict[str, str],
export_formats: List[str] | None = None,
) -> List[ProcessedInvoice]:
"""
Process multiple invoices.

Args:
texts: Mapping of source label to invoice text.
export_formats: Export formats to generate.

Returns:
List of ProcessedInvoice results.
"""
return [
self.process_text(text, source=src, export_formats=export_formats)
for src, text in texts.items()
]

def summary(self) -> dict:
"""Return aggregate statistics for all processed invoices."""
total = len(self.results)
if total == 0:
return {"total": 0}

status_counts: Dict[str, int] = {}
for r in self.results:
key = r.status.value
status_counts[key] = status_counts.get(key, 0) + 1

total_amount = sum(
r.data.get("total", 0) for r in self.results if r.data
)
warning_count = sum(
len(r.validation.warnings) for r in self.results if r.validation
)

return {
"total": total,
"status_breakdown": status_counts,
"total_invoice_amount": total_amount,
"total_warnings": warning_count,
"success_rate_pct": round(
status_counts.get("exported", 0) / total * 100, 1
),
}

Usage:

pipeline = InvoicePipeline(strict=False)

# Single invoice
result = pipeline.process_text(
invoice_text,
source="inv_001.txt",
export_formats=["quickbooks", "xero", "csv"],
)

print(f"Status : {result.status.value}")
print(f"Total : {result.data['total']}")
print(f"Errors : {result.validation.errors}")
print(f"Warnings: {result.validation.warnings}")
print(json.dumps(result.exports["quickbooks"], indent=2))

# Batch
batch_texts = {
"inv_001": eu_invoice_text,
"inv_002": medical_invoice_text,
"inv_003": contractor_invoice_text,
}

results = pipeline.process_batch(batch_texts, export_formats=["quickbooks"])

print(pipeline.summary())

Best Practices

Use Specific Entity Labels

Generic labels like "name" or "amount" will match too broadly. Prefer specific labels that encode their role in the invoice:

# Too generic -- will match many spans
labels = ["name", "date", "amount"]

# Specific -- targets exactly the fields you need
labels = ["vendor_name", "invoice_date", "line_total"]

Preprocess OCR Text

Clean common OCR errors before running extraction:

def clean_ocr_text(text: str) -> str:
"""Fix frequent OCR misreads in invoices."""
fixes = {
"lnvoice": "Invoice",
"0ate": "Date",
"T0TAL": "TOTAL",
"Qly": "Quantity",
}
for wrong, right in fixes.items():
text = text.replace(wrong, right)
return text

Normalize Extracted Values

from datetime import datetime


def normalize_date(raw: str) -> str:
"""Attempt to parse a date string into ISO-8601 format."""
for fmt in ("%B %d, %Y", "%m/%d/%Y", "%d/%m/%Y", "%Y-%m-%d", "%d.%m.%Y"):
try:
return datetime.strptime(raw.strip(), fmt).strftime("%Y-%m-%d")
except ValueError:
continue
return raw


def normalize_currency(symbol: str) -> str:
"""Map currency symbols to ISO 4217 codes."""
return {"$": "USD", "EUR": "EUR", "GBP": "GBP", "JPY": "JPY"}.get(
symbol.strip(), symbol.upper()
)

Tune the Confidence Threshold

  • Start with threshold=0.4 for broad recall.
  • Raise to 0.5--0.6 if you see too many false positives.
  • Lower to 0.3 if important fields are being missed.

You can also use per-field thresholds by running predict_entities multiple times with different label subsets and thresholds.

Limitations and Considerations

  1. Text input only. GLiNER operates on text. For scanned or image-based PDFs, run OCR first (e.g., Tesseract, EasyOCR, or a cloud OCR service) and feed the resulting text to the pipeline.

  2. Line-item grouping is heuristic. The _group_line_items function assumes descriptions appear before their quantities and prices. Invoices with unusual layouts may need custom grouping logic.

  3. No layout awareness. GLiNER processes flat text without spatial information. If two columns contain different data at the same vertical position, they may be confused after text extraction.

  4. Model size. The gliner-multitask-large-v0.5 model is approximately 1.5 GB. For constrained environments, consider a smaller GLiNER variant.

  5. Handwritten content. Handwritten notes or signatures will not be recognized. Flag invoices with significant handwritten sections for manual review.

  6. Currency and locale. Number formats differ across locales (e.g., 1.000,50 vs 1,000.50). Adjust _parse_number for your target locales.

Next Steps