Usage

GLiNKER offers three ways to create an entity linking pipeline, from simplest to most configurable.

Creating Pipelines

Option 1: `create_simple` (Recommended Start)

ProcessorFactory.create_simple builds a pipeline in one call. Without an L1 NER step, the model links entities directly from the input text against all loaded entities.

from glinker import ProcessorFactory

executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    threshold=0.5,
)

executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy."]})

With inline entities (no file needed):

executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    threshold=0.5,
    entities=[
        {"entity_id": "Q101", "label": "insulin", "description": "Peptide hormone regulating blood glucose"},
        {"entity_id": "Q102", "label": "glucose", "description": "Primary blood sugar and key metabolic fuel"},
        {"entity_id": "Q103", "label": "GLUT4", "description": "Insulin-responsive glucose transporter in muscle and adipose tissue"},
        {"entity_id": "Q104", "label": "pancreatic beta cell", "description": "Endocrine cell type that secretes insulin"},
    ],
)

result = executor.execute({
    "texts": [
        "After a meal, pancreatic beta cells release insulin, which promotes GLUT4 translocation and increases glucose uptake in muscle."
    ]
})

With a reranker (L2 → L3 → L4 → L0):

executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    threshold=0.5,
    reranker_model="knowledgator/gliner-multitask-large-v0.5",
    reranker_max_labels=20,
    reranker_threshold=0.3,
    entities="data/entities.jsonl",
    precompute_embeddings=True,
)

With entity descriptions in the template:

executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    template="{label}: {description}",  # L3 sees "BRCA1: Breast cancer type 1 susceptibility protein"
    entities="data/entities.jsonl",
)

All `create_simple` Parameters

Parameter	Default	Description
`model_name`	(required)	HuggingFace model ID or local path
`device`	`"cpu"`	Torch device (`"cpu"`, `"cuda"`, `"cuda:0"`)
`threshold`	`0.5`	Minimum score for entity predictions
`template`	`"{label}"`	Format string for entity labels (e.g. `"{label}: {description}"`)
`max_length`	`512`	Max sequence length for tokenization
`token`	`None`	HuggingFace auth token for gated models
`entities`	`None`	Entity data to load immediately (file path, list of dicts, or dict of dicts)
`precompute_embeddings`	`False`	Pre-embed all entity labels after loading (BiEncoder only)
`verbose`	`False`	Enable verbose logging
`reranker_model`	`None`	GLiNER model for L4 reranking (adds L4 node when set)
`reranker_max_labels`	`20`	Max candidate labels per L4 inference call
`reranker_threshold`	`None`	Score threshold for L4 (defaults to `threshold`)

Option 2: From a YAML Config File

For full control over every layer, define the pipeline in YAML and load it:

from glinker import ProcessorFactory

executor = ProcessorFactory.create_pipeline("configs/pipelines/dict/simple.yaml")
executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["TP53 mutations cause cancer"]})

See the Configuration guide for full YAML config examples.

Option 3: `ConfigBuilder` (Programmatic)

Build configs in Python with full control over each layer:

from glinker import ConfigBuilder, DAGExecutor

builder = ConfigBuilder(name="my_pipeline")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")

executor = DAGExecutor(builder.get_config())
executor.load_entities("data/entities.jsonl", target_layers=["dict"])

With multiple database layers:

builder = ConfigBuilder(name="production")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein"])
builder.l2.add("redis", priority=2, ttl=3600)
builder.l2.add("elasticsearch", priority=1, ttl=86400)
builder.l2.add("postgres", priority=0)
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0", use_precomputed_embeddings=True)
builder.l0.configure(strict_matching=True, min_confidence=0.3)
builder.save("config.yaml")

With L4 reranker:

builder = ConfigBuilder(name="reranked")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-base-v1.0")
builder.l4.configure(
    model="knowledgator/gliner-linker-rerank-v1.0",
    threshold=0.3,
    max_labels=20,
)
builder.save("config.yaml")  # Generates L1 → L2 → L3 → L4 → L0

Loading Entities

Entities can be loaded after pipeline creation via executor.load_entities(), or passed directly to create_simple(entities=...). Three input formats are supported.

From a JSONL File

One JSON object per line:

executor.load_entities("data/entities.jsonl")

# Or target specific database layers
executor.load_entities("data/entities.jsonl", target_layers=["dict", "postgres"])

data/entities.jsonl:

{"entity_id": "Q123", "label": "Kyiv", "description": "Capital and largest city of Ukraine", "entity_type": "city", "popularity": 1000000, "aliases": ["Kiev"]}
{"entity_id": "Q456", "label": "Dnipro River", "description": "Major river flowing through Ukraine and Belarus", "entity_type": "river", "popularity": 950000, "aliases": ["Dnieper"]}
{"entity_id": "Q789", "label": "Carpathian Mountains", "description": "Mountain range in Central and Eastern Europe", "entity_type": "mountain_range", "popularity": 800000, "aliases": ["Carpathians"]}

From a Python List

entities = [
    {
        "entity_id": "Q123",
        "label": "Kyiv",
        "description": "Capital and largest city of Ukraine",
        "entity_type": "city",
        "aliases": ["Kiev"],
    },
    {
        "entity_id": "Q456",
        "label": "Dnipro River",
        "description": "Major river flowing through Ukraine and Belarus",
        "entity_type": "river",
        "aliases": ["Dnieper"],
    },
]

executor.load_entities(entities)

From a Python Dict

Keys are entity IDs, values are entity data:

entities = {
    "Q123": {
        "label": "Kyiv",
        "description": "Capital and largest city of Ukraine",
        "entity_type": "city",
    },
    "Q456": {
        "label": "Dnipro River",
        "description": "Major river flowing through Ukraine and Belarus",
        "entity_type": "river",
    },
}

executor.load_entities(entities)

Entity Format Reference

Field	Type	Required	Default	Description
`entity_id`	str	yes	—	Unique identifier
`label`	str	yes	—	Primary name
`description`	str	no	`""`	Text description (used in templates like `"{label}: {description}"`)
`entity_type`	str	no	`""`	Category (e.g. `"gene"`, `"disease"`)
`aliases`	list[str]	no	`[]`	Alternative names for search matching
`popularity`	int	no	`0`	Ranking score for candidate ordering

Advanced Features

Precomputed Embeddings (BiEncoder)

For BiEncoder models, precomputing label embeddings gives 10-100x speedups:

# Load entities, then precompute
executor.load_entities("data/entities.jsonl")
executor.precompute_embeddings(batch_size=64)

# Or do both in create_simple
executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    entities="data/entities.jsonl",
    precompute_embeddings=True,
)

tip

Pre-computed embeddings are especially beneficial when processing large document collections with the same entity set. Compute once, reuse across millions of documents.

On-the-Fly Embedding Caching

Instead of precomputing all embeddings upfront, cache them as they are computed during inference:

builder.l3.configure(
    model="knowledgator/gliner-linker-large-v1.0",
    cache_embeddings=True,
)

L4 Reranker

When the candidate set from L2 is large (tens or hundreds of entities), the L4 reranker splits candidates into chunks for efficient processing:

# Via ConfigBuilder
builder.l4.configure(
    model="knowledgator/gliner-multitask-large-v0.5",
    threshold=0.3,
    max_labels=20,   # candidates per inference call
)

# Via create_simple
executor = ProcessorFactory.create_simple(
    model_name="knowledgator/gliner-bi-base-v2.0",
    reranker_model="knowledgator/gliner-multitask-large-v0.5",
    reranker_max_labels=20,
)

Custom Pipelines

# Custom L1 processing pipeline
l1_processor = processor_registry.get("l1_spacy")(
    config_dict={"model": "en_core_sci_sm"},
    pipeline=[
        ("extract_entities", {}),
        ("filter_by_length", {"min_length": 3}),
        ("deduplicate", {}),
        ("sort_by_position", {}),
    ]
)

Database Setup

Quick Start (Docker)

# Start all databases
cd scripts/database
docker-compose up -d

# Load entities
python scripts/database/setup_all.sh

Manual Setup

from glinker import DAGExecutor

executor = DAGExecutor(pipeline)
executor.load_entities(
    filepath="data/entities.jsonl",
    target_layers=["redis", "elasticsearch", "postgres"],
    batch_size=1000,
)

important

For production deployments with large entity databases (100K+ entities), Elasticsearch or PostgreSQL backends are recommended over the in-memory dictionary backend.

Creating Pipelines​

Option 1: create_simple (Recommended Start)​

All create_simple Parameters​

Option 2: From a YAML Config File​

Option 3: ConfigBuilder (Programmatic)​

Loading Entities​

From a JSONL File​

From a Python List​

From a Python Dict​

Entity Format Reference​

Advanced Features​

Precomputed Embeddings (BiEncoder)​

On-the-Fly Embedding Caching​

L4 Reranker​

Custom Pipelines​

Database Setup​

Quick Start (Docker)​

Manual Setup​