Usage
GLiNKER offers three ways to create an entity linking pipeline, from simplest to most configurable.
Creating Pipelines
Option 1: create_simple (Recommended Start)
ProcessorFactory.create_simple builds a pipeline in one call. Without an L1 NER step, the model links entities directly from the input text against all loaded entities.
from glinker import ProcessorFactory
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
)
executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy."]})
With inline entities (no file needed):
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
entities=[
{"entity_id": "Q101", "label": "insulin", "description": "Peptide hormone regulating blood glucose"},
{"entity_id": "Q102", "label": "glucose", "description": "Primary blood sugar and key metabolic fuel"},
{"entity_id": "Q103", "label": "GLUT4", "description": "Insulin-responsive glucose transporter in muscle and adipose tissue"},
{"entity_id": "Q104", "label": "pancreatic beta cell", "description": "Endocrine cell type that secretes insulin"},
],
)
result = executor.execute({
"texts": [
"After a meal, pancreatic beta cells release insulin, which promotes GLUT4 translocation and increases glucose uptake in muscle."
]
})
With a reranker (L2 → L3 → L4 → L0):
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
reranker_model="knowledgator/gliner-multitask-large-v0.5",
reranker_max_labels=20,
reranker_threshold=0.3,
entities="data/entities.jsonl",
precompute_embeddings=True,
)
With entity descriptions in the template:
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
template="{label}: {description}", # L3 sees "BRCA1: Breast cancer type 1 susceptibility protein"
entities="data/entities.jsonl",
)
All create_simple Parameters
| Parameter | Default | Description |
|---|---|---|
model_name | (required) | HuggingFace model ID or local path |
device | "cpu" | Torch device ("cpu", "cuda", "cuda:0") |
threshold | 0.5 | Minimum score for entity predictions |
template | "{label}" | Format string for entity labels (e.g. "{label}: {description}") |
max_length | 512 | Max sequence length for tokenization |
token | None | HuggingFace auth token for gated models |
entities | None | Entity data to load immediately (file path, list of dicts, or dict of dicts) |
precompute_embeddings | False | Pre-embed all entity labels after loading (BiEncoder only) |
verbose | False | Enable verbose logging |
reranker_model | None | GLiNER model for L4 reranking (adds L4 node when set) |
reranker_max_labels | 20 | Max candidate labels per L4 inference call |
reranker_threshold | None | Score threshold for L4 (defaults to threshold) |
Option 2: From a YAML Config File
For full control over every layer, define the pipeline in YAML and load it:
from glinker import ProcessorFactory
executor = ProcessorFactory.create_pipeline("configs/pipelines/dict/simple.yaml")
executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["TP53 mutations cause cancer"]})
See the Configuration guide for full YAML config examples.
Option 3: ConfigBuilder (Programmatic)
Build configs in Python with full control over each layer:
from glinker import ConfigBuilder, DAGExecutor
builder = ConfigBuilder(name="my_pipeline")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
executor = DAGExecutor(builder.get_config())
executor.load_entities("data/entities.jsonl", target_layers=["dict"])
With multiple database layers:
builder = ConfigBuilder(name="production")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein"])
builder.l2.add("redis", priority=2, ttl=3600)
builder.l2.add("elasticsearch", priority=1, ttl=86400)
builder.l2.add("postgres", priority=0)
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0", use_precomputed_embeddings=True)
builder.l0.configure(strict_matching=True, min_confidence=0.3)
builder.save("config.yaml")
With L4 reranker:
builder = ConfigBuilder(name="reranked")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-base-v1.0")
builder.l4.configure(
model="knowledgator/gliner-linker-rerank-v1.0",
threshold=0.3,
max_labels=20,
)
builder.save("config.yaml") # Generates L1 → L2 → L3 → L4 → L0
Loading Entities
Entities can be loaded after pipeline creation via executor.load_entities(), or passed directly to create_simple(entities=...). Three input formats are supported.
From a JSONL File
One JSON object per line:
executor.load_entities("data/entities.jsonl")
# Or target specific database layers
executor.load_entities("data/entities.jsonl", target_layers=["dict", "postgres"])
data/entities.jsonl:
{"entity_id": "Q123", "label": "Kyiv", "description": "Capital and largest city of Ukraine", "entity_type": "city", "popularity": 1000000, "aliases": ["Kiev"]}
{"entity_id": "Q456", "label": "Dnipro River", "description": "Major river flowing through Ukraine and Belarus", "entity_type": "river", "popularity": 950000, "aliases": ["Dnieper"]}
{"entity_id": "Q789", "label": "Carpathian Mountains", "description": "Mountain range in Central and Eastern Europe", "entity_type": "mountain_range", "popularity": 800000, "aliases": ["Carpathians"]}
From a Python List
entities = [
{
"entity_id": "Q123",
"label": "Kyiv",
"description": "Capital and largest city of Ukraine",
"entity_type": "city",
"aliases": ["Kiev"],
},
{
"entity_id": "Q456",
"label": "Dnipro River",
"description": "Major river flowing through Ukraine and Belarus",
"entity_type": "river",
"aliases": ["Dnieper"],
},
]
executor.load_entities(entities)
From a Python Dict
Keys are entity IDs, values are entity data:
entities = {
"Q123": {
"label": "Kyiv",
"description": "Capital and largest city of Ukraine",
"entity_type": "city",
},
"Q456": {
"label": "Dnipro River",
"description": "Major river flowing through Ukraine and Belarus",
"entity_type": "river",
},
}
executor.load_entities(entities)
Entity Format Reference
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
entity_id | str | yes | — | Unique identifier |
label | str | yes | — | Primary name |
description | str | no | "" | Text description (used in templates like "{label}: {description}") |
entity_type | str | no | "" | Category (e.g. "gene", "disease") |
aliases | list[str] | no | [] | Alternative names for search matching |
popularity | int | no | 0 | Ranking score for candidate ordering |
Advanced Features
Precomputed Embeddings (BiEncoder)
For BiEncoder models, precomputing label embeddings gives 10-100x speedups:
# Load entities, then precompute
executor.load_entities("data/entities.jsonl")
executor.precompute_embeddings(batch_size=64)
# Or do both in create_simple
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
entities="data/entities.jsonl",
precompute_embeddings=True,
)
Pre-computed embeddings are especially beneficial when processing large document collections with the same entity set. Compute once, reuse across millions of documents.
On-the-Fly Embedding Caching
Instead of precomputing all embeddings upfront, cache them as they are computed during inference:
builder.l3.configure(
model="knowledgator/gliner-linker-large-v1.0",
cache_embeddings=True,
)
L4 Reranker
When the candidate set from L2 is large (tens or hundreds of entities), the L4 reranker splits candidates into chunks for efficient processing:
# Via ConfigBuilder
builder.l4.configure(
model="knowledgator/gliner-multitask-large-v0.5",
threshold=0.3,
max_labels=20, # candidates per inference call
)
# Via create_simple
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
reranker_model="knowledgator/gliner-multitask-large-v0.5",
reranker_max_labels=20,
)
Custom Pipelines
# Custom L1 processing pipeline
l1_processor = processor_registry.get("l1_spacy")(
config_dict={"model": "en_core_sci_sm"},
pipeline=[
("extract_entities", {}),
("filter_by_length", {"min_length": 3}),
("deduplicate", {}),
("sort_by_position", {}),
]
)
Database Setup
Quick Start (Docker)
# Start all databases
cd scripts/database
docker-compose up -d
# Load entities
python scripts/database/setup_all.sh
Manual Setup
from glinker import DAGExecutor
executor = DAGExecutor(pipeline)
executor.load_entities(
filepath="data/entities.jsonl",
target_layers=["redis", "elasticsearch", "postgres"],
batch_size=1000,
)
For production deployments with large entity databases (100K+ entities), Elasticsearch or PostgreSQL backends are recommended over the in-memory dictionary backend.