Skip to main content

Usage

GLiNKER offers three ways to create an entity linking pipeline, from simplest to most configurable.

Creating Pipelines

ProcessorFactory.create_simple builds a pipeline in one call. Without an L1 NER step, the model links entities directly from the input text against all loaded entities.

from glinker import ProcessorFactory

executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
)

executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy."]})

With inline entities (no file needed):

executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
entities=[
{"entity_id": "Q101", "label": "insulin", "description": "Peptide hormone regulating blood glucose"},
{"entity_id": "Q102", "label": "glucose", "description": "Primary blood sugar and key metabolic fuel"},
{"entity_id": "Q103", "label": "GLUT4", "description": "Insulin-responsive glucose transporter in muscle and adipose tissue"},
{"entity_id": "Q104", "label": "pancreatic beta cell", "description": "Endocrine cell type that secretes insulin"},
],
)

result = executor.execute({
"texts": [
"After a meal, pancreatic beta cells release insulin, which promotes GLUT4 translocation and increases glucose uptake in muscle."
]
})

With a reranker (L2 → L3 → L4 → L0):

executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
threshold=0.5,
reranker_model="knowledgator/gliner-multitask-large-v0.5",
reranker_max_labels=20,
reranker_threshold=0.3,
entities="data/entities.jsonl",
precompute_embeddings=True,
)

With entity descriptions in the template:

executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
template="{label}: {description}", # L3 sees "BRCA1: Breast cancer type 1 susceptibility protein"
entities="data/entities.jsonl",
)

All create_simple Parameters

ParameterDefaultDescription
model_name(required)HuggingFace model ID or local path
device"cpu"Torch device ("cpu", "cuda", "cuda:0")
threshold0.5Minimum score for entity predictions
template"{label}"Format string for entity labels (e.g. "{label}: {description}")
max_length512Max sequence length for tokenization
tokenNoneHuggingFace auth token for gated models
entitiesNoneEntity data to load immediately (file path, list of dicts, or dict of dicts)
precompute_embeddingsFalsePre-embed all entity labels after loading (BiEncoder only)
verboseFalseEnable verbose logging
reranker_modelNoneGLiNER model for L4 reranking (adds L4 node when set)
reranker_max_labels20Max candidate labels per L4 inference call
reranker_thresholdNoneScore threshold for L4 (defaults to threshold)

Option 2: From a YAML Config File

For full control over every layer, define the pipeline in YAML and load it:

from glinker import ProcessorFactory

executor = ProcessorFactory.create_pipeline("configs/pipelines/dict/simple.yaml")
executor.load_entities("data/entities.jsonl")
result = executor.execute({"texts": ["TP53 mutations cause cancer"]})

See the Configuration guide for full YAML config examples.

Option 3: ConfigBuilder (Programmatic)

Build configs in Python with full control over each layer:

from glinker import ConfigBuilder, DAGExecutor

builder = ConfigBuilder(name="my_pipeline")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")

executor = DAGExecutor(builder.get_config())
executor.load_entities("data/entities.jsonl", target_layers=["dict"])

With multiple database layers:

builder = ConfigBuilder(name="production")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein"])
builder.l2.add("redis", priority=2, ttl=3600)
builder.l2.add("elasticsearch", priority=1, ttl=86400)
builder.l2.add("postgres", priority=0)
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0", use_precomputed_embeddings=True)
builder.l0.configure(strict_matching=True, min_confidence=0.3)
builder.save("config.yaml")

With L4 reranker:

builder = ConfigBuilder(name="reranked")
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
builder.l3.configure(model="knowledgator/gliner-linker-base-v1.0")
builder.l4.configure(
model="knowledgator/gliner-linker-rerank-v1.0",
threshold=0.3,
max_labels=20,
)
builder.save("config.yaml") # Generates L1 → L2 → L3 → L4 → L0

Loading Entities

Entities can be loaded after pipeline creation via executor.load_entities(), or passed directly to create_simple(entities=...). Three input formats are supported.

From a JSONL File

One JSON object per line:

executor.load_entities("data/entities.jsonl")

# Or target specific database layers
executor.load_entities("data/entities.jsonl", target_layers=["dict", "postgres"])

data/entities.jsonl:

{"entity_id": "Q123", "label": "Kyiv", "description": "Capital and largest city of Ukraine", "entity_type": "city", "popularity": 1000000, "aliases": ["Kiev"]}
{"entity_id": "Q456", "label": "Dnipro River", "description": "Major river flowing through Ukraine and Belarus", "entity_type": "river", "popularity": 950000, "aliases": ["Dnieper"]}
{"entity_id": "Q789", "label": "Carpathian Mountains", "description": "Mountain range in Central and Eastern Europe", "entity_type": "mountain_range", "popularity": 800000, "aliases": ["Carpathians"]}

From a Python List

entities = [
{
"entity_id": "Q123",
"label": "Kyiv",
"description": "Capital and largest city of Ukraine",
"entity_type": "city",
"aliases": ["Kiev"],
},
{
"entity_id": "Q456",
"label": "Dnipro River",
"description": "Major river flowing through Ukraine and Belarus",
"entity_type": "river",
"aliases": ["Dnieper"],
},
]

executor.load_entities(entities)

From a Python Dict

Keys are entity IDs, values are entity data:

entities = {
"Q123": {
"label": "Kyiv",
"description": "Capital and largest city of Ukraine",
"entity_type": "city",
},
"Q456": {
"label": "Dnipro River",
"description": "Major river flowing through Ukraine and Belarus",
"entity_type": "river",
},
}

executor.load_entities(entities)

Entity Format Reference

FieldTypeRequiredDefaultDescription
entity_idstryesUnique identifier
labelstryesPrimary name
descriptionstrno""Text description (used in templates like "{label}: {description}")
entity_typestrno""Category (e.g. "gene", "disease")
aliaseslist[str]no[]Alternative names for search matching
popularityintno0Ranking score for candidate ordering

Advanced Features

Precomputed Embeddings (BiEncoder)

For BiEncoder models, precomputing label embeddings gives 10-100x speedups:

# Load entities, then precompute
executor.load_entities("data/entities.jsonl")
executor.precompute_embeddings(batch_size=64)

# Or do both in create_simple
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
entities="data/entities.jsonl",
precompute_embeddings=True,
)
tip

Pre-computed embeddings are especially beneficial when processing large document collections with the same entity set. Compute once, reuse across millions of documents.

On-the-Fly Embedding Caching

Instead of precomputing all embeddings upfront, cache them as they are computed during inference:

builder.l3.configure(
model="knowledgator/gliner-linker-large-v1.0",
cache_embeddings=True,
)

L4 Reranker

When the candidate set from L2 is large (tens or hundreds of entities), the L4 reranker splits candidates into chunks for efficient processing:

# Via ConfigBuilder
builder.l4.configure(
model="knowledgator/gliner-multitask-large-v0.5",
threshold=0.3,
max_labels=20, # candidates per inference call
)

# Via create_simple
executor = ProcessorFactory.create_simple(
model_name="knowledgator/gliner-bi-base-v2.0",
reranker_model="knowledgator/gliner-multitask-large-v0.5",
reranker_max_labels=20,
)

Custom Pipelines

# Custom L1 processing pipeline
l1_processor = processor_registry.get("l1_spacy")(
config_dict={"model": "en_core_sci_sm"},
pipeline=[
("extract_entities", {}),
("filter_by_length", {"min_length": 3}),
("deduplicate", {}),
("sort_by_position", {}),
]
)

Database Setup

Quick Start (Docker)

# Start all databases
cd scripts/database
docker-compose up -d

# Load entities
python scripts/database/setup_all.sh

Manual Setup

from glinker import DAGExecutor

executor = DAGExecutor(pipeline)
executor.load_entities(
filepath="data/entities.jsonl",
target_layers=["redis", "elasticsearch", "postgres"],
batch_size=1000,
)
important

For production deployments with large entity databases (100K+ entities), Elasticsearch or PostgreSQL backends are recommended over the in-memory dictionary backend.