Skip to main content

Configuration

GLiNKER pipelines can be fully configured using YAML files for reproducible and shareable setups. YAML configs give full control over every node in the pipeline. Load them with:

from glinker import ProcessorFactory

executor = ProcessorFactory.create_pipeline("path/to/config.yaml")

Simple Pipeline (L2 → L3 → L0, No NER)

Equivalent to create_simple. No L1 node — texts are passed directly to L2/L3:

name: "simple"
description: "Simple pipeline - L3 only with entity database"

nodes:
- id: "l2"
processor: "l2_chain"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l2_result"
schema:
template: "{label}"
config:
max_candidates: 30
min_popularity: 0
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact"]

- id: "l3"
processor: "l3_batch"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}"
config:
model_name: "knowledgator/gliner-bi-base-v2.0"
device: "cpu"
threshold: 0.5
flat_ner: true
multi_label: false
use_precomputed_embeddings: true
cache_embeddings: false
max_length: 512

- id: "l0"
processor: "l0_aggregator"
requires: ["l2", "l3"]
inputs:
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l3_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: false
min_confidence: 0.0
include_unlinked: true
position_tolerance: 2

Full Pipeline with spaCy NER (L1 → L2 → L3 → L0)

name: "dict_default"
description: "In-memory dict layer with spaCy NER"

nodes:
- id: "l1"
processor: "l1_spacy"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l1_result"
config:
model: "en_core_sci_sm"
device: "cpu"
batch_size: 1
min_entity_length: 2
include_noun_chunks: true

- id: "l2"
processor: "l2_chain"
requires: ["l1"]
inputs:
mentions:
source: "l1_result"
fields: "entities"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 5
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact", "fuzzy"]
fuzzy:
max_distance: 64
min_similarity: 0.6

- id: "l3"
processor: "l3_batch"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-linker-large-v1.0"
device: "cpu"
threshold: 0.5
flat_ner: true
multi_label: false
max_length: 512

- id: "l0"
processor: "l0_aggregator"
requires: ["l1", "l2", "l3"]
inputs:
l1_entities:
source: "l1_result"
fields: "entities"
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l3_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: true
min_confidence: 0.0
include_unlinked: true
position_tolerance: 2

Pipeline with L4 Reranker (L1 → L2 → L3 → L4 → L0)

Use when the candidate set is large. L4 splits candidates into chunks of max_labels and runs GLiNER inference on each chunk:

name: "dict_reranker"
description: "In-memory dict with L4 GLiNER reranking"

nodes:
- id: "l1"
processor: "l1_gliner"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l1_result"
config:
model: "knowledgator/gliner-bi-base-v2.0"
labels: ["gene", "drug", "disease", "person", "organization"]
device: "cpu"

- id: "l2"
processor: "l2_chain"
requires: ["l1"]
inputs:
mentions:
source: "l1_result"
fields: "entities"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 100
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact", "fuzzy"]

- id: "l3"
processor: "l3_batch"
requires: ["l1", "l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-linker-base-v1.0"
device: "cpu"
threshold: 0.5
use_precomputed_embeddings: true

- id: "l4"
processor: "l4_reranker"
requires: ["l1", "l2", "l3"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
l1_entities:
source: "l1_result"
fields: "entities"
output:
key: "l4_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-multitask-large-v0.5"
device: "cpu"
threshold: 0.3
max_labels: 20 # candidates per inference call

- id: "l0"
processor: "l0_aggregator"
requires: ["l1", "l2", "l4"]
inputs:
l1_entities:
source: "l1_result"
fields: "entities"
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l4_result" # L0 reads from L4 instead of L3
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: true
min_confidence: 0.0
include_unlinked: true

Simple Pipeline with Reranker Only (L2 → L4 → L0, No L1/L3)

Skips both NER and L3 — L4 handles entity linking directly with chunked inference:

name: "simple_reranker"
description: "Simple pipeline with L4 reranker - no L1 or L3"

nodes:
- id: "l2"
processor: "l2_chain"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 100
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact"]

- id: "l4"
processor: "l4_reranker"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l4_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-multitask-large-v0.5"
device: "cpu"
threshold: 0.5
max_labels: 20

- id: "l0"
processor: "l0_aggregator"
requires: ["l2", "l4"]
inputs:
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l4_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: false
min_confidence: 0.0
include_unlinked: true

Production Config with Multiple Database Layers

name: "production_pipeline"

nodes:
- id: "l2"
processor: "l2_chain"
config:
layers:
- type: "redis"
priority: 2
ttl: 3600
- type: "elasticsearch"
priority: 1
ttl: 86400
- type: "postgres"
priority: 0
Important

Never store database passwords directly in YAML configuration files. Use environment variable references or a secrets manager for production deployments.


Node Configuration Reference

L1 Processors

l1_spacy

ParameterTypeDefaultDescription
modelstringrequiredspaCy model name (e.g. "en_core_web_sm")
devicestring"cpu"Torch device
batch_sizeint1Batch size for processing
min_entity_lengthint2Minimum entity character length
include_noun_chunksbooltrueInclude noun chunks as candidate mentions

l1_gliner

ParameterTypeDefaultDescription
modelstringrequiredGLiNER model ID
labelslist[str]requiredEntity labels to detect
devicestring"cpu"Torch device

L2 Processor (l2_chain)

ParameterTypeDefaultDescription
max_candidatesint30Max candidates per mention
min_popularityint0Minimum entity popularity score
layerslistrequiredDatabase layer configurations

Database Layer Options

ParameterTypeDescription
typestring"dict", "redis", "elasticsearch", or "postgres"
priorityintHigher priority layers are queried first
writeboolWhether to write results back to this layer
ttlintCache time-to-live in seconds (Redis/ES)
search_modelist[str]Search modes: "exact", "fuzzy"

L3 Processor (l3_batch)

ParameterTypeDefaultDescription
model_namestringrequiredGLiNER linker model ID
devicestring"cpu"Torch device
thresholdfloat0.5Minimum confidence score
flat_nerbooltrueUse flat NER mode
multi_labelboolfalseAllow multiple labels per span
use_precomputed_embeddingsboolfalseUse pre-computed label embeddings
cache_embeddingsboolfalseCache embeddings on-the-fly
max_lengthint512Max sequence length

L4 Processor (l4_reranker)

ParameterTypeDefaultDescription
model_namestringrequiredGLiNER reranker model ID
devicestring"cpu"Torch device
thresholdfloat0.3Minimum confidence score
max_labelsint20Max candidate labels per inference call

L0 Processor (l0_aggregator)

ParameterTypeDefaultDescription
strict_matchingboolfalseRequire exact span matching between L1 and L3
min_confidencefloat0.0Minimum confidence threshold for results
include_unlinkedbooltrueInclude unlinked mentions in output
position_toleranceint2Character tolerance for span matching

Schema Configuration

The schema block controls how entities are represented as labels across the pipeline:

ParameterTypeDescription
templatestringFormat string for entity labels (e.g. "{label}", "{label}: {description}")
note

The same template should be used consistently across L2, L3, and L4 nodes to ensure entity representations match throughout the pipeline.