Configuration
GLiNKER pipelines can be fully configured using YAML files for reproducible and shareable setups. YAML configs give full control over every node in the pipeline. Load them with:
from glinker import ProcessorFactory
executor = ProcessorFactory.create_pipeline("path/to/config.yaml")
Simple Pipeline (L2 → L3 → L0, No NER)
Equivalent to create_simple. No L1 node — texts are passed directly to L2/L3:
name: "simple"
description: "Simple pipeline - L3 only with entity database"
nodes:
- id: "l2"
processor: "l2_chain"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l2_result"
schema:
template: "{label}"
config:
max_candidates: 30
min_popularity: 0
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact"]
- id: "l3"
processor: "l3_batch"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}"
config:
model_name: "knowledgator/gliner-bi-base-v2.0"
device: "cpu"
threshold: 0.5
flat_ner: true
multi_label: false
use_precomputed_embeddings: true
cache_embeddings: false
max_length: 512
- id: "l0"
processor: "l0_aggregator"
requires: ["l2", "l3"]
inputs:
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l3_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: false
min_confidence: 0.0
include_unlinked: true
position_tolerance: 2
Full Pipeline with spaCy NER (L1 → L2 → L3 → L0)
name: "dict_default"
description: "In-memory dict layer with spaCy NER"
nodes:
- id: "l1"
processor: "l1_spacy"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l1_result"
config:
model: "en_core_sci_sm"
device: "cpu"
batch_size: 1
min_entity_length: 2
include_noun_chunks: true
- id: "l2"
processor: "l2_chain"
requires: ["l1"]
inputs:
mentions:
source: "l1_result"
fields: "entities"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 5
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact", "fuzzy"]
fuzzy:
max_distance: 64
min_similarity: 0.6
- id: "l3"
processor: "l3_batch"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-linker-large-v1.0"
device: "cpu"
threshold: 0.5
flat_ner: true
multi_label: false
max_length: 512
- id: "l0"
processor: "l0_aggregator"
requires: ["l1", "l2", "l3"]
inputs:
l1_entities:
source: "l1_result"
fields: "entities"
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l3_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: true
min_confidence: 0.0
include_unlinked: true
position_tolerance: 2
Pipeline with L4 Reranker (L1 → L2 → L3 → L4 → L0)
Use when the candidate set is large. L4 splits candidates into chunks of max_labels and runs GLiNER inference on each chunk:
name: "dict_reranker"
description: "In-memory dict with L4 GLiNER reranking"
nodes:
- id: "l1"
processor: "l1_gliner"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l1_result"
config:
model: "knowledgator/gliner-bi-base-v2.0"
labels: ["gene", "drug", "disease", "person", "organization"]
device: "cpu"
- id: "l2"
processor: "l2_chain"
requires: ["l1"]
inputs:
mentions:
source: "l1_result"
fields: "entities"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 100
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact", "fuzzy"]
- id: "l3"
processor: "l3_batch"
requires: ["l1", "l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l3_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-linker-base-v1.0"
device: "cpu"
threshold: 0.5
use_precomputed_embeddings: true
- id: "l4"
processor: "l4_reranker"
requires: ["l1", "l2", "l3"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
l1_entities:
source: "l1_result"
fields: "entities"
output:
key: "l4_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-multitask-large-v0.5"
device: "cpu"
threshold: 0.3
max_labels: 20 # candidates per inference call
- id: "l0"
processor: "l0_aggregator"
requires: ["l1", "l2", "l4"]
inputs:
l1_entities:
source: "l1_result"
fields: "entities"
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l4_result" # L0 reads from L4 instead of L3
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: true
min_confidence: 0.0
include_unlinked: true
Simple Pipeline with Reranker Only (L2 → L4 → L0, No L1/L3)
Skips both NER and L3 — L4 handles entity linking directly with chunked inference:
name: "simple_reranker"
description: "Simple pipeline with L4 reranker - no L1 or L3"
nodes:
- id: "l2"
processor: "l2_chain"
inputs:
texts:
source: "$input"
fields: "texts"
output:
key: "l2_result"
schema:
template: "{label}: {description}"
config:
max_candidates: 100
layers:
- type: "dict"
priority: 0
write: true
search_mode: ["exact"]
- id: "l4"
processor: "l4_reranker"
requires: ["l2"]
inputs:
texts:
source: "$input"
fields: "texts"
candidates:
source: "l2_result"
fields: "candidates"
output:
key: "l4_result"
schema:
template: "{label}: {description}"
config:
model_name: "knowledgator/gliner-multitask-large-v0.5"
device: "cpu"
threshold: 0.5
max_labels: 20
- id: "l0"
processor: "l0_aggregator"
requires: ["l2", "l4"]
inputs:
l2_candidates:
source: "l2_result"
fields: "candidates"
l3_entities:
source: "l4_result"
fields: "entities"
output:
key: "l0_result"
config:
strict_matching: false
min_confidence: 0.0
include_unlinked: true
Production Config with Multiple Database Layers
name: "production_pipeline"
nodes:
- id: "l2"
processor: "l2_chain"
config:
layers:
- type: "redis"
priority: 2
ttl: 3600
- type: "elasticsearch"
priority: 1
ttl: 86400
- type: "postgres"
priority: 0
Never store database passwords directly in YAML configuration files. Use environment variable references or a secrets manager for production deployments.
Node Configuration Reference
L1 Processors
l1_spacy
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | spaCy model name (e.g. "en_core_web_sm") |
device | string | "cpu" | Torch device |
batch_size | int | 1 | Batch size for processing |
min_entity_length | int | 2 | Minimum entity character length |
include_noun_chunks | bool | true | Include noun chunks as candidate mentions |
l1_gliner
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | GLiNER model ID |
labels | list[str] | required | Entity labels to detect |
device | string | "cpu" | Torch device |
L2 Processor (l2_chain)
| Parameter | Type | Default | Description |
|---|---|---|---|
max_candidates | int | 30 | Max candidates per mention |
min_popularity | int | 0 | Minimum entity popularity score |
layers | list | required | Database layer configurations |
Database Layer Options
| Parameter | Type | Description |
|---|---|---|
type | string | "dict", "redis", "elasticsearch", or "postgres" |
priority | int | Higher priority layers are queried first |
write | bool | Whether to write results back to this layer |
ttl | int | Cache time-to-live in seconds (Redis/ES) |
search_mode | list[str] | Search modes: "exact", "fuzzy" |
L3 Processor (l3_batch)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | string | required | GLiNER linker model ID |
device | string | "cpu" | Torch device |
threshold | float | 0.5 | Minimum confidence score |
flat_ner | bool | true | Use flat NER mode |
multi_label | bool | false | Allow multiple labels per span |
use_precomputed_embeddings | bool | false | Use pre-computed label embeddings |
cache_embeddings | bool | false | Cache embeddings on-the-fly |
max_length | int | 512 | Max sequence length |
L4 Processor (l4_reranker)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | string | required | GLiNER reranker model ID |
device | string | "cpu" | Torch device |
threshold | float | 0.3 | Minimum confidence score |
max_labels | int | 20 | Max candidate labels per inference call |
L0 Processor (l0_aggregator)
| Parameter | Type | Default | Description |
|---|---|---|---|
strict_matching | bool | false | Require exact span matching between L1 and L3 |
min_confidence | float | 0.0 | Minimum confidence threshold for results |
include_unlinked | bool | true | Include unlinked mentions in output |
position_tolerance | int | 2 | Character tolerance for span matching |
Schema Configuration
The schema block controls how entities are represented as labels across the pipeline:
| Parameter | Type | Description |
|---|---|---|
template | string | Format string for entity labels (e.g. "{label}", "{label}: {description}") |
The same template should be used consistently across L2, L3, and L4 nodes to ensure entity representations match throughout the pipeline.