Configuration

GLiNKER pipelines can be fully configured using YAML files for reproducible and shareable setups. YAML configs give full control over every node in the pipeline. Load them with:

from glinker import ProcessorFactory

executor = ProcessorFactory.create_pipeline("path/to/config.yaml")

Simple Pipeline (L2 → L3 → L0, No NER)

Equivalent to create_simple. No L1 node — texts are passed directly to L2/L3:

name: "simple"
description: "Simple pipeline - L3 only with entity database"

nodes:
  - id: "l2"
    processor: "l2_chain"
    inputs:
      texts:
        source: "$input"
        fields: "texts"
    output:
      key: "l2_result"
    schema:
      template: "{label}"
    config:
      max_candidates: 30
      min_popularity: 0
      layers:
        - type: "dict"
          priority: 0
          write: true
          search_mode: ["exact"]

  - id: "l3"
    processor: "l3_batch"
    requires: ["l2"]
    inputs:
      texts:
        source: "$input"
        fields: "texts"
      candidates:
        source: "l2_result"
        fields: "candidates"
    output:
      key: "l3_result"
    schema:
      template: "{label}"
    config:
      model_name: "knowledgator/gliner-bi-base-v2.0"
      device: "cpu"
      threshold: 0.5
      flat_ner: true
      multi_label: false
      use_precomputed_embeddings: true
      cache_embeddings: false
      max_length: 512

  - id: "l0"
    processor: "l0_aggregator"
    requires: ["l2", "l3"]
    inputs:
      l2_candidates:
        source: "l2_result"
        fields: "candidates"
      l3_entities:
        source: "l3_result"
        fields: "entities"
    output:
      key: "l0_result"
    config:
      strict_matching: false
      min_confidence: 0.0
      include_unlinked: true
      position_tolerance: 2

Full Pipeline with spaCy NER (L1 → L2 → L3 → L0)

name: "dict_default"
description: "In-memory dict layer with spaCy NER"

nodes:
  - id: "l1"
    processor: "l1_spacy"
    inputs:
      texts:
        source: "$input"
        fields: "texts"
    output:
      key: "l1_result"
    config:
      model: "en_core_sci_sm"
      device: "cpu"
      batch_size: 1
      min_entity_length: 2
      include_noun_chunks: true

  - id: "l2"
    processor: "l2_chain"
    requires: ["l1"]
    inputs:
      mentions:
        source: "l1_result"
        fields: "entities"
    output:
      key: "l2_result"
    schema:
      template: "{label}: {description}"
    config:
      max_candidates: 5
      layers:
        - type: "dict"
          priority: 0
          write: true
          search_mode: ["exact", "fuzzy"]
          fuzzy:
            max_distance: 64
            min_similarity: 0.6

  - id: "l3"
    processor: "l3_batch"
    requires: ["l2"]
    inputs:
      texts:
        source: "$input"
        fields: "texts"
      candidates:
        source: "l2_result"
        fields: "candidates"
    output:
      key: "l3_result"
    schema:
      template: "{label}: {description}"
    config:
      model_name: "knowledgator/gliner-linker-large-v1.0"
      device: "cpu"
      threshold: 0.5
      flat_ner: true
      multi_label: false
      max_length: 512

  - id: "l0"
    processor: "l0_aggregator"
    requires: ["l1", "l2", "l3"]
    inputs:
      l1_entities:
        source: "l1_result"
        fields: "entities"
      l2_candidates:
        source: "l2_result"
        fields: "candidates"
      l3_entities:
        source: "l3_result"
        fields: "entities"
    output:
      key: "l0_result"
    config:
      strict_matching: true
      min_confidence: 0.0
      include_unlinked: true
      position_tolerance: 2

Pipeline with L4 Reranker (L1 → L2 → L3 → L4 → L0)

Use when the candidate set is large. L4 splits candidates into chunks of max_labels and runs GLiNER inference on each chunk:

name: "dict_reranker"
description: "In-memory dict with L4 GLiNER reranking"

nodes:
  - id: "l1"
    processor: "l1_gliner"
    inputs:
      texts:
        source: "$input"
        fields: "texts"
    output:
      key: "l1_result"
    config:
      model: "knowledgator/gliner-bi-base-v2.0"
      labels: ["gene", "drug", "disease", "person", "organization"]
      device: "cpu"

  - id: "l2"
    processor: "l2_chain"
    requires: ["l1"]
    inputs:
      mentions:
        source: "l1_result"
        fields: "entities"
    output:
      key: "l2_result"
    schema:
      template: "{label}: {description}"
    config:
      max_candidates: 100
      layers:
        - type: "dict"
          priority: 0
          write: true
          search_mode: ["exact", "fuzzy"]

  - id: "l3"
    processor: "l3_batch"
    requires: ["l1", "l2"]
    inputs:
      texts:
        source: "$input"
        fields: "texts"
      candidates:
        source: "l2_result"
        fields: "candidates"
    output:
      key: "l3_result"
    schema:
      template: "{label}: {description}"
    config:
      model_name: "knowledgator/gliner-linker-base-v1.0"
      device: "cpu"
      threshold: 0.5
      use_precomputed_embeddings: true

  - id: "l4"
    processor: "l4_reranker"
    requires: ["l1", "l2", "l3"]
    inputs:
      texts:
        source: "$input"
        fields: "texts"
      candidates:
        source: "l2_result"
        fields: "candidates"
      l1_entities:
        source: "l1_result"
        fields: "entities"
    output:
      key: "l4_result"
    schema:
      template: "{label}: {description}"
    config:
      model_name: "knowledgator/gliner-multitask-large-v0.5"
      device: "cpu"
      threshold: 0.3
      max_labels: 20          # candidates per inference call

  - id: "l0"
    processor: "l0_aggregator"
    requires: ["l1", "l2", "l4"]
    inputs:
      l1_entities:
        source: "l1_result"
        fields: "entities"
      l2_candidates:
        source: "l2_result"
        fields: "candidates"
      l3_entities:
        source: "l4_result"   # L0 reads from L4 instead of L3
        fields: "entities"
    output:
      key: "l0_result"
    config:
      strict_matching: true
      min_confidence: 0.0
      include_unlinked: true

Simple Pipeline with Reranker Only (L2 → L4 → L0, No L1/L3)

Skips both NER and L3 — L4 handles entity linking directly with chunked inference:

name: "simple_reranker"
description: "Simple pipeline with L4 reranker - no L1 or L3"

nodes:
  - id: "l2"
    processor: "l2_chain"
    inputs:
      texts:
        source: "$input"
        fields: "texts"
    output:
      key: "l2_result"
    schema:
      template: "{label}: {description}"
    config:
      max_candidates: 100
      layers:
        - type: "dict"
          priority: 0
          write: true
          search_mode: ["exact"]

  - id: "l4"
    processor: "l4_reranker"
    requires: ["l2"]
    inputs:
      texts:
        source: "$input"
        fields: "texts"
      candidates:
        source: "l2_result"
        fields: "candidates"
    output:
      key: "l4_result"
    schema:
      template: "{label}: {description}"
    config:
      model_name: "knowledgator/gliner-multitask-large-v0.5"
      device: "cpu"
      threshold: 0.5
      max_labels: 20

  - id: "l0"
    processor: "l0_aggregator"
    requires: ["l2", "l4"]
    inputs:
      l2_candidates:
        source: "l2_result"
        fields: "candidates"
      l3_entities:
        source: "l4_result"
        fields: "entities"
    output:
      key: "l0_result"
    config:
      strict_matching: false
      min_confidence: 0.0
      include_unlinked: true

Production Config with Multiple Database Layers

name: "production_pipeline"

nodes:
  - id: "l2"
    processor: "l2_chain"
    config:
      layers:
        - type: "redis"
          priority: 2
          ttl: 3600
        - type: "elasticsearch"
          priority: 1
          ttl: 86400
        - type: "postgres"
          priority: 0

Important

Never store database passwords directly in YAML configuration files. Use environment variable references or a secrets manager for production deployments.

Node Configuration Reference

L1 Processors

`l1_spacy`

Parameter	Type	Default	Description
`model`	string	required	spaCy model name (e.g. `"en_core_web_sm"`)
`device`	string	`"cpu"`	Torch device
`batch_size`	int	`1`	Batch size for processing
`min_entity_length`	int	`2`	Minimum entity character length
`include_noun_chunks`	bool	`true`	Include noun chunks as candidate mentions

`l1_gliner`

Parameter	Type	Default	Description
`model`	string	required	GLiNER model ID
`labels`	list[str]	required	Entity labels to detect
`device`	string	`"cpu"`	Torch device

L2 Processor (`l2_chain`)

Parameter	Type	Default	Description
`max_candidates`	int	`30`	Max candidates per mention
`min_popularity`	int	`0`	Minimum entity popularity score
`layers`	list	required	Database layer configurations

Database Layer Options

Parameter	Type	Description
`type`	string	`"dict"`, `"redis"`, `"elasticsearch"`, or `"postgres"`
`priority`	int	Higher priority layers are queried first
`write`	bool	Whether to write results back to this layer
`ttl`	int	Cache time-to-live in seconds (Redis/ES)
`search_mode`	list[str]	Search modes: `"exact"`, `"fuzzy"`

L3 Processor (`l3_batch`)

Parameter	Type	Default	Description
`model_name`	string	required	GLiNER linker model ID
`device`	string	`"cpu"`	Torch device
`threshold`	float	`0.5`	Minimum confidence score
`flat_ner`	bool	`true`	Use flat NER mode
`multi_label`	bool	`false`	Allow multiple labels per span
`use_precomputed_embeddings`	bool	`false`	Use pre-computed label embeddings
`cache_embeddings`	bool	`false`	Cache embeddings on-the-fly
`max_length`	int	`512`	Max sequence length

L4 Processor (`l4_reranker`)

Parameter	Type	Default	Description
`model_name`	string	required	GLiNER reranker model ID
`device`	string	`"cpu"`	Torch device
`threshold`	float	`0.3`	Minimum confidence score
`max_labels`	int	`20`	Max candidate labels per inference call

L0 Processor (`l0_aggregator`)

Parameter	Type	Default	Description
`strict_matching`	bool	`false`	Require exact span matching between L1 and L3
`min_confidence`	float	`0.0`	Minimum confidence threshold for results
`include_unlinked`	bool	`true`	Include unlinked mentions in output
`position_tolerance`	int	`2`	Character tolerance for span matching

Schema Configuration

The schema block controls how entities are represented as labels across the pipeline:

Parameter	Type	Description
`template`	string	Format string for entity labels (e.g. `"{label}"`, `"{label}: {description}"`)

note

The same template should be used consistently across L2, L3, and L4 nodes to ensure entity representations match throughout the pipeline.

Simple Pipeline (L2 → L3 → L0, No NER)​

Full Pipeline with spaCy NER (L1 → L2 → L3 → L0)​

Pipeline with L4 Reranker (L1 → L2 → L3 → L4 → L0)​

Simple Pipeline with Reranker Only (L2 → L4 → L0, No L1/L3)​

Production Config with Multiple Database Layers​

Node Configuration Reference​

L1 Processors​

l1_spacy​

l1_gliner​

L2 Processor (l2_chain)​

Database Layer Options​

L3 Processor (l3_batch)​

L4 Processor (l4_reranker)​

L0 Processor (l0_aggregator)​

Schema Configuration​