Training

GLiNER can be easily fine-tuned thanks to its architecture and carefully pre-trained models available on Hugging Face.

Quickstart

Installation

pip install gliner[training]

Simple Training Example

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

train_data = [
    {
        "tokenized_text": ["Apple", "Inc.", "is", "headquartered", "in", "Cupertino"],
        "ner": [[0, 1, "organization"], [5, 5, "location"]]
    }
]

trainer = model.train_model(
    train_dataset=train_data,
    output_dir="./my_model",
    max_steps=1000,
    learning_rate=5e-5,
    per_device_train_batch_size=8,
)

trainer.save_model()

Dataset Format

Basic Structure

All GLiNER training data follows this structure:

{
    "tokenized_text": List[str],  # Pre-tokenized text as list of tokens
    "ner": List[List[Union[int, str]]]  # [[start_idx, end_idx, label], ...]
}

Key Points:

Indices are token-level (not character-level) and inclusive
start_idx and end_idx both point to tokens within the entity span
Example: ["Barack", "Obama", "was", "born"] → person at [0, 1] covers tokens 0 and 1

Multi-Task Format

For multi-task models, explicitly control label sampling with:

{
    "tokenized_text": ["tokens", "..."],
    "ner": [[0, 1, "person"], [4, 4, "organization"]],

    # Optional: explicit positive labels for this example
    "ner_labels": ["person", "organization", "location"],

    # Optional: hard negative labels to improve discrimination
    "ner_negatives": ["product", "event", "date"]
}

Why use explicit labels?

Better control over training label distribution
Include hard negatives (similar types) for better discrimination
Mix datasets with different annotation schemes
Domain adaptation and curriculum learning

Task-Specific Formats

GLiNER multi-task models support various information extraction tasks using the same core format but with different label types and optional prompt-based data preparation.

Named Entity Recognition (NER)

Standard entity extraction:

{
    "tokenized_text": ["Microsoft", "was", "founded", "by", "Bill", "Gates", "in", "1975"],
    "ner": [[0, 0, "organization"], [4, 5, "person"], [7, 7, "date"]]
}

Relation Extraction (GLiNER-multitask)

For multi-task models, relations are extracted by labeling spans with entity-relation pairs:

{
    "tokenized_text": ["Microsoft", "was", "founded", "by", "Bill", "Gates"],
    "ner": [
        [4, 5, "Microsoft <> founder"],      # "Bill Gates" is Microsoft's founder
        [0, 0, "Bill Gates <> founded"]      # "Microsoft" is what Bill Gates founded
    ],
    "ner_labels": ["Microsoft <> founder", "Bill Gates <> founded", "Microsoft <> inception date"]
}

Format:

Use "Entity <> relation" syntax in labels
Mark spans that answer the relation query
At inference, use same syntax: labels = ["Microsoft <> founder"]

Relation Extraction (GLiNER-relex)

For dedicated relation extraction models (e.g., knowledgator/gliner-relex-large-v0.5), use explicit relation annotations:

{
    "tokenized_text": ["Bill", "Gates", "founded", "Microsoft"],
    "ner": [[0, 1, "person"], [3, 3, "organization"]],

    # Relations reference entity positions in the "ner" list (not token positions)
    "relations": [
        [0, 1, "founded_by"]  # person at index 0 founded_by org at index 1
    ],

    # Optional: explicit relation types
    "rel_labels": ["founded_by", "works_at", "located_in"],
    "rel_negatives": ["competitor_of", "acquired_by"]
}

Format:

relations: List of [head_entity_idx, tail_entity_idx, relation_type]
Indices refer to positions in the ner list (not token positions)
Requires entity annotations first, then relation annotations between them

Summarization

Extract important sentences or phrases:

{
    "tokenized_text": ["The", "study", "shows", "...", "Additionally", "it", "found", "..."],
    "ner": [
        [0, 3, "summary"],      # First important sentence
        [10, 15, "summary"]     # Second important sentence
    ],
    "ner_labels": ["summary"]
}

At inference, prepend prompt: "Summarize the given text:\n" + text with labels ["summary"]

Question Answering

Mark answer spans in the text:

{
    "tokenized_text": ["Bill", "Gates", "was", "CEO", "of", "Microsoft"],
    "ner": [[0, 1, "answer"]],  # Answer to "Who was CEO of Microsoft?"
    "ner_labels": ["answer"]
}

At inference, prepend question: question + text with labels ["answer"]

Open Information Extraction

Extract information based on custom prompts:

{
    "tokenized_text": ["The", "battery", "life", "is", "excellent", "and", "sound", "quality", "is", "great"],
    "ner": [
        [1, 4, "match"],        # "battery life is excellent"
        [6, 9, "match"]         # "sound quality is great"
    ],
    "ner_labels": ["match"]
}

At inference, use custom prompt: "Find all positive aspects:\n" + text with labels ["match"]

Text Classification

Classify by matching the text to labels:

{
    "tokenized_text": ["This", "product", "is", "amazing", "!"],
    "ner": [[0, 4, "match"]],  # Entire text matches "positive review"
    "ner_labels": ["match"]
}

At inference, specify classes in prompt: "Classify: positive, negative, neutral\n" + text with labels ["match"]

Key-Phrase Extraction

Similar to NER but for important phrases:

{
    "tokenized_text": ["Deep", "learning", "and", "neural", "networks", "are", "popular"],
    "ner": [[0, 1, "key_phrase"], [3, 4, "key_phrase"]]
}

Training Examples

Multi-Task Model Training

Train on multiple tasks simultaneously:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

train_data = [
    # NER task
    {
        "tokenized_text": ["Apple", "released", "iPhone", "15"],
        "ner": [[0, 0, "organization"], [2, 3, "product"]],
        "ner_labels": ["organization", "product", "person"],
        "ner_negatives": ["summary", "answer"]
    },
    # Relation extraction task (multitask format)
    {
        "tokenized_text": ["Steve", "Jobs", "founded", "Apple"],
        "ner": [
            [3, 3, "Steve Jobs <> founded"],         # What did Steve Jobs found?
            [0, 1, "Apple <> founder"]               # Who founded Apple?
        ],
        "ner_labels": ["Apple <> founder", "Steve Jobs <> founded", "Apple <> CEO"]
    },
    # Summarization task
    {
        "tokenized_text": ["The", "study", "found", "significant", "results", "."],
        "ner": [[0, 5, "summary"]],
        "ner_labels": ["summary"],
        "ner_negatives": ["organization", "person"]
    }
]

trainer = model.train_model(
    train_dataset=train_data,
    output_dir="./gliner_multitask",
    max_steps=10000,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    others_lr=5e-5,
    negatives=1.0,
    save_steps=1000,
)

trainer.save_model()

Bi-Encoder Model Training

Bi-encoders use separate encoders for text and labels, enabling better zero-shot performance:

model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")

train_data = [
    {
        "tokenized_text": ["Dr.", "Smith", "works", "at", "MIT"],
        "ner": [[0, 1, "person"], [4, 4, "organization"]],
        # Bi-encoders benefit from diverse, descriptive labels
        "ner_labels": ["person", "organization", "institution", "location", "title"]
    }
]

trainer = model.train_model(
    train_dataset=train_data,
    output_dir="./gliner_bi",
    max_steps=5000,
    learning_rate=1e-5,
    negatives=1.5,  # Higher negative sampling for bi-encoders
)

GLiNER-relex Model Training

GLiNER-relex models are specialized for relation extraction with explicit entity-relation annotations:

model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v0.5")

train_data = [
    {
        "tokenized_text": ["John", "Smith", "works", "at", "Microsoft", "in", "Seattle"],
        "ner": [
            [0, 1, "person"],
            [4, 4, "organization"],
            [6, 6, "location"]
        ],
        "relations": [
            [0, 1, "works_at"],      # person works_at organization
            [1, 2, "located_in"]     # organization located_in location
        ],
        "ner_labels": ["person", "organization", "location"],
        "rel_labels": ["works_at", "located_in", "founded_by"],
        "rel_negatives": ["competitor_of", "subsidiary_of"]
    },
    {
        "tokenized_text": ["Sarah", "founded", "TechCorp", "in", "2020"],
        "ner": [
            [0, 0, "person"],
            [2, 2, "organization"],
            [4, 4, "date"]
        ],
        "relations": [
            [0, 1, "founded_by"]     # organization founded_by person
        ],
        "ner_labels": ["person", "organization", "date"],
        "rel_labels": ["founded_by", "works_at", "located_in"]
    }
]

trainer = model.train_model(
    train_dataset=train_data,
    output_dir="./gliner_relex",
    max_steps=10000,
    per_device_train_batch_size=4,  # Relations require more memory
    learning_rate=1e-5,
    others_lr=5e-5,
)

trainer.save_model()

Training with Configuration Files

For reproducible training, use YAML configuration:

# config.yaml
training:
  prev_path: "knowledgator/gliner-multitask-large-v0.5"
  num_steps: 10000
  train_batch_size: 8
  lr_encoder: 1e-5
  lr_others: 5e-5
  negatives: 1.0
  save_total_limit: 3

data:
  train_data: "data/train.json"
  val_data_dir: "data/val.json"
  root_dir: "models"

Training script:

import json
from gliner import GLiNER
from gliner.utils import load_config_as_namespace

cfg = load_config_as_namespace("config.yaml")

with open(cfg.data.train_data) as f:
    train_data = json.load(f)

model = GLiNER.from_pretrained(cfg.training.prev_path)

trainer = model.train_model(
    train_dataset=train_data,
    output_dir=cfg.data.root_dir,
    max_steps=cfg.training.num_steps,
    per_device_train_batch_size=cfg.training.train_batch_size,
    learning_rate=float(cfg.training.lr_encoder),
    others_lr=float(cfg.training.lr_others),
    negatives=float(cfg.training.negatives),
    save_total_limit=cfg.training.save_total_limit,
)

trainer.save_model()

Best Practices

Data Preparation

Start with pretrained models - Fine-tuning is more effective than training from scratch
Validate data format - Ensure token indices are correct and within bounds
Use explicit labels - Specify ner_labels and ner_negatives for multi-task models
Include hard negatives - Add similar entity types as negatives to improve discrimination

Multi-Task Training

Balance tasks - Ensure good coverage across different task types
Task separation - Use negatives from other tasks to help the model distinguish task types
Domain consistency - Use consistent label naming when mixing datasets
Prompt preparation - For tasks like summarization and QA, consider including prompts in training data

Model-Specific

Bi-encoders - Train with diverse entity types for better zero-shot generalization
GLiNER-multitask relations - Use "Entity <> relation" syntax in NER labels; spans answer relation queries
GLiNER-relex - Use explicit "relations" field; ensure entity annotations are accurate first
Negative sampling - Experiment with negatives parameter (typically 1.0-1.5)
Component freezing - Freeze encoder with freeze_components=["text_encoder"] for quick adaptation

Training Configuration

Learning rates - Use lower LR for encoder (1e-5) and higher for other components (5e-5)
Batch size - Larger batches (8-16) generally work better for span-based models
Checkpointing - Save multiple checkpoints with save_total_limit > 1
Monitoring - Track loss and metrics to detect overfitting

Training Parameters Reference

Common parameters for model.train_model():

Parameter	Default	Description
`max_steps`	-	Total training steps
`per_device_train_batch_size`	8	Batch size per GPU
`learning_rate`	5e-5	Learning rate for encoder
`others_lr`	5e-5	Learning rate for other components
`weight_decay`	0.01	Weight decay for encoder
`others_weight_decay`	0.01	Weight decay for other components
`negatives`	1.0	Negative sampling ratio
`masking`	"none"	Masking strategy: "none", "global", "label", "span"
`focal_loss_alpha`	-1	Focal loss alpha (-1 disables)
`focal_loss_gamma`	0	Focal loss gamma (0 disables)
`warmup_ratio`	0.1	Warmup ratio for learning rate
`lr_scheduler_type`	"linear"	Scheduler type: "linear", "cosine", etc.
`save_steps`	500	Save checkpoint every N steps
`save_total_limit`	3	Maximum checkpoints to keep
`freeze_components`	None	Components to freeze: e.g., `["text_encoder"]`

Quickstart​

Installation​

Simple Training Example​

Dataset Format​

Basic Structure​

Multi-Task Format​

Task-Specific Formats​

Named Entity Recognition (NER)​

Relation Extraction (GLiNER-multitask)​

Relation Extraction (GLiNER-relex)​

Summarization​

Question Answering​

Open Information Extraction​

Text Classification​

Key-Phrase Extraction​

Training Examples​

Multi-Task Model Training​

Bi-Encoder Model Training​

GLiNER-relex Model Training​

Training with Configuration Files​

Best Practices​

Data Preparation​

Multi-Task Training​

Model-Specific​

Training Configuration​

Training Parameters Reference​