Training
GLiNER can be easily fine-tuned thanks to its architecture and carefully pre-trained models available on Hugging Face.
Quickstart
Installation
pip install gliner[training]
Simple Training Example
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
train_data = [
{
"tokenized_text": ["Apple", "Inc.", "is", "headquartered", "in", "Cupertino"],
"ner": [[0, 1, "organization"], [5, 5, "location"]]
}
]
trainer = model.train_model(
train_dataset=train_data,
output_dir="./my_model",
max_steps=1000,
learning_rate=5e-5,
per_device_train_batch_size=8,
)
trainer.save_model()
Dataset Format
Basic Structure
All GLiNER training data follows this structure:
{
"tokenized_text": List[str], # Pre-tokenized text as list of tokens
"ner": List[List[Union[int, str]]] # [[start_idx, end_idx, label], ...]
}
Key Points:
- Indices are token-level (not character-level) and inclusive
start_idxandend_idxboth point to tokens within the entity span- Example:
["Barack", "Obama", "was", "born"]→ person at[0, 1]covers tokens 0 and 1
Multi-Task Format
For multi-task models, explicitly control label sampling with:
{
"tokenized_text": ["tokens", "..."],
"ner": [[0, 1, "person"], [4, 4, "organization"]],
# Optional: explicit positive labels for this example
"ner_labels": ["person", "organization", "location"],
# Optional: hard negative labels to improve discrimination
"ner_negatives": ["product", "event", "date"]
}
Why use explicit labels?
- Better control over training label distribution
- Include hard negatives (similar types) for better discrimination
- Mix datasets with different annotation schemes
- Domain adaptation and curriculum learning
Task-Specific Formats
GLiNER multi-task models support various information extraction tasks using the same core format but with different label types and optional prompt-based data preparation.
Named Entity Recognition (NER)
Standard entity extraction:
{
"tokenized_text": ["Microsoft", "was", "founded", "by", "Bill", "Gates", "in", "1975"],
"ner": [[0, 0, "organization"], [4, 5, "person"], [7, 7, "date"]]
}
Relation Extraction (GLiNER-multitask)
For multi-task models, relations are extracted by labeling spans with entity-relation pairs:
{
"tokenized_text": ["Microsoft", "was", "founded", "by", "Bill", "Gates"],
"ner": [
[4, 5, "Microsoft <> founder"], # "Bill Gates" is Microsoft's founder
[0, 0, "Bill Gates <> founded"] # "Microsoft" is what Bill Gates founded
],
"ner_labels": ["Microsoft <> founder", "Bill Gates <> founded", "Microsoft <> inception date"]
}
Format:
- Use
"Entity <> relation"syntax in labels - Mark spans that answer the relation query
- At inference, use same syntax:
labels = ["Microsoft <> founder"]
Relation Extraction (GLiNER-relex)
For dedicated relation extraction models (e.g., knowledgator/gliner-relex-large-v0.5), use explicit relation annotations:
{
"tokenized_text": ["Bill", "Gates", "founded", "Microsoft"],
"ner": [[0, 1, "person"], [3, 3, "organization"]],
# Relations reference entity positions in the "ner" list (not token positions)
"relations": [
[0, 1, "founded_by"] # person at index 0 founded_by org at index 1
],
# Optional: explicit relation types
"rel_labels": ["founded_by", "works_at", "located_in"],
"rel_negatives": ["competitor_of", "acquired_by"]
}
Format:
relations: List of[head_entity_idx, tail_entity_idx, relation_type]- Indices refer to positions in the
nerlist (not token positions) - Requires entity annotations first, then relation annotations between them
Summarization
Extract important sentences or phrases:
{
"tokenized_text": ["The", "study", "shows", "...", "Additionally", "it", "found", "..."],
"ner": [
[0, 3, "summary"], # First important sentence
[10, 15, "summary"] # Second important sentence
],
"ner_labels": ["summary"]
}
At inference, prepend prompt: "Summarize the given text:\n" + text with labels ["summary"]
Question Answering
Mark answer spans in the text:
{
"tokenized_text": ["Bill", "Gates", "was", "CEO", "of", "Microsoft"],
"ner": [[0, 1, "answer"]], # Answer to "Who was CEO of Microsoft?"
"ner_labels": ["answer"]
}
At inference, prepend question: question + text with labels ["answer"]
Open Information Extraction
Extract information based on custom prompts:
{
"tokenized_text": ["The", "battery", "life", "is", "excellent", "and", "sound", "quality", "is", "great"],
"ner": [
[1, 4, "match"], # "battery life is excellent"
[6, 9, "match"] # "sound quality is great"
],
"ner_labels": ["match"]
}
At inference, use custom prompt: "Find all positive aspects:\n" + text with labels ["match"]
Text Classification
Classify by matching the text to labels:
{
"tokenized_text": ["This", "product", "is", "amazing", "!"],
"ner": [[0, 4, "match"]], # Entire text matches "positive review"
"ner_labels": ["match"]
}
At inference, specify classes in prompt: "Classify: positive, negative, neutral\n" + text with labels ["match"]
Key-Phrase Extraction
Similar to NER but for important phrases:
{
"tokenized_text": ["Deep", "learning", "and", "neural", "networks", "are", "popular"],
"ner": [[0, 1, "key_phrase"], [3, 4, "key_phrase"]]
}
Training Examples
Multi-Task Model Training
Train on multiple tasks simultaneously:
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
train_data = [
# NER task
{
"tokenized_text": ["Apple", "released", "iPhone", "15"],
"ner": [[0, 0, "organization"], [2, 3, "product"]],
"ner_labels": ["organization", "product", "person"],
"ner_negatives": ["summary", "answer"]
},
# Relation extraction task (multitask format)
{
"tokenized_text": ["Steve", "Jobs", "founded", "Apple"],
"ner": [
[3, 3, "Steve Jobs <> founded"], # What did Steve Jobs found?
[0, 1, "Apple <> founder"] # Who founded Apple?
],
"ner_labels": ["Apple <> founder", "Steve Jobs <> founded", "Apple <> CEO"]
},
# Summarization task
{
"tokenized_text": ["The", "study", "found", "significant", "results", "."],
"ner": [[0, 5, "summary"]],
"ner_labels": ["summary"],
"ner_negatives": ["organization", "person"]
}
]
trainer = model.train_model(
train_dataset=train_data,
output_dir="./gliner_multitask",
max_steps=10000,
per_device_train_batch_size=8,
learning_rate=1e-5,
others_lr=5e-5,
negatives=1.0,
save_steps=1000,
)
trainer.save_model()
Bi-Encoder Model Training
Bi-encoders use separate encoders for text and labels, enabling better zero-shot performance:
model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")
train_data = [
{
"tokenized_text": ["Dr.", "Smith", "works", "at", "MIT"],
"ner": [[0, 1, "person"], [4, 4, "organization"]],
# Bi-encoders benefit from diverse, descriptive labels
"ner_labels": ["person", "organization", "institution", "location", "title"]
}
]
trainer = model.train_model(
train_dataset=train_data,
output_dir="./gliner_bi",
max_steps=5000,
learning_rate=1e-5,
negatives=1.5, # Higher negative sampling for bi-encoders
)
GLiNER-relex Model Training
GLiNER-relex models are specialized for relation extraction with explicit entity-relation annotations:
model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v0.5")
train_data = [
{
"tokenized_text": ["John", "Smith", "works", "at", "Microsoft", "in", "Seattle"],
"ner": [
[0, 1, "person"],
[4, 4, "organization"],
[6, 6, "location"]
],
"relations": [
[0, 1, "works_at"], # person works_at organization
[1, 2, "located_in"] # organization located_in location
],
"ner_labels": ["person", "organization", "location"],
"rel_labels": ["works_at", "located_in", "founded_by"],
"rel_negatives": ["competitor_of", "subsidiary_of"]
},
{
"tokenized_text": ["Sarah", "founded", "TechCorp", "in", "2020"],
"ner": [
[0, 0, "person"],
[2, 2, "organization"],
[4, 4, "date"]
],
"relations": [
[0, 1, "founded_by"] # organization founded_by person
],
"ner_labels": ["person", "organization", "date"],
"rel_labels": ["founded_by", "works_at", "located_in"]
}
]
trainer = model.train_model(
train_dataset=train_data,
output_dir="./gliner_relex",
max_steps=10000,
per_device_train_batch_size=4, # Relations require more memory
learning_rate=1e-5,
others_lr=5e-5,
)
trainer.save_model()
Training with Configuration Files
For reproducible training, use YAML configuration:
# config.yaml
training:
prev_path: "knowledgator/gliner-multitask-large-v0.5"
num_steps: 10000
train_batch_size: 8
lr_encoder: 1e-5
lr_others: 5e-5
negatives: 1.0
save_total_limit: 3
data:
train_data: "data/train.json"
val_data_dir: "data/val.json"
root_dir: "models"
Training script:
import json
from gliner import GLiNER
from gliner.utils import load_config_as_namespace
cfg = load_config_as_namespace("config.yaml")
with open(cfg.data.train_data) as f:
train_data = json.load(f)
model = GLiNER.from_pretrained(cfg.training.prev_path)
trainer = model.train_model(
train_dataset=train_data,
output_dir=cfg.data.root_dir,
max_steps=cfg.training.num_steps,
per_device_train_batch_size=cfg.training.train_batch_size,
learning_rate=float(cfg.training.lr_encoder),
others_lr=float(cfg.training.lr_others),
negatives=float(cfg.training.negatives),
save_total_limit=cfg.training.save_total_limit,
)
trainer.save_model()
Best Practices
Data Preparation
- Start with pretrained models - Fine-tuning is more effective than training from scratch
- Validate data format - Ensure token indices are correct and within bounds
- Use explicit labels - Specify
ner_labelsandner_negativesfor multi-task models - Include hard negatives - Add similar entity types as negatives to improve discrimination
Multi-Task Training
- Balance tasks - Ensure good coverage across different task types
- Task separation - Use negatives from other tasks to help the model distinguish task types
- Domain consistency - Use consistent label naming when mixing datasets
- Prompt preparation - For tasks like summarization and QA, consider including prompts in training data
Model-Specific
- Bi-encoders - Train with diverse entity types for better zero-shot generalization
- GLiNER-multitask relations - Use
"Entity <> relation"syntax in NER labels; spans answer relation queries - GLiNER-relex - Use explicit
"relations"field; ensure entity annotations are accurate first - Negative sampling - Experiment with
negativesparameter (typically 1.0-1.5) - Component freezing - Freeze encoder with
freeze_components=["text_encoder"]for quick adaptation
Training Configuration
- Learning rates - Use lower LR for encoder (1e-5) and higher for other components (5e-5)
- Batch size - Larger batches (8-16) generally work better for span-based models
- Checkpointing - Save multiple checkpoints with
save_total_limit > 1 - Monitoring - Track loss and metrics to detect overfitting
Training Parameters Reference
Common parameters for model.train_model():
| Parameter | Default | Description |
|---|---|---|
max_steps | - | Total training steps |
per_device_train_batch_size | 8 | Batch size per GPU |
learning_rate | 5e-5 | Learning rate for encoder |
others_lr | 5e-5 | Learning rate for other components |
weight_decay | 0.01 | Weight decay for encoder |
others_weight_decay | 0.01 | Weight decay for other components |
negatives | 1.0 | Negative sampling ratio |
masking | "none" | Masking strategy: "none", "global", "label", "span" |
focal_loss_alpha | -1 | Focal loss alpha (-1 disables) |
focal_loss_gamma | 0 | Focal loss gamma (0 disables) |
warmup_ratio | 0.1 | Warmup ratio for learning rate |
lr_scheduler_type | "linear" | Scheduler type: "linear", "cosine", etc. |
save_steps | 500 | Save checkpoint every N steps |
save_total_limit | 3 | Maximum checkpoints to keep |
freeze_components | None | Components to freeze: e.g., ["text_encoder"] |