Prepared Datasets
This page provides a detailed overview of the official datasets for GLiClass models.
| Name | Total examples▼ | Unique labels | Cache size (GB) |
|---|---|---|---|
| gliclass-v2.0 | 1 196 218 | 1 382 952 | 1.11 |
| gliclass-v2.0-RAC | 612 142 | 857 027 | 1.31 |
gliclass-v2.0-RAC

To further enhance classification performance, we generated a Retrieval-Augmented Classification (RAC) dataset. Each text example in the gliclass-v2.0 dataset was encoded using the paraphrase-MiniLM-L6-v2 sentence transformer and indexed in an HNSW (Hierarchical Navigable Small World) database. For 250k randomly selected samples, we retrieved up to three most similar examples (cosine similarity > 0.5) from the dataset.
During augmentation:
- The number of retrieved examples per sample was randomly chosen between 1 and 3.
- 30% of retrieved examples were replaced with random, unrelated examples to introduce controlled noise.
- If true labels were present in a retrieved example, false labels were removed with a 50% probability to balance information clarity.
Each retrieved example was formatted using structured <<EXAMPLE>> ... <</EXAMPLE>> tags, where:
- True labels were explicitly marked as
<<TRUE_LABEL>> {label}. - False labels were marked as
<<FALSE_LABEL>> {label}, unless removed.
For each randomly selected 250k examples, the “text” was modified as {original_text} <<EXAMPLE>> {retrieved_text} {true_labels_str} {false_labels_str} <</EXAMPLE>>...
Where:
{original_text}is the original example text.{retrieved_text}is a similar or randomly selected example.{true_labels_str}contains true labels formatted as<<TRUE_LABEL>> {label}.{false_labels_str}contains false labels formatted as<<FALSE_LABEL>> {label}(unless removed with 50% probability).
Such a strategy allows the model to learn how to utilize the provided information without overfocusing on RAC examples. With both relevant and randomly retrieved examples, the dataset maintains a balance between useful contextual information and controlled noise. This ensures that the model does not become overly reliant on retrieval-augmented inputs while still benefiting from additional context when available.
GLiClass-V3 Logic Dataset
Rows: 7,776 | Split: train only | Format: Parquet | Language: EN | License: Apache-2.0
What it is
A length-balanced corpus of single-sentence prompts built purely for inducing reasoning in language models.
Why it helps
- Teaches symbolic-logic patterns and multi-label behaviour: Models learn to handle complex logical reasoning tasks.
- Length-balanced training: Buckets cover 15 word-length ranges (4 → 1,024) in equal proportions, exposing models to both tiny and very long inputs.
- Variable answer sets: Each example has 1-50 true and 1-50 false labels, forcing the model to cope with large, variable answer sets.
Where the prompts come from
Re-annotated snippets drawn from three public resources:
| Source dataset | Notes |
|---|---|
| FineWeb (clean web crawl) | Plain sentences automatically filtered for quality, then labelled with LLM. |
| tau/CommonsenseQA | Question stems only; each converted to a declarative premise and re-labelled multi-label style. |
| GLiClass-2k prototype (BioMike/formal-logic-reasoning-gliclass-2k) | Earlier formal-logic items. |
| nyu-mll/MultiNLI | Premise/hypothesis pairs. |
Data schema
| Column | Type | Notes |
|---|---|---|
text | string | Sentence or short passage. |
true_labels | list<string> | All correct answers. |
all_labels | list<string> | true_labels + distractors (shuffled). |
Quick load
from datasets import load_dataset
ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"]
Citation
@misc{stepanov2025gliclassgeneralistlightweightmodel,
title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks},
author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko},
year={2025},
eprint={2508.07662},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.07662},
}