Prepared Datasets

This page provides a detailed overview of the official datasets for GLiClass models.

Name▼	Total examples▼	Unique labels▼	Cache size (GB)▼
gliclass-v2.0	1 196 218	1 382 952	1.11
gliclass-v2.0-RAC	612 142	857 027	1.31

gliclass-v2.0-RAC

image/png

To further enhance classification performance, we generated a Retrieval-Augmented Classification (RAC) dataset. Each text example in the gliclass-v2.0 dataset was encoded using the paraphrase-MiniLM-L6-v2 sentence transformer and indexed in an HNSW (Hierarchical Navigable Small World) database. For 250k randomly selected samples, we retrieved up to three most similar examples (cosine similarity > 0.5) from the dataset.

During augmentation:

The number of retrieved examples per sample was randomly chosen between 1 and 3.
30% of retrieved examples were replaced with random, unrelated examples to introduce controlled noise.
If true labels were present in a retrieved example, false labels were removed with a 50% probability to balance information clarity.

Each retrieved example was formatted using structured <<EXAMPLE>> ... <</EXAMPLE>> tags, where:

True labels were explicitly marked as <<TRUE_LABEL>> {label}.
False labels were marked as <<FALSE_LABEL>> {label}, unless removed.

For each randomly selected 250k examples, the “text” was modified as {original_text} <<EXAMPLE>> {retrieved_text} {true_labels_str} {false_labels_str} <</EXAMPLE>>... Where:

{original_text} is the original example text.
{retrieved_text} is a similar or randomly selected example.
{true_labels_str} contains true labels formatted as <<TRUE_LABEL>> {label}.
{false_labels_str} contains false labels formatted as <<FALSE_LABEL>> {label} (unless removed with 50% probability).

Such a strategy allows the model to learn how to utilize the provided information without overfocusing on RAC examples. With both relevant and randomly retrieved examples, the dataset maintains a balance between useful contextual information and controlled noise. This ensures that the model does not become overly reliant on retrieval-augmented inputs while still benefiting from additional context when available.

GLiClass-V3 Logic Dataset

Rows: 7,776 | Split: train only | Format: Parquet | Language: EN | License: Apache-2.0

What it is

A length-balanced corpus of single-sentence prompts built purely for inducing reasoning in language models.

Why it helps

Teaches symbolic-logic patterns and multi-label behaviour: Models learn to handle complex logical reasoning tasks.
Length-balanced training: Buckets cover 15 word-length ranges (4 → 1,024) in equal proportions, exposing models to both tiny and very long inputs.
Variable answer sets: Each example has 1-50 true and 1-50 false labels, forcing the model to cope with large, variable answer sets.

Where the prompts come from

Re-annotated snippets drawn from three public resources:

Source dataset	Notes
FineWeb (clean web crawl)	Plain sentences automatically filtered for quality, then labelled with LLM.
tau/CommonsenseQA	Question stems only; each converted to a declarative premise and re-labelled multi-label style.
GLiClass-2k prototype (BioMike/formal-logic-reasoning-gliclass-2k)	Earlier formal-logic items.
nyu-mll/MultiNLI	Premise/hypothesis pairs.

Data schema

Column	Type	Notes
`text`	string	Sentence or short passage.
`true_labels`	list<string>	All correct answers.
`all_labels`	list<string>	true_labels + distractors (shuffled).

Quick load

from datasets import load_dataset
ds = load_dataset("knowledgator/gliclass-v3-logic-dataset")["train"]

Citation

@misc{stepanov2025gliclassgeneralistlightweightmodel,
      title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks},
      author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko},
      year={2025},
      eprint={2508.07662},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.07662},
}

gliclass-v2.0-RAC​

GLiClass-V3 Logic Dataset​

What it is​

Why it helps​

Where the prompts come from​

Data schema​

Quick load​

Citation​