Knowledgator Docs
GitHubDiscord
  • 🛎️Welcome
  • ⚙️Models
    • 🧮Comprehend-it
      • Comprehend_it-base
      • Comprehend_it-multilingual-t5-base
    • 🦎UTC
  • 👷Frameworks
    • 💧LiqFit
      • Quick Start
      • Benchmarks
      • API Reference
        • Collators
          • NLICollator
          • Creating custom collator
        • Datasets
          • NLIDataset
        • Losses
          • Focal Loss
          • Binary Cross Entropy
          • Cross Entropy Loss
        • Modeling
          • LiqFitBackbone
          • LiqFitModel
        • Downstream Heads
          • LiqFitHead
          • LabelClassificationHead
          • ClassClassificationHead
          • ClassificationHead
        • Pooling
          • GlobalMaxPooling1D
          • GlobalAbsAvgPooling1D
          • GlobalAbsMaxPooling1D
          • GlobalRMSPooling1D
          • GlobalSumPooling1D
          • GlobalAvgPooling1D
          • FirstTokenPooling1D
        • Models
          • Deberta
          • T5
        • Pipelines
          • ZeroShotClassificationPipeline
  • 📚Datasets
    • Biotech news dataset
  • 👩‍🔧Support
  • API Reference
    • Comprehend-it API
    • Entity extraction
      • /fast
      • /deterministic
      • /advanced
    • Token searcher
    • Web2Meaning
    • Web2Meaning2
    • Relation extraction
    • Text2Table
      • /web2text
      • /text_preprocessing
      • /text2table
      • /merge_tables
Powered by GitBook
On this page
  • Supported languages
  • Usage instructions
  • How to use
  1. Models
  2. Comprehend-it

Comprehend_it-multilingual-t5-base

PreviousComprehend_it-baseNextUTC

Last updated 1 year ago

Supported languages

Supports 101 different languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Usage instructions

How to use

Install the neccessary libraries before using it

Because of the different model architecture, we can't use transformers' "zero-shot-classification" pipeline. For that, we developed a special library called . If you haven't installed the sentencepiece library you need to install it as well to use T5 tokenizers.

pip install liqfit sentencepiece

With the LiqFit pipeline

The model can be loaded with the zero-shot-classification pipeline like so:

from liqfit.pipeline import ZeroShotClassificationPipeline
from liqfit.models import T5ForZeroShotClassification
from transformers import T5Tokenizer

model = T5ForZeroShotClassification.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
tokenizer = T5Tokenizer.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer,
                                                      hypothesis_template = '{}', encoder_decoder = True)

You can then use this pipeline to classify sequences into any of the class names you specify.

sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels, multi_label=False)
{'sequence': 'one day I will see the world',
 'labels': ['travel', 'cooking', 'dancing'],
 'scores': [0.7350383996963501, 0.1484801471233368, 0.1164814680814743]}

Among English you can use the model for many other languages, such as Ukrainian:

sequence_to_classify = "Одного дня я побачу цей світ."
candidate_labels = ['подорож', 'кулінарія', 'танці']
classifier(sequence_to_classify, candidate_labels, multi_label=False)
{'sequence': 'Одного дня я побачу цей світ.',
 'labels': ['подорож', 'кулінарія', 'танці'],
 'scores': [0.6393420696258545, 0.2657214105129242, 0.09493650496006012]}

The model works even if labels and text are in different languages:

sequence_to_classify = "Одного дня я побачу цей світ"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels, multi_label=False)
{'sequence': 'Одного дня я побачу цей світ',
 'labels': ['travel', 'cooking', 'dancing'],
 'scores': [0.7676175236701965, 0.15484870970249176, 0.07753374427556992]}
⚙️
🧮
LiqFit