🧮Comprehend-it

Overview

The Comprehend-it is a powerful NLP model based on DeBERTaV3-base, trained extensively on natural language inference (NLI) and various text classification datasets. It stands out for its performance in zero-shot and few-shot text classification, surpassing Bart-large-mnli in quality while maintaining a significantly smaller size.

Supported IE tasks

Text classification
Reranking of search results
Named-entity recognition
Relation extraction
Entity linking
Question-answering

Models

Model	Base model	Size	Input capacity	Languages	Access
Comprehend_it-base	DeBERTaV3-base	184M	3K	English	Open-sourced under Apache 2.0 Hugging Face API Access: Rapid API
Comprehend_it-multilingual-t5-base	mT5-base	390M	3K	101 languages	Open-sourced under Apache 2.0 Hugging Face

Model

Base model

Size

Input capacity

Languages

Access

Comprehend_it-base

DeBERTaV3-base

184M

English

Open-sourced under Apache 2.0

Hugging Face

API Access:

Rapid API

Comprehend_it-multilingual-t5-base

mT5-base

390M

101 languages

Open-sourced under Apache 2.0

Hugging Face

Benchmarking

Below, you can see the F1 score on several text classification datasets. All tested models were not fine-tuned on those datasets and were tested in a zero-shot setting.

Model	IMDB	AG_NEWS	Emotions
Bart-large-mnli (407 M)	0.89	0.6887	0.3765
Deberta-base-v3 (184 M)	0.85	0.6455	0.5095
Comprehendo (184M)	0.90	0.7982	0.5660
Comprehendo-multi-lang (390M)	0.88	0.8372	-
SetFit BAAI/bge-small-en-v1.5 (33.4M)	0.86	0.5636	0.5754

Usage instructions

For installation guidelines please refer to the model's detailed page.

Examples

candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
classifier(sequence_to_classify, candidate_labels, multi_label=True)
#{'labels': ['travel', 'exploration', 'dancing', 'cooking'],
# 'scores': [0.9945111274719238,
#  0.9383890628814697,
#  0.0057061901316046715,
#  0.0018193122232332826],
# 'sequence': 'one day I will see the world'}

Besides text classification, the model can be used for many other information extraction tasks.

Question-answering

The model can be used to solve open question-answering as well as reading comprehension tasks if it's possible to transform a task into a multi-choice Q&A.

#open question-answering
question = "What is the capital city of Ukraine?"
candidate_answers = ['Kyiv', 'London', 'Berlin', 'Warsaw']
classifier(question, candidate_answers)

# labels': ['Kyiv', 'Warsaw', 'London', 'Berlin'],
#  'scores': [0.8633171916007996,
#   0.11328165978193283,
#   0.012766502797603607,
#   0.010634596459567547]

#reading comprehension
question = 'In what country is Normandy located?'
text = 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'
input_ = f"{question}\n{text}"

candidate_answers = ['Denmark', 'Iceland', 'France', "Norway"]

classifier(input_, candidate_answers)

#  'labels': ['France', 'Iceland', 'Norway', 'Denmark'],
#  'scores': [0.9102861285209656,
#   0.03861876204609871,
#   0.028696594759821892,
#   0.02239849977195263]

#binary question-answering
question = "Does drug development regulation become more aligned with modern technologies and trends, choose yes or no?"
text = "Drug development has become unbearably slow and expensive. A key underlying problem is the clinical prediction challenge: the inability to predict which drug candidates will be safe in the human body and for whom. Recently, a dramatic regulatory change has removed FDA's mandated reliance on antiquated, ineffective animal studies. A new frontier is an integration of several disruptive technologies [machine learning (ML), patient-on-chip, real-time sensing, and stem cells], which, when integrated, have the potential to address this challenge, drastically cutting the time and cost of developing drugs, and tailoring them to individual patients."
input_ = f"{question}\n{text}"

candidate_answers = ['yes', 'no']

classifier(input_, candidate_answers)

# 'labels': ['yes', 'no'],
#  'scores': [0.5876278281211853, 0.4123721718788147]}

Named-entity classification and disambiguation

The model can be used to classify named entities or disambiguate similar ones. It can be put as one of the components in an entity-linking system as a reranker.

text = """Knowledgator is an open-source ML research organization focused on advancing the information extraction field."""

candidate_labels = ['Knowledgator - company',
 'Knowledgator - product', 
 'Knowledgator - city']

classifier(text, candidate_labels)

# 'labels': ['Knowledgator - company',
#   'Knowledgator - product',
#   'Knowledgator - city'],
#  'scores': [0.887371301651001, 0.097423255443573, 0.015205471776425838]

Relation classification

With the same principle, the model can be utilized to classify relations from a text.

text = """The FKBP5 gene codifies a co-chaperone protein associated with the modulation of glucocorticoid receptor interaction involved in the adaptive stress response. The FKBP5 intracellular concentration affects the binding affinity of the glucocorticoid receptor (GR) to glucocorticoids (GCs). This gene has glucocorticoid response elements (GRES) located in introns 2, 5 and 7, which affect its expression. Recent studies have examined GRE activity and the effects of genetic variants on transcript efficiency and their contribution to susceptibility to behavioral disorders. Epigenetic changes and environmental factors can influence the effects of these allele-specific variants, impacting the response to GCs of the FKBP5 gene. The main epigenetic mark investigated in FKBP5 intronic regions is DNA methylation, however, few studies have been performed for all GRES located in these regions. One of the major findings was the association of low DNA methylation levels in the intron 7 of FKBP5 in patients with psychiatric disorders. To date, there are no reports of DNA methylation in introns 2 and 5 of the gene associated with diagnoses of psychiatric disorders. This review highlights what has been discovered so far about the relationship between polymorphisms and epigenetic targets in intragenic regions, and reveals the gaps that need to be explored, mainly concerning the role of DNA methylation in these regions and how it acts in psychiatric disease susceptibility."""

candidate_labels = ['FKBP5-associated with -> PTSD',
 'FKBP5 - has no effect on -> PTSD',
 'FKBP5 - is similar to -> PTSD',
 'FKBP5 - inhibitor of-> PTSD',
 'FKBP5 - ancestor of -> PTSD']

classifier(text, candidate_labels)

#  'labels': ['FKBP5-associated with -> PTSD',
#   'FKBP5 - is similar to -> PTSD',
#   'FKBP5 - has no effect on -> PTSD',
#   'FKBP5 - ancestor of -> PTSD',
#   'FKBP5 - inhibitor of-> PTSD'],
#  'scores': [0.5880666971206665,
#   0.17369700968265533,
#   0.14067059755325317,
#   0.05044548586010933,
#   0.04712018370628357]

Fine-tuning

We recommend fine-tuning models using our LiqFit framework, which allows you to efficiently fine-tune models using about 8 training examples per label.

PreviousModels NextComprehend_it-base

Last updated 5 months ago