ZeroShotClassificationPipeline

The pipeline allows easily run zero-shot text classification with fine-tuned cross-encoders.

class ZeroShotClassificationPipeline

(args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs)

Parameters:

  • model (AutoModelForSequenceClassification | CrossFitModel | torch.nn.Module): the argument specifies a fine-tuned model to be used in the processing pipeline.

  • tokenizer (AutoTokenizer): the tokenizer is responsible for breaking down input text into individual tokens, which are the basic units of language.

  • hypothesis_template (str, default='{}'): this optional argument allows to specify a template for generating hypotheses. The template is a string with a placeholder(s) that can be filled in during the inference process. The default value is an empty string, indicating that no specific template is required. Users can customize this template based on the desired output format.

  • hypothesis_first (bool, default = False): this argument specifies whether to put hypothesis before premise. It can be beneficial for models with a block attention mechanism when each token interacts with tokens in the range of some window and with N first tokens.

  • encoder_decoder (bool, default=True): this boolean flag determines whether the model operates as an encoder-decoder architecture. When set to True, the model is configured as an encoder-decoder; in this case, a text is processed by the encoder, and the labels are processed with the decoder.

Using ZeroShotClassificationPipeline:

The pipeline supports classical cross-encoder models with 3 output neurons, corresponding entail, contradiction, neutral classes.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

from liqfit.pipeline import ZeroShotClassificationPipeline


sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
template = 'This example is {}.'

model_path = 'knowledgator/comprehend_it-base'
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForSequenceClassification.from_pretrained(model_path)

classifier = ZeroShotClassificationPipeline(model=self.model, 
                                                        tokenizer=self.tokenizer, 
                                                        hypothesis_template = self.template
                                                        )

results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
print(results)

Also, the pipeline supports binary reranking models for both single-label and multi-label scenarios:

model_path = 'BAAI/bge-reranker-base'
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForSequenceClassification.from_pretrained(model_path)

classifier = ZeroShotClassificationPipeline(model=self.model, 
                                                        tokenizer=self.tokenizer, 
                                                        hypothesis_template = self.template,
                                                        hypothesis_first = False)

results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
print(results)

Encoder-decoder models are more flexible because they independently process text with an encoder and then, with smaller decoders, calculate the probabilities of each class. Moreover, they demonstrate better distinguishing between text and labels because it's processed with different parts of a model.

from liqfit.pipeline import ZeroShotClassificationPipeline
from liqfit.models import T5ForZeroShotClassification
from transformers import T5Tokenizer

model = T5ForZeroShotClassification.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
tokenizer = T5Tokenizer.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer,
                                                      hypothesis_template = '{}', encoder_decoder = True)

Last updated