🦎UTC

Overview

Prompt-based token classification model based on DeBERTaV3 and T5 encoders. A model trained on a variety of token classification tasks demonstrates great generalization capabilities. It excels in zero-shot and few-shot settings, for diverse information extraction (IE) tasks, making it a versatile tool for a range of NLP applications.

Supported IE tasks

  • Named-entity recognition (NER)

  • Relation extraction

  • Summarization

  • Q&A

  • Text cleaning

  • Coreference resolution

Models

ModelBase modelSizeInput capacityAccess

UTC-small

141M

3K

English

Open-sourced under Apache 2.0

UTC-large

434M

3K

English

Open-sourced under Apache 2.0

UTC-base

184M

3K

English

Open-sourced under Apache 2.0

UTC-large

783M

3K

English

Open-sourced under Apache 2.0

Common features

  • Prompt-based The model was trained on multiple token classification tasks making it adaptable for a variety of information extraction tasks using user prompts.

  • Supports zero-shot and few-shot learning. Capable of performing tasks with little to no training data, making it highly adaptable to new challenges.

  • 3K Token Capacity. Can process texts up to 3,000 tokens in length. We work on expanding model processing capacity.

  • Currently supports the English language only.

Usage instructions

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

def process(text, prompt, treshold=0.5):
  """
  Processes text by preparing prompt and adjusting indices.
  
  Args:
    text (str): The text to process
    prompt (str): The prompt to prepend to the text
    
  Returns:
    list: A list of dicts with adjusted spans and scores
  """

  # Concatenate text and prompt for full input
  input_ = f"{prompt}\n{text}" 
  
  results = nlp(input_) # Run NLP on full input
  
  processed_results = []

  prompt_length = len(prompt) # Get prompt length
  
  for result in results:
    # check whether score is higher than treshold
    if result['score']<treshold:
        continue
    # Adjust indices by subtracting prompt length
    start = result['start'] - prompt_length 

    # If indexes belongs to the prompt - continue
    if start<0:
        continue

    end = result['end'] - prompt_length
    
    # Extract span from original text using adjusted indices
    span = text[start:end]

    # Create processed result dict
    processed_result = {
      'span': span,
      'start': start,  
      'end': end,
      'score': result['score']
    }

    processed_results.append(processed_result)

  return processed_results

tokenizer = AutoTokenizer.from_pretrained("knowledgator/UTC-DeBERTa-small")
model = AutoModelForTokenClassification.from_pretrained("knowledgator/UTC-DeBERTa-small")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy = 'first')

Examples

Zero-shot NER

prompt = "Identify the following entity classes in the text: computer\nText:"
results = process(text, prompt)

Question answering

question = "Who are the founders of Microsoft?"
results = process(text, question)

Text cleaning

prompt = "Clean the following text extracted from the web matching not relevant parts:"
results = process(text, prompt)

Relation extraction

rex_prompt = "Identify target entity given the following relation: '{}' and the following source entity: '{}'\nText:"
results = process(text, rex_prompt.format(relation, entity))

Fine-tuning

Currently, you can fine-tune our model via the Hugging Face Auto-train feature.

Potential and limitations

  • Potential. The UTC-DeBERTa-small model's prompt-based approach allows for flexible adaptation to various tasks. Its strength in token-level analysis makes it highly effective for detailed text-processing tasks.

  • Limitations. While the model shows promise in summarization, it is currently not its strongest application. Enhancements in this area are a future development focus.

Last updated