🦎UTC

Overview

Prompt-based token classification model based on DeBERTaV3 and T5 encoders. A model trained on a variety of token classification tasks demonstrates great generalization capabilities. It excels in zero-shot and few-shot settings, for diverse information extraction (IE) tasks, making it a versatile tool for a range of NLP applications.

Supported IE tasks

Named-entity recognition (NER)
Relation extraction
Summarization
Q&A
Text cleaning
Coreference resolution

Models

Model	Base model	Size	Input capacity		Access
UTC-small	DeBERTaV3-small	141M	3K	English	Open-sourced under Apache 2.0 Hugging Face
UTC-large	DeBERTaV3-large	434M	3K	English	Open-sourced under Apache 2.0 Hugging Face
UTC-base	DeBERTaV3-base	184M	3K	English	Open-sourced under Apache 2.0 Hugging Face
UTC-large	T5-large	783M	3K	English	Open-sourced under Apache 2.0 Hugging Face

Model

Base model

Size

Input capacity

Access

UTC-small

DeBERTaV3-small

141M

English

Open-sourced under Apache 2.0

Hugging Face

UTC-large

DeBERTaV3-large

434M

English

Open-sourced under Apache 2.0

Hugging Face

UTC-base

DeBERTaV3-base

184M

English

Open-sourced under Apache 2.0

Hugging Face

UTC-large

T5-large

783M

English

Open-sourced under Apache 2.0

Hugging Face

Common features

Prompt-based The model was trained on multiple token classification tasks making it adaptable for a variety of information extraction tasks using user prompts.
Supports zero-shot and few-shot learning. Capable of performing tasks with little to no training data, making it highly adaptable to new challenges.
3K Token Capacity. Can process texts up to 3,000 tokens in length. We work on expanding model processing capacity.
Currently supports the English language only.

Usage instructions

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

def process(text, prompt, treshold=0.5):
  """
  Processes text by preparing prompt and adjusting indices.
  
  Args:
    text (str): The text to process
    prompt (str): The prompt to prepend to the text
    
  Returns:
    list: A list of dicts with adjusted spans and scores
  """

  # Concatenate text and prompt for full input
  input_ = f"{prompt}\n{text}" 
  
  results = nlp(input_) # Run NLP on full input
  
  processed_results = []

  prompt_length = len(prompt) # Get prompt length
  
  for result in results:
    # check whether score is higher than treshold
    if result['score']<treshold:
        continue
    # Adjust indices by subtracting prompt length
    start = result['start'] - prompt_length 

    # If indexes belongs to the prompt - continue
    if start<0:
        continue

    end = result['end'] - prompt_length
    
    # Extract span from original text using adjusted indices
    span = text[start:end]

    # Create processed result dict
    processed_result = {
      'span': span,
      'start': start,  
      'end': end,
      'score': result['score']
    }

    processed_results.append(processed_result)

  return processed_results

tokenizer = AutoTokenizer.from_pretrained("knowledgator/UTC-DeBERTa-small")
model = AutoModelForTokenClassification.from_pretrained("knowledgator/UTC-DeBERTa-small")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy = 'first')

Examples

Zero-shot NER

prompt = "Identify the following entity classes in the text: computer\nText:"
results = process(text, prompt)

Question answering

question = "Who are the founders of Microsoft?"
results = process(text, question)

Text cleaning

prompt = "Clean the following text extracted from the web matching not relevant parts:"
results = process(text, prompt)

Relation extraction

rex_prompt = "Identify target entity given the following relation: '{}' and the following source entity: '{}'\nText:"
results = process(text, rex_prompt.format(relation, entity))

Fine-tuning

Currently, you can fine-tune our model via the Hugging Face Auto-train feature.

Potential and limitations

Potential. The UTC-DeBERTa-small model's prompt-based approach allows for flexible adaptation to various tasks. Its strength in token-level analysis makes it highly effective for detailed text-processing tasks.
Limitations. While the model shows promise in summarization, it is currently not its strongest application. Enhancements in this area are a future development focus.

PreviousComprehend_it-multilingual-t5-base NextFrameworks

Last updated 5 months ago