Links
🦎

UTC

Overview

Prompt-based token classification model based on DeBERTaV3 and T5 encoders. A model trained on a variety of token classification tasks demonstrates great generalization capabilities. It excels in zero-shot and few-shot settings, for diverse information extraction (IE) tasks, making it a versatile tool for a range of NLP applications.

Supported IE tasks

  • Named-entity recognition (NER)
  • Relation extraction
  • Summarization
  • Q&A
  • Text cleaning
  • Coreference resolution

Models

Model
Base model
Size
Input capacity
Text
Access
UTC-small
​DeBERTaV3-small​
141M
3K
English
​
Open-sourced under Apache 2.0​
UTC-large
​DeBERTaV3-large​
434M
3K
English
​
​
Open-sourced under Apache 2.0​
UTC-base
​DeBERTaV3-base​
184M
3K
English
​
​
Open-sourced under Apache 2.0​
UTC-large
​T5-large​
783M
3K
English
​
​
Open-sourced under Apache 2.0​

Common features

  • Prompt-based The model was trained on multiple token classification tasks making it adaptable for a variety of information extraction tasks using user prompts.
  • Supports zero-shot and few-shot learning. Capable of performing tasks with little to no training data, making it highly adaptable to new challenges.
  • 3K Token Capacity. Can process texts up to 3,000 tokens in length. We work on expanding model processing capacity.
  • Currently supports the English language only.

Usage instructions

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
​
def process(text, prompt, treshold=0.5):
"""
Processes text by preparing prompt and adjusting indices.
Args:
text (str): The text to process
prompt (str): The prompt to prepend to the text
Returns:
list: A list of dicts with adjusted spans and scores
"""
​
# Concatenate text and prompt for full input
input_ = f"{prompt}\n{text}"
results = nlp(input_) # Run NLP on full input
processed_results = []
​
prompt_length = len(prompt) # Get prompt length
for result in results:
# check whether score is higher than treshold
if result['score']<treshold:
continue
# Adjust indices by subtracting prompt length
start = result['start'] - prompt_length
​
# If indexes belongs to the prompt - continue
if start<0:
continue
​
end = result['end'] - prompt_length
# Extract span from original text using adjusted indices
span = text[start:end]
​
# Create processed result dict
processed_result = {
'span': span,
'start': start,
'end': end,
'score': result['score']
}
​
processed_results.append(processed_result)
​
return processed_results
​
tokenizer = AutoTokenizer.from_pretrained("knowledgator/UTC-DeBERTa-small")
model = AutoModelForTokenClassification.from_pretrained("knowledgator/UTC-DeBERTa-small")
​
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy = 'first')

Examples

Zero-shot NER

prompt = "Identify the following entity classes in the text: computer\nText:"
results = process(text, prompt)

Question answering

question = "Who are the founders of Microsoft?"
results = process(text, question)

Text cleaning

prompt = "Clean the following text extracted from the web matching not relevant parts:"
results = process(text, prompt)

Relation extraction

rex_prompt = "Identify target entity given the following relation: '{}' and the following source entity: '{}'\nText:"
results = process(text, rex_prompt.format(relation, entity))

Fine-tuning

Currently, you can fine-tune our model via the Hugging Face Auto-train feature.

Potential and limitations

  • Potential. The UTC-DeBERTa-small model's prompt-based approach allows for flexible adaptation to various tasks. Its strength in token-level analysis makes it highly effective for detailed text-processing tasks.
  • Limitations. While the model shows promise in summarization, it is currently not its strongest application. Enhancements in this area are a future development focus.