How to Categorize Social Media Posts Locally with GLiClass
Run zero-shot text classification on your local machine to categorize social media posts by topic, sentiment, and intent without sending data to external APIs.
Overview
This cookbook demonstrates using the knowledgator/gliclass-edge-v3.0 model for local, privacy-preserving classification of social media content. The edge model is optimized for fast inference on consumer hardware while maintaining high accuracy.
What You'll Learn
- Set up GLiClass for local inference
- Define category taxonomies for social media content
- Classify posts by topic, sentiment, and engagement potential
- Handle multi-platform content (Twitter/X, Instagram, LinkedIn, TikTok)
- Optimize performance for batch processing
- Build a complete local classification pipeline
Prerequisites
- Python 3.10+
- 4GB+ RAM (8GB recommended)
- GPU optional but recommended for batch processing
- Sample social media posts for testing
Use Cases
- Content moderation and filtering
- Social media analytics dashboards
- Trend detection and monitoring
- Influencer content analysis
- Brand mention categorization
- Competitor content tracking
Why Run Locally?
Running classification locally offers several advantages:
| Benefit | Description |
|---|---|
| Privacy | Sensitive social data never leaves your infrastructure |
| Cost | No per-request API fees for high-volume processing |
| Latency | Sub-100ms inference without network round-trips |
| Offline | Works without internet connectivity |
| Control | Full control over model versions and updates |
The GLiClass Edge Model
The knowledgator/gliclass-edge-v3.0 model is a compact zero-shot classifier optimized for edge deployment:
| Specification | Value |
|---|---|
| Model size | ~200MB |
| Inference speed | ~50ms per text (CPU) |
| Max sequence length | 512 tokens |
| Zero-shot | Yes (no training required) |
| Multilingual | English primary, partial multilingual support |
Installation
Install Dependencies
pip install gliclass torch transformers
For GPU Acceleration (Optional)
# CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
Verify Installation
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer
# Load model and tokenizer (downloads on first run)
model = GLiClassModel.from_pretrained("knowledgator/gliclass-edge-v3.0")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-edge-v3.0")
# Create classification pipeline
pipeline = ZeroShotClassificationPipeline(
model, tokenizer,
classification_type='multi-label',
device='cpu'
)
print("Model loaded successfully!")
# Quick test
text = "Just shipped a new feature! So excited to share it with everyone."
labels = ["announcement", "question", "complaint", "casual conversation"]
results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
print(f"{result['label']} => {result['score']:.3f}")
Step 1: Define Social Media Categories
Topic Categories
TOPIC_LABELS = [
"technology and software",
"business and entrepreneurship",
"health and fitness",
"food and cooking",
"travel and adventure",
"fashion and beauty",
"entertainment and pop culture",
"sports",
"politics and news",
"education and learning",
"art and creativity",
"personal life update",
"humor and memes",
"motivational and inspirational"
]
Content Type Categories
CONTENT_TYPE_LABELS = [
"product announcement or launch",
"promotional content or advertisement",
"educational tutorial or how-to",
"opinion or commentary",
"question asking for advice",
"personal story or experience",
"news or current events",
"behind-the-scenes content",
"user-generated testimonial",
"engagement bait or poll",
"meme or humorous content",
"inspirational quote or message"
]
Sentiment Categories
SENTIMENT_LABELS = [
"very positive and enthusiastic",
"positive and satisfied",
"neutral or informational",
"negative or disappointed",
"angry or frustrated",
"sarcastic or ironic"
]
Engagement Intent Categories
ENGAGEMENT_LABELS = [
"seeking likes and shares",
"starting a discussion",
"asking for help or advice",
"sharing information",
"promoting a product or service",
"building personal brand",
"networking and connecting",
"entertainment only"
]
Step 2: Basic Classification
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer
# Initialize model and pipeline
model = GLiClassModel.from_pretrained("knowledgator/gliclass-edge-v3.0")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-edge-v3.0")
pipeline = ZeroShotClassificationPipeline(
model, tokenizer,
classification_type='multi-label',
device='cpu'
)
def classify_post(text: str, labels: list[str], threshold: float = 0.3) -> dict:
"""
Classify a social media post against given labels.
Args:
text: The post content
labels: List of category labels
threshold: Minimum confidence score
Returns:
Dictionary of label -> score for labels above threshold
"""
results = pipeline(text, labels, threshold=threshold)[0]
return {r['label']: r['score'] for r in results}
# Example post
post = """
🚀 Finally launched my new SaaS product after 6 months of building!
It helps small businesses automate their invoicing.
Would love your feedback - link in bio!
#startup #entrepreneurship #buildinpublic
"""
# Classify by topic
topic = classify_post(post, TOPIC_LABELS)
print("Topic:", topic)
# Output: {'business and entrepreneurship': 0.92, 'technology and software': 0.71}
# Classify by content type
content_type = classify_post(post, CONTENT_TYPE_LABELS)
print("Content Type:", content_type)
# Output: {'product announcement or launch': 0.95, 'promotional content or advertisement': 0.68}
# Classify sentiment
sentiment = classify_post(post, SENTIMENT_LABELS)
print("Sentiment:", sentiment)
# Output: {'very positive and enthusiastic': 0.89}
Step 3: Multi-Label Classification
Social media posts often belong to multiple categories:
def classify_multi_label(
text: str,
labels: list[str],
threshold: float = 0.4,
max_labels: int = 3
) -> list[tuple[str, float]]:
"""
Return multiple matching labels sorted by confidence.
"""
results = pipeline(text, labels, threshold=threshold)[0]
# Sort by score descending
sorted_results = sorted(
results,
key=lambda x: x['score'],
reverse=True
)
return [(r['label'], r['score']) for r in sorted_results[:max_labels]]
# Example: Post with multiple topics
post = """
Made this amazing avocado toast while watching the game.
Perfect Sunday vibes! Recipe in comments 🥑⚽
"""
labels = classify_multi_label(post, TOPIC_LABELS)
for label, score in labels:
print(f" {label}: {score:.2f}")
# Output:
# food and cooking: 0.87
# sports: 0.62
# personal life update: 0.58
Step 4: Platform-Specific Classification
Different platforms have different content styles:
from dataclasses import dataclass
from enum import Enum
class Platform(Enum):
TWITTER = "twitter"
INSTAGRAM = "instagram"
LINKEDIN = "linkedin"
TIKTOK = "tiktok"
FACEBOOK = "facebook"
# Platform-specific labels
PLATFORM_LABELS = {
Platform.TWITTER: [
"hot take or opinion",
"thread or long-form content",
"news commentary",
"viral moment reaction",
"community engagement",
"self-promotion",
"humor or shitpost"
],
Platform.INSTAGRAM: [
"lifestyle showcase",
"product feature",
"behind-the-scenes",
"aesthetic or mood post",
"story highlight",
"collaboration or partnership",
"user-generated content repost"
],
Platform.LINKEDIN: [
"career update or announcement",
"thought leadership",
"industry insight",
"job opportunity",
"company news",
"professional achievement",
"networking request",
"motivational content"
],
Platform.TIKTOK: [
"trend participation",
"tutorial or how-to",
"comedy skit",
"storytime",
"product review",
"dance or music content",
"day in my life",
"duet or stitch response"
]
}
@dataclass
class SocialPost:
"""Represents a social media post with metadata."""
text: str
platform: Platform
author: str = ""
hashtags: list[str] = None
mentions: list[str] = None
def __post_init__(self):
self.hashtags = self.hashtags or []
self.mentions = self.mentions or []
def classify_by_platform(post: SocialPost, threshold: float = 0.4) -> dict:
"""Classify post using platform-specific categories."""
platform_labels = PLATFORM_LABELS.get(post.platform, TOPIC_LABELS)
results = pipeline(post.text, platform_labels, threshold=threshold)[0]
categories = {r['label']: r['score'] for r in results}
return {
"platform": post.platform.value,
"categories": categories,
"primary_category": max(categories.items(), key=lambda x: x[1])[0] if categories else "uncategorized"
}
# Example usage
linkedin_post = SocialPost(
text="""
Excited to announce I've joined Acme Corp as Senior Engineer!
After 5 years at my previous role, I'm ready for this new challenge.
Grateful for everyone who supported me on this journey.
#newjob #career #grateful
""",
platform=Platform.LINKEDIN,
author="jane_doe",
hashtags=["newjob", "career", "grateful"]
)
result = classify_by_platform(linkedin_post)
print(f"Platform: {result['platform']}")
print(f"Primary: {result['primary_category']}")
print(f"All categories: {result['categories']}")
Step 5: Complete Classification Pipeline
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import json
@dataclass
class ClassificationResult:
"""Complete classification output for a social media post."""
post_id: str
text: str
platform: str
topic: dict
content_type: dict
sentiment: dict
engagement_intent: dict
primary_topic: str
primary_sentiment: str
is_promotional: bool
classified_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
def to_dict(self) -> dict:
return {
"post_id": self.post_id,
"text": self.text[:100] + "..." if len(self.text) > 100 else self.text,
"platform": self.platform,
"primary_topic": self.primary_topic,
"primary_sentiment": self.primary_sentiment,
"is_promotional": self.is_promotional,
"topic_scores": self.topic,
"sentiment_scores": self.sentiment,
"content_type_scores": self.content_type,
"engagement_intent_scores": self.engagement_intent,
"classified_at": self.classified_at
}
class SocialMediaClassifier:
"""
Complete local classification pipeline for social media posts.
"""
def __init__(
self,
model_name: str = "knowledgator/gliclass-edge-v3.0",
device: str = "cpu",
threshold: float = 0.4,
classification_type: str = "multi-label"
):
self.model = GLiClassModel.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.threshold = threshold
self.pipeline = ZeroShotClassificationPipeline(
self.model,
self.tokenizer,
classification_type=classification_type,
device=device
)
def _classify(self, text: str, labels: list[str]) -> dict:
"""Internal classification method."""
results = self.pipeline(text, labels, threshold=self.threshold)[0]
return {r['label']: r['score'] for r in results}
def classify(
self,
post_id: str,
text: str,
platform: str = "unknown"
) -> ClassificationResult:
"""Perform complete classification on a single post."""
# Topic classification
topic_result = self._classify(text, TOPIC_LABELS)
# Content type classification
content_type_result = self._classify(text, CONTENT_TYPE_LABELS)
# Sentiment classification
sentiment_result = self._classify(text, SENTIMENT_LABELS)
# Engagement intent classification
engagement_result = self._classify(text, ENGAGEMENT_LABELS)
# Determine primary classifications
primary_topic = max(topic_result.items(), key=lambda x: x[1])[0] if topic_result else "uncategorized"
primary_sentiment = max(sentiment_result.items(), key=lambda x: x[1])[0] if sentiment_result else "neutral"
# Check if promotional
promotional_indicators = ["promotional content", "promoting a product", "product announcement"]
is_promotional = any(
any(ind in label.lower() for ind in promotional_indicators)
for label in list(content_type_result.keys()) + list(engagement_result.keys())
)
return ClassificationResult(
post_id=post_id,
text=text,
platform=platform,
topic=topic_result,
content_type=content_type_result,
sentiment=sentiment_result,
engagement_intent=engagement_result,
primary_topic=primary_topic,
primary_sentiment=primary_sentiment,
is_promotional=is_promotional
)
def classify_batch(
self,
posts: list[dict],
text_field: str = "text",
id_field: str = "id",
platform_field: str = "platform"
) -> list[ClassificationResult]:
"""Classify multiple posts."""
results = []
for post in posts:
result = self.classify(
post_id=str(post.get(id_field, "")),
text=post[text_field],
platform=post.get(platform_field, "unknown")
)
results.append(result)
return results
def get_summary(self, results: list[ClassificationResult]) -> dict:
"""Generate summary statistics from classification results."""
topic_counts = {}
sentiment_counts = {}
promotional_count = 0
for result in results:
# Count topics
topic_counts[result.primary_topic] = topic_counts.get(result.primary_topic, 0) + 1
# Count sentiments
sentiment_counts[result.primary_sentiment] = sentiment_counts.get(result.primary_sentiment, 0) + 1
# Count promotional
if result.is_promotional:
promotional_count += 1
return {
"total_posts": len(results),
"topic_distribution": topic_counts,
"sentiment_distribution": sentiment_counts,
"promotional_percentage": round(promotional_count / len(results) * 100, 1) if results else 0
}
Step 6: Usage Examples
Basic Usage
# Initialize classifier
classifier = SocialMediaClassifier(device="cpu")
# Single post classification
post_text = """
Just discovered this amazing coffee shop in downtown!
The latte art is incredible and the vibes are immaculate ☕✨
Definitely my new favorite spot. Who else loves finding hidden gems?
"""
result = classifier.classify(
post_id="post_001",
text=post_text,
platform="instagram"
)
print(f"Topic: {result.primary_topic}")
print(f"Sentiment: {result.primary_sentiment}")
print(f"Is Promotional: {result.is_promotional}")
print(f"All topics: {result.topic}")
Batch Processing
# Sample posts from different platforms
posts = [
{
"id": "tw_001",
"text": "Hot take: tabs are better than spaces. Fight me.",
"platform": "twitter"
},
{
"id": "ig_001",
"text": "Morning routine 🌅 5am wake up, meditation, cold shower, gym. Discipline = freedom",
"platform": "instagram"
},
{
"id": "li_001",
"text": "Excited to share that our team just closed a $10M Series A! Thank you to all our investors and supporters.",
"platform": "linkedin"
},
{
"id": "tw_002",
"text": "Anyone else's code working perfectly locally but failing in production? Just me? 🙃",
"platform": "twitter"
},
{
"id": "ig_002",
"text": "Ad: Use code SUMMER20 for 20% off my new ebook! Link in bio 📚",
"platform": "instagram"
}
]
# Classify all posts
results = classifier.classify_batch(posts)
# Print results
for result in results:
print(f"\n{result.post_id} ({result.platform}):")
print(f" Topic: {result.primary_topic}")
print(f" Sentiment: {result.primary_sentiment}")
print(f" Promotional: {'Yes' if result.is_promotional else 'No'}")
# Get summary statistics
summary = classifier.get_summary(results)
print(f"\n--- Summary ---")
print(f"Total posts: {summary['total_posts']}")
print(f"Promotional: {summary['promotional_percentage']}%")
print(f"Topics: {summary['topic_distribution']}")
Step 7: Performance Optimization
GPU Acceleration
import torch
# Check GPU availability
if torch.cuda.is_available():
print(f"GPU available: {torch.cuda.get_device_name(0)}")
classifier = SocialMediaClassifier(device="cuda:0")
else:
print("Running on CPU")
classifier = SocialMediaClassifier(device="cpu")
Batch Processing with Progress
from tqdm import tqdm
def classify_with_progress(
classifier: SocialMediaClassifier,
posts: list[dict],
batch_size: int = 100
) -> list[ClassificationResult]:
"""Classify posts with progress bar."""
results = []
for i in tqdm(range(0, len(posts), batch_size), desc="Classifying"):
batch = posts[i:i + batch_size]
batch_results = classifier.classify_batch(batch)
results.extend(batch_results)
return results
Memory-Efficient Processing
def classify_stream(
classifier: SocialMediaClassifier,
post_iterator,
output_file: str
):
"""
Process posts as a stream, writing results immediately.
Memory-efficient for very large datasets.
"""
with open(output_file, 'w') as f:
f.write('[\n')
first = True
for post in post_iterator:
result = classifier.classify(
post_id=post['id'],
text=post['text'],
platform=post.get('platform', 'unknown')
)
if not first:
f.write(',\n')
first = False
json.dump(result.to_dict(), f, indent=2)
f.write('\n]')
# Usage with file-based iteration
def read_posts_from_file(filepath: str):
"""Generator to read posts line by line."""
with open(filepath, 'r') as f:
for line in f:
yield json.loads(line)
# Process large dataset
# classify_stream(classifier, read_posts_from_file("posts.jsonl"), "results.json")
Caching for Repeated Classifications
import hashlib
class CachedClassifier(SocialMediaClassifier):
"""Classifier with result caching for repeated texts."""
def __init__(self, cache_size: int = 10000, **kwargs):
super().__init__(**kwargs)
self.cache_size = cache_size
self._cache = {}
def _get_cache_key(self, text: str, labels: tuple) -> str:
"""Generate cache key from text and labels."""
content = f"{text}:{':'.join(sorted(labels))}"
return hashlib.md5(content.encode()).hexdigest()
def _classify_cached(
self,
text: str,
labels: list[str]
) -> dict:
"""Classify with caching."""
cache_key = self._get_cache_key(text, tuple(labels))
if cache_key in self._cache:
return self._cache[cache_key]
result = self._classify(text, labels)
# Maintain cache size
if len(self._cache) >= self.cache_size:
# Remove oldest entry
self._cache.pop(next(iter(self._cache)))
self._cache[cache_key] = result
return result
Step 8: Content Moderation Example
MODERATION_LABELS = [
"spam or scam content",
"hate speech or discrimination",
"harassment or bullying",
"misinformation or fake news",
"adult or explicit content",
"violence or threats",
"self-harm or dangerous content",
"safe and appropriate content"
]
def moderate_content(
classifier: SocialMediaClassifier,
text: str,
threshold: float = 0.5
) -> dict:
"""
Check post for policy violations.
"""
results = classifier.pipeline(text, MODERATION_LABELS, threshold=threshold)[0]
result_dict = {r['label']: r['score'] for r in results}
# Determine if content is safe
safe_label = "safe and appropriate content"
is_safe = safe_label in result_dict and result_dict[safe_label] > 0.6
# Get violations (excluding safe label)
violations = {
k: v for k, v in result_dict.items()
if k != safe_label and v > threshold
}
return {
"is_safe": is_safe and not violations,
"violations": violations,
"requires_review": bool(violations),
"confidence": result_dict.get(safe_label, 0.0)
}
# Example
classifier = SocialMediaClassifier(device="cpu")
suspicious_post = "Make $10,000 from home! DM me for details! 💰🔥"
moderation_result = moderate_content(classifier, suspicious_post)
print(f"Safe: {moderation_result['is_safe']}")
print(f"Violations: {moderation_result['violations']}")
# Output:
# Safe: False
# Violations: {'spam or scam content': 0.89}
Step 9: Export and Integration
Export to CSV
import csv
def export_to_csv(results: list[ClassificationResult], filepath: str):
"""Export classification results to CSV."""
with open(filepath, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow([
'post_id', 'platform', 'primary_topic', 'primary_sentiment',
'is_promotional', 'text_preview', 'classified_at'
])
for result in results:
writer.writerow([
result.post_id,
result.platform,
result.primary_topic,
result.primary_sentiment,
result.is_promotional,
result.text[:50] + "..." if len(result.text) > 50 else result.text,
result.classified_at
])
Export to JSON Lines
def export_to_jsonl(results: list[ClassificationResult], filepath: str):
"""Export results to JSON Lines format."""
with open(filepath, 'w') as f:
for result in results:
f.write(json.dumps(result.to_dict()) + '\n')
Webhook Integration
import requests
def send_to_webhook(
result: ClassificationResult,
webhook_url: str,
filter_promotional: bool = False
):
"""Send classification result to webhook."""
if filter_promotional and not result.is_promotional:
return None
payload = {
"event": "post_classified",
"data": result.to_dict()
}
response = requests.post(webhook_url, json=payload)
return response.status_code == 200
Best Practices
-
Choose appropriate thresholds: Start with 0.4 for multi-label scenarios, increase to 0.6+ for single-label precision
-
Use descriptive labels: "product announcement or launch" works better than just "announcement"
-
Preprocess text: Remove URLs, excessive emojis, and hashtags if they add noise
import re
def clean_post(text: str) -> str:
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'#\w+', '', text) # Remove hashtags
return text.strip() -
Batch for throughput: Process posts in batches of 50-100 for optimal GPU utilization
-
Cache repeated content: Social media often has duplicate or near-duplicate posts
-
Monitor model drift: Periodically validate classifications against human labels
-
Handle edge cases: Very short posts (fewer than 10 words) may have lower accuracy; consider flagging for review
Troubleshooting
| Issue | Solution |
|---|---|
| Out of memory | Reduce batch size, use CPU, or enable gradient checkpointing |
| Slow inference | Use GPU, reduce labels per request, enable caching |
| Low accuracy | Use more descriptive labels, lower threshold, preprocess text |
| Model download fails | Check internet connection, set HF_HOME for custom cache location |
Next Steps
- PII Detection and Redaction — Remove personal data before analysis
- Customer Intent Classification — Apply similar techniques to support tickets
- Financial Spam Detection — Train custom classifiers with GLiClass