Skip to main content

How to Categorize Social Media Posts Locally with GLiClass

Run zero-shot text classification on your local machine to categorize social media posts by topic, sentiment, and intent without sending data to external APIs.

Overview

This cookbook demonstrates using the knowledgator/gliclass-edge-v3.0 model for local, privacy-preserving classification of social media content. The edge model is optimized for fast inference on consumer hardware while maintaining high accuracy.

What You'll Learn

  • Set up GLiClass for local inference
  • Define category taxonomies for social media content
  • Classify posts by topic, sentiment, and engagement potential
  • Handle multi-platform content (Twitter/X, Instagram, LinkedIn, TikTok)
  • Optimize performance for batch processing
  • Build a complete local classification pipeline

Prerequisites

  • Python 3.10+
  • 4GB+ RAM (8GB recommended)
  • GPU optional but recommended for batch processing
  • Sample social media posts for testing

Use Cases

  • Content moderation and filtering
  • Social media analytics dashboards
  • Trend detection and monitoring
  • Influencer content analysis
  • Brand mention categorization
  • Competitor content tracking

Why Run Locally?

Running classification locally offers several advantages:

BenefitDescription
PrivacySensitive social data never leaves your infrastructure
CostNo per-request API fees for high-volume processing
LatencySub-100ms inference without network round-trips
OfflineWorks without internet connectivity
ControlFull control over model versions and updates

The GLiClass Edge Model

The knowledgator/gliclass-edge-v3.0 model is a compact zero-shot classifier optimized for edge deployment:

SpecificationValue
Model size~200MB
Inference speed~50ms per text (CPU)
Max sequence length512 tokens
Zero-shotYes (no training required)
MultilingualEnglish primary, partial multilingual support

Installation

Install Dependencies

pip install gliclass torch transformers

For GPU Acceleration (Optional)

# CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

Verify Installation

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

# Load model and tokenizer (downloads on first run)
model = GLiClassModel.from_pretrained("knowledgator/gliclass-edge-v3.0")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-edge-v3.0")

# Create classification pipeline
pipeline = ZeroShotClassificationPipeline(
model, tokenizer,
classification_type='multi-label',
device='cpu'
)

print("Model loaded successfully!")

# Quick test
text = "Just shipped a new feature! So excited to share it with everyone."
labels = ["announcement", "question", "complaint", "casual conversation"]

results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
print(f"{result['label']} => {result['score']:.3f}")

Step 1: Define Social Media Categories

Topic Categories

TOPIC_LABELS = [
"technology and software",
"business and entrepreneurship",
"health and fitness",
"food and cooking",
"travel and adventure",
"fashion and beauty",
"entertainment and pop culture",
"sports",
"politics and news",
"education and learning",
"art and creativity",
"personal life update",
"humor and memes",
"motivational and inspirational"
]

Content Type Categories

CONTENT_TYPE_LABELS = [
"product announcement or launch",
"promotional content or advertisement",
"educational tutorial or how-to",
"opinion or commentary",
"question asking for advice",
"personal story or experience",
"news or current events",
"behind-the-scenes content",
"user-generated testimonial",
"engagement bait or poll",
"meme or humorous content",
"inspirational quote or message"
]

Sentiment Categories

SENTIMENT_LABELS = [
"very positive and enthusiastic",
"positive and satisfied",
"neutral or informational",
"negative or disappointed",
"angry or frustrated",
"sarcastic or ironic"
]

Engagement Intent Categories

ENGAGEMENT_LABELS = [
"seeking likes and shares",
"starting a discussion",
"asking for help or advice",
"sharing information",
"promoting a product or service",
"building personal brand",
"networking and connecting",
"entertainment only"
]

Step 2: Basic Classification

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

# Initialize model and pipeline
model = GLiClassModel.from_pretrained("knowledgator/gliclass-edge-v3.0")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-edge-v3.0")

pipeline = ZeroShotClassificationPipeline(
model, tokenizer,
classification_type='multi-label',
device='cpu'
)


def classify_post(text: str, labels: list[str], threshold: float = 0.3) -> dict:
"""
Classify a social media post against given labels.

Args:
text: The post content
labels: List of category labels
threshold: Minimum confidence score

Returns:
Dictionary of label -> score for labels above threshold
"""
results = pipeline(text, labels, threshold=threshold)[0]
return {r['label']: r['score'] for r in results}


# Example post
post = """
🚀 Finally launched my new SaaS product after 6 months of building!
It helps small businesses automate their invoicing.
Would love your feedback - link in bio!
#startup #entrepreneurship #buildinpublic
"""

# Classify by topic
topic = classify_post(post, TOPIC_LABELS)
print("Topic:", topic)
# Output: {'business and entrepreneurship': 0.92, 'technology and software': 0.71}

# Classify by content type
content_type = classify_post(post, CONTENT_TYPE_LABELS)
print("Content Type:", content_type)
# Output: {'product announcement or launch': 0.95, 'promotional content or advertisement': 0.68}

# Classify sentiment
sentiment = classify_post(post, SENTIMENT_LABELS)
print("Sentiment:", sentiment)
# Output: {'very positive and enthusiastic': 0.89}

Step 3: Multi-Label Classification

Social media posts often belong to multiple categories:

def classify_multi_label(
text: str,
labels: list[str],
threshold: float = 0.4,
max_labels: int = 3
) -> list[tuple[str, float]]:
"""
Return multiple matching labels sorted by confidence.
"""
results = pipeline(text, labels, threshold=threshold)[0]

# Sort by score descending
sorted_results = sorted(
results,
key=lambda x: x['score'],
reverse=True
)

return [(r['label'], r['score']) for r in sorted_results[:max_labels]]


# Example: Post with multiple topics
post = """
Made this amazing avocado toast while watching the game.
Perfect Sunday vibes! Recipe in comments 🥑⚽
"""

labels = classify_multi_label(post, TOPIC_LABELS)
for label, score in labels:
print(f" {label}: {score:.2f}")

# Output:
# food and cooking: 0.87
# sports: 0.62
# personal life update: 0.58

Step 4: Platform-Specific Classification

Different platforms have different content styles:

from dataclasses import dataclass
from enum import Enum

class Platform(Enum):
TWITTER = "twitter"
INSTAGRAM = "instagram"
LINKEDIN = "linkedin"
TIKTOK = "tiktok"
FACEBOOK = "facebook"

# Platform-specific labels
PLATFORM_LABELS = {
Platform.TWITTER: [
"hot take or opinion",
"thread or long-form content",
"news commentary",
"viral moment reaction",
"community engagement",
"self-promotion",
"humor or shitpost"
],
Platform.INSTAGRAM: [
"lifestyle showcase",
"product feature",
"behind-the-scenes",
"aesthetic or mood post",
"story highlight",
"collaboration or partnership",
"user-generated content repost"
],
Platform.LINKEDIN: [
"career update or announcement",
"thought leadership",
"industry insight",
"job opportunity",
"company news",
"professional achievement",
"networking request",
"motivational content"
],
Platform.TIKTOK: [
"trend participation",
"tutorial or how-to",
"comedy skit",
"storytime",
"product review",
"dance or music content",
"day in my life",
"duet or stitch response"
]
}

@dataclass
class SocialPost:
"""Represents a social media post with metadata."""
text: str
platform: Platform
author: str = ""
hashtags: list[str] = None
mentions: list[str] = None

def __post_init__(self):
self.hashtags = self.hashtags or []
self.mentions = self.mentions or []


def classify_by_platform(post: SocialPost, threshold: float = 0.4) -> dict:
"""Classify post using platform-specific categories."""
platform_labels = PLATFORM_LABELS.get(post.platform, TOPIC_LABELS)

results = pipeline(post.text, platform_labels, threshold=threshold)[0]
categories = {r['label']: r['score'] for r in results}

return {
"platform": post.platform.value,
"categories": categories,
"primary_category": max(categories.items(), key=lambda x: x[1])[0] if categories else "uncategorized"
}


# Example usage
linkedin_post = SocialPost(
text="""
Excited to announce I've joined Acme Corp as Senior Engineer!
After 5 years at my previous role, I'm ready for this new challenge.
Grateful for everyone who supported me on this journey.
#newjob #career #grateful
""",
platform=Platform.LINKEDIN,
author="jane_doe",
hashtags=["newjob", "career", "grateful"]
)

result = classify_by_platform(linkedin_post)
print(f"Platform: {result['platform']}")
print(f"Primary: {result['primary_category']}")
print(f"All categories: {result['categories']}")

Step 5: Complete Classification Pipeline

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import json

@dataclass
class ClassificationResult:
"""Complete classification output for a social media post."""
post_id: str
text: str
platform: str
topic: dict
content_type: dict
sentiment: dict
engagement_intent: dict
primary_topic: str
primary_sentiment: str
is_promotional: bool
classified_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())

def to_dict(self) -> dict:
return {
"post_id": self.post_id,
"text": self.text[:100] + "..." if len(self.text) > 100 else self.text,
"platform": self.platform,
"primary_topic": self.primary_topic,
"primary_sentiment": self.primary_sentiment,
"is_promotional": self.is_promotional,
"topic_scores": self.topic,
"sentiment_scores": self.sentiment,
"content_type_scores": self.content_type,
"engagement_intent_scores": self.engagement_intent,
"classified_at": self.classified_at
}


class SocialMediaClassifier:
"""
Complete local classification pipeline for social media posts.
"""

def __init__(
self,
model_name: str = "knowledgator/gliclass-edge-v3.0",
device: str = "cpu",
threshold: float = 0.4,
classification_type: str = "multi-label"
):
self.model = GLiClassModel.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.threshold = threshold

self.pipeline = ZeroShotClassificationPipeline(
self.model,
self.tokenizer,
classification_type=classification_type,
device=device
)

def _classify(self, text: str, labels: list[str]) -> dict:
"""Internal classification method."""
results = self.pipeline(text, labels, threshold=self.threshold)[0]
return {r['label']: r['score'] for r in results}

def classify(
self,
post_id: str,
text: str,
platform: str = "unknown"
) -> ClassificationResult:
"""Perform complete classification on a single post."""

# Topic classification
topic_result = self._classify(text, TOPIC_LABELS)

# Content type classification
content_type_result = self._classify(text, CONTENT_TYPE_LABELS)

# Sentiment classification
sentiment_result = self._classify(text, SENTIMENT_LABELS)

# Engagement intent classification
engagement_result = self._classify(text, ENGAGEMENT_LABELS)

# Determine primary classifications
primary_topic = max(topic_result.items(), key=lambda x: x[1])[0] if topic_result else "uncategorized"
primary_sentiment = max(sentiment_result.items(), key=lambda x: x[1])[0] if sentiment_result else "neutral"

# Check if promotional
promotional_indicators = ["promotional content", "promoting a product", "product announcement"]
is_promotional = any(
any(ind in label.lower() for ind in promotional_indicators)
for label in list(content_type_result.keys()) + list(engagement_result.keys())
)

return ClassificationResult(
post_id=post_id,
text=text,
platform=platform,
topic=topic_result,
content_type=content_type_result,
sentiment=sentiment_result,
engagement_intent=engagement_result,
primary_topic=primary_topic,
primary_sentiment=primary_sentiment,
is_promotional=is_promotional
)

def classify_batch(
self,
posts: list[dict],
text_field: str = "text",
id_field: str = "id",
platform_field: str = "platform"
) -> list[ClassificationResult]:
"""Classify multiple posts."""
results = []

for post in posts:
result = self.classify(
post_id=str(post.get(id_field, "")),
text=post[text_field],
platform=post.get(platform_field, "unknown")
)
results.append(result)

return results

def get_summary(self, results: list[ClassificationResult]) -> dict:
"""Generate summary statistics from classification results."""
topic_counts = {}
sentiment_counts = {}
promotional_count = 0

for result in results:
# Count topics
topic_counts[result.primary_topic] = topic_counts.get(result.primary_topic, 0) + 1

# Count sentiments
sentiment_counts[result.primary_sentiment] = sentiment_counts.get(result.primary_sentiment, 0) + 1

# Count promotional
if result.is_promotional:
promotional_count += 1

return {
"total_posts": len(results),
"topic_distribution": topic_counts,
"sentiment_distribution": sentiment_counts,
"promotional_percentage": round(promotional_count / len(results) * 100, 1) if results else 0
}

Step 6: Usage Examples

Basic Usage

# Initialize classifier
classifier = SocialMediaClassifier(device="cpu")

# Single post classification
post_text = """
Just discovered this amazing coffee shop in downtown!
The latte art is incredible and the vibes are immaculate ☕✨
Definitely my new favorite spot. Who else loves finding hidden gems?
"""

result = classifier.classify(
post_id="post_001",
text=post_text,
platform="instagram"
)

print(f"Topic: {result.primary_topic}")
print(f"Sentiment: {result.primary_sentiment}")
print(f"Is Promotional: {result.is_promotional}")
print(f"All topics: {result.topic}")

Batch Processing

# Sample posts from different platforms
posts = [
{
"id": "tw_001",
"text": "Hot take: tabs are better than spaces. Fight me.",
"platform": "twitter"
},
{
"id": "ig_001",
"text": "Morning routine 🌅 5am wake up, meditation, cold shower, gym. Discipline = freedom",
"platform": "instagram"
},
{
"id": "li_001",
"text": "Excited to share that our team just closed a $10M Series A! Thank you to all our investors and supporters.",
"platform": "linkedin"
},
{
"id": "tw_002",
"text": "Anyone else's code working perfectly locally but failing in production? Just me? 🙃",
"platform": "twitter"
},
{
"id": "ig_002",
"text": "Ad: Use code SUMMER20 for 20% off my new ebook! Link in bio 📚",
"platform": "instagram"
}
]

# Classify all posts
results = classifier.classify_batch(posts)

# Print results
for result in results:
print(f"\n{result.post_id} ({result.platform}):")
print(f" Topic: {result.primary_topic}")
print(f" Sentiment: {result.primary_sentiment}")
print(f" Promotional: {'Yes' if result.is_promotional else 'No'}")

# Get summary statistics
summary = classifier.get_summary(results)
print(f"\n--- Summary ---")
print(f"Total posts: {summary['total_posts']}")
print(f"Promotional: {summary['promotional_percentage']}%")
print(f"Topics: {summary['topic_distribution']}")

Step 7: Performance Optimization

GPU Acceleration

import torch

# Check GPU availability
if torch.cuda.is_available():
print(f"GPU available: {torch.cuda.get_device_name(0)}")
classifier = SocialMediaClassifier(device="cuda:0")
else:
print("Running on CPU")
classifier = SocialMediaClassifier(device="cpu")

Batch Processing with Progress

from tqdm import tqdm

def classify_with_progress(
classifier: SocialMediaClassifier,
posts: list[dict],
batch_size: int = 100
) -> list[ClassificationResult]:
"""Classify posts with progress bar."""
results = []

for i in tqdm(range(0, len(posts), batch_size), desc="Classifying"):
batch = posts[i:i + batch_size]
batch_results = classifier.classify_batch(batch)
results.extend(batch_results)

return results

Memory-Efficient Processing

def classify_stream(
classifier: SocialMediaClassifier,
post_iterator,
output_file: str
):
"""
Process posts as a stream, writing results immediately.
Memory-efficient for very large datasets.
"""
with open(output_file, 'w') as f:
f.write('[\n')
first = True

for post in post_iterator:
result = classifier.classify(
post_id=post['id'],
text=post['text'],
platform=post.get('platform', 'unknown')
)

if not first:
f.write(',\n')
first = False

json.dump(result.to_dict(), f, indent=2)

f.write('\n]')


# Usage with file-based iteration
def read_posts_from_file(filepath: str):
"""Generator to read posts line by line."""
with open(filepath, 'r') as f:
for line in f:
yield json.loads(line)

# Process large dataset
# classify_stream(classifier, read_posts_from_file("posts.jsonl"), "results.json")

Caching for Repeated Classifications

import hashlib

class CachedClassifier(SocialMediaClassifier):
"""Classifier with result caching for repeated texts."""

def __init__(self, cache_size: int = 10000, **kwargs):
super().__init__(**kwargs)
self.cache_size = cache_size
self._cache = {}

def _get_cache_key(self, text: str, labels: tuple) -> str:
"""Generate cache key from text and labels."""
content = f"{text}:{':'.join(sorted(labels))}"
return hashlib.md5(content.encode()).hexdigest()

def _classify_cached(
self,
text: str,
labels: list[str]
) -> dict:
"""Classify with caching."""
cache_key = self._get_cache_key(text, tuple(labels))

if cache_key in self._cache:
return self._cache[cache_key]

result = self._classify(text, labels)

# Maintain cache size
if len(self._cache) >= self.cache_size:
# Remove oldest entry
self._cache.pop(next(iter(self._cache)))

self._cache[cache_key] = result
return result

Step 8: Content Moderation Example

MODERATION_LABELS = [
"spam or scam content",
"hate speech or discrimination",
"harassment or bullying",
"misinformation or fake news",
"adult or explicit content",
"violence or threats",
"self-harm or dangerous content",
"safe and appropriate content"
]

def moderate_content(
classifier: SocialMediaClassifier,
text: str,
threshold: float = 0.5
) -> dict:
"""
Check post for policy violations.
"""
results = classifier.pipeline(text, MODERATION_LABELS, threshold=threshold)[0]
result_dict = {r['label']: r['score'] for r in results}

# Determine if content is safe
safe_label = "safe and appropriate content"
is_safe = safe_label in result_dict and result_dict[safe_label] > 0.6

# Get violations (excluding safe label)
violations = {
k: v for k, v in result_dict.items()
if k != safe_label and v > threshold
}

return {
"is_safe": is_safe and not violations,
"violations": violations,
"requires_review": bool(violations),
"confidence": result_dict.get(safe_label, 0.0)
}


# Example
classifier = SocialMediaClassifier(device="cpu")
suspicious_post = "Make $10,000 from home! DM me for details! 💰🔥"
moderation_result = moderate_content(classifier, suspicious_post)

print(f"Safe: {moderation_result['is_safe']}")
print(f"Violations: {moderation_result['violations']}")
# Output:
# Safe: False
# Violations: {'spam or scam content': 0.89}

Step 9: Export and Integration

Export to CSV

import csv

def export_to_csv(results: list[ClassificationResult], filepath: str):
"""Export classification results to CSV."""
with open(filepath, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow([
'post_id', 'platform', 'primary_topic', 'primary_sentiment',
'is_promotional', 'text_preview', 'classified_at'
])

for result in results:
writer.writerow([
result.post_id,
result.platform,
result.primary_topic,
result.primary_sentiment,
result.is_promotional,
result.text[:50] + "..." if len(result.text) > 50 else result.text,
result.classified_at
])

Export to JSON Lines

def export_to_jsonl(results: list[ClassificationResult], filepath: str):
"""Export results to JSON Lines format."""
with open(filepath, 'w') as f:
for result in results:
f.write(json.dumps(result.to_dict()) + '\n')

Webhook Integration

import requests

def send_to_webhook(
result: ClassificationResult,
webhook_url: str,
filter_promotional: bool = False
):
"""Send classification result to webhook."""
if filter_promotional and not result.is_promotional:
return None

payload = {
"event": "post_classified",
"data": result.to_dict()
}

response = requests.post(webhook_url, json=payload)
return response.status_code == 200

Best Practices

  1. Choose appropriate thresholds: Start with 0.4 for multi-label scenarios, increase to 0.6+ for single-label precision

  2. Use descriptive labels: "product announcement or launch" works better than just "announcement"

  3. Preprocess text: Remove URLs, excessive emojis, and hashtags if they add noise

    import re
    def clean_post(text: str) -> str:
    text = re.sub(r'http\S+', '', text) # Remove URLs
    text = re.sub(r'#\w+', '', text) # Remove hashtags
    return text.strip()
  4. Batch for throughput: Process posts in batches of 50-100 for optimal GPU utilization

  5. Cache repeated content: Social media often has duplicate or near-duplicate posts

  6. Monitor model drift: Periodically validate classifications against human labels

  7. Handle edge cases: Very short posts (fewer than 10 words) may have lower accuracy; consider flagging for review


Troubleshooting

IssueSolution
Out of memoryReduce batch size, use CPU, or enable gradient checkpointing
Slow inferenceUse GPU, reduce labels per request, enable caching
Low accuracyUse more descriptive labels, lower threshold, preprocess text
Model download failsCheck internet connection, set HF_HOME for custom cache location

Next Steps