Production Serving
GLiClass includes a Ray Serve deployment for production inference. It supports dynamic request batching, multiple replicas, CPU or GPU execution, model compilation, memory-aware batch sizing, an HTTP client, and an in-process Python API.
Installation
Install GLiClass with its serving dependencies:
pip install "gliclass[serve]"
The serve extra installs Ray Serve, Requests, and PyYAML in addition to the standard GLiClass dependencies.
Start the Server
Start the default model, knowledgator/gliclass-edge-v3.0, on port 8000:
python -m gliclass.serve
Select a model and port:
python -m gliclass.serve \
--model knowledgator/gliclass-edge-v3.0 \
--port 8000
The default endpoint is http://localhost:8000/gliclass. Run the following command to see all CLI options:
python -m gliclass.serve --help
The default configuration uses CUDA. For a machine without a GPU, add --device cpu --dtype float32 --num-gpus-per-replica 0.
Send Requests
Python Client
GLiClassClient.classify classifies one text and returns a list of label-score dictionaries:
from gliclass.serve import GLiClassClient
client = GLiClassClient("http://localhost:8000/gliclass")
result = client.classify(
text="This is a great product!",
labels=["positive", "negative", "neutral"],
threshold=0.3,
multi_label=True,
)
print(result)
# [{"label": "positive", "score": 0.95}, ...]
The client also accepts the pipeline's few-shot examples and task prompt:
result = client.classify(
text="Fast delivery and the item works perfectly!",
labels=["positive", "negative", "product", "shipping"],
examples=[
{"text": "Excellent quality.", "labels": ["positive", "product"]},
{"text": "The package arrived late.", "labels": ["negative", "shipping"]},
],
prompt="Classify the sentiment and subject of this review:",
)
Check whether the endpoint is reachable with client.health_check().
HTTP API
Send a POST request to the configured route:
curl -X POST http://localhost:8000/gliclass \
-H "Content-Type: application/json" \
-d '{
"text": "This is a great product!",
"labels": ["positive", "negative", "neutral"],
"threshold": 0.3,
"multi_label": true
}'
The request body supports these fields:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to classify. texts is accepted as a compatibility alias. |
labels | string[] | Yes | Candidate labels. |
threshold | number | No | Confidence threshold. Uses default_threshold when omitted. |
multi_label | boolean | No | true for multi-label or false for single-label classification. Defaults to true. |
examples | object[] | No | Few-shot examples in the same format as the pipeline. |
prompt | string | No | Task description prompt. |
adapter_id | string | No | ID of a loaded PolyLoRA adapter. |
The response is a JSON array:
[
{"label": "positive", "score": 0.95}
]
The HTTP endpoint processes one text per request. If texts is an array, only its first item is processed. Send concurrent requests to benefit from server-side dynamic batching, or use the pipeline directly for offline batches.
Configure the Deployment
For reproducible deployments, create a YAML file:
model: knowledgator/gliclass-edge-v3.0
device: cuda
dtype: float16
max_model_len: 2048
max_labels: -1
max_labels_alloc: dynamic
default_threshold: 0.5
num_replicas: 1
num_gpus_per_replica: 1.0
num_cpus_per_replica: 1.0
max_batch_size: 32
batch_wait_timeout_ms: 20.0
max_ongoing_requests: 256
queue_capacity: 4096
route_prefix: /gliclass
http_port: 8000
enable_compilation: false
precompile_on_startup: false
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]
calibrate_on_startup: false
use_memory_aware_batching: false
target_memory_fraction: 0.8
memory_overhead_factor: 1.3
ray_address: null
Start it with:
python -m gliclass.serve --config serve_config.yaml
CLI arguments override values loaded from the YAML file:
python -m gliclass.serve \
--config serve_config.yaml \
--num-replicas 2 \
--max-batch-size 64
Important settings include:
| Setting | Default | Description |
|---|---|---|
model | knowledgator/gliclass-edge-v3.0 from the CLI | Hugging Face model ID or local path. |
device / dtype | cuda / bfloat16 | Inference device and model weight type. |
max_model_len | 2048 | Maximum input sequence length. |
max_labels | -1 | Maximum candidate labels; -1 is unlimited. Longer lists are truncated when a positive limit is set. |
max_labels_alloc | dynamic | Label memory allocation strategy: dynamic, fixed, or an integer. |
num_replicas | 1 | Number of Ray Serve model replicas. |
max_batch_size | 32 | Maximum number of concurrent requests combined into one model batch. |
batch_wait_timeout_ms | 20.0 | Maximum time to wait for requests to fill a batch. |
route_prefix / http_port | /gliclass / 8000 | HTTP route and port. |
ray_address | null | Existing Ray cluster address; null starts a local runtime. |
Compilation and Memory-Aware Batching
Enable model compilation and optional startup warmup with:
enable_compilation: true
precompile_on_startup: true
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]
warmup_iterations: 3
Memory-aware batching calibrates GPU memory use at startup and selects the largest configured precompiled batch size that fits the observed input length:
calibrate_on_startup: true
use_memory_aware_batching: true
target_memory_fraction: 0.8
memory_overhead_factor: 1.3
calibration_min_seq_len: 64
calibration_probe_batch_size: 2
This mode requires CUDA. Without calibration data, the server falls back to the largest configured precompiled batch size.
In-Process Python API
GLiClassFactory starts Ray Serve and exposes blocking and asynchronous prediction without HTTP:
from gliclass.serve import GLiClassFactory
with GLiClassFactory(
model="knowledgator/gliclass-edge-v3.0",
device="cuda",
dtype="float16",
) as classifier:
results = classifier.predict(
["Great product!", "Terrible experience."],
labels=["positive", "negative", "neutral"],
)
For async applications, use await classifier.predict_async(...). Concurrent predictions are accumulated by Ray Serve into model batches. The context manager shuts down both the Serve deployment and its Ray runtime.
PolyLoRA Adapter Serving
The serving module can route requests to adapters through PolyLoRA. Install PolyLoRA separately; it is not included in the serve extra:
pip install polylora
Requested adapters must already exist in the adapter store.
Enable adapter serving in YAML:
enable_polylora: true
polylora_adapter_weight_modules: [query, key, value]
polylora_max_rank: 16
polylora_max_gpu_adapters: 8
polylora_max_cpu_adapters: 128
polylora_disk_cache_dir: /var/cache/gliclass/adapters
polylora_base_adapter_id: __base__
Select an adapter by passing adapter_id through the Python client or HTTP request. Omit it to use the base model.
Inspect the adapter cache:
status = client.adapter_cache_status()
adapter_status = client.adapter_cache_status("support-domain")
is_cached = client.is_adapter_cached("support-domain")
The equivalent HTTP endpoint is:
curl "http://localhost:8000/gliclass/adapter-cache?adapter_id=support-domain"
Adapter IDs must match the configured polylora_adapter_id_pattern; the default accepts 1–128 letters, digits, underscores, periods, and hyphens.
Source Configuration
See the upstream serve_config.yaml and gliclass.serve package for the complete configuration and implementation.