Skip to main content

Production Serving

GLiClass includes a Ray Serve deployment for production inference. It supports dynamic request batching, multiple replicas, CPU or GPU execution, model compilation, memory-aware batch sizing, an HTTP client, and an in-process Python API.

Installation

Install GLiClass with its serving dependencies:

pip install "gliclass[serve]"

The serve extra installs Ray Serve, Requests, and PyYAML in addition to the standard GLiClass dependencies.

Start the Server

Start the default model, knowledgator/gliclass-edge-v3.0, on port 8000:

python -m gliclass.serve

Select a model and port:

python -m gliclass.serve \
--model knowledgator/gliclass-edge-v3.0 \
--port 8000

The default endpoint is http://localhost:8000/gliclass. Run the following command to see all CLI options:

python -m gliclass.serve --help
note

The default configuration uses CUDA. For a machine without a GPU, add --device cpu --dtype float32 --num-gpus-per-replica 0.

Send Requests

Python Client

GLiClassClient.classify classifies one text and returns a list of label-score dictionaries:

from gliclass.serve import GLiClassClient

client = GLiClassClient("http://localhost:8000/gliclass")

result = client.classify(
text="This is a great product!",
labels=["positive", "negative", "neutral"],
threshold=0.3,
multi_label=True,
)

print(result)
# [{"label": "positive", "score": 0.95}, ...]

The client also accepts the pipeline's few-shot examples and task prompt:

result = client.classify(
text="Fast delivery and the item works perfectly!",
labels=["positive", "negative", "product", "shipping"],
examples=[
{"text": "Excellent quality.", "labels": ["positive", "product"]},
{"text": "The package arrived late.", "labels": ["negative", "shipping"]},
],
prompt="Classify the sentiment and subject of this review:",
)

Check whether the endpoint is reachable with client.health_check().

HTTP API

Send a POST request to the configured route:

curl -X POST http://localhost:8000/gliclass \
-H "Content-Type: application/json" \
-d '{
"text": "This is a great product!",
"labels": ["positive", "negative", "neutral"],
"threshold": 0.3,
"multi_label": true
}'

The request body supports these fields:

FieldTypeRequiredDescription
textstringYesText to classify. texts is accepted as a compatibility alias.
labelsstring[]YesCandidate labels.
thresholdnumberNoConfidence threshold. Uses default_threshold when omitted.
multi_labelbooleanNotrue for multi-label or false for single-label classification. Defaults to true.
examplesobject[]NoFew-shot examples in the same format as the pipeline.
promptstringNoTask description prompt.
adapter_idstringNoID of a loaded PolyLoRA adapter.

The response is a JSON array:

[
{"label": "positive", "score": 0.95}
]
warning

The HTTP endpoint processes one text per request. If texts is an array, only its first item is processed. Send concurrent requests to benefit from server-side dynamic batching, or use the pipeline directly for offline batches.

Configure the Deployment

For reproducible deployments, create a YAML file:

serve_config.yaml
model: knowledgator/gliclass-edge-v3.0
device: cuda
dtype: float16

max_model_len: 2048
max_labels: -1
max_labels_alloc: dynamic
default_threshold: 0.5

num_replicas: 1
num_gpus_per_replica: 1.0
num_cpus_per_replica: 1.0

max_batch_size: 32
batch_wait_timeout_ms: 20.0
max_ongoing_requests: 256
queue_capacity: 4096

route_prefix: /gliclass
http_port: 8000

enable_compilation: false
precompile_on_startup: false
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]

calibrate_on_startup: false
use_memory_aware_batching: false
target_memory_fraction: 0.8
memory_overhead_factor: 1.3

ray_address: null

Start it with:

python -m gliclass.serve --config serve_config.yaml

CLI arguments override values loaded from the YAML file:

python -m gliclass.serve \
--config serve_config.yaml \
--num-replicas 2 \
--max-batch-size 64

Important settings include:

SettingDefaultDescription
modelknowledgator/gliclass-edge-v3.0 from the CLIHugging Face model ID or local path.
device / dtypecuda / bfloat16Inference device and model weight type.
max_model_len2048Maximum input sequence length.
max_labels-1Maximum candidate labels; -1 is unlimited. Longer lists are truncated when a positive limit is set.
max_labels_allocdynamicLabel memory allocation strategy: dynamic, fixed, or an integer.
num_replicas1Number of Ray Serve model replicas.
max_batch_size32Maximum number of concurrent requests combined into one model batch.
batch_wait_timeout_ms20.0Maximum time to wait for requests to fill a batch.
route_prefix / http_port/gliclass / 8000HTTP route and port.
ray_addressnullExisting Ray cluster address; null starts a local runtime.

Compilation and Memory-Aware Batching

Enable model compilation and optional startup warmup with:

enable_compilation: true
precompile_on_startup: true
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]
warmup_iterations: 3

Memory-aware batching calibrates GPU memory use at startup and selects the largest configured precompiled batch size that fits the observed input length:

calibrate_on_startup: true
use_memory_aware_batching: true
target_memory_fraction: 0.8
memory_overhead_factor: 1.3
calibration_min_seq_len: 64
calibration_probe_batch_size: 2

This mode requires CUDA. Without calibration data, the server falls back to the largest configured precompiled batch size.

In-Process Python API

GLiClassFactory starts Ray Serve and exposes blocking and asynchronous prediction without HTTP:

from gliclass.serve import GLiClassFactory

with GLiClassFactory(
model="knowledgator/gliclass-edge-v3.0",
device="cuda",
dtype="float16",
) as classifier:
results = classifier.predict(
["Great product!", "Terrible experience."],
labels=["positive", "negative", "neutral"],
)

For async applications, use await classifier.predict_async(...). Concurrent predictions are accumulated by Ray Serve into model batches. The context manager shuts down both the Serve deployment and its Ray runtime.

PolyLoRA Adapter Serving

The serving module can route requests to adapters through PolyLoRA. Install PolyLoRA separately; it is not included in the serve extra:

pip install polylora

Requested adapters must already exist in the adapter store.

Enable adapter serving in YAML:

enable_polylora: true
polylora_adapter_weight_modules: [query, key, value]
polylora_max_rank: 16
polylora_max_gpu_adapters: 8
polylora_max_cpu_adapters: 128
polylora_disk_cache_dir: /var/cache/gliclass/adapters
polylora_base_adapter_id: __base__

Select an adapter by passing adapter_id through the Python client or HTTP request. Omit it to use the base model.

Inspect the adapter cache:

status = client.adapter_cache_status()
adapter_status = client.adapter_cache_status("support-domain")
is_cached = client.is_adapter_cached("support-domain")

The equivalent HTTP endpoint is:

curl "http://localhost:8000/gliclass/adapter-cache?adapter_id=support-domain"

Adapter IDs must match the configured polylora_adapter_id_pattern; the default accepts 1–128 letters, digits, underscores, periods, and hyphens.

Source Configuration

See the upstream serve_config.yaml and gliclass.serve package for the complete configuration and implementation.