Production Serving

GLiClass includes a Ray Serve deployment for production inference. It supports dynamic request batching, multiple replicas, CPU or GPU execution, model compilation, memory-aware batch sizing, an HTTP client, and an in-process Python API.

Installation

Install GLiClass with its serving dependencies:

pip install "gliclass[serve]"

The serve extra installs Ray Serve, Requests, and PyYAML in addition to the standard GLiClass dependencies.

Start the Server

Start the default model, knowledgator/gliclass-edge-v3.0, on port 8000:

python -m gliclass.serve

Select a model and port:

python -m gliclass.serve \
  --model knowledgator/gliclass-edge-v3.0 \
  --port 8000

The default endpoint is http://localhost:8000/gliclass. Run the following command to see all CLI options:

python -m gliclass.serve --help

note

The default configuration uses CUDA. For a machine without a GPU, add --device cpu --dtype float32 --num-gpus-per-replica 0.

Send Requests

Python Client

GLiClassClient.classify classifies one text and returns a list of label-score dictionaries:

from gliclass.serve import GLiClassClient

client = GLiClassClient("http://localhost:8000/gliclass")

result = client.classify(
    text="This is a great product!",
    labels=["positive", "negative", "neutral"],
    threshold=0.3,
    multi_label=True,
)

print(result)
# [{"label": "positive", "score": 0.95}, ...]

The client also accepts the pipeline's few-shot examples and task prompt:

result = client.classify(
    text="Fast delivery and the item works perfectly!",
    labels=["positive", "negative", "product", "shipping"],
    examples=[
        {"text": "Excellent quality.", "labels": ["positive", "product"]},
        {"text": "The package arrived late.", "labels": ["negative", "shipping"]},
    ],
    prompt="Classify the sentiment and subject of this review:",
)

Check whether the endpoint is reachable with client.health_check().

HTTP API

Send a POST request to the configured route:

curl -X POST http://localhost:8000/gliclass \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a great product!",
    "labels": ["positive", "negative", "neutral"],
    "threshold": 0.3,
    "multi_label": true
  }'

The request body supports these fields:

Field	Type	Required	Description
`text`	string	Yes	Text to classify. `texts` is accepted as a compatibility alias.
`labels`	string[]	Yes	Candidate labels.
`threshold`	number	No	Confidence threshold. Uses `default_threshold` when omitted.
`multi_label`	boolean	No	`true` for multi-label or `false` for single-label classification. Defaults to `true`.
`examples`	object[]	No	Few-shot examples in the same format as the pipeline.
`prompt`	string	No	Task description prompt.
`adapter_id`	string	No	ID of a loaded PolyLoRA adapter.

The response is a JSON array:

[
  {"label": "positive", "score": 0.95}
]

warning

The HTTP endpoint processes one text per request. If texts is an array, only its first item is processed. Send concurrent requests to benefit from server-side dynamic batching, or use the pipeline directly for offline batches.

Configure the Deployment

For reproducible deployments, create a YAML file:

serve_config.yaml
model: knowledgator/gliclass-edge-v3.0
device: cuda
dtype: float16

max_model_len: 2048
max_labels: -1
max_labels_alloc: dynamic
default_threshold: 0.5

num_replicas: 1
num_gpus_per_replica: 1.0
num_cpus_per_replica: 1.0

max_batch_size: 32
batch_wait_timeout_ms: 20.0
max_ongoing_requests: 256
queue_capacity: 4096

route_prefix: /gliclass
http_port: 8000

enable_compilation: false
precompile_on_startup: false
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]

calibrate_on_startup: false
use_memory_aware_batching: false
target_memory_fraction: 0.8
memory_overhead_factor: 1.3

ray_address: null

Start it with:

python -m gliclass.serve --config serve_config.yaml

CLI arguments override values loaded from the YAML file:

python -m gliclass.serve \
  --config serve_config.yaml \
  --num-replicas 2 \
  --max-batch-size 64

Important settings include:

Setting	Default	Description
`model`	`knowledgator/gliclass-edge-v3.0` from the CLI	Hugging Face model ID or local path.
`device` / `dtype`	`cuda` / `bfloat16`	Inference device and model weight type.
`max_model_len`	`2048`	Maximum input sequence length.
`max_labels`	`-1`	Maximum candidate labels; `-1` is unlimited. Longer lists are truncated when a positive limit is set.
`max_labels_alloc`	`dynamic`	Label memory allocation strategy: `dynamic`, `fixed`, or an integer.
`num_replicas`	`1`	Number of Ray Serve model replicas.
`max_batch_size`	`32`	Maximum number of concurrent requests combined into one model batch.
`batch_wait_timeout_ms`	`20.0`	Maximum time to wait for requests to fill a batch.
`route_prefix` / `http_port`	`/gliclass` / `8000`	HTTP route and port.
`ray_address`	`null`	Existing Ray cluster address; `null` starts a local runtime.

Compilation and Memory-Aware Batching

Enable model compilation and optional startup warmup with:

enable_compilation: true
precompile_on_startup: true
precompiled_batch_sizes: [1, 2, 4, 8, 16, 32]
warmup_iterations: 3

Memory-aware batching calibrates GPU memory use at startup and selects the largest configured precompiled batch size that fits the observed input length:

calibrate_on_startup: true
use_memory_aware_batching: true
target_memory_fraction: 0.8
memory_overhead_factor: 1.3
calibration_min_seq_len: 64
calibration_probe_batch_size: 2

This mode requires CUDA. Without calibration data, the server falls back to the largest configured precompiled batch size.

In-Process Python API

GLiClassFactory starts Ray Serve and exposes blocking and asynchronous prediction without HTTP:

from gliclass.serve import GLiClassFactory

with GLiClassFactory(
    model="knowledgator/gliclass-edge-v3.0",
    device="cuda",
    dtype="float16",
) as classifier:
    results = classifier.predict(
        ["Great product!", "Terrible experience."],
        labels=["positive", "negative", "neutral"],
    )

For async applications, use await classifier.predict_async(...). Concurrent predictions are accumulated by Ray Serve into model batches. The context manager shuts down both the Serve deployment and its Ray runtime.

PolyLoRA Adapter Serving

The serving module can route requests to adapters through PolyLoRA. Install PolyLoRA separately; it is not included in the serve extra:

pip install polylora

Requested adapters must already exist in the adapter store.

Enable adapter serving in YAML:

enable_polylora: true
polylora_adapter_weight_modules: [query, key, value]
polylora_max_rank: 16
polylora_max_gpu_adapters: 8
polylora_max_cpu_adapters: 128
polylora_disk_cache_dir: /var/cache/gliclass/adapters
polylora_base_adapter_id: __base__

Select an adapter by passing adapter_id through the Python client or HTTP request. Omit it to use the base model.

Inspect the adapter cache:

status = client.adapter_cache_status()
adapter_status = client.adapter_cache_status("support-domain")
is_cached = client.is_adapter_cached("support-domain")

The equivalent HTTP endpoint is:

curl "http://localhost:8000/gliclass/adapter-cache?adapter_id=support-domain"

Adapter IDs must match the configured polylora_adapter_id_pattern; the default accepts 1–128 letters, digits, underscores, periods, and hyphens.

Source Configuration

See the upstream serve_config.yaml and gliclass.serve package for the complete configuration and implementation.

Installation​

Start the Server​

Send Requests​

Python Client​

HTTP API​

Configure the Deployment​

Compilation and Memory-Aware Batching​

In-Process Python API​

PolyLoRA Adapter Serving​

Source Configuration​