Inference - Cirron Documentation

@ci.inference wraps a serving function with profiling. Each call opens a request scope, attributes latency and cost to the deployment record the platform already has. For LLMs, it also automatically captures token counts, time-to-first-token, and tokens/second.

The basics

import cirron as ci

@ci.inference
def predict(request):
    with ci.scope("preprocess"):
        x = preprocess(request)
    with ci.scope("model"):
        y = model(x)
    with ci.scope("postprocess"):
        return format_response(y)

The decorator does not change the function’s signature or return value. On each call it:

Opens a request scope with an auto-generated request ID.
Invokes your function.
Closes the scope. Per-request latency, scope tree, and marks are attributed to the deployment.

Works with sync, async, and streaming functions. FastAPI, Flask, ASGI, and threaded serving frameworks all work out of the box.

Per-request isolation

Every request gets its own scope tree via contextvars.ContextVar. Concurrent requests never contaminate each other’s scopes or marks, regardless of whether the runtime uses threads, asyncio, or both.

@ci.inference
async def predict(request):
    async with aiohttp.ClientSession() as s:
        with ci.scope("fetch"):
            data = await s.get(request["url"])
        with ci.scope("model"):
            return model(data)

Config-driven capture

Pass a config= dict to toggle optional capture logic at runtime, without redeploying code.

import cirron as ci

config = ci.env("CONFIG") or {}

@ci.inference(config=config)
def predict(request):
    result = model(preprocess(request))

    if config.get("capture_embeddings"):
        ci.mark("embedding_norm", result.embedding.norm().item())

    if config.get("log_attention"):
        ci.mark("attention_entropy", compute_entropy(result.attention))

    threshold = config.get("threshold", 0.5)
    return {"label": "positive" if result.score > threshold else "negative",
            "score": result.score}

ci.env() reads from the deployment’s environment variables. On the dashboard’s deployment config panel, edit the CONFIG env var (or whichever key you chose) and hit apply. The platform triggers a rolling restart of the deployment’s containers with the new value, and the next call to ci.env("CONFIG") returns it. See ci.env in Configuration for the JSON auto-parsing rules.

Automatic LLM detection

When the wrapped function calls an OpenAI-compatible client or HuggingFace generate, the SDK captures LLM-shaped metrics with no extra code:

OpenAI-compatible responses: if the return value has usage.prompt_tokens / usage.completion_tokens, they’re marked on the request scope.
HuggingFace generate: input_ids length and output length are captured.
Streaming responses: the time between scope open and first yield is marked as time-to-first-token; tokens/second is computed across the stream.

All detection is best-effort and wrapped in try/except; if it fails, your function still returns normally.

Lifecycle and deployment context

When the SDK is running inside a Cirron deployment, ci.profile() is typically called at module import time, before the serving framework starts accepting traffic. The deployment’s runtime injects the CIRRON_DEPLOYMENT_ID, CIRRON_WORKSPACE_ID, and any CIRRON_SECRET_* env vars your function reads via ci.secret(). Running standalone (no deployment record), @ci.inference still produces local traces: the request scope lands at ./.cirron/spool/ like any other scope, just without deployment attribution.

Configuration

ci.env, ci.secret, and the Cirron class: what you’ll use to source config and credentials in a deployment.

Profiling

Training-side instrumentation if your deployment also trains or fine-tunes inline.

SDK

Documentation Index

​The basics

​Per-request isolation

​Config-driven capture

​Automatic LLM detection

​Lifecycle and deployment context

​Next