Decorator that wraps a serving function with profiling. Each call opens aDocumentation Index
Fetch the complete documentation index at: https://docs.cirron.com/llms.txt
Use this file to discover all available pages before exploring further.
request scope with per-request ContextVar isolation,
attributes latency and cost to the deployment record the platform
already has. For LLMs, it also automatically captures token counts,
time-to-first-token, and tokens/second.
Signature
Parameters
| Name | Type | Default | Purpose |
|---|---|---|---|
fn | Callable? | None | Implicit: set when used as bare @ci.inference |
config | dict? | None | Runtime feature flags the wrapped function reads via config.get() |
Behavior
On each call the decorator:- Allocates an auto-generated request ID (UUID4).
- Opens a
requestscope bound to that ID viacontextvars.ContextVar. - Invokes the wrapped function.
- Closes the scope. Per-request latency, the nested scope tree, and any marks emitted during the call attribute to the deployment.
Concurrency
Per-request isolation viaContextVar means concurrent requests never
contaminate each other’s scopes or marks, regardless of whether the
runtime uses threads, asyncio, or both.
Works with:
- FastAPI / Starlette (async)
- Flask (threaded)
- ASGI servers directly
- Plain function calls in synchronous code
Examples
Basic
@ci.inference function, the full scope and mark surface is
available: ci.scope opens nested spans
inside the auto-generated request scope, and
ci.mark attaches per-request values.
Async
Config-driven capture
CONFIG on the deployment config panel in the dashboard and hit
apply; the platform triggers a rolling restart and the next call to
ci.env("CONFIG") returns the new value.
FastAPI
LLM auto-detection
Wrapped calls that hit an OpenAI-compatible client or HuggingFacegenerate() are detected automatically:
- OpenAI-shaped responses: if the return value has
usage.prompt_tokens/usage.completion_tokens(the shape theopenai>=1.0Python client returns), they’re marked on the request scope. - HuggingFace
generate: calls intotransformers.GenerationMixin.generateare detected; input_ids length and output length are marked. - Streaming responses: when the wrapped function returns an iterator or async iterator of chunks, the time between scope open and first yield is marked as time-to-first-token; tokens/second is computed across the stream.
openai>=1.0 clients and transformers.generate.
Custom streaming wrappers, other LLM SDKs, and hand-rolled
SSE/WebSocket clients may not be detected; fall back to explicit
ci.mark("tokens", n) calls when the auto-detection doesn’t fire.
Detection is best-effort and wrapped in try/except; if it fails,
the wrapped function still returns normally.
Standalone use
Without a deployment record (running outside a Cirron deployment),@ci.inference still produces local traces: the request scope
lands at ./.cirron/spool/ like any other scope, just without
deployment attribution.
Related
Inference guide
Narrative walk-through including FastAPI and Flask examples.
ci.scope
The
with ci.scope("preprocess"): blocks the examples use.ci.mark
Attach per-request values (tokens, scores, latencies).
ci.env
How
CONFIG flows in from the deployment’s env vars.