Profiling training - Cirron Documentation

This guide walks the training-side surface as one story, from the one-line zero-touch setup to custom scopes and marks. For a flat signature reference, jump to the Reference section. For inference instrumentation, see the Inference guide.

The happy path

One call, called once per process. Framework hooks do the work.

import cirron as ci

ci.profile()

That’s the whole setup. The SDK detects installed frameworks and installs hooks for each. It opens the cirron.session root scope, starts a background flush thread, and registers clean-shutdown handlers.

What the hooks produce

Framework	How it’s triggered	What you get
Keras	`model.fit()`	`epoch` and `batch` scopes; logged metrics as marks
HuggingFace Trainer	`trainer.train()`	`epoch` and `step` scopes; end-of-epoch values as summary marks
PyTorch + DataLoader	`for batch in loader:`	`data_load`, `forward`, `backward`, `optimizer_step`, implicit `step`

import cirron as ci
ci.profile()

model.fit(X, y, epochs=20)

Distributed training

Every rank calls ci.profile(). The SDK reads RANK / LOCAL_RANK / WORLD_SIZE from the environment and tags every span with the rank. The platform merges views at query time. See ci.profile for the full signature and parameter table.

Custom loops

If your loop doesn’t fit the hook patterns (generator-based iteration, custom samplers, step counters without a DataLoader), wrap the iterables. They’re transparent: ci.epochs(range(20)) yields 0..19 exactly while opening an indexed epoch scope around each iteration.

for epoch in ci.epochs(range(20)):
    for batch in ci.batches(loader):
        loss = train_step(batch)
        ci.mark("loss", loss.item())

ci.batches() additionally measures DataLoader stall time (time spent waiting for data vs. time spent on compute) when the iterable is a torch.utils.data.DataLoader. See Loop wrappers.

Custom regions

Explicit scopes for regions the hooks and wrappers don’t cover: augmentation, beam search, custom schedulers, preprocessing passes.

with ci.scope("augmentation"):
    batch = augment(batch)

with ci.scope("postprocess", variant="beam-search"):
    output = beam_search(logits)

Scopes nest arbitrarily under whatever scope is already open, so the hooks’ epoch / batch / forward tree stays intact and your custom scope slots in at the right level. Max depth: 64. See ci.scope.

Values

Attach scalar values to the innermost open scope.

ci.mark("loss", loss.item())                                # point (default)
ci.mark("grad_norm", compute_grad_norm(model))
ci.mark("learning_rate", scheduler.get_last_lr()[0])
ci.mark("epoch_accuracy", val_acc, kind="summary")          # canonical epoch value

Two kinds:

kind="point" (default): time-series values recorded while the span is open. Viewers render as a chart.
kind="summary": a single canonical end-of-span value. Viewers render as one value on the span.

See ci.mark.

Framework hooks

Hooks are installed automatically by ci.profile() when the framework is importable. Each hook is wrapped in a top-level try/except, so a hook that fails logs a warning and training continues.

Priority

When multiple frameworks are installed, hooks fire in priority order:

transformers  >  tensorflow  >  torch

Higher-level frameworks claim ownership of the semantic scopes (epoch, step) first. Lower-level frameworks yield on those names so no semantic scope is duplicated; HuggingFace Trainer running over a PyTorch DataLoader gets one epoch span per epoch, not two.

PyTorch

Hook	Mechanism	Scope produced
Forward pass	`nn.Module.__call__` pre/post hook	`forward` (with `mode=train\|eval`)
Backward pass	Autograd hooks on `Tensor.backward`	`backward`
Optimizer	`register_optimizer_step_pre/post_hook`	`optimizer_step`
DataLoader	`DataLoader.__iter__` / `__next__` wrapping	`data_load` per batch
Step boundary	First `__next__` after each `optimizer_step`	`step` wrapping the four above
CUDA time	`torch.cuda.Event` pairs per scope	`gpu_ns` attribute on spans

Gradient accumulation (multiple forward/backward pairs between optimizer steps) produces a single step span covering all of them. Epoch-boundary detection uses DataLoader iterator exhaustion: a new iterator begins a new epoch. Fallback: every N optimizer steps (configurable, default 1000). When using ci.epochs(), the wrapper handles it directly.

TensorFlow / Keras

A keras.callbacks.Callback is auto-registered by patching Model.fit to inject it if not already present. Opens/closes scopes on on_epoch_begin/end and on_train_batch_begin/end. Logged metrics become marks automatically.

HuggingFace transformers

A TrainerCallback is auto-registered by patching Trainer.__init__. Opens scopes for on_train_begin, on_epoch_begin, on_step_begin. End-of-epoch values are marked kind="summary". Torch hooks nest correctly underneath HF’s step span.

scikit-learn

No auto-hook. Opt in by wrapping the estimator:

from sklearn.ensemble import RandomForestClassifier
import cirron as ci

model = ci.wrap(RandomForestClassifier(n_estimators=100))
model.fit(X, y)      # opens a scope around fit, delegates everything else

See ci.wrap.

Snapshots

At each detected epoch boundary, the SDK captures per-tensor statistics for every parameter in the model being profiled. Three modes, controlled by ci.profile(snapshots=...).

Mode	Cost per epoch boundary	What’s captured
`"stats"`	≤ 50 ms (typical model)	`{mean, std, min, max, norm, histogram[16]}` per tensor
`"sampled"`	≤ 200 ms on sampled steps	Stats + raw tensor values for `random() < sample_rate` epochs
`"full"`	unbounded; debug-only	Stats + raw tensor values every epoch

In "sampled" and "full" modes, raw tensors are written as safetensors blobs at ./.cirron/snapshots/<span_id>/weights.safetensors (and gradients.safetensors when gradients are non-None).

Model discovery

Keras and HuggingFace hooks discover the model from their callback kwargs. Bare PyTorch loops that don’t use ci.epochs() should register the model once with ci.watch(model) before training:

import cirron as ci

ci.profile(snapshots="stats")
ci.watch(model)

for epoch in range(20):
    ...

See ci.watch.

"full" mode is not recommended for models over 100M parameters. At 7B+, even "sampled" is expensive, drop the sample_rate.

Output sinks

By default ci.profile() writes each batch as a JSON file under ./.cirron/spool/. The output= parameter swaps or fans-out that local destination so you can stream traces alongside training output or run purely in-memory:

ci.profile(output="stdout")            # live [cirron] lines per closed span
ci.profile(output="log")               # cirron.trace logger at INFO
ci.profile(output=["spool", "log"])    # disk + log
ci.profile(output="none")              # nothing written; pair with ci.trace()

Sinks are independent of the platform transport. output="none" inside a Cirron pipeline still ships traces to the platform over the kernel event stream; only the local mirroring is suppressed. See output= reference for the full table.

In-process read-back

ci.trace() returns the current session’s scope tree without leaving the process. Useful in notebooks (the cell renders the tree inline) and for ad-hoc analysis (a flat DataFrame for quantiles and group-bys):

ci.trace()                          # text tree to stdout (or notebook value)
ci.trace(format="dict")             # nested dict
ci.trace(format="json")             # JSON string
ci.trace(format="df")               # pandas DataFrame, one row per span
ci.trace(name="epoch")              # filter by scope name
ci.trace(last=10)                   # 10 most recently closed spans

Works with or without an active profiler. When no profiler is attached, the call is purely in-memory and never writes a spool file as a side effect (safe on read-only filesystems). See ci.trace reference.

Lifecycle

Three helpers for manual control. The atexit handler registered by ci.profile() calls them for you on process exit, reach for them only when you need deterministic behavior in tests or hot-reload scenarios.

ci.flush()       # synchronously drain scope + mark buffers to spool
ci.health()      # dict: enabled, drop counts, hook handles, transport, spool usage
ci.shutdown()    # close root scope, flush, stop flush thread, clear singleton

See Lifecycle reference.

Inference guide

@ci.inference, LLM detection, config-driven capture.

Reference, ci.profile

Full signature and parameter table.

Reference, ci.trace

Read-back formats and filters.

SDK

Documentation Index

​The happy path

​What the hooks produce

​Distributed training

​Custom loops

​Custom regions

​Values

​Framework hooks

​Priority

​PyTorch

​TensorFlow / Keras

​HuggingFace transformers

​scikit-learn

​Snapshots

​Model discovery

​Output sinks

​In-process read-back

​Lifecycle

​Next

Inference guide

Reference, ci.profile

Reference, ci.trace

The happy path

What the hooks produce

Distributed training

Custom loops

Custom regions

Values

Framework hooks

Priority

PyTorch

TensorFlow / Keras

HuggingFace transformers

scikit-learn

Snapshots

Model discovery

Output sinks

In-process read-back

Lifecycle

Next