Data loading

ci.load() is the single entry point for data access. One function, flat kwargs, local-first by default. Nothing hits the network unless you explicitly opt in via source="platform" or a scheme in the source string.

Signature

def load(
    name: str | list[str],
    *,
    source: Literal["local", "platform"] = "local",
    match: str | dict | None = None,
    ext: list[str] | None = None,
    columns: list[str] | None = None,
    map: Callable | None = None,
    where: str | None = None,
    as_: Literal["pandas", "polars", "iter", "tensor", "hf"] = "pandas",
    lazy: bool = False,
    batch_size: int = 10_000,
    confirm_large: bool = False,
) -> Any: ...

A scheme in the name string (s3://, gs://, postgres://, …) always overrides the source= kwarg. Without a scheme and with the default source="local", ci.load() probes the local filesystem and never calls the platform.

Where the data comes from

# Single file
df = ci.load("./data/training/events.parquet")

# Directory: probes ./training-data/, then ./data/training-data/
# First match wins; no fallback to later candidates if the first exists but is empty.
df = ci.load("training-data")

# Multi-source union: parallel load, concatenate
df = ci.load(["./data/a/", "./data/b/"])

Filtering and selection

match= and ext= work on any filesystem-backed source (local, S3, GCS, Azure, file://).

# Glob pattern
df = ci.load("s3://ml-data/events/", match="year=2025/month=*/*.parquet")

# Structured match: path glob + filename regex + columns pushdown
df = ci.load(
    "s3://ml-data/events/",
    match={
        "path": "year=2025/month=*/",
        "filename": r"events_.*\.parquet",
        "columns": ["user_id", "ts", "event_type"],
    },
)

# Extension shorthand: accepts multiple
df = ci.load("./data/", ext=["csv", "parquet"])

# Column selection: pushed down to Parquet / SQL readers
df = ci.load("./events.parquet", columns=["user_id", "ts", "event_type"])

where= is passed through to SQL sources unescaped: it’s your query, against your data. Bound result size with LIMIT when you can.

Transforms at load time

# Row-wise: receives one row at a time
df = ci.load(
    "./raw/",
    columns=["raw_text", "label"],
    map=lambda row: {"text": row["raw_text"].lower(), "label": int(row["label"])},
)

# Batch-wise: receives the full frame at once
@ci.map
def to_features(frame):
    frame["text"] = frame["raw_text"].str.lower()
    return frame

df = ci.load("./raw/", map=to_features)

Use @ci.map when the transform is vectorizable against pandas / polars; use plain callables for per-row work. How the switch is made: the @ci.map decorator sets a _cirron_batch_map=True attribute on the callable. ci.load() checks for that attribute: present means the whole frame is passed in one call, absent means rows are iterated. That’s the entire mechanism; decorate or don’t.

Return types

`as_=`	Returns	Requires
`"pandas"`	`pandas.DataFrame`	`cirron-sdk[pandas]`
`"polars"`	`polars.DataFrame` or `LazyFrame`	`cirron-sdk[polars]`
`"iter"`	`Iterator[dict]` in `batch_size` batches	nothing extra
`"tensor"`	`torch.Tensor` or `tf.Tensor` (auto-detected)	framework installed
`"hf"`	`datasets.Dataset`	`cirron-sdk[hf]`

If neither pandas nor polars is installed and as_= is not specified, ci.load() raises CirronDependencyError with an install hint.

Lazy loading

handle = ci.load("./events.parquet", as_="polars", lazy=True)
filtered = handle.collect().filter(pl.col("label") == 1)

lazy=True returns a LazyHandle with .collect(). Useful for large datasets that will be filtered or projected further before materialization.

Size guardrails

Before downloading anything, ci.load() sums the matched bytes across all sources (for multi-source calls) and applies a three-tier policy on the total:

Size	Behavior
< 1 GB	Silent
< 10 GB	`WARNING` log with narrowing hints (use `match=`, etc.)
≥ 10 GB	Raises `CirronDataSizeError` unless `confirm_large=True`

Configurable per Cirron instance:

from cirron import Cirron

c = Cirron(
    load_warn_bytes=500_000_000,   # 500 MB → warn
    load_max_bytes=5_000_000_000,  # 5 GB → error
)
c.load("large-bucket", source="platform")

SQL sources opt out of the size tier (they can’t report a size before executing the query). Use LIMIT to bound results.

Credential resolution for SQL sources

Credentials resolve in this order, first match wins:

URI inline: postgres://user:pass@host/db
Platform integrations: GET /api/integrations/resolve with a scoped, short-lived token (requires a configured Cirron integration for that host)
ci.secret("<scheme>-<host>"): platform-mounted secret
Driver env var: PGPASSWORD / MYSQL_PWD / SNOWFLAKE_PASSWORD / DATABRICKS_TOKEN

Same code works across cloud, on-prem, and air-gapped environments; only the credential source changes.

Not-yet-shipped

search= / top_k= accept input today for API stability but raise the stdlib NotImplementedError (not a Cirron-specific exception) until the platform vector-index feature ships. The docs will update when it does.

Errors

Exception	When
`CirronDependencyError`	`as_=` requires a backend that isn’t installed (pandas, polars, hf)
`CirronDataSizeError`	Matched bytes ≥ `load_max_bytes` and `confirm_large=False`
`CirronDatasetNotFound`	`source="platform"` and the registered name doesn’t exist
`CirronPlatformRequired`	`source="platform"` but credentials or network are unavailable

Configuration

The Cirron class, ci.env, ci.secret, and where credentials come from.

ci.load reference

Full signature and parameter table.

SDK

Signature

Where the data comes from

Filtering and selection

Transforms at load time

Return types

Lazy loading

Size guardrails

Credential resolution for SQL sources

Not-yet-shipped

Errors

Next

Configuration

ci.load reference

SDK

Documentation Index

​Signature

​Where the data comes from

​Filtering and selection

​Transforms at load time

​Return types

​Lazy loading

​Size guardrails

​Credential resolution for SQL sources

​Not-yet-shipped

​Errors

​Next

Configuration

ci.load reference

Signature

Where the data comes from

Filtering and selection

Transforms at load time

Return types

Lazy loading

Size guardrails

Credential resolution for SQL sources

Not-yet-shipped

Errors

Next