ci.load - Cirron Documentation

Single entry point for data loading. Flat kwargs, local-first by default, scheme routing for cloud and SQL sources.

Signature

def load(
    name: str | list[str],
    *,
    source: Literal["local", "platform"] = "local",
    match: str | dict | None = None,
    ext: list[str] | None = None,
    columns: list[str] | None = None,
    map: Callable | None = None,
    where: str | None = None,
    as_: Literal["pandas", "polars", "iter", "tensor", "hf"] = "pandas",
    lazy: bool = False,
    batch_size: int = 10_000,
    confirm_large: bool = False,
) -> Any

Parameters

Name	Type	Default	Purpose
`name`	`str \| list[str]`	-	Path, scheme URI, registered dataset name, or a list (multi-source)
`source`	`"local" \| "platform"`	`"local"`	Backend for scheme-less strings; overridden by a scheme in `name`
`match`	`str \| dict?`	`None`	Glob string or `{path, filename, columns}` dict for filesystem sources
`ext`	`list[str]?`	`None`	Shorthand extension filter (`["csv", "parquet"]`)
`columns`	`list[str]?`	`None`	Column selection pushed down to the reader
`map`	`Callable?`	`None`	Row-wise or batch-wise transform (see below)
`where`	`str?`	`None`	SQL `WHERE` clause pushed to SQL sources
`as_`	`"pandas" \| "polars" \| "iter" \| "tensor" \| "hf"`	`"pandas"`	Return type
`lazy`	`bool`	`False`	Return a `LazyHandle`; call `.collect()` to materialize
`batch_size`	`int`	`10_000`	Chunk size for `"iter"` and `"tensor"` return modes
`confirm_large`	`bool`	`False`	Bypass the 10 GB size-tier error

Source resolution

A scheme in name always wins over source=:

Input	Backend
`"./path"` / `"name"` (no scheme, `source="local"`)	Local filesystem
`"name"` (no scheme, `source="platform"`)	Platform dataset resolver
`"s3://..."`	S3
`"gs://..."`	Google Cloud Storage
`"azure://..."`	Azure Blob Storage
`"file://..."`	Local filesystem
`"postgres://..."`	Postgres
`"mysql://..."`	MySQL
`"databricks://..."`	Databricks SQL
`"snowflake://..."`	Snowflake

Return types

`as_=`	Returns	Requires
`"pandas"`	`pandas.DataFrame`	`cirron-sdk[pandas]`
`"polars"`	`polars.DataFrame` or `LazyFrame`	`cirron-sdk[polars]`
`"iter"`	`Iterator[dict]` in `batch_size` batches	nothing extra
`"tensor"`	`torch.Tensor` or `tf.Tensor` (auto-detected)	framework installed
`"hf"`	`datasets.Dataset`	`cirron-sdk[hf]`

Missing backends raise CirronDependencyError with an install hint.

`match=` shapes

# Glob string: simplest form
ci.load("s3://bucket/", match="year=2025/month=*/*.parquet")

# Structured dict: separate path glob, filename regex, column pushdown
ci.load("s3://bucket/", match={
    "path":     "year=2025/month=*/",
    "filename": r"events_.*\.parquet",
    "columns":  ["user_id", "ts", "event_type"],
})

`map=` shapes

# Row-wise: plain callable, receives one row at a time
ci.load("./raw/", map=lambda row: {"text": row["raw"].lower()})

# Batch-wise: decorate with @ci.map, receives the full frame
@ci.map
def to_features(frame):
    frame["text"] = frame["raw"].str.lower()
    return frame

ci.load("./raw/", map=to_features)

Size guardrails

Before downloading anything, ci.load() sums the matched bytes and applies the tier policy configured on the Cirron instance:

Size	Behavior
< 1 GB	Silent
< 10 GB	`WARNING` log with narrowing hints
≥ 10 GB	Raises `CirronDataSizeError` unless `confirm_large=True`

SQL sources opt out because they can’t estimate size before executing. Use LIMIT to bound result size.

SQL credential resolution

First match wins:

URI inline: postgres://user:pass@host/db
Platform integrations: GET /api/integrations/resolve with a scoped, short-lived token (requires a configured Cirron integration)
ci.secret("<scheme>-<host>"): platform-mounted secret
Driver env var: PGPASSWORD / MYSQL_PWD / SNOWFLAKE_PASSWORD / DATABRICKS_TOKEN

Examples

# Local
df = ci.load("./data/events.parquet")
df = ci.load("training-data")                 # probes ./training-data/, ./data/training-data/

# Multi-source: parallel load, concatenate
df = ci.load(["./a/", "./b/"])

# Cloud
df = ci.load("s3://ml-data/events/", match="year=2025/month=*/*.parquet")
df = ci.load("gs://bucket/events/", ext=["parquet"], columns=["user_id", "ts"])

# SQL
df = ci.load("postgres://prod/events", where="created_at > '2025-01-01'")
df = ci.load("snowflake://wh/db/schema/table", where="region = 'EMEA' LIMIT 100000")

# Platform-registered
df = ci.load("bucket1", source="platform")

# Polars + lazy
handle = ci.load("./events.parquet", as_="polars", lazy=True)
out = handle.collect().filter(pl.col("label") == 1)

Errors

Exception	Raised when
`CirronDependencyError`	`as_=` requires a backend that isn’t installed
`CirronDataSizeError`	Matched bytes ≥ `load_max_bytes` and `confirm_large=False`
`CirronDatasetNotFound`	`source="platform"` and the name doesn’t resolve
`CirronPlatformRequired`	`source="platform"` but credentials or network unavailable

Not yet shipped

search= and top_k= (semantic search over a platform vector index) accept input for API stability but raise NotImplementedError until the backend ships.

Data loading guide

Narrative walk-through with more examples.

Errors

Exception hierarchy for the data loader.

SDK

Documentation Index

​Signature

​Parameters

​Source resolution

​Return types

​match= shapes

​map= shapes

​Size guardrails

​SQL credential resolution

​Examples

​Errors

​Not yet shipped

​Related

Data loading guide

Errors

Signature

Parameters

Source resolution

Return types

`match=` shapes

`map=` shapes

Size guardrails

SQL credential resolution

Examples

Errors

Not yet shipped

Related