Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cirron.com/llms.txt

Use this file to discover all available pages before exploring further.

Single entry point for data loading. Flat kwargs, local-first by default, scheme routing for cloud and SQL sources.

Signature

def load(
    name: str | list[str],
    *,
    source: Literal["local", "platform"] = "local",
    match: str | dict | None = None,
    ext: list[str] | None = None,
    columns: list[str] | None = None,
    map: Callable | None = None,
    where: str | None = None,
    as_: Literal["pandas", "polars", "iter", "tensor", "hf"] = "pandas",
    lazy: bool = False,
    batch_size: int = 10_000,
    confirm_large: bool = False,
) -> Any

Parameters

NameTypeDefaultPurpose
namestr | list[str]-Path, scheme URI, registered dataset name, or a list (multi-source)
source"local" | "platform""local"Backend for scheme-less strings; overridden by a scheme in name
matchstr | dict?NoneGlob string or {path, filename, columns} dict for filesystem sources
extlist[str]?NoneShorthand extension filter (["csv", "parquet"])
columnslist[str]?NoneColumn selection pushed down to the reader
mapCallable?NoneRow-wise or batch-wise transform (see below)
wherestr?NoneSQL WHERE clause pushed to SQL sources
as_"pandas" | "polars" | "iter" | "tensor" | "hf""pandas"Return type
lazyboolFalseReturn a LazyHandle; call .collect() to materialize
batch_sizeint10_000Chunk size for "iter" and "tensor" return modes
confirm_largeboolFalseBypass the 10 GB size-tier error

Source resolution

A scheme in name always wins over source=:
InputBackend
"./path" / "name" (no scheme, source="local")Local filesystem
"name" (no scheme, source="platform")Platform dataset resolver
"s3://..."S3
"gs://..."Google Cloud Storage
"azure://..."Azure Blob Storage
"file://..."Local filesystem
"postgres://..."Postgres
"mysql://..."MySQL
"databricks://..."Databricks SQL
"snowflake://..."Snowflake

Return types

as_=ReturnsRequires
"pandas"pandas.DataFramecirron-sdk[pandas]
"polars"polars.DataFrame or LazyFramecirron-sdk[polars]
"iter"Iterator[dict] in batch_size batchesnothing extra
"tensor"torch.Tensor or tf.Tensor (auto-detected)framework installed
"hf"datasets.Datasetcirron-sdk[hf]
Missing backends raise CirronDependencyError with an install hint.

match= shapes

# Glob string: simplest form
ci.load("s3://bucket/", match="year=2025/month=*/*.parquet")

# Structured dict: separate path glob, filename regex, column pushdown
ci.load("s3://bucket/", match={
    "path":     "year=2025/month=*/",
    "filename": r"events_.*\.parquet",
    "columns":  ["user_id", "ts", "event_type"],
})

map= shapes

# Row-wise: plain callable, receives one row at a time
ci.load("./raw/", map=lambda row: {"text": row["raw"].lower()})

# Batch-wise: decorate with @ci.map, receives the full frame
@ci.map
def to_features(frame):
    frame["text"] = frame["raw"].str.lower()
    return frame

ci.load("./raw/", map=to_features)

Size guardrails

Before downloading anything, ci.load() sums the matched bytes and applies the tier policy configured on the Cirron instance:
SizeBehavior
< 1 GBSilent
< 10 GBWARNING log with narrowing hints
≥ 10 GBRaises CirronDataSizeError unless confirm_large=True
SQL sources opt out because they can’t estimate size before executing. Use LIMIT to bound result size.

SQL credential resolution

First match wins:
  1. URI inline: postgres://user:pass@host/db
  2. Platform integrations: GET /api/integrations/resolve with a scoped, short-lived token (requires a configured Cirron integration)
  3. ci.secret("<scheme>-<host>"): platform-mounted secret
  4. Driver env var: PGPASSWORD / MYSQL_PWD / SNOWFLAKE_PASSWORD / DATABRICKS_TOKEN

Examples

# Local
df = ci.load("./data/events.parquet")
df = ci.load("training-data")                 # probes ./training-data/, ./data/training-data/

# Multi-source: parallel load, concatenate
df = ci.load(["./a/", "./b/"])

# Cloud
df = ci.load("s3://ml-data/events/", match="year=2025/month=*/*.parquet")
df = ci.load("gs://bucket/events/", ext=["parquet"], columns=["user_id", "ts"])

# SQL
df = ci.load("postgres://prod/events", where="created_at > '2025-01-01'")
df = ci.load("snowflake://wh/db/schema/table", where="region = 'EMEA' LIMIT 100000")

# Platform-registered
df = ci.load("bucket1", source="platform")

# Polars + lazy
handle = ci.load("./events.parquet", as_="polars", lazy=True)
out = handle.collect().filter(pl.col("label") == 1)

Errors

ExceptionRaised when
CirronDependencyErroras_= requires a backend that isn’t installed
CirronDataSizeErrorMatched bytes ≥ load_max_bytes and confirm_large=False
CirronDatasetNotFoundsource="platform" and the name doesn’t resolve
CirronPlatformRequiredsource="platform" but credentials or network unavailable

Not yet shipped

search= and top_k= (semantic search over a platform vector index) accept input for API stability but raise NotImplementedError until the backend ships.

Data loading guide

Narrative walk-through with more examples.

Errors

Exception hierarchy for the data loader.