Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cirron.com/llms.txt

Use this file to discover all available pages before exploring further.

ci.load() is the single entry point for data access. One function, flat kwargs, local-first by default. Nothing hits the network unless you explicitly opt in via source="platform" or a scheme in the source string.

Signature

def load(
    name: str | list[str],
    *,
    source: Literal["local", "platform"] = "local",
    match: str | dict | None = None,
    ext: list[str] | None = None,
    columns: list[str] | None = None,
    map: Callable | None = None,
    where: str | None = None,
    as_: Literal["pandas", "polars", "iter", "tensor", "hf"] = "pandas",
    lazy: bool = False,
    batch_size: int = 10_000,
    confirm_large: bool = False,
) -> Any: ...
A scheme in the name string (s3://, gs://, postgres://, …) always overrides the source= kwarg. Without a scheme and with the default source="local", ci.load() probes the local filesystem and never calls the platform.

Where the data comes from

# Single file
df = ci.load("./data/training/events.parquet")

# Directory: probes ./training-data/, then ./data/training-data/
# First match wins; no fallback to later candidates if the first exists but is empty.
df = ci.load("training-data")

# Multi-source union: parallel load, concatenate
df = ci.load(["./data/a/", "./data/b/"])

Filtering and selection

match= and ext= work on any filesystem-backed source (local, S3, GCS, Azure, file://).
# Glob pattern
df = ci.load("s3://ml-data/events/", match="year=2025/month=*/*.parquet")

# Structured match: path glob + filename regex + columns pushdown
df = ci.load(
    "s3://ml-data/events/",
    match={
        "path": "year=2025/month=*/",
        "filename": r"events_.*\.parquet",
        "columns": ["user_id", "ts", "event_type"],
    },
)

# Extension shorthand: accepts multiple
df = ci.load("./data/", ext=["csv", "parquet"])

# Column selection: pushed down to Parquet / SQL readers
df = ci.load("./events.parquet", columns=["user_id", "ts", "event_type"])
where= is passed through to SQL sources unescaped: it’s your query, against your data. Bound result size with LIMIT when you can.

Transforms at load time

# Row-wise: receives one row at a time
df = ci.load(
    "./raw/",
    columns=["raw_text", "label"],
    map=lambda row: {"text": row["raw_text"].lower(), "label": int(row["label"])},
)

# Batch-wise: receives the full frame at once
@ci.map
def to_features(frame):
    frame["text"] = frame["raw_text"].str.lower()
    return frame

df = ci.load("./raw/", map=to_features)
Use @ci.map when the transform is vectorizable against pandas / polars; use plain callables for per-row work. How the switch is made: the @ci.map decorator sets a _cirron_batch_map=True attribute on the callable. ci.load() checks for that attribute: present means the whole frame is passed in one call, absent means rows are iterated. That’s the entire mechanism; decorate or don’t.

Return types

as_=ReturnsRequires
"pandas"pandas.DataFramecirron-sdk[pandas]
"polars"polars.DataFrame or LazyFramecirron-sdk[polars]
"iter"Iterator[dict] in batch_size batchesnothing extra
"tensor"torch.Tensor or tf.Tensor (auto-detected)framework installed
"hf"datasets.Datasetcirron-sdk[hf]
If neither pandas nor polars is installed and as_= is not specified, ci.load() raises CirronDependencyError with an install hint.

Lazy loading

handle = ci.load("./events.parquet", as_="polars", lazy=True)
filtered = handle.collect().filter(pl.col("label") == 1)
lazy=True returns a LazyHandle with .collect(). Useful for large datasets that will be filtered or projected further before materialization.

Size guardrails

Before downloading anything, ci.load() sums the matched bytes across all sources (for multi-source calls) and applies a three-tier policy on the total:
SizeBehavior
< 1 GBSilent
< 10 GBWARNING log with narrowing hints (use match=, etc.)
≥ 10 GBRaises CirronDataSizeError unless confirm_large=True
Configurable per Cirron instance:
from cirron import Cirron

c = Cirron(
    load_warn_bytes=500_000_000,   # 500 MB → warn
    load_max_bytes=5_000_000_000,  # 5 GB → error
)
c.load("large-bucket", source="platform")
SQL sources opt out of the size tier (they can’t report a size before executing the query). Use LIMIT to bound results.

Credential resolution for SQL sources

Credentials resolve in this order, first match wins:
  1. URI inline: postgres://user:pass@host/db
  2. Platform integrations: GET /api/integrations/resolve with a scoped, short-lived token (requires a configured Cirron integration for that host)
  3. ci.secret("<scheme>-<host>"): platform-mounted secret
  4. Driver env var: PGPASSWORD / MYSQL_PWD / SNOWFLAKE_PASSWORD / DATABRICKS_TOKEN
Same code works across cloud, on-prem, and air-gapped environments; only the credential source changes.

Not-yet-shipped

search= / top_k= accept input today for API stability but raise the stdlib NotImplementedError (not a Cirron-specific exception) until the platform vector-index feature ships. The docs will update when it does.

Errors

ExceptionWhen
CirronDependencyErroras_= requires a backend that isn’t installed (pandas, polars, hf)
CirronDataSizeErrorMatched bytes ≥ load_max_bytes and confirm_large=False
CirronDatasetNotFoundsource="platform" and the registered name doesn’t exist
CirronPlatformRequiredsource="platform" but credentials or network are unavailable

Next

Configuration

The Cirron class, ci.env, ci.secret, and where credentials come from.

ci.load reference

Full signature and parameter table.