Documentation Index Fetch the complete documentation index at: https://docs.cirron.com/llms.txt
Use this file to discover all available pages before exploring further.
Single entry point for data loading. Flat kwargs, local-first by
default, scheme routing for cloud and SQL sources.
Signature
def load (
name : str | list[ str ],
* ,
source : Literal[ "local" , "platform" ] = "local" ,
match : str | dict | None = None ,
ext : list[ str ] | None = None ,
columns : list[ str ] | None = None ,
map : Callable | None = None ,
where : str | None = None ,
as_ : Literal[ "pandas" , "polars" , "iter" , "tensor" , "hf" ] = "pandas" ,
lazy : bool = False ,
batch_size : int = 10_000 ,
confirm_large : bool = False ,
) -> Any
Parameters
Name Type Default Purpose namestr | list[str]- Path, scheme URI, registered dataset name, or a list (multi-source) source"local" | "platform""local"Backend for scheme-less strings; overridden by a scheme in name matchstr | dict?NoneGlob string or {path, filename, columns} dict for filesystem sources extlist[str]?NoneShorthand extension filter (["csv", "parquet"]) columnslist[str]?NoneColumn selection pushed down to the reader mapCallable?NoneRow-wise or batch-wise transform (see below) wherestr?NoneSQL WHERE clause pushed to SQL sources as_"pandas" | "polars" | "iter" | "tensor" | "hf""pandas"Return type lazyboolFalseReturn a LazyHandle; call .collect() to materialize batch_sizeint10_000Chunk size for "iter" and "tensor" return modes confirm_largeboolFalseBypass the 10 GB size-tier error
Source resolution
A scheme in name always wins over source=:
Input Backend "./path" / "name" (no scheme, source="local")Local filesystem "name" (no scheme, source="platform")Platform dataset resolver "s3://..."S3 "gs://..."Google Cloud Storage "azure://..."Azure Blob Storage "file://..."Local filesystem "postgres://..."Postgres "mysql://..."MySQL "databricks://..."Databricks SQL "snowflake://..."Snowflake
Return types
as_=Returns Requires "pandas"pandas.DataFramecirron-sdk[pandas]"polars"polars.DataFrame or LazyFramecirron-sdk[polars]"iter"Iterator[dict] in batch_size batchesnothing extra "tensor"torch.Tensor or tf.Tensor (auto-detected)framework installed "hf"datasets.Datasetcirron-sdk[hf]
Missing backends raise CirronDependencyError
with an install hint.
match= shapes
# Glob string: simplest form
ci.load( "s3://bucket/" , match = "year=2025/month=*/*.parquet" )
# Structured dict: separate path glob, filename regex, column pushdown
ci.load( "s3://bucket/" , match = {
"path" : "year=2025/month=*/" ,
"filename" : r "events_ . * \. parquet" ,
"columns" : [ "user_id" , "ts" , "event_type" ],
})
map= shapes
# Row-wise: plain callable, receives one row at a time
ci.load( "./raw/" , map = lambda row : { "text" : row[ "raw" ].lower()})
# Batch-wise: decorate with @ci.map, receives the full frame
@ci.map
def to_features ( frame ):
frame[ "text" ] = frame[ "raw" ].str.lower()
return frame
ci.load( "./raw/" , map = to_features)
Size guardrails
Before downloading anything, ci.load() sums the matched bytes and
applies the tier policy configured on the Cirron instance:
Size Behavior < 1 GB Silent < 10 GB WARNING log with narrowing hints≥ 10 GB Raises CirronDataSizeError unless confirm_large=True
SQL sources opt out because they can’t estimate size before executing.
Use LIMIT to bound result size.
SQL credential resolution
First match wins:
URI inline : postgres://user:pass@host/db
Platform integrations : GET /api/integrations/resolve with a
scoped, short-lived token (requires a configured Cirron integration)
ci.secret("<scheme>-<host>") : platform-mounted secret
Driver env var : PGPASSWORD / MYSQL_PWD /
SNOWFLAKE_PASSWORD / DATABRICKS_TOKEN
Examples
# Local
df = ci.load( "./data/events.parquet" )
df = ci.load( "training-data" ) # probes ./training-data/, ./data/training-data/
# Multi-source: parallel load, concatenate
df = ci.load([ "./a/" , "./b/" ])
# Cloud
df = ci.load( "s3://ml-data/events/" , match = "year=2025/month=*/*.parquet" )
df = ci.load( "gs://bucket/events/" , ext = [ "parquet" ], columns = [ "user_id" , "ts" ])
# SQL
df = ci.load( "postgres://prod/events" , where = "created_at > '2025-01-01'" )
df = ci.load( "snowflake://wh/db/schema/table" , where = "region = 'EMEA' LIMIT 100000" )
# Platform-registered
df = ci.load( "bucket1" , source = "platform" )
# Polars + lazy
handle = ci.load( "./events.parquet" , as_ = "polars" , lazy = True )
out = handle.collect().filter(pl.col( "label" ) == 1 )
Errors
Exception Raised when CirronDependencyErroras_= requires a backend that isn’t installedCirronDataSizeErrorMatched bytes ≥ load_max_bytes and confirm_large=False CirronDatasetNotFoundsource="platform" and the name doesn’t resolveCirronPlatformRequiredsource="platform" but credentials or network unavailable
Not yet shipped
search= and top_k= (semantic search over a platform vector index)
accept input for API stability but raise NotImplementedError until
the backend ships.
Data loading guide Narrative walk-through with more examples.
Errors Exception hierarchy for the data loader.