CTable¶
A columnar compressed table backed by one physical container per column.
Scalar columns use NDArray; list-valued columns use
ListArray. Each column is stored, compressed, and queried
independently; rows are never materialised in their entirety unless you
explicitly call to_arrow() or iterate with
__iter__().
- class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None)[source]¶
Columnar compressed table with typed columns and row-oriented access.
- Attributes:
cbytesTotal compressed size in bytes (all columns + valid_rows mask).
computed_columnsRead-only view of the computed-column definitions.
cratioCompression ratio for the whole table payload.
indexesReturn a list of
blosc2.Indexhandles for all active indexes.infoGet information about this table.
info_itemsStructured summary items used by
info().nbytesTotal uncompressed size in bytes (all columns + valid_rows mask).
ncolsTotal number of columns, including computed (virtual) columns.
- nrows
schemaThe compiled schema that drives this table’s columns and validation.
Methods
add_column(name, spec)Add a new column filled from the default declared in spec.
add_computed_column(name, expr, *[, dtype])Add a read-only virtual column whose values are computed from other columns.
append(data)Append a single row to the table.
close()Close any persistent backing store held by this table.
column_schema(name)Return the
CompiledColumndescriptor for name.compact()Physically rewrite every column array keeping only live rows.
compact_index([col_name, expression, name])Compact an index, merging any incremental append runs.
copy([compact, urlpath, overwrite])Return a new standalone copy of this table.
cov()Return the covariance matrix as a numpy array.
create_index([col_name, field, expression, ...])Build and register an index for a stored column or table expression.
delete(ind)Mark one or more rows as deleted (tombstone deletion).
describe()Print a per-column statistical summary.
drop_column(name)Remove a column from the table.
drop_computed_column(name)Remove a computed column from the table.
drop_index([col_name, expression, name])Remove an index and delete any sidecar files.
extend(data, *[, validate])Append multiple rows at once.
from_arrow(schema, batches, *[, urlpath, ...])Build a
CTablefrom an Arrow schema and iterable of record batches.from_csv(path, row_cls, *[, header, sep])Build a
CTablefrom a CSV file.from_parquet(path, *[, columns, batch_size, ...])Read a Parquet file into a
CTable.head([N])Return a view of the first N live rows (default 5).
index([col_name, expression, name])Return the index handle for a stored-column or expression target.
iter_arrow_batches(*[, columns, batch_size, ...])Yield live rows as bounded-size
pyarrow.RecordBatchobjects.iter_sorted(cols[, ascending, start, stop, ...])Iterate rows in sorted order without materializing a full copy.
load(urlpath)Load a persistent table from urlpath into RAM.
materialize_computed_column(name, *[, ...])Materialize a computed column into a new stored snapshot column.
open(urlpath, *[, mode])Open a persistent CTable from urlpath.
rebuild_index([col_name, expression, name])Drop and recreate an index with the same parameters.
rename_column(old, new)Rename a column.
sample(n, *[, seed])Return a read-only view of n randomly chosen live rows.
save(urlpath, *[, overwrite])Persist this table to disk at urlpath.
Return a JSON-compatible dict describing this table's schema.
select(cols)Return a column-projection view exposing only cols.
sort_by(cols[, ascending, inplace])Return a copy of the table sorted by one or more columns.
tail([N])Return a view of the last N live rows (default 5).
to_arrow()Convert all live rows to a
pyarrow.Table.to_b2d(urlpath, *[, overwrite, compact])Write this table to a directory-backed store.
to_b2z(urlpath, *[, overwrite, compact])Write this table to a compact
.b2zcontainer.to_csv(path, *[, header, sep])Write all live rows to a CSV file.
to_parquet(path, *[, columns, batch_size, ...])Write this table to a Parquet file batch-wise using pyarrow.
view(new_valid_rows)Return a row-filter view backed by a boolean mask array without copying data.
where(expr_result, *[, columns])Return a row-filtered view matching a boolean predicate.
Special methods
Return the number of live (non-deleted) rows.
Iterate over live rows in insertion order, yielding namedtuple-like row objects.
CTable.__getitem__(key)Type-driven indexing for columns, rows, projections, and filters.
Short
CTable<cols>(N rows, X compressed)summary string.Pandas-style tabular display with column names, dtypes, and a row count footer.
- __len__()[source]¶
Return the number of live (non-deleted) rows.
Return the number of live (non-deleted) rows.
- __iter__()[source]¶
Iterate over live rows in insertion order, yielding namedtuple-like row objects.
Iterate over live rows in insertion order, yielding namedtuple-like row objects with one attribute per column.
- __getitem__(key)[source]¶
Type-driven indexing for columns, rows, projections, and filters.
Supported keys are:
str: return aColumnwhen it matches a stored or computed column name; otherwise evaluate it as a boolean expression viawhere().boolean
blosc2.LazyExprorblosc2.NDArray: return the same filtered view aswhere(), e.g.t[t.temperature_f > 70].int: return one live row as a namedtuple-like object.slice: return a row-range view.integer array/list: return a gathered-row view.
boolean NumPy array/list: return a boolean-mask filtered view.
string list: return a column-projection view, equivalent to
select().
Examples
Access columns and rows:
temps = t["temperature"] first = t[0] view = t[10:20]
Filter rows with a string expression, a stored-column expression, or a computed-column expression:
warm = t["temperature > 20"] warm_active = t[(t.temperature > 20) & t.active] hot_fahrenheit = t[t.temperature_f > 70]
Project columns:
slim = t[["sensor_id", "temperature_f"]]
Type-driven indexing:
str— column name returns aColumn; any other string is interpreted as a boolean expression and behaves likewhere().boolean
LazyExpr/NDArray— filtered row view, same aswhere(), e.g.t[t.temperature_f > 70].int— single row as a namedtuple-like object.slice— row-range view.list[int]/ndarray[int]— gathered-row view.ndarray[bool]— boolean-mask filtered view.list[str]— column-projection view (same asselect()).
- __str__() str[source]¶
Pandas-style tabular display with column names, dtypes, and a row count footer.
- classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Arrowstructcolumns are imported asstruct()columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
- classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrowstruct<...>columns are imported asstruct()columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-lessobject()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow().**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- add_column(name: str, spec: SchemaSpec | Field) None[source]¶
Add a new column filled from the default declared in spec.
- Parameters:
name¶ – Column name. Must follow the same naming rules as schema fields.
spec¶ – A schema descriptor such as
b2.int64(ge=0)or a field descriptor such asb2.field(b2.int64(ge=0), default=0). When the table already has live rows, useblosc2.field(...)with a default declared so those rows can be backfilled.
- Raises:
ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.
TypeError – If a declared default cannot be coerced to spec’s dtype.
- add_computed_column(name: str, expr, *, dtype: dtype | None = None) None[source]¶
Add a read-only virtual column whose values are computed from other columns.
The column stores no data — it is evaluated on-the-fly when read. It participates in display, filtering, sorting, export (to_arrow / to_csv), and aggregates, but cannot be written to, indexed, or included in
append/extendinputs.- Parameters:
name¶ – Column name. Must not collide with any existing stored or computed column and must satisfy the usual naming rules.
expr¶ – Either a callable
(cols: dict[str, NDArray]) -> LazyExpror an expression string (e.g."price * qty") where column names are referenced directly and resolved from stored columns.dtype¶ – Override the inferred result dtype. When omitted the dtype is taken from the
blosc2.LazyExpr.
- Raises:
ValueError – If called on a view, the table is read-only, name already exists, or an operand is not a stored column of this table.
TypeError – If expr is not a callable or string, or does not return a
blosc2.LazyExpr.
- append(data: list | void | ndarray) None[source]¶
Append a single row to the table.
data may be a list, tuple,
numpy.void, or structurednumpy.ndarraywhose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. RaisesValueErrorif the table is read-only or a view.
- column_schema(name: str) CompiledColumn[source]¶
Return the
CompiledColumndescriptor for name.- Raises:
KeyError – If name is not a column in this table.
- compact()[source]¶
Physically rewrite every column array keeping only live rows.
Closes the gaps left by prior
delete()calls. All existing indexes are dropped and must be recreated afterwards. RaisesValueErrorif the table is read-only or a view.
- compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Compact an index, merging any incremental append runs.
- copy(compact: bool = True, *, urlpath: str | PathLike[str] | None = None, overwrite: bool = False) CTable[source]¶
Return a new standalone copy of this table.
- Parameters:
compact¶ – If
True(default), only live (non-deleted) rows are copied. The result is a dense table with no tombstones and no parent dependency — ideal for materialising a filtered view. IfFalse, all physical slots are copied including deleted gaps, preserving the tombstone state exactly for in-memory copies.urlpath¶ – Destination path for a persistent copy. The
.b2zextension selects a compact zip-backed store; any other path uses a directory-backed store. A.b2dsuffix is recommended for directory-backed stores. IfNone(default), return an in-memory copy.overwrite¶ – If
True, replace an existing persistent destination.
- cov() ndarray[source]¶
Return the covariance matrix as a numpy array.
Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise
TypeError.- Returns:
Shape
(ncols, ncols). Column order matchescol_names.- Return type:
numpy.ndarray
- Raises:
TypeError – If any column has an unsupported dtype (complex, string, …).
ValueError – If the table has fewer than 2 live rows (covariance undefined).
- create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]¶
Build and register an index for a stored column or table expression.
- delete(ind: int | slice | str | Iterable) None[source]¶
Mark one or more rows as deleted (tombstone deletion).
ind may be a logical row index (
int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed untilcompact()is called. RaisesValueErrorif the table is read-only or a view.
- describe() None[source]¶
Print a per-column statistical summary.
Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.
- drop_column(name: str) None[source]¶
Remove a column from the table.
On disk tables the corresponding persisted column leaf is deleted.
- Raises:
ValueError – If the table is read-only, is a view, or name is the last column.
KeyError – If name does not exist.
- drop_computed_column(name: str) None[source]¶
Remove a computed column from the table.
- Parameters:
name¶ – Name of the computed column to remove.
- Raises:
KeyError – If name is not a computed column.
ValueError – If called on a view.
- drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]¶
Remove an index and delete any sidecar files.
- extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]¶
Append multiple rows at once.
data may be:
a dict of arrays
{"col": array, ...}— all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;a list of rows, each compatible with
append();another CTable — columns are matched by name.
Pass
validate=Falseto skip per-row Pydantic validation on trusted bulk imports. RaisesValueErrorif the table is read-only or a view.
- index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Return the index handle for a stored-column or expression target.
- iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]¶
Yield live rows as bounded-size
pyarrow.RecordBatchobjects.
- iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]¶
Iterate rows in sorted order without materializing a full copy.
Uses a FULL index when available (no sort needed); otherwise falls back to
np.lexsorton live physical positions. Yields namedtuple-like row objects in the same way as__iter__.The sorted positions array is stored as a compressed
blosc2.NDArrayto keep RAM usage low for large tables.batch_sizepositions are decompressed at a time during iteration.- Parameters:
cols¶ – Column name or list of column names to sort by.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
start¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.stop¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.step¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.batch_size¶ – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.
- materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]¶
Materialize a computed column into a new stored snapshot column.
- Parameters:
- Raises:
ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.
KeyError – If name is not a computed column.
TypeError – If dtype is incompatible with the computed values.
- rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Drop and recreate an index with the same parameters.
- rename_column(old: str, new: str) None[source]¶
Rename a column.
On disk tables the corresponding persisted column leaf is renamed.
- Raises:
ValueError – If the table is read-only, is a view, or new already exists.
KeyError – If old does not exist.
- sample(n: int, *, seed: int | None = None) CTable[source]¶
Return a read-only view of n randomly chosen live rows.
- save(urlpath: str, *, overwrite: bool = False) None[source]¶
Persist this table to disk at urlpath.
This writes a standalone copy and returns
None; usecopy()directly when the copiedCTableobject is needed.Only live rows are written — the on-disk table is always compacted. A
.b2zsuffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a.b2dsuffix for directory-backed stores when possible so the format is clear.- Parameters:
urlpath¶ – Destination path. Use a
.b2zsuffix for a compact zip-backed store; any other suffix creates a directory-backed store. A.b2dsuffix is recommended for directory-backed stores.overwrite¶ – If
False(default), raiseValueErrorwhen urlpath already exists. Set toTrueto replace an existing table.
- Raises:
ValueError – If urlpath already exists and
overwrite=False.
- select(cols: list[str]) CTable[source]¶
Return a column-projection view exposing only cols.
The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.
- Parameters:
cols¶ – Ordered list of column names to keep.
- Raises:
KeyError – If any name in cols is not a column of this table.
ValueError – If cols is empty.
- sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]¶
Return a copy of the table sorted by one or more columns.
- Parameters:
cols¶ – Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
inplace¶ – If
True, rewrite the physical data in place and returnself(likecompact()but sorted). IfFalse(default), return a new in-memory CTable leaving this one untouched.
- Raises:
ValueError – If called on a view or a read-only table when
inplace=True.KeyError – If any column name is not found.
TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).
- to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a directory-backed store.
Directory-backed CTable stores may use any path that does not end in
.b2z; using a.b2dsuffix is recommended for clarity. For persistent, non-view.b2ztables opened read-only andcompact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.For in-memory tables, views, writable
.b2ztables, existing directory-backed tables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new directory-backed store.Examples
Fast-unpack an existing compact zip store into a directory-backed table:
table = blosc2.CTable.open("data.b2z", mode="r") table.to_b2d("data.b2d", overwrite=True) table.close()
Materialize a filtered view into a directory-backed store:
view = table.where(table["score"] > 10) view.to_b2d("high-score.b2d", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2ztable:table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
- to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a compact
.b2zcontainer..b2zis the compact zip-backed CTable format. For persistent, non-view directory-backed tables andcompact=False, this uses a fast physical-pack path: the backingTreeStoredirectory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A.b2dsuffix is recommended for directory-backed stores, but not required.For in-memory tables, views, existing
.b2ztables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new.b2zstore.Examples
Fast-pack an existing directory-backed table into a compact zip store:
table = blosc2.CTable.open("data.b2d", mode="r") table.to_b2z("data.b2z", overwrite=True) table.close()
Materialize a filtered view into a new compact store:
view = table.where(table["score"] > 10) view.to_b2z("high-score.b2z", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2dtable:table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
- to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]¶
Write all live rows to a CSV file.
Uses Python’s stdlib
csvmodule — no extra dependency required. Each column is materialised once viacol[:]; rows are then written one at a time.
- to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]¶
Write this table to a Parquet file batch-wise using pyarrow.
- view(new_valid_rows)[source]¶
Return a row-filter view backed by a boolean mask array without copying data.
- where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]¶
Return a row-filtered view matching a boolean predicate.
Signature:
where(expr_result) -> CTable
The predicate can be supplied as a boolean
blosc2.LazyExpr, a booleanblosc2.NDArray, a boolean NumPy array, a booleanColumn, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.The returned object is a
CTableview sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.- Parameters:
expr_result¶ – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g.
"value * category >= 150". Column objects can also be used in Python expressions, e.g.(t.value * t.category) >= 150.- Returns:
A view over the same columns containing only rows where the predicate is true and the source row is live. When
columnsis provided, the returned view is additionally projected to that ordered subset of columns.- Return type:
- Raises:
TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.
Examples
Filter using a string expression:
view = t.where("value * category >= 150") slim = t.where("value * category >= 150", columns=["value", "category"])
Filter using column arithmetic:
view = t.where((t.value * t.category) >= 150)
Blosc2 lazy functions can be used in column expressions:
view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)
For column names that are not valid Python identifiers, use item access:
view = t.where((t["unit price"] * t["quantity"]) > 100)
Notes
Use bitwise operators (
&,|,~) or string expressions for element-wise boolean logic. Python’s logical operatorsand,orandnotcannot be overloaded and therefore do not build lazy column expressions.Use:
t.where((t.x > 0) & (t.y < 10)) t.where(~t.returned) t.where("not returned")
not:
t.where((t.x > 0) and (t.y < 10)) t.where(not t.returned)
- base: CTable | None¶
Parent table when this instance is a row-filter or column-projection view (created by
where(),select(), orview()).Nonefor top-level tables. Structural mutations such asadd_column()anddrop_column()are blocked on views.
- property cbytes: int¶
Total compressed size in bytes (all columns + valid_rows mask).
- col_names: list[str]¶
Ordered list of stored column names. Computed columns are not included; access those via
computed_columns.
- property computed_columns: dict[str, dict]¶
Read-only view of the computed-column definitions.
Each value is a dict with keys
expression,col_deps,lazy(blosc2.LazyExpr), anddtype.
- property cratio: float¶
Compression ratio for the whole table payload.
- property indexes: list[Index]¶
Return a list of
blosc2.Indexhandles for all active indexes.
- property info: _CTableInfoReporter¶
Get information about this table.
Examples
>>> print(t.info) >>> t.info()
- property nbytes: int¶
Total uncompressed size in bytes (all columns + valid_rows mask).
- property ncols: int¶
Total number of columns, including computed (virtual) columns.
- property schema: CompiledSchema¶
The compiled schema that drives this table’s columns and validation.
Construction¶
|
|
|
Open a persistent CTable from urlpath. |
|
Load a persistent table from urlpath into RAM. |
|
Build a |
|
Read a Parquet file into a |
|
Build a |
- CTable.__init__(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None) None[source]¶
- classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- classmethod CTable.load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Arrowstructcolumns are imported asstruct()columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrowstruct<...>columns are imported asstruct()columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-lessobject()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow().**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
Null policy¶
Nullable scalar CTable columns are represented with per-column sentinel values,
not native validity bitmaps. When CTable has to infer those sentinels, the
selection can be customized with NullPolicy and scoped with
null_policy():
policy = blosc2.NullPolicy(
signed_int_strategy="max",
string_value="<NULL>",
column_null_values={"user_id": -1, "country": "NA"},
)
with blosc2.null_policy(policy):
table = blosc2.CTable.from_parquet("data.parquet")
The same policy is used by explicit nullable schema specs when no
null_value is supplied:
from dataclasses import dataclass
@dataclass
class Row:
user_id: int = blosc2.field(blosc2.int64(nullable=True))
country: str = blosc2.field(blosc2.string(nullable=True))
with blosc2.null_policy(policy):
table = blosc2.CTable(Row)
Sentinels are resolved in this order: explicit null_value in the schema,
NullPolicy.column_null_values for a matching column, then the type-wide
NullPolicy default. Columns without nullable=True or an explicit
null_value are not nullable.
|
Default sentinels for inferred CTable scalar nulls. |
|
Temporarily set the default policy for CTable null sentinel inference. |
Return the current default null policy. |
- class blosc2.NullPolicy(string_value: str = '__BLOSC2_NULL__', bytes_value: bytes = b'__BLOSC2_NULL__', float_value: float = nan, bool_value: int = 255, signed_int_strategy: ~typing.Literal['min', 'max'] = 'min', unsigned_int_strategy: ~typing.Literal['min', 'max'] = 'max', timestamp_value: int = -9223372036854775808, column_null_values: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]¶
Default sentinels for inferred CTable scalar nulls.
CTable nullable scalar columns are represented with per-column sentinel values. This policy is used when CTable has to infer those sentinels, such as when importing nullable scalar Arrow or Parquet columns without an explicit column-level null sentinel. The selected sentinel is stored in the resulting CTable schema, so existing tables remain self-describing.
Examples
Use
blosc2.null_policy()to apply a policy while creating a CTable from data with nullable scalar columns:policy = blosc2.NullPolicy( signed_int_strategy="max", string_value="<NULL>", column_null_values={"user_id": -1, "country": "NA"}, ) with blosc2.null_policy(policy): table = blosc2.CTable.from_parquet("data.parquet")
The same policy is used for explicit nullable schema specs:
@dataclass class Row: user_id: int = blosc2.field(blosc2.int64(nullable=True)) country: str = blosc2.field(blosc2.string(nullable=True)) with blosc2.null_policy(policy): table = blosc2.CTable(Row)
column_null_valuestakes precedence over the type-wide defaults in the policy. This is useful when a particular column needs a sentinel that is known not to collide with its real values.Methods
sentinel_for_arrow_type(pa, pa_type)Return the default sentinel for pa_type, or
Noneif unsupported.
- blosc2.null_policy(policy: NullPolicy)¶
Temporarily set the default policy for CTable null sentinel inference.
- blosc2.get_null_policy() NullPolicy[source]¶
Return the current default null policy.
Attributes¶
Ordered list of stored column names. |
|
Read-only view of the computed-column definitions. |
|
Total number of columns, including computed (virtual) columns. |
|
Total compressed size in bytes (all columns + valid_rows mask). |
|
Total uncompressed size in bytes (all columns + valid_rows mask). |
|
The compiled schema that drives this table's columns and validation. |
|
Parent table when this instance is a row-filter or column-projection view (created by |
- CTable.col_names: list[str]¶
Ordered list of stored column names. Computed columns are not included; access those via
computed_columns.
- property CTable.computed_columns: dict[str, dict]¶
Read-only view of the computed-column definitions.
Each value is a dict with keys
expression,col_deps,lazy(blosc2.LazyExpr), anddtype.
- property CTable.nrows: int¶
- property CTable.ncols: int¶
Total number of columns, including computed (virtual) columns.
- property CTable.cbytes: int¶
Total compressed size in bytes (all columns + valid_rows mask).
- property CTable.nbytes: int¶
Total uncompressed size in bytes (all columns + valid_rows mask).
- property CTable.schema: CompiledSchema¶
The compiled schema that drives this table’s columns and validation.
- CTable.base: CTable | None¶
Parent table when this instance is a row-filter or column-projection view (created by
where(),select(), orview()).Nonefor top-level tables. Structural mutations such asadd_column()anddrop_column()are blocked on views.
Inserting data¶
|
Append a single row to the table. |
|
Append multiple rows at once. |
- CTable.append(data: list | void | ndarray) None[source]¶
Append a single row to the table.
data may be a list, tuple,
numpy.void, or structurednumpy.ndarraywhose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. RaisesValueErrorif the table is read-only or a view.
- CTable.extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]¶
Append multiple rows at once.
data may be:
a dict of arrays
{"col": array, ...}— all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;a list of rows, each compatible with
append();another CTable — columns are matched by name.
Pass
validate=Falseto skip per-row Pydantic validation on trusted bulk imports. RaisesValueErrorif the table is read-only or a view.
Querying¶
Boolean expressions¶
Use bitwise operators (&, |, ~) or string expressions for
row-wise boolean logic. Python’s logical operators and, or and
not cannot be overloaded and therefore do not build lazy column
expressions.
Use column expressions with explicit parentheses around comparisons:
t.where((t.amount > 100) & (t.region == "North"))
t.where(~t.returned)
or use string expressions when that reads better:
t.where("amount > 100 and region == 'North'")
t.where("not returned")
t["not returned"]
The last three forms for negating a boolean column are equivalent:
t.where(~t.returned), t.where("not returned"), and
t["not returned"].
Indexing & projection¶
CTable indexing is type-driven:
t["amount"] # column access
t[3] # one row as a namedtuple-like object
t[3:8] # row view
t[[1, 4, 7]] # gathered-row view
t[mask] # filtered row view
t[t.amount > 100] # LazyExpr filtered row view, like where()
t[["region", "amount"]] # projected column view
String keys first try exact column-name lookup. If the string is not a
column name, it is interpreted as a boolean expression and behaves like
CTable.where(). Boolean LazyExpr and boolean
NDArray keys also behave like CTable.where(), so computed
column predicates such as t[t.temperature_f > 70] are supported.
For explicit filtered projection, use:
t.where("amount > 100", columns=["region", "amount"])
When a NumPy structured array is needed, materialize explicitly:
np.asarray(t[:10])
|
Return a row-filtered view matching a boolean predicate. |
|
Return a row-filter view backed by a boolean mask array without copying data. |
|
Return a column-projection view exposing only cols. |
|
Return a view of the first N live rows (default 5). |
|
Return a view of the last N live rows (default 5). |
|
Return a read-only view of n randomly chosen live rows. |
|
Return a copy of the table sorted by one or more columns. |
|
Iterate rows in sorted order without materializing a full copy. |
- CTable.where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]¶
Return a row-filtered view matching a boolean predicate.
Signature:
where(expr_result) -> CTable
The predicate can be supplied as a boolean
blosc2.LazyExpr, a booleanblosc2.NDArray, a boolean NumPy array, a booleanColumn, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.The returned object is a
CTableview sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.- Parameters:
expr_result¶ – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g.
"value * category >= 150". Column objects can also be used in Python expressions, e.g.(t.value * t.category) >= 150.- Returns:
A view over the same columns containing only rows where the predicate is true and the source row is live. When
columnsis provided, the returned view is additionally projected to that ordered subset of columns.- Return type:
- Raises:
TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.
Examples
Filter using a string expression:
view = t.where("value * category >= 150") slim = t.where("value * category >= 150", columns=["value", "category"])
Filter using column arithmetic:
view = t.where((t.value * t.category) >= 150)
Blosc2 lazy functions can be used in column expressions:
view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)
For column names that are not valid Python identifiers, use item access:
view = t.where((t["unit price"] * t["quantity"]) > 100)
Notes
Use bitwise operators (
&,|,~) or string expressions for element-wise boolean logic. Python’s logical operatorsand,orandnotcannot be overloaded and therefore do not build lazy column expressions.Use:
t.where((t.x > 0) & (t.y < 10)) t.where(~t.returned) t.where("not returned")
not:
t.where((t.x > 0) and (t.y < 10)) t.where(not t.returned)
- CTable.view(new_valid_rows)[source]¶
Return a row-filter view backed by a boolean mask array without copying data.
- CTable.select(cols: list[str]) CTable[source]¶
Return a column-projection view exposing only cols.
The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.
- Parameters:
cols¶ – Ordered list of column names to keep.
- Raises:
KeyError – If any name in cols is not a column of this table.
ValueError – If cols is empty.
- CTable.sample(n: int, *, seed: int | None = None) CTable[source]¶
Return a read-only view of n randomly chosen live rows.
- CTable.sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]¶
Return a copy of the table sorted by one or more columns.
- Parameters:
cols¶ – Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
inplace¶ – If
True, rewrite the physical data in place and returnself(likecompact()but sorted). IfFalse(default), return a new in-memory CTable leaving this one untouched.
- Raises:
ValueError – If called on a view or a read-only table when
inplace=True.KeyError – If any column name is not found.
TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).
- CTable.iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]¶
Iterate rows in sorted order without materializing a full copy.
Uses a FULL index when available (no sort needed); otherwise falls back to
np.lexsorton live physical positions. Yields namedtuple-like row objects in the same way as__iter__.The sorted positions array is stored as a compressed
blosc2.NDArrayto keep RAM usage low for large tables.batch_sizepositions are decompressed at a time during iteration.- Parameters:
cols¶ – Column name or list of column names to sort by.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
start¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.stop¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.step¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.batch_size¶ – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.
Mutations¶
In addition to physical schema changes such as CTable.add_column(),
CTables can host computed columns backed by a lazy expression over stored
columns. Computed columns are read-only, use no extra storage, participate in
display, filtering, sorting, and aggregates, and are persisted across
CTable.save(), CTable.load(), and CTable.open().
When a computed result should become a normal stored column, use
CTable.materialize_computed_column(). The materialized column is a stored
snapshot that can be indexed like any other stored column. New rows inserted
later via CTable.append() or CTable.extend() auto-fill omitted
materialized-column values from the recorded expression metadata.
|
Mark one or more rows as deleted (tombstone deletion). |
Physically rewrite every column array keeping only live rows. |
|
|
Add a new column filled from the default declared in spec. |
|
Add a read-only virtual column whose values are computed from other columns. |
|
Materialize a computed column into a new stored snapshot column. |
Remove a computed column from the table. |
|
|
Remove a column from the table. |
|
Rename a column. |
- CTable.delete(ind: int | slice | str | Iterable) None[source]¶
Mark one or more rows as deleted (tombstone deletion).
ind may be a logical row index (
int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed untilcompact()is called. RaisesValueErrorif the table is read-only or a view.
- CTable.compact()[source]¶
Physically rewrite every column array keeping only live rows.
Closes the gaps left by prior
delete()calls. All existing indexes are dropped and must be recreated afterwards. RaisesValueErrorif the table is read-only or a view.
- CTable.add_column(name: str, spec: SchemaSpec | Field) None[source]¶
Add a new column filled from the default declared in spec.
- Parameters:
name¶ – Column name. Must follow the same naming rules as schema fields.
spec¶ – A schema descriptor such as
b2.int64(ge=0)or a field descriptor such asb2.field(b2.int64(ge=0), default=0). When the table already has live rows, useblosc2.field(...)with a default declared so those rows can be backfilled.
- Raises:
ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.
TypeError – If a declared default cannot be coerced to spec’s dtype.
- CTable.add_computed_column(name: str, expr, *, dtype: dtype | None = None) None[source]¶
Add a read-only virtual column whose values are computed from other columns.
The column stores no data — it is evaluated on-the-fly when read. It participates in display, filtering, sorting, export (to_arrow / to_csv), and aggregates, but cannot be written to, indexed, or included in
append/extendinputs.- Parameters:
name¶ – Column name. Must not collide with any existing stored or computed column and must satisfy the usual naming rules.
expr¶ – Either a callable
(cols: dict[str, NDArray]) -> LazyExpror an expression string (e.g."price * qty") where column names are referenced directly and resolved from stored columns.dtype¶ – Override the inferred result dtype. When omitted the dtype is taken from the
blosc2.LazyExpr.
- Raises:
ValueError – If called on a view, the table is read-only, name already exists, or an operand is not a stored column of this table.
TypeError – If expr is not a callable or string, or does not return a
blosc2.LazyExpr.
- CTable.materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]¶
Materialize a computed column into a new stored snapshot column.
- Parameters:
- Raises:
ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.
KeyError – If name is not a computed column.
TypeError – If dtype is incompatible with the computed values.
- CTable.drop_computed_column(name: str) None[source]¶
Remove a computed column from the table.
- Parameters:
name¶ – Name of the computed column to remove.
- Raises:
KeyError – If name is not a computed column.
ValueError – If called on a view.
Indexes¶
CTable indexes are created with CTable.create_index() and returned as
blosc2.Index handles. For tables, Index refers to an entry stored
in the table index catalog and delegates maintenance operations such as
drop(), rebuild(), and compact() back to the owning table. Users
normally only receive these handles from the CTable API; they do not instantiate
them directly.
Indexes can target stored columns or direct expressions over stored columns
via create_index(expression=...). This lets queries reuse indexes for
derived predicates without adding either a computed column or a materialized
stored one. A matching FULL direct-expression index can also be reused by
ordering paths such as CTable.sort_by() when sorting by a computed column
backed by the same expression. OPSI indexes are a separate exact-filtering
tier with a tunable number of iterative ordering cycles; they are not intended
to converge to a completely sorted FULL/CSI index, so use FULL when
globally sorted ordered reuse is required.
|
Build and register an index for a stored column or table expression. |
|
Return the index handle for a stored-column or expression target. |
Return a list of |
|
|
Remove an index and delete any sidecar files. |
|
Drop and recreate an index with the same parameters. |
|
Compact an index, merging any incremental append runs. |
- CTable.create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]¶
Build and register an index for a stored column or table expression.
- CTable.index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Return the index handle for a stored-column or expression target.
- CTable.indexes¶
Return a list of
blosc2.Indexhandles for all active indexes.
- CTable.drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]¶
Remove an index and delete any sidecar files.
- CTable.rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Drop and recreate an index with the same parameters.
- CTable.compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]¶
Compact an index, merging any incremental append runs.
See blosc2.Index for the returned handle attributes and methods.
Persistence¶
Persist CTables to disk or interchange formats, and restore them later without losing schema information. These methods cover native Blosc2 persistence as well as import/export paths for CSV, Arrow, and Parquet data.
|
Load a persistent table from urlpath into RAM. |
|
Open a persistent CTable from urlpath. |
|
Persist this table to disk at urlpath. |
|
Write this table to a compact |
|
Write this table to a directory-backed store. |
|
Write all live rows to a CSV file. |
Convert all live rows to a |
|
|
Write this table to a Parquet file batch-wise using pyarrow. |
|
Build a |
|
Read a Parquet file into a |
|
Build a |
- classmethod CTable.load(urlpath: str) CTable[source]¶
Load a persistent table from urlpath into RAM.
The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.
- Parameters:
urlpath¶ – Path to the table root directory.
- Raises:
FileNotFoundError – If urlpath does not contain a CTable.
ValueError – If the metadata at urlpath does not identify a CTable.
- classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]¶
Open a persistent CTable from urlpath.
- CTable.save(urlpath: str, *, overwrite: bool = False) None[source]¶
Persist this table to disk at urlpath.
This writes a standalone copy and returns
None; usecopy()directly when the copiedCTableobject is needed.Only live rows are written — the on-disk table is always compacted. A
.b2zsuffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a.b2dsuffix for directory-backed stores when possible so the format is clear.- Parameters:
urlpath¶ – Destination path. Use a
.b2zsuffix for a compact zip-backed store; any other suffix creates a directory-backed store. A.b2dsuffix is recommended for directory-backed stores.overwrite¶ – If
False(default), raiseValueErrorwhen urlpath already exists. Set toTrueto replace an existing table.
- Raises:
ValueError – If urlpath already exists and
overwrite=False.
- CTable.to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a compact
.b2zcontainer..b2zis the compact zip-backed CTable format. For persistent, non-view directory-backed tables andcompact=False, this uses a fast physical-pack path: the backingTreeStoredirectory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A.b2dsuffix is recommended for directory-backed stores, but not required.For in-memory tables, views, existing
.b2ztables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new.b2zstore.Examples
Fast-pack an existing directory-backed table into a compact zip store:
table = blosc2.CTable.open("data.b2d", mode="r") table.to_b2z("data.b2z", overwrite=True) table.close()
Materialize a filtered view into a new compact store:
view = table.where(table["score"] > 10) view.to_b2z("high-score.b2z", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2dtable:table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
- CTable.to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a directory-backed store.
Directory-backed CTable stores may use any path that does not end in
.b2z; using a.b2dsuffix is recommended for clarity. For persistent, non-view.b2ztables opened read-only andcompact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.For in-memory tables, views, writable
.b2ztables, existing directory-backed tables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new directory-backed store.Examples
Fast-unpack an existing compact zip store into a directory-backed table:
table = blosc2.CTable.open("data.b2z", mode="r") table.to_b2d("data.b2d", overwrite=True) table.close()
Materialize a filtered view into a directory-backed store:
view = table.where(table["score"] > 10) view.to_b2d("high-score.b2d", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2ztable:table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
- CTable.to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]¶
Write all live rows to a CSV file.
Uses Python’s stdlib
csvmodule — no extra dependency required. Each column is materialised once viacol[:]; rows are then written one at a time.
- CTable.to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]¶
Write this table to a Parquet file batch-wise using pyarrow.
- classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Arrowstructcolumns are imported asstruct()columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrowstruct<...>columns are imported asstruct()columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-lessobject()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow().**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
Inspection & statistics¶
Compute common descriptive statistics directly on CTable data without
materializing rows first. These methods operate column-wise on the compressed
representation, making it easy to summarize distributions or measure
relationships between numeric columns.
|
Return the |
Get information about this table. |
|
Return a JSON-compatible dict describing this table's schema. |
|
Print a per-column statistical summary. |
|
Return the covariance matrix as a numpy array. |
- CTable.column_schema(name: str) CompiledColumn[source]¶
Return the
CompiledColumndescriptor for name.- Raises:
KeyError – If name is not a column in this table.
- CTable.info()¶
Get information about this table.
Examples
>>> print(t.info) >>> t.info()
- CTable.schema_dict() dict[str, Any][source]¶
Return a JSON-compatible dict describing this table’s schema.
- CTable.describe() None[source]¶
Print a per-column statistical summary.
Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.
- CTable.cov() ndarray[source]¶
Return the covariance matrix as a numpy array.
Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise
TypeError.- Returns:
Shape
(ncols, ncols). Column order matchescol_names.- Return type:
numpy.ndarray
- Raises:
TypeError – If any column has an unsupported dtype (complex, string, …).
ValueError – If the table has fewer than 2 live rows (covariance undefined).
Column¶
A lazy column accessor returned by table["col_name"] or table.col_name.
All index operations and aggregates apply the table’s tombstone mask
(_valid_rows) so deleted rows are silently excluded.
- class blosc2.Column(table: CTable, col_name: str, mask=None)[source]¶
Column view for a
CTable, with vectorized operations and reductions.- Attributes:
dtypeNumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).is_computedTrue if this column is a virtual computed column (read-only).
- is_list
is_varlen_scalarTrue if this column holds variable-length scalar strings or bytes.
ndimNumber of logical dimensions.
null_valueThe sentinel value that represents NULL for this column, or
None.shapeLogical shape of the live column values.
sizeNumber of live values in the column.
viewReturn a
ColumnViewIndexerfor creating logical sub-views.
Methods
all()Return True if every live, non-null value is True.
any()Return True if at least one live, non-null value is True.
assign(data)Replace all live values in this column with data.
is_null()Return a boolean array True where the live value is the null sentinel.
iter_chunks([size])Iterate over live column values in chunks of size rows.
max(*[, where])Maximum live, non-null value.
mean(*[, where])Arithmetic mean of all live, non-null values.
min(*[, where])Minimum live, non-null value.
notnull()Return a boolean array True where the live value is not the null sentinel.
Return the number of live rows whose value equals the null sentinel.
std([ddof, where])Standard deviation of all live, non-null values (single-pass, Welford's algorithm).
sum([dtype, where, jit, jit_backend])Sum of all live, non-null values.
unique()Return sorted array of unique live, non-null values.
Return a
{value: count}dict sorted by count descending.Special methods
Return the number of live (non-deleted) values in this column.
Iterate over live column values in insertion order, skipping deleted rows.
Column.__getitem__(key)Return values for the given logical index.
Column.__setitem__(key, value)Set one or more live column values; accepts the same index forms as
__getitem__().- __len__()[source]¶
Return the number of live (non-deleted) values in this column.
Return the number of live (non-deleted) values in this column.
- __iter__()[source]¶
Iterate over live column values in insertion order, skipping deleted rows.
Iterate over live values in insertion order, skipping deleted rows.
- __getitem__(key: int | slice | list | ndarray)[source]¶
Return values for the given logical index.
int→ scalarslice→numpy.ndarraylist / np.ndarray→numpy.ndarraybool np.ndarray→numpy.ndarray
For a writable logical sub-view use
view.
- __setitem__(key: int | slice | list | ndarray, value)[source]¶
Set one or more live column values; accepts the same index forms as
__getitem__().Set one or more live column values. Accepts the same index forms as
__getitem__().
- all() bool[source]¶
Return True if every live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.
- any() bool[source]¶
Return True if at least one live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.
- assign(data) None[source]¶
Replace all live values in this column with data.
Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.
- Parameters:
data¶ – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.
- Raises:
ValueError – If
len(data)does not match the number of live rows, or the table is opened read-only.TypeError – If values cannot be coerced to the column’s dtype.
- is_null() ndarray[source]¶
Return a boolean array True where the live value is the null sentinel.
For varlen scalar columns (vlstring/vlbytes) nullability is represented as native
Nonevalues, so this returns True wherever the value isNone.
- iter_chunks(size: int = 65536)[source]¶
Iterate over live column values in chunks of size rows.
Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.
- Parameters:
size¶ – Number of live rows per yielded chunk. Defaults to 65 536.
- Yields:
numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.
Examples
>>> for chunk in t["score"].iter_chunks(size=100_000): ... process(chunk)
- max(*, where=None)[source]¶
Maximum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- mean(*, where=None) float[source]¶
Arithmetic mean of all live, non-null values.
Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.
- min(*, where=None)[source]¶
Minimum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- notnull() ndarray[source]¶
Return a boolean array True where the live value is not the null sentinel.
- null_count() int[source]¶
Return the number of live rows whose value equals the null sentinel.
Returns
0in O(1) if nonull_valueis configured for this column and the column is not a varlen scalar column.
- std(ddof: int = 0, *, where=None) float[source]¶
Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).
- Parameters:
ddof¶ – Delta degrees of freedom.
0(default) gives the population std;1gives the sample std (divides by N-1).where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.
dtypes¶ (Supported)
skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)
float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)
- sum(dtype=None, *, where=None, jit=None, jit_backend=None)[source]¶
Sum of all live, non-null values.
Returns zero for an empty column or filtered view.
Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.
- Parameters:
dtype¶ – Optional accumulator dtype. When omitted, float columns use
np.float64, complex columns usenp.complex128, and integer / bool columns usenp.int64.where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.
jit¶ – Optional miniexpr JIT policy passed to the lazy reduction engine.
jit_backend¶ – Optional miniexpr JIT backend. Use
"tcc"or"cc".
Examples
Sum values matching a predicate without materializing a filtered view:
total = t["amount"].sum(where=t.category == 3)
Combine several column predicates:
total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))
Nullable sentinel values are skipped automatically:
# Equivalent to summing only live rows where predicate is true and # t.col2 is not its configured null sentinel. total = t.col2.sum(where=t.col1 < 300)
- unique() ndarray[source]¶
Return sorted array of unique live, non-null values.
Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
- value_counts() dict[source]¶
Return a
{value: count}dict sorted by count descending.Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
Example
>>> t["active"].value_counts() {True: 8432, False: 1568}
- property dtype¶
NumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).
- property ndim: int¶
Number of logical dimensions.
- property null_value¶
The sentinel value that represents NULL for this column, or
None.
- property shape: tuple[int]¶
Logical shape of the live column values.
- property size: int¶
Number of live values in the column.
- property view: ColumnViewIndexer¶
Return a
ColumnViewIndexerfor creating logical sub-views.Examples
Read a sub-view for chained aggregates:
sub = t.price.view[2:10] sub.sum()
Bulk write through a sub-view:
t.price.view[0:5][:] = np.zeros(5)
Attributes¶
NumPy dtype of the underlying storage, or |
|
The sentinel value that represents NULL for this column, or |
- property Column.dtype¶
NumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).
- property Column.null_value¶
The sentinel value that represents NULL for this column, or
None.
Data access¶
Return a |
|
|
Iterate over live column values in chunks of size rows. |
|
Replace all live values in this column with data. |
- property Column.view: ColumnViewIndexer¶
Return a
ColumnViewIndexerfor creating logical sub-views.Examples
Read a sub-view for chained aggregates:
sub = t.price.view[2:10] sub.sum()
Bulk write through a sub-view:
t.price.view[0:5][:] = np.zeros(5)
- Column.iter_chunks(size: int = 65536)[source]¶
Iterate over live column values in chunks of size rows.
Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.
- Parameters:
size¶ – Number of live rows per yielded chunk. Defaults to 65 536.
- Yields:
numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.
Examples
>>> for chunk in t["score"].iter_chunks(size=100_000): ... process(chunk)
- Column.assign(data) None[source]¶
Replace all live values in this column with data.
Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.
- Parameters:
data¶ – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.
- Raises:
ValueError – If
len(data)does not match the number of live rows, or the table is opened read-only.TypeError – If values cannot be coerced to the column’s dtype.
Nullable helpers¶
Return a boolean array True where the live value is the null sentinel. |
|
Return a boolean array True where the live value is not the null sentinel. |
|
Return the number of live rows whose value equals the null sentinel. |
- Column.is_null() ndarray[source]¶
Return a boolean array True where the live value is the null sentinel.
For varlen scalar columns (vlstring/vlbytes) nullability is represented as native
Nonevalues, so this returns True wherever the value isNone.
Unique values¶
Return sorted array of unique live, non-null values. |
|
Return a |
Aggregates¶
Null sentinel values are automatically excluded from all aggregates.
|
Sum of all live, non-null values. |
|
Minimum live, non-null value. |
|
Maximum live, non-null value. |
|
Arithmetic mean of all live, non-null values. |
|
Standard deviation of all live, non-null values (single-pass, Welford's algorithm). |
Return True if at least one live, non-null value is True. |
|
Return True if every live, non-null value is True. |
- Column.sum(dtype=None, *, where=None, jit=None, jit_backend=None)[source]¶
Sum of all live, non-null values.
Returns zero for an empty column or filtered view.
Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.
- Parameters:
dtype¶ – Optional accumulator dtype. When omitted, float columns use
np.float64, complex columns usenp.complex128, and integer / bool columns usenp.int64.where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.
jit¶ – Optional miniexpr JIT policy passed to the lazy reduction engine.
jit_backend¶ – Optional miniexpr JIT backend. Use
"tcc"or"cc".
Examples
Sum values matching a predicate without materializing a filtered view:
total = t["amount"].sum(where=t.category == 3)
Combine several column predicates:
total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))
Nullable sentinel values are skipped automatically:
# Equivalent to summing only live rows where predicate is true and # t.col2 is not its configured null sentinel. total = t.col2.sum(where=t.col1 < 300)
- Column.min(*, where=None)[source]¶
Minimum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- Column.max(*, where=None)[source]¶
Maximum live, non-null value.
Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.
- Column.mean(*, where=None) float[source]¶
Arithmetic mean of all live, non-null values.
Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.
- Column.std(ddof: int = 0, *, where=None) float[source]¶
Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).
- Parameters:
ddof¶ – Delta degrees of freedom.
0(default) gives the population std;1gives the sample std (divides by N-1).where¶ – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.
dtypes¶ (Supported)
skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)
float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)
- Column.any() bool[source]¶
Return True if at least one live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.
- Column.all() bool[source]¶
Return True if every live, non-null value is True.
Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.
Schema Specs¶
Schema specs are passed to field() to declare a column’s type,
storage constraints, and optional null sentinel. They are also
available directly in the blosc2 namespace (e.g. blosc2.int64).
- blosc2.field(spec: ~blosc2.schema.SchemaSpec, *, default=<dataclasses._MISSING_TYPE object>, cparams: dict[str, ~typing.Any] | None = None, dparams: dict[str, ~typing.Any] | None = None, chunks: tuple[int, ...] | None = None, blocks: tuple[int, ...] | None = None) Field[source]¶
Attach a Blosc2 schema spec and per-column storage options to a dataclass field.
- Parameters:
spec¶ – A schema descriptor such as
b2.int64(ge=0)orb2.float64().default¶ – Default value for the field. Omit for required fields.
cparams¶ – Compression parameters for this column’s NDArray.
dparams¶ – Decompression parameters for this column’s NDArray.
chunks¶ – Chunk shape for this column’s NDArray.
blocks¶ – Block shape for this column’s NDArray.
Examples
>>> from dataclasses import dataclass >>> import blosc2 as b2 >>> @dataclass ... class Row: ... id: int = b2.field(b2.int64(ge=0)) ... score: float = b2.field(b2.float64(ge=0, le=100)) ... active: bool = b2.field(b2.bool(), default=True)
Numeric¶
|
8-bit signed integer column (−128 … 127). |
|
16-bit signed integer column (−32 768 … 32 767). |
|
32-bit signed integer column (−2 147 483 648 … 2 147 483 647). |
|
64-bit signed integer column. |
|
8-bit unsigned integer column (0 … 255). |
|
16-bit unsigned integer column (0 … 65 535). |
|
32-bit unsigned integer column (0 … 4 294 967 295). |
|
64-bit unsigned integer column. |
|
32-bit floating-point column (single precision). |
|
64-bit floating-point column (double precision). |
|
Timestamp column stored as signed 64-bit epoch offsets. |
- class blosc2.int8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
8-bit signed integer column (−128 … 127).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int8
- class blosc2.int16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
16-bit signed integer column (−32 768 … 32 767).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int16
- class blosc2.int32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit signed integer column (−2 147 483 648 … 2 147 483 647).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int32
- class blosc2.int64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit signed integer column.
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int64
- class blosc2.uint8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
8-bit unsigned integer column (0 … 255).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint8
- class blosc2.uint16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
16-bit unsigned integer column (0 … 65 535).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint16
- class blosc2.uint32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit unsigned integer column (0 … 4 294 967 295).
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint32
- class blosc2.uint64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit unsigned integer column.
Methods
python_typealias of
intto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
uint64
- class blosc2.float32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
32-bit floating-point column (single precision).
Methods
python_typealias of
floatto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
float32
- class blosc2.float64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]¶
64-bit floating-point column (double precision).
Methods
python_typealias of
floatto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
float64
- class blosc2.timestamp(*, unit: str = 'us', timezone: str | None = None, nullable: bool = False, null_value=None)[source]¶
Timestamp column stored as signed 64-bit epoch offsets.
The physical storage dtype is
int64.unitfollows Arrow/NumPy datetime units:"s","ms","us"or"ns".timezoneis metadata preserved for Arrow/Parquet roundtrips.Methods
python_typealias of
objectto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
int64
Complex¶
64-bit complex number column (two 32-bit floats). |
|
128-bit complex number column (two 64-bit floats). |
Boolean¶
|
Boolean column. |
- class blosc2.bool(*, nullable: bool = False, null_value=None)[source]¶
Boolean column.
Nullable bool columns use uint8 physical storage with values
0(false),1(true), and255(null).Methods
python_typealias of
boolto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
typealias of
bool
Text & binary¶
|
Fixed-width Unicode string column. |
|
Fixed-width bytes column. |
|
Build a variable-length scalar string schema descriptor. |
|
Build a variable-length scalar bytes schema descriptor. |
|
Build a structured schema descriptor for dict-like CTable values. |
|
Build a schema-less Python object column descriptor for CTable. |
|
Build a list-valued schema descriptor for CTable and ListArray. |
- class blosc2.string(*, min_length=None, max_length=None, pattern=None, nullable: bool = False, null_value=None)[source]¶
Fixed-width Unicode string column.
- Parameters:
max_length¶ – Maximum number of characters. Determines the NumPy
U<n>dtype. Defaults to 32 if not specified.min_length¶ – Minimum number of characters (validation only, no effect on dtype).
pattern¶ – Regex pattern the value must match (validation only).
nullable¶ – If
Trueandnull_valueis not set, choose a null sentinel from the current CTable null policy when the schema is compiled.null_value¶ – Explicit null sentinel. Takes precedence over
nullable=True.
Methods
python_typealias of
strto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
- class blosc2.bytes(*, min_length=None, max_length=None, nullable: bool = False, null_value=None)[source]¶
Fixed-width bytes column.
- Parameters:
max_length¶ – Maximum number of bytes. Determines the NumPy
S<n>dtype. Defaults to 32 if not specified.min_length¶ – Minimum number of bytes (validation only, no effect on dtype).
nullable¶ – If
Trueandnull_valueis not set, choose a null sentinel from the current CTable null policy when the schema is compiled.null_value¶ – Explicit null sentinel. Takes precedence over
nullable=True.
Methods
python_typealias of
bytesto_metadata_dict()Return a JSON-compatible dict for schema serialization.
to_pydantic_kwargs()Return kwargs for building a Pydantic field annotation.
- blosc2.vlstring(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLStringSpec[source]¶
Build a variable-length scalar string schema descriptor.
Use this as an explicit opt-in when a CTable column holds long or wildly variable-length strings that would waste space in a fixed-width
string(max_length=N)column. Must be requested viablosc2.field(blosc2.vlstring())— it is never inferred automatically from plainstrannotations.
- blosc2.vlbytes(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLBytesSpec[source]¶
Build a variable-length scalar bytes schema descriptor.
Use this as an explicit opt-in when a CTable column holds long or wildly variable-length byte strings. Must be requested via
blosc2.field(blosc2.vlbytes())— it is never inferred automatically from plainbytesannotations.
- blosc2.struct(fields: dict[str, SchemaSpec], *, nullable: bool = False) StructSpec[source]¶
Build a structured schema descriptor for dict-like CTable values.
Top-level struct columns store one dictionary (or
Nonewhen nullable) per row. Struct specs may also be nested as list item specs.
- blosc2.object(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) ObjectSpec[source]¶
Build a schema-less Python object column descriptor for CTable.
Values are stored via batched msgpack serialization. Prefer typed specs such as
struct(),list(),vlstring(), orvlbytes()when the data has a stable schema; useobjectfor heterogeneous per-row payloads.
Object columns¶
Timestamp columns¶
Timestamp columns are declared with blosc2.timestamp and store signed
64-bit epoch offsets with timestamp metadata. Column reads return
numpy.datetime64 values, comparisons accept numpy.datetime64 values,
ISO-like strings, or Python datetime objects, and Arrow/Parquet import/export
roundtrips timestamp units and time zones:
from dataclasses import dataclass
import numpy as np
import blosc2 as b2
@dataclass
class Event:
when: np.datetime64 = b2.field(b2.timestamp(unit="us", nullable=True))
value: int = b2.field(b2.int64())
table = b2.CTable(Event)
table.append(["2025-01-01T12:00:00", 42])
recent = table[table.when >= np.datetime64("2025-01-01", "us")]
Object columns¶
Schema-less object columns are declared with blosc2.object() and store one
msgpack-serializable Python object (or None when nullable) per row in
batched variable-length storage. Prefer typed specs such as blosc2.struct()
or blosc2.list() when the payload has a stable schema; use object columns
for heterogeneous per-row payloads:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Event:
id: int = b2.field(b2.int64())
payload: object = b2.field(b2.object(nullable=True))
table.append([1, {"kind": "click", "xy": [10, 20]}])
table.append([2, ("custom", {"nested": True})])
table.append([3, None])
Object columns have no fixed Arrow type, so CTable.to_arrow() and
CTable.to_parquet() raise for them unless users first convert the payloads
to a typed representation. They are not used as an implicit fallback during
Parquet import; unsupported Arrow/Parquet types still raise unless explicitly
imported through CTable.from_arrow() with object_fallback=True.
Struct columns¶
Struct columns are declared with blosc2.struct() and store one dictionary
(or None when nullable) per row in batched variable-length storage. They are
also used when importing top-level Arrow/Parquet struct<...> columns:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Product:
properties: dict = b2.field(
b2.struct({"code": b2.int32(), "label": b2.vlstring()}, nullable=True)
)
table.append([{"code": 1, "label": "fresh"}])
table.append([None])
List columns¶
List columns are declared with blosc2.list(), for example:
from dataclasses import dataclass
import blosc2 as b2
@dataclass
class Product:
code: str = b2.field(b2.string(max_length=8))
tags: list[str] = b2.field(b2.list(b2.string(), nullable=True))
Whole-cell replacement is supported, so users should reassign modified lists:
row_tags = table.tags[0]
row_tags.append("extra") # local Python list only
table.tags[0] = row_tags # explicit write-back