CTable

A columnar compressed table backed by one physical container per column. Scalar columns use NDArray; list-valued columns use ListArray. Each column is stored, compressed, and queried independently; rows are never materialised in their entirety unless you explicitly call to_arrow() or iterate with __iter__().

class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None)[source]

Columnar compressed table with typed columns and row-oriented access.

Attributes:
cbytes

Total compressed size in bytes (all columns + valid_rows mask).

computed_columns

Read-only view of the computed-column definitions.

cratio

Compression ratio for the whole table payload.

indexes

Return a list of blosc2.Index handles for all active indexes.

info

Get information about this table.

info_items

Structured summary items used by info().

nbytes

Total uncompressed size in bytes (all columns + valid_rows mask).

ncols

Total number of columns, including computed (virtual) columns.

nrows
schema

The compiled schema that drives this table’s columns and validation.

Methods

add_column(name, spec)

Add a new column filled from the default declared in spec.

add_computed_column(name, expr, *[, dtype])

Add a read-only virtual column whose values are computed from other columns.

append(data)

Append a single row to the table.

close()

Close any persistent backing store held by this table.

column_schema(name)

Return the CompiledColumn descriptor for name.

compact()

Physically rewrite every column array keeping only live rows.

compact_index([col_name, expression, name])

Compact an index, merging any incremental append runs.

copy([compact, urlpath, overwrite])

Return a new standalone copy of this table.

cov()

Return the covariance matrix as a numpy array.

create_index([col_name, field, expression, ...])

Build and register an index for a stored column or table expression.

delete(ind)

Mark one or more rows as deleted (tombstone deletion).

describe()

Print a per-column statistical summary.

drop_column(name)

Remove a column from the table.

drop_computed_column(name)

Remove a computed column from the table.

drop_index([col_name, expression, name])

Remove an index and delete any sidecar files.

extend(data, *[, validate])

Append multiple rows at once.

from_arrow(schema, batches, *[, urlpath, ...])

Build a CTable from an Arrow schema and iterable of record batches.

from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

from_parquet(path, *[, columns, batch_size, ...])

Read a Parquet file into a CTable.

head([N])

Return a view of the first N live rows (default 5).

index([col_name, expression, name])

Return the index handle for a stored-column or expression target.

iter_arrow_batches(*[, columns, batch_size, ...])

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols[, ascending, start, stop, ...])

Iterate rows in sorted order without materializing a full copy.

load(urlpath)

Load a persistent table from urlpath into RAM.

materialize_computed_column(name, *[, ...])

Materialize a computed column into a new stored snapshot column.

open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

rebuild_index([col_name, expression, name])

Drop and recreate an index with the same parameters.

rename_column(old, new)

Rename a column.

sample(n, *[, seed])

Return a read-only view of n randomly chosen live rows.

save(urlpath, *[, overwrite])

Persist this table to disk at urlpath.

schema_dict()

Return a JSON-compatible dict describing this table's schema.

select(cols)

Return a column-projection view exposing only cols.

sort_by(cols[, ascending, inplace])

Return a copy of the table sorted by one or more columns.

tail([N])

Return a view of the last N live rows (default 5).

to_arrow()

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath, *[, overwrite, compact])

Write this table to a directory-backed store.

to_b2z(urlpath, *[, overwrite, compact])

Write this table to a compact .b2z container.

to_csv(path, *[, header, sep])

Write all live rows to a CSV file.

to_parquet(path, *[, columns, batch_size, ...])

Write this table to a Parquet file batch-wise using pyarrow.

view(new_valid_rows)

Return a row-filter view backed by a boolean mask array without copying data.

where(expr_result, *[, columns])

Return a row-filtered view matching a boolean predicate.

Special methods

CTable.__len__()

Return the number of live (non-deleted) rows.

CTable.__iter__()

Iterate over live rows in insertion order, yielding namedtuple-like row objects.

CTable.__getitem__(key)

Type-driven indexing for columns, rows, projections, and filters.

CTable.__repr__()

Short CTable<cols>(N rows, X compressed) summary string.

CTable.__str__()

Pandas-style tabular display with column names, dtypes, and a row count footer.

__len__()[source]

Return the number of live (non-deleted) rows.

Return the number of live (non-deleted) rows.

__iter__()[source]

Iterate over live rows in insertion order, yielding namedtuple-like row objects.

Iterate over live rows in insertion order, yielding namedtuple-like row objects with one attribute per column.

__getitem__(key)[source]

Type-driven indexing for columns, rows, projections, and filters.

Supported keys are:

  • str: return a Column when it matches a stored or computed column name; otherwise evaluate it as a boolean expression via where().

  • boolean blosc2.LazyExpr or blosc2.NDArray: return the same filtered view as where(), e.g. t[t.temperature_f > 70].

  • int: return one live row as a namedtuple-like object.

  • slice: return a row-range view.

  • integer array/list: return a gathered-row view.

  • boolean NumPy array/list: return a boolean-mask filtered view.

  • string list: return a column-projection view, equivalent to select().

Examples

Access columns and rows:

temps = t["temperature"]
first = t[0]
view = t[10:20]

Filter rows with a string expression, a stored-column expression, or a computed-column expression:

warm = t["temperature > 20"]
warm_active = t[(t.temperature > 20) & t.active]
hot_fahrenheit = t[t.temperature_f > 70]

Project columns:

slim = t[["sensor_id", "temperature_f"]]

Type-driven indexing:

  • str — column name returns a Column; any other string is interpreted as a boolean expression and behaves like where().

  • boolean LazyExpr / NDArray — filtered row view, same as where(), e.g. t[t.temperature_f > 70].

  • int — single row as a namedtuple-like object.

  • slice — row-range view.

  • list[int] / ndarray[int] — gathered-row view.

  • ndarray[bool] — boolean-mask filtered view.

  • list[str] — column-projection view (same as select()).

__repr__() str[source]

Short CTable<cols>(N rows, X compressed) summary string.

__str__() str[source]

Pandas-style tabular display with column names, dtypes, and a row count footer.

classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Arrow struct columns are imported as struct() columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrow struct<...> columns are imported as struct() columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow().

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

add_column(name: str, spec: SchemaSpec | Field) None[source]

Add a new column filled from the default declared in spec.

Parameters:
  • name – Column name. Must follow the same naming rules as schema fields.

  • spec – A schema descriptor such as b2.int64(ge=0) or a field descriptor such as b2.field(b2.int64(ge=0), default=0). When the table already has live rows, use blosc2.field(...) with a default declared so those rows can be backfilled.

Raises:
  • ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.

  • TypeError – If a declared default cannot be coerced to spec’s dtype.

add_computed_column(name: str, expr, *, dtype: dtype | None = None) None[source]

Add a read-only virtual column whose values are computed from other columns.

The column stores no data — it is evaluated on-the-fly when read. It participates in display, filtering, sorting, export (to_arrow / to_csv), and aggregates, but cannot be written to, indexed, or included in append / extend inputs.

Parameters:
  • name – Column name. Must not collide with any existing stored or computed column and must satisfy the usual naming rules.

  • expr – Either a callable (cols: dict[str, NDArray]) -> LazyExpr or an expression string (e.g. "price * qty") where column names are referenced directly and resolved from stored columns.

  • dtype – Override the inferred result dtype. When omitted the dtype is taken from the blosc2.LazyExpr.

Raises:
  • ValueError – If called on a view, the table is read-only, name already exists, or an operand is not a stored column of this table.

  • TypeError – If expr is not a callable or string, or does not return a blosc2.LazyExpr.

append(data: list | void | ndarray) None[source]

Append a single row to the table.

data may be a list, tuple, numpy.void, or structured numpy.ndarray whose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. Raises ValueError if the table is read-only or a view.

close() None[source]

Close any persistent backing store held by this table.

column_schema(name: str) CompiledColumn[source]

Return the CompiledColumn descriptor for name.

Raises:

KeyError – If name is not a column in this table.

compact()[source]

Physically rewrite every column array keeping only live rows.

Closes the gaps left by prior delete() calls. All existing indexes are dropped and must be recreated afterwards. Raises ValueError if the table is read-only or a view.

compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Compact an index, merging any incremental append runs.

copy(compact: bool = True, *, urlpath: str | PathLike[str] | None = None, overwrite: bool = False) CTable[source]

Return a new standalone copy of this table.

Parameters:
  • compact – If True (default), only live (non-deleted) rows are copied. The result is a dense table with no tombstones and no parent dependency — ideal for materialising a filtered view. If False, all physical slots are copied including deleted gaps, preserving the tombstone state exactly for in-memory copies.

  • urlpath – Destination path for a persistent copy. The .b2z extension selects a compact zip-backed store; any other path uses a directory-backed store. A .b2d suffix is recommended for directory-backed stores. If None (default), return an in-memory copy.

  • overwrite – If True, replace an existing persistent destination.

cov() ndarray[source]

Return the covariance matrix as a numpy array.

Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise TypeError.

Returns:

Shape (ncols, ncols). Column order matches col_names.

Return type:

numpy.ndarray

Raises:
  • TypeError – If any column has an unsupported dtype (complex, string, …).

  • ValueError – If the table has fewer than 2 live rows (covariance undefined).

create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]

Build and register an index for a stored column or table expression.

delete(ind: int | slice | str | Iterable) None[source]

Mark one or more rows as deleted (tombstone deletion).

ind may be a logical row index (int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed until compact() is called. Raises ValueError if the table is read-only or a view.

describe() None[source]

Print a per-column statistical summary.

Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.

drop_column(name: str) None[source]

Remove a column from the table.

On disk tables the corresponding persisted column leaf is deleted.

Raises:
  • ValueError – If the table is read-only, is a view, or name is the last column.

  • KeyError – If name does not exist.

drop_computed_column(name: str) None[source]

Remove a computed column from the table.

Parameters:

name – Name of the computed column to remove.

Raises:
  • KeyError – If name is not a computed column.

  • ValueError – If called on a view.

drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]

Remove an index and delete any sidecar files.

extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]

Append multiple rows at once.

data may be:

  • a dict of arrays {"col": array, ...} — all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;

  • a list of rows, each compatible with append();

  • another CTable — columns are matched by name.

Pass validate=False to skip per-row Pydantic validation on trusted bulk imports. Raises ValueError if the table is read-only or a view.

head(N: int = 5) CTable[source]

Return a view of the first N live rows (default 5).

index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Return the index handle for a stored-column or expression target.

iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]

Iterate rows in sorted order without materializing a full copy.

Uses a FULL index when available (no sort needed); otherwise falls back to np.lexsort on live physical positions. Yields namedtuple-like row objects in the same way as __iter__.

The sorted positions array is stored as a compressed blosc2.NDArray to keep RAM usage low for large tables. batch_size positions are decompressed at a time during iteration.

Parameters:
  • cols – Column name or list of column names to sort by.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • start – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • stop – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • step – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • batch_size – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.

materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]

Materialize a computed column into a new stored snapshot column.

Parameters:
  • name – Existing computed column to materialize.

  • new_name – Name of the new stored column. Defaults to f"{name}_stored".

  • dtype – Optional target dtype for the stored column. Defaults to the computed column dtype.

  • cparams – Optional compression parameters for the new stored column.

Raises:
  • ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.

  • KeyError – If name is not a computed column.

  • TypeError – If dtype is incompatible with the computed values.

rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Drop and recreate an index with the same parameters.

rename_column(old: str, new: str) None[source]

Rename a column.

On disk tables the corresponding persisted column leaf is renamed.

Raises:
  • ValueError – If the table is read-only, is a view, or new already exists.

  • KeyError – If old does not exist.

sample(n: int, *, seed: int | None = None) CTable[source]

Return a read-only view of n randomly chosen live rows.

Parameters:
  • n – Number of rows to sample. If n >= number of live rows, returns a view of the whole table.

  • seed – Optional random seed for reproducibility.

Returns:

A read-only view sharing columns with this table.

Return type:

CTable

save(urlpath: str, *, overwrite: bool = False) None[source]

Persist this table to disk at urlpath.

This writes a standalone copy and returns None; use copy() directly when the copied CTable object is needed.

Only live rows are written — the on-disk table is always compacted. A .b2z suffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a .b2d suffix for directory-backed stores when possible so the format is clear.

Parameters:
  • urlpath – Destination path. Use a .b2z suffix for a compact zip-backed store; any other suffix creates a directory-backed store. A .b2d suffix is recommended for directory-backed stores.

  • overwrite – If False (default), raise ValueError when urlpath already exists. Set to True to replace an existing table.

Raises:

ValueError – If urlpath already exists and overwrite=False.

schema_dict() dict[str, Any][source]

Return a JSON-compatible dict describing this table’s schema.

select(cols: list[str]) CTable[source]

Return a column-projection view exposing only cols.

The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.

Parameters:

cols – Ordered list of column names to keep.

Raises:
  • KeyError – If any name in cols is not a column of this table.

  • ValueError – If cols is empty.

sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]

Return a copy of the table sorted by one or more columns.

Parameters:
  • cols – Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • inplace – If True, rewrite the physical data in place and return self (like compact() but sorted). If False (default), return a new in-memory CTable leaving this one untouched.

Raises:
  • ValueError – If called on a view or a read-only table when inplace=True.

  • KeyError – If any column name is not found.

  • TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).

tail(N: int = 5) CTable[source]

Return a view of the last N live rows (default 5).

to_arrow()[source]

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a directory-backed store.

Directory-backed CTable stores may use any path that does not end in .b2z; using a .b2d suffix is recommended for clarity. For persistent, non-view .b2z tables opened read-only and compact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.

For in-memory tables, views, writable .b2z tables, existing directory-backed tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new directory-backed store.

Examples

Fast-unpack an existing compact zip store into a directory-backed table:

table = blosc2.CTable.open("data.b2z", mode="r")
table.to_b2d("data.b2d", overwrite=True)
table.close()

Materialize a filtered view into a directory-backed store:

view = table.where(table["score"] > 10)
view.to_b2d("high-score.b2d", overwrite=True)

Force a logical compacted copy, even for a persistent .b2z table:

table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a compact .b2z container.

.b2z is the compact zip-backed CTable format. For persistent, non-view directory-backed tables and compact=False, this uses a fast physical-pack path: the backing TreeStore directory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A .b2d suffix is recommended for directory-backed stores, but not required.

For in-memory tables, views, existing .b2z tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new .b2z store.

Examples

Fast-pack an existing directory-backed table into a compact zip store:

table = blosc2.CTable.open("data.b2d", mode="r")
table.to_b2z("data.b2z", overwrite=True)
table.close()

Materialize a filtered view into a new compact store:

view = table.where(table["score"] > 10)
view.to_b2z("high-score.b2z", overwrite=True)

Force a logical compacted copy, even for a persistent .b2d table:

table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]

Write all live rows to a CSV file.

Uses Python’s stdlib csv module — no extra dependency required. Each column is materialised once via col[:]; rows are then written one at a time.

Parameters:
  • path – Destination file path. Created or overwritten.

  • header – If True (default), write column names as the first row.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]

Write this table to a Parquet file batch-wise using pyarrow.

view(new_valid_rows)[source]

Return a row-filter view backed by a boolean mask array without copying data.

where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]

Return a row-filtered view matching a boolean predicate.

Signature:

where(expr_result) -> CTable

The predicate can be supplied as a boolean blosc2.LazyExpr, a boolean blosc2.NDArray, a boolean NumPy array, a boolean Column, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.

The returned object is a CTable view sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.

Parameters:

expr_result – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g. "value * category >= 150". Column objects can also be used in Python expressions, e.g. (t.value * t.category) >= 150.

Returns:

A view over the same columns containing only rows where the predicate is true and the source row is live. When columns is provided, the returned view is additionally projected to that ordered subset of columns.

Return type:

CTable

Raises:

TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.

Examples

Filter using a string expression:

view = t.where("value * category >= 150")
slim = t.where("value * category >= 150", columns=["value", "category"])

Filter using column arithmetic:

view = t.where((t.value * t.category) >= 150)

Blosc2 lazy functions can be used in column expressions:

view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)

For column names that are not valid Python identifiers, use item access:

view = t.where((t["unit price"] * t["quantity"]) > 100)

Notes

Use bitwise operators (&, |, ~) or string expressions for element-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use:

t.where((t.x > 0) & (t.y < 10))
t.where(~t.returned)
t.where("not returned")

not:

t.where((t.x > 0) and (t.y < 10))
t.where(not t.returned)
base: CTable | None

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()). None for top-level tables. Structural mutations such as add_column() and drop_column() are blocked on views.

property cbytes: int

Total compressed size in bytes (all columns + valid_rows mask).

col_names: list[str]

Ordered list of stored column names. Computed columns are not included; access those via computed_columns.

property computed_columns: dict[str, dict]

Read-only view of the computed-column definitions.

Each value is a dict with keys expression, col_deps, lazy (blosc2.LazyExpr), and dtype.

property cratio: float

Compression ratio for the whole table payload.

property indexes: list[Index]

Return a list of blosc2.Index handles for all active indexes.

property info: _CTableInfoReporter

Get information about this table.

Examples

>>> print(t.info)
>>> t.info()
property info_items: list[tuple[str, object]]

Structured summary items used by info().

property nbytes: int

Total uncompressed size in bytes (all columns + valid_rows mask).

property ncols: int

Total number of columns, including computed (virtual) columns.

property schema: CompiledSchema

The compiled schema that drives this table’s columns and validation.

Construction

CTable.__init__(row_type[, new_data, ...])

CTable.open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

CTable.load(urlpath)

Load a persistent table from urlpath into RAM.

CTable.from_arrow(schema, batches, *[, ...])

Build a CTable from an Arrow schema and iterable of record batches.

CTable.from_parquet(path, *[, columns, ...])

Read a Parquet file into a CTable.

CTable.from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

CTable.__init__(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None) None[source]
classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Arrow struct columns are imported as struct() columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrow struct<...> columns are imported as struct() columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow().

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

Null policy

Nullable scalar CTable columns are represented with per-column sentinel values, not native validity bitmaps. When CTable has to infer those sentinels, the selection can be customized with NullPolicy and scoped with null_policy():

policy = blosc2.NullPolicy(
    signed_int_strategy="max",
    string_value="<NULL>",
    column_null_values={"user_id": -1, "country": "NA"},
)

with blosc2.null_policy(policy):
    table = blosc2.CTable.from_parquet("data.parquet")

The same policy is used by explicit nullable schema specs when no null_value is supplied:

from dataclasses import dataclass

@dataclass
class Row:
    user_id: int = blosc2.field(blosc2.int64(nullable=True))
    country: str = blosc2.field(blosc2.string(nullable=True))

with blosc2.null_policy(policy):
    table = blosc2.CTable(Row)

Sentinels are resolved in this order: explicit null_value in the schema, NullPolicy.column_null_values for a matching column, then the type-wide NullPolicy default. Columns without nullable=True or an explicit null_value are not nullable.

NullPolicy(string_value, bytes_value, ...)

Default sentinels for inferred CTable scalar nulls.

null_policy(policy)

Temporarily set the default policy for CTable null sentinel inference.

get_null_policy()

Return the current default null policy.

class blosc2.NullPolicy(string_value: str = '__BLOSC2_NULL__', bytes_value: bytes = b'__BLOSC2_NULL__', float_value: float = nan, bool_value: int = 255, signed_int_strategy: ~typing.Literal['min', 'max'] = 'min', unsigned_int_strategy: ~typing.Literal['min', 'max'] = 'max', timestamp_value: int = -9223372036854775808, column_null_values: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]

Default sentinels for inferred CTable scalar nulls.

CTable nullable scalar columns are represented with per-column sentinel values. This policy is used when CTable has to infer those sentinels, such as when importing nullable scalar Arrow or Parquet columns without an explicit column-level null sentinel. The selected sentinel is stored in the resulting CTable schema, so existing tables remain self-describing.

Examples

Use blosc2.null_policy() to apply a policy while creating a CTable from data with nullable scalar columns:

policy = blosc2.NullPolicy(
    signed_int_strategy="max",
    string_value="<NULL>",
    column_null_values={"user_id": -1, "country": "NA"},
)

with blosc2.null_policy(policy):
    table = blosc2.CTable.from_parquet("data.parquet")

The same policy is used for explicit nullable schema specs:

@dataclass
class Row:
    user_id: int = blosc2.field(blosc2.int64(nullable=True))
    country: str = blosc2.field(blosc2.string(nullable=True))

with blosc2.null_policy(policy):
    table = blosc2.CTable(Row)

column_null_values takes precedence over the type-wide defaults in the policy. This is useful when a particular column needs a sentinel that is known not to collide with its real values.

Methods

sentinel_for_arrow_type(pa, pa_type)

Return the default sentinel for pa_type, or None if unsupported.

blosc2.null_policy(policy: NullPolicy)

Temporarily set the default policy for CTable null sentinel inference.

blosc2.get_null_policy() NullPolicy[source]

Return the current default null policy.

Attributes

CTable.col_names

Ordered list of stored column names.

CTable.computed_columns

Read-only view of the computed-column definitions.

CTable.nrows

CTable.ncols

Total number of columns, including computed (virtual) columns.

CTable.cbytes

Total compressed size in bytes (all columns + valid_rows mask).

CTable.nbytes

Total uncompressed size in bytes (all columns + valid_rows mask).

CTable.schema

The compiled schema that drives this table's columns and validation.

CTable.base

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()).

CTable.col_names: list[str]

Ordered list of stored column names. Computed columns are not included; access those via computed_columns.

property CTable.computed_columns: dict[str, dict]

Read-only view of the computed-column definitions.

Each value is a dict with keys expression, col_deps, lazy (blosc2.LazyExpr), and dtype.

property CTable.nrows: int
property CTable.ncols: int

Total number of columns, including computed (virtual) columns.

property CTable.cbytes: int

Total compressed size in bytes (all columns + valid_rows mask).

property CTable.nbytes: int

Total uncompressed size in bytes (all columns + valid_rows mask).

property CTable.schema: CompiledSchema

The compiled schema that drives this table’s columns and validation.

CTable.base: CTable | None

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()). None for top-level tables. Structural mutations such as add_column() and drop_column() are blocked on views.

Inserting data

CTable.append(data)

Append a single row to the table.

CTable.extend(data, *[, validate])

Append multiple rows at once.

CTable.append(data: list | void | ndarray) None[source]

Append a single row to the table.

data may be a list, tuple, numpy.void, or structured numpy.ndarray whose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. Raises ValueError if the table is read-only or a view.

CTable.extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]

Append multiple rows at once.

data may be:

  • a dict of arrays {"col": array, ...} — all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;

  • a list of rows, each compatible with append();

  • another CTable — columns are matched by name.

Pass validate=False to skip per-row Pydantic validation on trusted bulk imports. Raises ValueError if the table is read-only or a view.

Querying

Boolean expressions

Use bitwise operators (&, |, ~) or string expressions for row-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use column expressions with explicit parentheses around comparisons:

t.where((t.amount > 100) & (t.region == "North"))
t.where(~t.returned)

or use string expressions when that reads better:

t.where("amount > 100 and region == 'North'")
t.where("not returned")
t["not returned"]

The last three forms for negating a boolean column are equivalent: t.where(~t.returned), t.where("not returned"), and t["not returned"].

Indexing & projection

CTable indexing is type-driven:

t["amount"]                 # column access
t[3]                        # one row as a namedtuple-like object
t[3:8]                      # row view
t[[1, 4, 7]]                # gathered-row view
t[mask]                     # filtered row view
t[t.amount > 100]           # LazyExpr filtered row view, like where()
t[["region", "amount"]]   # projected column view

String keys first try exact column-name lookup. If the string is not a column name, it is interpreted as a boolean expression and behaves like CTable.where(). Boolean LazyExpr and boolean NDArray keys also behave like CTable.where(), so computed column predicates such as t[t.temperature_f > 70] are supported.

For explicit filtered projection, use:

t.where("amount > 100", columns=["region", "amount"])

When a NumPy structured array is needed, materialize explicitly:

np.asarray(t[:10])

CTable.where(expr_result, *[, columns])

Return a row-filtered view matching a boolean predicate.

CTable.view(new_valid_rows)

Return a row-filter view backed by a boolean mask array without copying data.

CTable.select(cols)

Return a column-projection view exposing only cols.

CTable.head([N])

Return a view of the first N live rows (default 5).

CTable.tail([N])

Return a view of the last N live rows (default 5).

CTable.sample(n, *[, seed])

Return a read-only view of n randomly chosen live rows.

CTable.sort_by(cols[, ascending, inplace])

Return a copy of the table sorted by one or more columns.

CTable.iter_sorted(cols[, ascending, start, ...])

Iterate rows in sorted order without materializing a full copy.

CTable.where(expr_result: str | ndarray | NDArray | LazyExpr | Column, *, columns: list[str] | tuple[str, ...] | None = None) CTable[source]

Return a row-filtered view matching a boolean predicate.

Signature:

where(expr_result) -> CTable

The predicate can be supplied as a boolean blosc2.LazyExpr, a boolean blosc2.NDArray, a boolean NumPy array, a boolean Column, or a string expression evaluated against this table’s columns. String expressions can reference stored and computed columns directly by name.

The returned object is a CTable view sharing the original column data. The row-selection mask is evaluated immediately and intersected with the table’s current live rows; selected column data is not copied.

Parameters:

expr_result – Boolean predicate selecting rows. Strings are converted to a lazy expression with table columns as operands, e.g. "value * category >= 150". Column objects can also be used in Python expressions, e.g. (t.value * t.category) >= 150.

Returns:

A view over the same columns containing only rows where the predicate is true and the source row is live. When columns is provided, the returned view is additionally projected to that ordered subset of columns.

Return type:

CTable

Raises:

TypeError – If expr_result does not evaluate to a boolean Blosc2/NumPy array or lazy expression.

Examples

Filter using a string expression:

view = t.where("value * category >= 150")
slim = t.where("value * category >= 150", columns=["value", "category"])

Filter using column arithmetic:

view = t.where((t.value * t.category) >= 150)

Blosc2 lazy functions can be used in column expressions:

view = t.where(((t.value + 2) * blosc2.sin(t.category)) >= 10)

For column names that are not valid Python identifiers, use item access:

view = t.where((t["unit price"] * t["quantity"]) > 100)

Notes

Use bitwise operators (&, |, ~) or string expressions for element-wise boolean logic. Python’s logical operators and, or and not cannot be overloaded and therefore do not build lazy column expressions.

Use:

t.where((t.x > 0) & (t.y < 10))
t.where(~t.returned)
t.where("not returned")

not:

t.where((t.x > 0) and (t.y < 10))
t.where(not t.returned)
CTable.view(new_valid_rows)[source]

Return a row-filter view backed by a boolean mask array without copying data.

CTable.select(cols: list[str]) CTable[source]

Return a column-projection view exposing only cols.

The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.

Parameters:

cols – Ordered list of column names to keep.

Raises:
  • KeyError – If any name in cols is not a column of this table.

  • ValueError – If cols is empty.

CTable.head(N: int = 5) CTable[source]

Return a view of the first N live rows (default 5).

CTable.tail(N: int = 5) CTable[source]

Return a view of the last N live rows (default 5).

CTable.sample(n: int, *, seed: int | None = None) CTable[source]

Return a read-only view of n randomly chosen live rows.

Parameters:
  • n – Number of rows to sample. If n >= number of live rows, returns a view of the whole table.

  • seed – Optional random seed for reproducibility.

Returns:

A read-only view sharing columns with this table.

Return type:

CTable

CTable.sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]

Return a copy of the table sorted by one or more columns.

Parameters:
  • cols – Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • inplace – If True, rewrite the physical data in place and return self (like compact() but sorted). If False (default), return a new in-memory CTable leaving this one untouched.

Raises:
  • ValueError – If called on a view or a read-only table when inplace=True.

  • KeyError – If any column name is not found.

  • TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).

CTable.iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]

Iterate rows in sorted order without materializing a full copy.

Uses a FULL index when available (no sort needed); otherwise falls back to np.lexsort on live physical positions. Yields namedtuple-like row objects in the same way as __iter__.

The sorted positions array is stored as a compressed blosc2.NDArray to keep RAM usage low for large tables. batch_size positions are decompressed at a time during iteration.

Parameters:
  • cols – Column name or list of column names to sort by.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • start – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • stop – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • step – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • batch_size – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.

Mutations

In addition to physical schema changes such as CTable.add_column(), CTables can host computed columns backed by a lazy expression over stored columns. Computed columns are read-only, use no extra storage, participate in display, filtering, sorting, and aggregates, and are persisted across CTable.save(), CTable.load(), and CTable.open().

When a computed result should become a normal stored column, use CTable.materialize_computed_column(). The materialized column is a stored snapshot that can be indexed like any other stored column. New rows inserted later via CTable.append() or CTable.extend() auto-fill omitted materialized-column values from the recorded expression metadata.

CTable.delete(ind)

Mark one or more rows as deleted (tombstone deletion).

CTable.compact()

Physically rewrite every column array keeping only live rows.

CTable.add_column(name, spec)

Add a new column filled from the default declared in spec.

CTable.add_computed_column(name, expr, *[, ...])

Add a read-only virtual column whose values are computed from other columns.

CTable.materialize_computed_column(name, *)

Materialize a computed column into a new stored snapshot column.

CTable.drop_computed_column(name)

Remove a computed column from the table.

CTable.drop_column(name)

Remove a column from the table.

CTable.rename_column(old, new)

Rename a column.

CTable.delete(ind: int | slice | str | Iterable) None[source]

Mark one or more rows as deleted (tombstone deletion).

ind may be a logical row index (int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed until compact() is called. Raises ValueError if the table is read-only or a view.

CTable.compact()[source]

Physically rewrite every column array keeping only live rows.

Closes the gaps left by prior delete() calls. All existing indexes are dropped and must be recreated afterwards. Raises ValueError if the table is read-only or a view.

CTable.add_column(name: str, spec: SchemaSpec | Field) None[source]

Add a new column filled from the default declared in spec.

Parameters:
  • name – Column name. Must follow the same naming rules as schema fields.

  • spec – A schema descriptor such as b2.int64(ge=0) or a field descriptor such as b2.field(b2.int64(ge=0), default=0). When the table already has live rows, use blosc2.field(...) with a default declared so those rows can be backfilled.

Raises:
  • ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.

  • TypeError – If a declared default cannot be coerced to spec’s dtype.

CTable.add_computed_column(name: str, expr, *, dtype: dtype | None = None) None[source]

Add a read-only virtual column whose values are computed from other columns.

The column stores no data — it is evaluated on-the-fly when read. It participates in display, filtering, sorting, export (to_arrow / to_csv), and aggregates, but cannot be written to, indexed, or included in append / extend inputs.

Parameters:
  • name – Column name. Must not collide with any existing stored or computed column and must satisfy the usual naming rules.

  • expr – Either a callable (cols: dict[str, NDArray]) -> LazyExpr or an expression string (e.g. "price * qty") where column names are referenced directly and resolved from stored columns.

  • dtype – Override the inferred result dtype. When omitted the dtype is taken from the blosc2.LazyExpr.

Raises:
  • ValueError – If called on a view, the table is read-only, name already exists, or an operand is not a stored column of this table.

  • TypeError – If expr is not a callable or string, or does not return a blosc2.LazyExpr.

CTable.materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]

Materialize a computed column into a new stored snapshot column.

Parameters:
  • name – Existing computed column to materialize.

  • new_name – Name of the new stored column. Defaults to f"{name}_stored".

  • dtype – Optional target dtype for the stored column. Defaults to the computed column dtype.

  • cparams – Optional compression parameters for the new stored column.

Raises:
  • ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.

  • KeyError – If name is not a computed column.

  • TypeError – If dtype is incompatible with the computed values.

CTable.drop_computed_column(name: str) None[source]

Remove a computed column from the table.

Parameters:

name – Name of the computed column to remove.

Raises:
  • KeyError – If name is not a computed column.

  • ValueError – If called on a view.

CTable.drop_column(name: str) None[source]

Remove a column from the table.

On disk tables the corresponding persisted column leaf is deleted.

Raises:
  • ValueError – If the table is read-only, is a view, or name is the last column.

  • KeyError – If name does not exist.

CTable.rename_column(old: str, new: str) None[source]

Rename a column.

On disk tables the corresponding persisted column leaf is renamed.

Raises:
  • ValueError – If the table is read-only, is a view, or new already exists.

  • KeyError – If old does not exist.

Indexes

CTable indexes are created with CTable.create_index() and returned as blosc2.Index handles. For tables, Index refers to an entry stored in the table index catalog and delegates maintenance operations such as drop(), rebuild(), and compact() back to the owning table. Users normally only receive these handles from the CTable API; they do not instantiate them directly.

Indexes can target stored columns or direct expressions over stored columns via create_index(expression=...). This lets queries reuse indexes for derived predicates without adding either a computed column or a materialized stored one. A matching FULL direct-expression index can also be reused by ordering paths such as CTable.sort_by() when sorting by a computed column backed by the same expression. OPSI indexes are a separate exact-filtering tier with a tunable number of iterative ordering cycles; they are not intended to converge to a completely sorted FULL/CSI index, so use FULL when globally sorted ordered reuse is required.

CTable.create_index([col_name, field, ...])

Build and register an index for a stored column or table expression.

CTable.index([col_name, expression, name])

Return the index handle for a stored-column or expression target.

CTable.indexes

Return a list of blosc2.Index handles for all active indexes.

CTable.drop_index([col_name, expression, name])

Remove an index and delete any sidecar files.

CTable.rebuild_index([col_name, expression, ...])

Drop and recreate an index with the same parameters.

CTable.compact_index([col_name, expression, ...])

Compact an index, merging any incremental append runs.

CTable.create_index(col_name: str | None = None, *, field: str | None = None, expression: str | None = None, operands: dict | None = None, kind: IndexKind = IndexKind.BUCKET, optlevel: int = 5, name: str | None = None, build: str = 'auto', tmpdir: str | None = None, **kwargs) Index[source]

Build and register an index for a stored column or table expression.

CTable.index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Return the index handle for a stored-column or expression target.

CTable.indexes

Return a list of blosc2.Index handles for all active indexes.

CTable.drop_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) None[source]

Remove an index and delete any sidecar files.

CTable.rebuild_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Drop and recreate an index with the same parameters.

CTable.compact_index(col_name: str | None = None, *, expression: str | None = None, name: str | None = None) Index[source]

Compact an index, merging any incremental append runs.

See blosc2.Index for the returned handle attributes and methods.

Persistence

Persist CTables to disk or interchange formats, and restore them later without losing schema information. These methods cover native Blosc2 persistence as well as import/export paths for CSV, Arrow, and Parquet data.

CTable.load(urlpath)

Load a persistent table from urlpath into RAM.

CTable.open(urlpath, *[, mode])

Open a persistent CTable from urlpath.

CTable.save(urlpath, *[, overwrite])

Persist this table to disk at urlpath.

CTable.to_b2z(urlpath, *[, overwrite, compact])

Write this table to a compact .b2z container.

CTable.to_b2d(urlpath, *[, overwrite, compact])

Write this table to a directory-backed store.

CTable.to_csv(path, *[, header, sep])

Write all live rows to a CSV file.

CTable.to_arrow()

Convert all live rows to a pyarrow.Table.

CTable.to_parquet(path, *[, columns, ...])

Write this table to a Parquet file batch-wise using pyarrow.

CTable.from_arrow(schema, batches, *[, ...])

Build a CTable from an Arrow schema and iterable of record batches.

CTable.from_parquet(path, *[, columns, ...])

Read a Parquet file into a CTable.

CTable.from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

classmethod CTable.load(urlpath: str) CTable[source]

Load a persistent table from urlpath into RAM.

The schema is read from the table’s metadata — the original Python dataclass is not required. The returned table is fully in-memory and read/write.

Parameters:

urlpath – Path to the table root directory.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

classmethod CTable.open(urlpath: str, *, mode: str = 'r') CTable[source]

Open a persistent CTable from urlpath.

Parameters:
  • urlpath – Path to the table root directory (created by passing urlpath to CTable).

  • mode'r' (default) — read-only. 'a' — read/write.

Raises:
  • FileNotFoundError – If urlpath does not contain a CTable.

  • ValueError – If the metadata at urlpath does not identify a CTable.

CTable.save(urlpath: str, *, overwrite: bool = False) None[source]

Persist this table to disk at urlpath.

This writes a standalone copy and returns None; use copy() directly when the copied CTable object is needed.

Only live rows are written — the on-disk table is always compacted. A .b2z suffix selects the compact zip-backed format; any other suffix creates a directory-backed store. Use a .b2d suffix for directory-backed stores when possible so the format is clear.

Parameters:
  • urlpath – Destination path. Use a .b2z suffix for a compact zip-backed store; any other suffix creates a directory-backed store. A .b2d suffix is recommended for directory-backed stores.

  • overwrite – If False (default), raise ValueError when urlpath already exists. Set to True to replace an existing table.

Raises:

ValueError – If urlpath already exists and overwrite=False.

CTable.to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a compact .b2z container.

.b2z is the compact zip-backed CTable format. For persistent, non-view directory-backed tables and compact=False, this uses a fast physical-pack path: the backing TreeStore directory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A .b2d suffix is recommended for directory-backed stores, but not required.

For in-memory tables, views, existing .b2z tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new .b2z store.

Examples

Fast-pack an existing directory-backed table into a compact zip store:

table = blosc2.CTable.open("data.b2d", mode="r")
table.to_b2z("data.b2z", overwrite=True)
table.close()

Materialize a filtered view into a new compact store:

view = table.where(table["score"] > 10)
view.to_b2z("high-score.b2z", overwrite=True)

Force a logical compacted copy, even for a persistent .b2d table:

table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
CTable.to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a directory-backed store.

Directory-backed CTable stores may use any path that does not end in .b2z; using a .b2d suffix is recommended for clarity. For persistent, non-view .b2z tables opened read-only and compact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.

For in-memory tables, views, writable .b2z tables, existing directory-backed tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new directory-backed store.

Examples

Fast-unpack an existing compact zip store into a directory-backed table:

table = blosc2.CTable.open("data.b2z", mode="r")
table.to_b2d("data.b2d", overwrite=True)
table.close()

Materialize a filtered view into a directory-backed store:

view = table.where(table["score"] > 10)
view.to_b2d("high-score.b2d", overwrite=True)

Force a logical compacted copy, even for a persistent .b2z table:

table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
CTable.to_csv(path: str, *, header: bool = True, sep: str = ',') None[source]

Write all live rows to a CSV file.

Uses Python’s stdlib csv module — no extra dependency required. Each column is materialised once via col[:]; rows are then written one at a time.

Parameters:
  • path – Destination file path. Created or overwritten.

  • header – If True (default), write column names as the first row.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

CTable.to_arrow()[source]

Convert all live rows to a pyarrow.Table.

CTable.to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]

Write this table to a Parquet file batch-wise using pyarrow.

classmethod CTable.from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Arrow struct columns are imported as struct() columns backed by batched variable-length storage. Null values for these variable- length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod CTable.from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method. Top-level Arrow struct<...> columns are imported as struct() columns backed by batched variable-length storage. Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow().

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
classmethod CTable.from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

Inspection & statistics

Compute common descriptive statistics directly on CTable data without materializing rows first. These methods operate column-wise on the compressed representation, making it easy to summarize distributions or measure relationships between numeric columns.

CTable.column_schema(name)

Return the CompiledColumn descriptor for name.

CTable.info

Get information about this table.

CTable.schema_dict()

Return a JSON-compatible dict describing this table's schema.

CTable.describe()

Print a per-column statistical summary.

CTable.cov()

Return the covariance matrix as a numpy array.

CTable.column_schema(name: str) CompiledColumn[source]

Return the CompiledColumn descriptor for name.

Raises:

KeyError – If name is not a column in this table.

CTable.info()

Get information about this table.

Examples

>>> print(t.info)
>>> t.info()
CTable.schema_dict() dict[str, Any][source]

Return a JSON-compatible dict describing this table’s schema.

CTable.describe() None[source]

Print a per-column statistical summary.

Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.

CTable.cov() ndarray[source]

Return the covariance matrix as a numpy array.

Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise TypeError.

Returns:

Shape (ncols, ncols). Column order matches col_names.

Return type:

numpy.ndarray

Raises:
  • TypeError – If any column has an unsupported dtype (complex, string, …).

  • ValueError – If the table has fewer than 2 live rows (covariance undefined).


Column

A lazy column accessor returned by table["col_name"] or table.col_name. All index operations and aggregates apply the table’s tombstone mask (_valid_rows) so deleted rows are silently excluded.

class blosc2.Column(table: CTable, col_name: str, mask=None)[source]

Column view for a CTable, with vectorized operations and reductions.

Attributes:
dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

is_computed

True if this column is a virtual computed column (read-only).

is_list
is_varlen_scalar

True if this column holds variable-length scalar strings or bytes.

ndim

Number of logical dimensions.

null_value

The sentinel value that represents NULL for this column, or None.

shape

Logical shape of the live column values.

size

Number of live values in the column.

view

Return a ColumnViewIndexer for creating logical sub-views.

Methods

all()

Return True if every live, non-null value is True.

any()

Return True if at least one live, non-null value is True.

assign(data)

Replace all live values in this column with data.

is_null()

Return a boolean array True where the live value is the null sentinel.

iter_chunks([size])

Iterate over live column values in chunks of size rows.

max(*[, where])

Maximum live, non-null value.

mean(*[, where])

Arithmetic mean of all live, non-null values.

min(*[, where])

Minimum live, non-null value.

notnull()

Return a boolean array True where the live value is not the null sentinel.

null_count()

Return the number of live rows whose value equals the null sentinel.

std([ddof, where])

Standard deviation of all live, non-null values (single-pass, Welford's algorithm).

sum([dtype, where, jit, jit_backend])

Sum of all live, non-null values.

unique()

Return sorted array of unique live, non-null values.

value_counts()

Return a {value: count} dict sorted by count descending.

Special methods

Column.__len__()

Return the number of live (non-deleted) values in this column.

Column.__iter__()

Iterate over live column values in insertion order, skipping deleted rows.

Column.__getitem__(key)

Return values for the given logical index.

Column.__setitem__(key, value)

Set one or more live column values; accepts the same index forms as __getitem__().

__len__()[source]

Return the number of live (non-deleted) values in this column.

Return the number of live (non-deleted) values in this column.

__iter__()[source]

Iterate over live column values in insertion order, skipping deleted rows.

Iterate over live values in insertion order, skipping deleted rows.

__getitem__(key: int | slice | list | ndarray)[source]

Return values for the given logical index.

  • int → scalar

  • slicenumpy.ndarray

  • list / np.ndarraynumpy.ndarray

  • bool np.ndarraynumpy.ndarray

For a writable logical sub-view use view.

__setitem__(key: int | slice | list | ndarray, value)[source]

Set one or more live column values; accepts the same index forms as __getitem__().

Set one or more live column values. Accepts the same index forms as __getitem__().

all() bool[source]

Return True if every live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.

any() bool[source]

Return True if at least one live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.

assign(data) None[source]

Replace all live values in this column with data.

Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.

Parameters:

data – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.

Raises:
  • ValueError – If len(data) does not match the number of live rows, or the table is opened read-only.

  • TypeError – If values cannot be coerced to the column’s dtype.

is_null() ndarray[source]

Return a boolean array True where the live value is the null sentinel.

For varlen scalar columns (vlstring/vlbytes) nullability is represented as native None values, so this returns True wherever the value is None.

iter_chunks(size: int = 65536)[source]

Iterate over live column values in chunks of size rows.

Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.

Parameters:

size – Number of live rows per yielded chunk. Defaults to 65 536.

Yields:

numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.

Examples

>>> for chunk in t["score"].iter_chunks(size=100_000):
...     process(chunk)
max(*, where=None)[source]

Maximum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

mean(*, where=None) float[source]

Arithmetic mean of all live, non-null values.

Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.

min(*, where=None)[source]

Minimum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

notnull() ndarray[source]

Return a boolean array True where the live value is not the null sentinel.

null_count() int[source]

Return the number of live rows whose value equals the null sentinel.

Returns 0 in O(1) if no null_value is configured for this column and the column is not a varlen scalar column.

std(ddof: int = 0, *, where=None) float[source]

Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).

Parameters:
  • ddof – Delta degrees of freedom. 0 (default) gives the population std; 1 gives the sample std (divides by N-1).

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.

  • dtypes (Supported)

  • skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)

  • float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)

sum(dtype=None, *, where=None, jit=None, jit_backend=None)[source]

Sum of all live, non-null values.

Returns zero for an empty column or filtered view.

Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.

Parameters:
  • dtype – Optional accumulator dtype. When omitted, float columns use np.float64, complex columns use np.complex128, and integer / bool columns use np.int64.

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.

  • jit – Optional miniexpr JIT policy passed to the lazy reduction engine.

  • jit_backend – Optional miniexpr JIT backend. Use "tcc" or "cc".

Examples

Sum values matching a predicate without materializing a filtered view:

total = t["amount"].sum(where=t.category == 3)

Combine several column predicates:

total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))

Nullable sentinel values are skipped automatically:

# Equivalent to summing only live rows where predicate is true and
# t.col2 is not its configured null sentinel.
total = t.col2.sum(where=t.col1 < 300)
unique() ndarray[source]

Return sorted array of unique live, non-null values.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

value_counts() dict[source]

Return a {value: count} dict sorted by count descending.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Example

>>> t["active"].value_counts()
{True: 8432, False: 1568}
property dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

property is_computed: bool

True if this column is a virtual computed column (read-only).

property is_varlen_scalar: bool

True if this column holds variable-length scalar strings or bytes.

property ndim: int

Number of logical dimensions.

property null_value

The sentinel value that represents NULL for this column, or None.

property shape: tuple[int]

Logical shape of the live column values.

property size: int

Number of live values in the column.

property view: ColumnViewIndexer

Return a ColumnViewIndexer for creating logical sub-views.

Examples

Read a sub-view for chained aggregates:

sub = t.price.view[2:10]
sub.sum()

Bulk write through a sub-view:

t.price.view[0:5][:] = np.zeros(5)

Attributes

Column.dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

Column.null_value

The sentinel value that represents NULL for this column, or None.

property Column.dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

property Column.null_value

The sentinel value that represents NULL for this column, or None.

Data access

Column.view

Return a ColumnViewIndexer for creating logical sub-views.

Column.iter_chunks([size])

Iterate over live column values in chunks of size rows.

Column.assign(data)

Replace all live values in this column with data.

property Column.view: ColumnViewIndexer

Return a ColumnViewIndexer for creating logical sub-views.

Examples

Read a sub-view for chained aggregates:

sub = t.price.view[2:10]
sub.sum()

Bulk write through a sub-view:

t.price.view[0:5][:] = np.zeros(5)
Column.iter_chunks(size: int = 65536)[source]

Iterate over live column values in chunks of size rows.

Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.

Parameters:

size – Number of live rows per yielded chunk. Defaults to 65 536.

Yields:

numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.

Examples

>>> for chunk in t["score"].iter_chunks(size=100_000):
...     process(chunk)
Column.assign(data) None[source]

Replace all live values in this column with data.

Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.

Parameters:

data – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.

Raises:
  • ValueError – If len(data) does not match the number of live rows, or the table is opened read-only.

  • TypeError – If values cannot be coerced to the column’s dtype.

Nullable helpers

Column.is_null()

Return a boolean array True where the live value is the null sentinel.

Column.notnull()

Return a boolean array True where the live value is not the null sentinel.

Column.null_count()

Return the number of live rows whose value equals the null sentinel.

Column.is_null() ndarray[source]

Return a boolean array True where the live value is the null sentinel.

For varlen scalar columns (vlstring/vlbytes) nullability is represented as native None values, so this returns True wherever the value is None.

Column.notnull() ndarray[source]

Return a boolean array True where the live value is not the null sentinel.

Column.null_count() int[source]

Return the number of live rows whose value equals the null sentinel.

Returns 0 in O(1) if no null_value is configured for this column and the column is not a varlen scalar column.

Unique values

Column.unique()

Return sorted array of unique live, non-null values.

Column.value_counts()

Return a {value: count} dict sorted by count descending.

Column.unique() ndarray[source]

Return sorted array of unique live, non-null values.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Column.value_counts() dict[source]

Return a {value: count} dict sorted by count descending.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Example

>>> t["active"].value_counts()
{True: 8432, False: 1568}

Aggregates

Null sentinel values are automatically excluded from all aggregates.

Column.sum([dtype, where, jit, jit_backend])

Sum of all live, non-null values.

Column.min(*[, where])

Minimum live, non-null value.

Column.max(*[, where])

Maximum live, non-null value.

Column.mean(*[, where])

Arithmetic mean of all live, non-null values.

Column.std([ddof, where])

Standard deviation of all live, non-null values (single-pass, Welford's algorithm).

Column.any()

Return True if at least one live, non-null value is True.

Column.all()

Return True if every live, non-null value is True.

Column.sum(dtype=None, *, where=None, jit=None, jit_backend=None)[source]

Sum of all live, non-null values.

Returns zero for an empty column or filtered view.

Supported dtypes: bool, int, uint, float, complex. Bool values are counted as 0 / 1. Null sentinel values are skipped.

Parameters:
  • dtype – Optional accumulator dtype. When omitted, float columns use np.float64, complex columns use np.complex128, and integer / bool columns use np.int64.

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included. This enables direct filtered aggregate pushdown, avoiding creation of an intermediate filtered table view.

  • jit – Optional miniexpr JIT policy passed to the lazy reduction engine.

  • jit_backend – Optional miniexpr JIT backend. Use "tcc" or "cc".

Examples

Sum values matching a predicate without materializing a filtered view:

total = t["amount"].sum(where=t.category == 3)

Combine several column predicates:

total = t.col2.sum(where=(t.col1 < 300) & (t.col2 < 400))

Nullable sentinel values are skipped automatically:

# Equivalent to summing only live rows where predicate is true and
# t.col2 is not its configured null sentinel.
total = t.col2.sum(where=t.col1 < 300)
Column.min(*, where=None)[source]

Minimum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

Column.max(*, where=None)[source]

Maximum live, non-null value.

Supported dtypes: bool, int, uint, float, string, bytes. Strings are compared lexicographically. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included.

Column.mean(*, where=None) float[source]

Arithmetic mean of all live, non-null values.

Supported dtypes: bool, int, uint, float. Null sentinel values are skipped. When where is provided, only rows matching the boolean predicate are included. Always returns a Python float.

Column.std(ddof: int = 0, *, where=None) float[source]

Standard deviation of all live, non-null values (single-pass, Welford’s algorithm).

Parameters:
  • ddof – Delta degrees of freedom. 0 (default) gives the population std; 1 gives the sample std (divides by N-1).

  • where – Optional boolean predicate. Only rows where the predicate is true, the table row is live, and this column is non-null are included.

  • dtypes (Supported)

  • skipped. (Null _sphinx_paramlinks_blosc2.Column.std.sentinel values are)

  • float. (Always _sphinx_paramlinks_blosc2.Column.std.returns a Python)

Column.any() bool[source]

Return True if at least one live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first True found.

Column.all() bool[source]

Return True if every live, non-null value is True.

Supported dtypes: bool. Null sentinel values are skipped. Short-circuits on the first False found.


Schema Specs

Schema specs are passed to field() to declare a column’s type, storage constraints, and optional null sentinel. They are also available directly in the blosc2 namespace (e.g. blosc2.int64).

blosc2.field(spec: ~blosc2.schema.SchemaSpec, *, default=<dataclasses._MISSING_TYPE object>, cparams: dict[str, ~typing.Any] | None = None, dparams: dict[str, ~typing.Any] | None = None, chunks: tuple[int, ...] | None = None, blocks: tuple[int, ...] | None = None) Field[source]

Attach a Blosc2 schema spec and per-column storage options to a dataclass field.

Parameters:
  • spec – A schema descriptor such as b2.int64(ge=0) or b2.float64().

  • default – Default value for the field. Omit for required fields.

  • cparams – Compression parameters for this column’s NDArray.

  • dparams – Decompression parameters for this column’s NDArray.

  • chunks – Chunk shape for this column’s NDArray.

  • blocks – Block shape for this column’s NDArray.

Examples

>>> from dataclasses import dataclass
>>> import blosc2 as b2
>>> @dataclass
... class Row:
...     id: int = b2.field(b2.int64(ge=0))
...     score: float = b2.field(b2.float64(ge=0, le=100))
...     active: bool = b2.field(b2.bool(), default=True)

Numeric

int8(*[, ge, gt, le, lt, nullable, null_value])

8-bit signed integer column (−128 … 127).

int16(*[, ge, gt, le, lt, nullable, null_value])

16-bit signed integer column (−32 768 … 32 767).

int32(*[, ge, gt, le, lt, nullable, null_value])

32-bit signed integer column (−2 147 483 648 … 2 147 483 647).

int64(*[, ge, gt, le, lt, nullable, null_value])

64-bit signed integer column.

uint8(*[, ge, gt, le, lt, nullable, null_value])

8-bit unsigned integer column (0 … 255).

uint16(*[, ge, gt, le, lt, nullable, null_value])

16-bit unsigned integer column (0 … 65 535).

uint32(*[, ge, gt, le, lt, nullable, null_value])

32-bit unsigned integer column (0 … 4 294 967 295).

uint64(*[, ge, gt, le, lt, nullable, null_value])

64-bit unsigned integer column.

float32(*[, ge, gt, le, lt, nullable, ...])

32-bit floating-point column (single precision).

float64(*[, ge, gt, le, lt, nullable, ...])

64-bit floating-point column (double precision).

timestamp(*[, unit, timezone, nullable, ...])

Timestamp column stored as signed 64-bit epoch offsets.

class blosc2.int8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

8-bit signed integer column (−128 … 127).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int8

class blosc2.int16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

16-bit signed integer column (−32 768 … 32 767).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int16

class blosc2.int32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit signed integer column (−2 147 483 648 … 2 147 483 647).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int32

class blosc2.int64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit signed integer column.

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int64

class blosc2.uint8(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

8-bit unsigned integer column (0 … 255).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint8

class blosc2.uint16(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

16-bit unsigned integer column (0 … 65 535).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint16

class blosc2.uint32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit unsigned integer column (0 … 4 294 967 295).

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint32

class blosc2.uint64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit unsigned integer column.

Methods

python_type

alias of int

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of uint64

class blosc2.float32(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

32-bit floating-point column (single precision).

Methods

python_type

alias of float

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of float32

class blosc2.float64(*, ge=None, gt=None, le=None, lt=None, nullable: bool = False, null_value=None)[source]

64-bit floating-point column (double precision).

Methods

python_type

alias of float

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of float64

class blosc2.timestamp(*, unit: str = 'us', timezone: str | None = None, nullable: bool = False, null_value=None)[source]

Timestamp column stored as signed 64-bit epoch offsets.

The physical storage dtype is int64. unit follows Arrow/NumPy datetime units: "s", "ms", "us" or "ns". timezone is metadata preserved for Arrow/Parquet roundtrips.

Methods

python_type

alias of object

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of int64

Complex

complex64()

64-bit complex number column (two 32-bit floats).

complex128()

128-bit complex number column (two 64-bit floats).

class blosc2.complex64[source]

64-bit complex number column (two 32-bit floats).

Methods

python_type

alias of complex

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of complex64

class blosc2.complex128[source]

128-bit complex number column (two 64-bit floats).

Methods

python_type

alias of complex

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of complex128

Boolean

bool(*[, nullable, null_value])

Boolean column.

class blosc2.bool(*, nullable: bool = False, null_value=None)[source]

Boolean column.

Nullable bool columns use uint8 physical storage with values 0 (false), 1 (true), and 255 (null).

Methods

python_type

alias of bool

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

type

alias of bool

Text & binary

string(*[, min_length, max_length, pattern, ...])

Fixed-width Unicode string column.

bytes(*[, min_length, max_length, nullable, ...])

Fixed-width bytes column.

vlstring(*[, nullable, serializer, ...])

Build a variable-length scalar string schema descriptor.

vlbytes(*[, nullable, serializer, ...])

Build a variable-length scalar bytes schema descriptor.

struct(fields, *[, nullable])

Build a structured schema descriptor for dict-like CTable values.

object(*[, nullable, serializer, ...])

Build a schema-less Python object column descriptor for CTable.

list(item_spec, *[, nullable, storage, ...])

Build a list-valued schema descriptor for CTable and ListArray.

class blosc2.string(*, min_length=None, max_length=None, pattern=None, nullable: bool = False, null_value=None)[source]

Fixed-width Unicode string column.

Parameters:
  • max_length – Maximum number of characters. Determines the NumPy U<n> dtype. Defaults to 32 if not specified.

  • min_length – Minimum number of characters (validation only, no effect on dtype).

  • pattern – Regex pattern the value must match (validation only).

  • nullable – If True and null_value is not set, choose a null sentinel from the current CTable null policy when the schema is compiled.

  • null_value – Explicit null sentinel. Takes precedence over nullable=True.

Methods

python_type

alias of str

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

class blosc2.bytes(*, min_length=None, max_length=None, nullable: bool = False, null_value=None)[source]

Fixed-width bytes column.

Parameters:
  • max_length – Maximum number of bytes. Determines the NumPy S<n> dtype. Defaults to 32 if not specified.

  • min_length – Minimum number of bytes (validation only, no effect on dtype).

  • nullable – If True and null_value is not set, choose a null sentinel from the current CTable null policy when the schema is compiled.

  • null_value – Explicit null sentinel. Takes precedence over nullable=True.

Methods

python_type

alias of bytes

to_metadata_dict()

Return a JSON-compatible dict for schema serialization.

to_pydantic_kwargs()

Return kwargs for building a Pydantic field annotation.

blosc2.vlstring(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLStringSpec[source]

Build a variable-length scalar string schema descriptor.

Use this as an explicit opt-in when a CTable column holds long or wildly variable-length strings that would waste space in a fixed-width string(max_length=N) column. Must be requested via blosc2.field(blosc2.vlstring()) — it is never inferred automatically from plain str annotations.

blosc2.vlbytes(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) VLBytesSpec[source]

Build a variable-length scalar bytes schema descriptor.

Use this as an explicit opt-in when a CTable column holds long or wildly variable-length byte strings. Must be requested via blosc2.field(blosc2.vlbytes()) — it is never inferred automatically from plain bytes annotations.

blosc2.struct(fields: dict[str, SchemaSpec], *, nullable: bool = False) StructSpec[source]

Build a structured schema descriptor for dict-like CTable values.

Top-level struct columns store one dictionary (or None when nullable) per row. Struct specs may also be nested as list item specs.

blosc2.object(*, nullable: bool = False, serializer: str = 'msgpack', batch_rows: int | None = 2048, items_per_block: int | None = None) ObjectSpec[source]

Build a schema-less Python object column descriptor for CTable.

Values are stored via batched msgpack serialization. Prefer typed specs such as struct(), list(), vlstring(), or vlbytes() when the data has a stable schema; use object for heterogeneous per-row payloads.

blosc2.list(item_spec: SchemaSpec, *, nullable: bool = False, storage: str = 'batch', serializer: str = 'msgpack', batch_rows: int | None = None, items_per_block: int | None = None) ListSpec[source]

Build a list-valued schema descriptor for CTable and ListArray.

Object columns

Timestamp columns

Timestamp columns are declared with blosc2.timestamp and store signed 64-bit epoch offsets with timestamp metadata. Column reads return numpy.datetime64 values, comparisons accept numpy.datetime64 values, ISO-like strings, or Python datetime objects, and Arrow/Parquet import/export roundtrips timestamp units and time zones:

from dataclasses import dataclass
import numpy as np
import blosc2 as b2

@dataclass
class Event:
    when: np.datetime64 = b2.field(b2.timestamp(unit="us", nullable=True))
    value: int = b2.field(b2.int64())

table = b2.CTable(Event)
table.append(["2025-01-01T12:00:00", 42])
recent = table[table.when >= np.datetime64("2025-01-01", "us")]

Object columns

Schema-less object columns are declared with blosc2.object() and store one msgpack-serializable Python object (or None when nullable) per row in batched variable-length storage. Prefer typed specs such as blosc2.struct() or blosc2.list() when the payload has a stable schema; use object columns for heterogeneous per-row payloads:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Event:
    id: int = b2.field(b2.int64())
    payload: object = b2.field(b2.object(nullable=True))

table.append([1, {"kind": "click", "xy": [10, 20]}])
table.append([2, ("custom", {"nested": True})])
table.append([3, None])

Object columns have no fixed Arrow type, so CTable.to_arrow() and CTable.to_parquet() raise for them unless users first convert the payloads to a typed representation. They are not used as an implicit fallback during Parquet import; unsupported Arrow/Parquet types still raise unless explicitly imported through CTable.from_arrow() with object_fallback=True.

Struct columns

Struct columns are declared with blosc2.struct() and store one dictionary (or None when nullable) per row in batched variable-length storage. They are also used when importing top-level Arrow/Parquet struct<...> columns:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Product:
    properties: dict = b2.field(
        b2.struct({"code": b2.int32(), "label": b2.vlstring()}, nullable=True)
    )

table.append([{"code": 1, "label": "fresh"}])
table.append([None])

List columns

List columns are declared with blosc2.list(), for example:

from dataclasses import dataclass
import blosc2 as b2

@dataclass
class Product:
    code: str = b2.field(b2.string(max_length=8))
    tags: list[str] = b2.field(b2.list(b2.string(), nullable=True))

Whole-cell replacement is supported, so users should reassign modified lists:

row_tags = table.tags[0]
row_tags.append("extra")      # local Python list only
table.tags[0] = row_tags      # explicit write-back