Miscellaneous

This page documents the miscellaneous members of the blosc2 module that do not fit into other categories.

blosc2.cpu_info = {'count': 4, 'l1_data_cache_size': 32768, 'l2_cache_size': 524288, 'l3_cache_size': 33554432}
class blosc2.finfo(dtype)

Machine limits for floating point types.

bits

The number of bits occupied by the type.

Type:

int

dtype

Returns the dtype for which finfo returns information. For complex input, the returned dtype is the associated float* dtype for its real and complex components.

Type:

dtype

eps

The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, eps = 2**-52, approximately 2.22e-16.

Type:

float

epsneg

The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, epsneg = 2**-53, approximately 1.11e-16.

Type:

float

iexp

The number of bits in the exponent portion of the floating point representation.

Type:

int

machep

The exponent that yields eps.

Type:

int

max

The largest representable number.

Type:

floating point number of the appropriate type

maxexp

The smallest positive power of the base (2) that causes overflow. Corresponds to the C standard MAX_EXP.

Type:

int

min

The smallest representable number, typically -max.

Type:

floating point number of the appropriate type

minexp

The most negative power of the base (2) consistent with there being no leading 0’s in the mantissa. Corresponds to the C standard MIN_EXP - 1.

Type:

int

negep

The exponent that yields epsneg.

Type:

int

nexp

The number of bits in the exponent including its sign and bias.

Type:

int

nmant

The number of explicit bits in the mantissa (excluding the implicit leading bit for normalized numbers).

Type:

int

precision

The approximate number of decimal digits to which this kind of float is precise.

Type:

int

resolution

The approximate decimal resolution of this type, i.e., 10**-precision.

Type:

floating point number of the appropriate type

tiny

An alias for smallest_normal, kept for backwards compatibility.

Type:

float

smallest_normal

The smallest positive floating point number with 1 as leading bit in the mantissa following IEEE-754 (see Notes).

Type:

float

smallest_subnormal

The smallest positive floating point number with 0 as leading bit in the mantissa following IEEE-754.

Type:

float

Parameters:

dtype (float, dtype, or instance) – Kind of floating point or complex floating point data-type about which to get information.

See also

iinfo

The equivalent for integer data types.

spacing

The distance between a value and the nearest adjacent number

nextafter

The next floating point value after x1 towards x2

Notes

For developers of NumPy: do not instantiate this at the module level. The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling finfo() repeatedly inside your functions is not a problem.

Note that smallest_normal is not actually the smallest positive representable value in a NumPy floating point type. As in the IEEE-754 standard [1], NumPy floating point types make use of subnormal numbers to fill the gap between 0 and smallest_normal. However, subnormal numbers may have significantly reduced precision [2].

For longdouble, the representation varies across platforms. On most platforms it is IEEE 754 binary128 (quad precision) or binary64-extended (80-bit extended precision). On PowerPC systems, it may use the IBM double-double format (a pair of float64 values), which has special characteristics for precision and range.

This function can also be used for complex data types as well. If used, the output will be the same as the corresponding real float type (e.g. numpy.finfo(numpy.csingle) is the same as numpy.finfo(numpy.single)). However, the output is true for the real and imaginary components.

References

[1]

IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2008, pp.1-70, 2008, https://doi.org/10.1109/IEEESTD.2008.4610935

[2]

Wikipedia, “Denormal Numbers”, https://en.wikipedia.org/wiki/Denormal_number

Examples

>>> import numpy as np
>>> np.finfo(np.float64).dtype
dtype('float64')
>>> np.finfo(np.complex64).dtype
dtype('float32')
Attributes:
epsneg
iexp
machep
negep
nexp
resolution
tiny

Return the value for tiny, alias of smallest_normal.

tinyfloat

Value for the smallest normal, alias of smallest_normal.

UserWarning

If the calculated value for the smallest normal is requested for double-double.

class blosc2.iinfo(type)

Machine limits for integer types.

bits

The number of bits occupied by the type.

Type:

int

dtype

Returns the dtype for which iinfo returns information.

Type:

dtype

min

The smallest integer expressible by the type.

Type:

int

max

The largest integer expressible by the type.

Type:

int

Parameters:

int_type (integer type, dtype, or instance) – The kind of integer data type to get information about.

See also

finfo

The equivalent for floating point data types.

Examples

With types:

>>> import numpy as np
>>> ii16 = np.iinfo(np.int16)
>>> ii16.min
-32768
>>> ii16.max
32767
>>> ii32 = np.iinfo(np.int32)
>>> ii32.min
-2147483648
>>> ii32.max
2147483647

With instances:

>>> ii32 = np.iinfo(np.int32(10))
>>> ii32.min
-2147483648
>>> ii32.max
2147483647
Attributes:
max

Maximum value of given dtype.

min

Minimum value of given dtype.

blosc2.get_matmul_library() str | None[source]

Return the library used by the active matmul fast backend, if any.

Returns:

"Accelerate.framework" when the selected backend is Accelerate, the loaded CBLAS library path for runtime-discovered CBLAS backends, or None when the selected backend is naive.

Return type:

str | None

Unclassified module members

The list below is intentionally generated from blosc2 module members that are not excluded above. It acts as a reminder to classify newly documented public objects into the appropriate reference section.

class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None, create_summary_index: bool = True)[source]

Columnar compressed table with typed columns and row-oriented access.

Attributes:
blocks

Block shape shared by the table’s aligned fixed-size columns.

cbytes

Total compressed size in bytes (all columns + valid_rows mask).

chunks

Chunk shape shared by the table’s aligned fixed-size columns.

computed_columns

Read-only view of the computed-column definitions.

cratio

Compression ratio for the whole table payload.

indexes

Return a list of blosc2.Index handles for all active indexes.

info

Get information about this table.

info_items

Structured summary items used by info().

nbytes

Total uncompressed size in bytes (all columns + valid_rows mask).

ncols

Total number of columns, including computed (virtual) columns.

nrows
schema

The compiled schema that drives this table’s columns and validation.

vlmeta

Variable-length metadata attached to this table.

Methods

add_column(name, spec)

Add a new column filled from the default declared in spec.

add_computed_column(name, expr, *[, dtype, ...])

Add a read-only virtual column computed from stored columns.

add_generated_column(name, *, values[, ...])

Add a stored generated column maintained by the table.

append(data)

Append a single row to the table.

close()

Close any persistent backing store held by this table.

column_schema(name)

Return the CompiledColumn descriptor for name.

compact()

Physically rewrite every column array keeping only live rows.

compact_index([col_name, expression, name])

Compact an index, merging any incremental append runs.

cov()

Return the covariance matrix as a numpy array.

create_index([col_name, field, expression, ...])

Build and register an index for a stored column or table expression.

delete(ind)

Mark one or more rows as deleted (tombstone deletion).

describe()

Print a per-column statistical summary.

drop_column(name)

Remove a column from the table.

drop_computed_column(name)

Remove a computed column from the table.

drop_index([col_name, expression, name])

Remove an index and delete any sidecar files.

extend(data, *[, validate])

Append multiple rows at once.

from_arrow(schema, batches, *[, urlpath, ...])

Build a CTable from an Arrow schema and iterable of record batches.

from_csv(path, row_cls, *[, header, sep])

Build a CTable from a CSV file.

from_pandas(df, row_cls)

Build a CTable from a pandas DataFrame.

from_parquet(path, *[, columns, batch_size, ...])

Read a Parquet file into a CTable.

group_by(keys, *[, sort, dropna, engine, ...])

Return a deferred group-by object for this table.

head([N])

Return a view of the first N live rows (default 5).

index([col_name, expression, name])

Return the index handle for a stored-column or expression target.

iter_arrow_batches(*[, columns, batch_size, ...])

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols[, ascending, start, stop, ...])

Iterate rows in sorted order without materializing a full copy.

materialize_computed_column(name, *[, ...])

Materialize a computed column into a new stored snapshot column.

rebuild_index([col_name, expression, name])

Drop and recreate an index with the same parameters.

refresh_generated_column(name)

Recompute a stored generated/materialized column from its source columns.

refresh_generated_columns(*[, source])

Refresh all generated columns, optionally only those depending on source.

rename_column(old, new)

Rename a column.

sample(n, *[, seed])

Return a read-only view of n randomly chosen live rows.

schema_dict()

Return a JSON-compatible dict describing this table's schema.

select(cols)

Return a column-projection view exposing only cols.

slice(start[, stop, copy])

Return a contiguous range of live (non-deleted) rows.

sort_by(cols[, ascending, inplace])

Return a copy of the table sorted by one or more columns.

tail([N])

Return a view of the last N live rows (default 5).

to_arrow()

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath, *[, overwrite, compact])

Write this table to a directory-backed store.

to_b2z(urlpath, *[, overwrite, compact])

Write this table to a compact .b2z container.

to_csv([path, header, sep])

Write all live rows to CSV.

to_pandas()

Convert to a pandas DataFrame.

to_parquet(path, *[, columns, batch_size, ...])

Write this table to a Parquet file batch-wise using pyarrow.

to_string(*[, max_rows, max_width, ...])

Return a tabular string representation of the table.

trim_capacity()

Shrink fixed-width physical storage to the last live row position.

view(new_valid_rows)

Return a row-filter view backed by a boolean mask array without copying data.

add_column(name: str, spec: SchemaSpec | Field) None[source]

Add a new column filled from the default declared in spec.

Parameters:
  • name – Column name. Must follow the same naming rules as schema fields.

  • spec – A schema descriptor such as b2.int64(ge=0) or a field descriptor such as b2.field(b2.int64(ge=0), default=0). When the table already has live rows, use blosc2.field(...) with a default declared so those rows can be backfilled.

Raises:
  • ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.

  • TypeError – If a declared default cannot be coerced to spec’s dtype.

add_computed_column(name: str, expr: str | LazyExpr | DSLKernel | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None, inputs: list[str] | None = None) None[source]

Add a read-only virtual column computed from stored columns.

A computed column has no physical storage. It is backed by a blosc2.LazyExpr and is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied in append() / extend() inputs. To store and optionally index a computed result, use add_generated_column() or materialize an existing computed column with materialize_computed_column().

Supported signatures are:

add_computed_column(name, "price * qty")
add_computed_column(name, lazy_expr)
add_computed_column(name, dsl_kernel, inputs=["price", "qty"])
add_computed_column(name, blosc2.lazyudf(dsl_kernel, (t.price, t.qty)))
add_computed_column(name, lambda cols: cols["price"] * cols["qty"])
Parameters:
  • name – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.

  • expr

    Definition of the virtual column. Accepted forms:

    • str: scalar expression over stored scalar columns, e.g. "price * qty".

    • blosc2.LazyExpr: lazy expression over stored columns of this table.

    • blosc2.dsl_kernel()-decorated kernel passed directly with inputs=[...] — one stored scalar column name per kernel parameter, bound positionally. The kernel may use loops, if/else and where(...). Its source is persisted and recompiled on open; the column stays virtual/unstored.

    • blosc2.LazyUDF built from a blosc2.dsl_kernel() via blosc2.lazyudf() — column bindings are inferred by identity from the operands, so inputs= is not needed. Accepted forms include blosc2.lazyudf(kernel, (t.col1, t.col2)) (using Column accessors) or the raw NDArray equivalents.

    • callable: called as expr(self._cols) and must return a blosc2.LazyExpr or a blosc2.LazyUDF backed by a blosc2.dsl_kernel().

    DSL columns (last three forms) are persisted — their source is stored and recompiled on open — and may be referenced inside where() predicates.

    Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use add_generated_column() with values=t.ndarray_col.row_transformer....

  • dtype – Optional dtype override for the computed values. For expression forms it is inferred from the resulting blosc2.LazyExpr when omitted. For DSL forms, an omitted dtype is inferred by NumPy type promotion of the input column dtypes (correct for elementwise arithmetic kernels); pass dtype explicitly for kernels that change the type (comparisons/where/casts) or when the kernel has no column inputs. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.

  • inputs – Only used when expr is a bare blosc2.dsl_kernel(): a list of stored scalar column names, one per kernel parameter, bound positionally (kernel parameter iinputs[i]). Not needed when passing a blosc2.LazyUDF or a callable — bindings are inferred from the operands in those cases.

Examples

Add a computed column from a string expression and use it like a normal read-only column:

t.add_computed_column("total", "price * qty")
assert t.total[:].shape == (t.nrows,)

Add a computed column from a callable. The callable receives the table’s stored column mapping:

t.add_computed_column(
    "price_with_tax",
    lambda cols: cols["price"] * 1.21,
    dtype=np.float64,
)

Callable expressions can use normal Python logic while still returning a lazy expression:

def total_expr(cols):
    base = cols["price"] * cols["qty"]
    return base * 1.21 if include_tax else base

t.add_computed_column("total", total_expr)

They are also convenient for reusable, parameterized helpers:

def ratio(num, den):
    return lambda cols: cols[num] / cols[den]

t.add_computed_column("margin", ratio("profit", "revenue"))

Computed columns participate in filters and aggregates:

expensive = t.where(t.total > 100)
total_revenue = t.total.sum()

Computed columns are virtual and read-only and cannot be indexed. If you need to filter or sort by this value frequently, use a generated column instead — it is physically stored and can be indexed:

t.add_generated_column(
    "total_stored",
    values="price * qty",
    dtype=blosc2.float64(),
    create_index=True,
)

Or convert an existing computed column to a stored snapshot:

t.materialize_computed_column("total", new_name="total_stored")
t.create_index("total_stored")
Raises:
  • ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.

  • TypeError – If expr has an unsupported form, does not produce a blosc2.LazyExpr, references unsupported source columns, or if a RowTransformer is passed. Row transformers are only accepted by add_generated_column().

add_generated_column(name: str, *, values: str | LazyExpr | DSLKernel | Callable[[dict[str, Any]], LazyExpr] | RowTransformer, dtype=None, create_index: bool = False, inputs: list[str] | None = None) None[source]

Add a stored generated column maintained by the table.

A generated column is physical storage, not a virtual expression. The initial values are computed for all current live rows, and later append() / extend() calls automatically compute values for newly inserted rows when source columns are provided. If a source column is modified in-place, dependent generated columns are marked stale; call refresh_generated_column() or refresh_generated_columns() to recompute them.

Supported signatures are:

add_generated_column(name, *, values="price * qty", dtype=..., create_index=False)
add_generated_column(name, *, values=lazy_expr, dtype=...)
add_generated_column(name, *, values=dsl_kernel, inputs=["price", "qty"], dtype=...)
add_generated_column(name, *, values=blosc2.lazyudf(dsl_kernel, (t.price, t.qty)))
add_generated_column(name, *, values=lambda cols: cols["price"] * 1.21, dtype=...)
add_generated_column(name, *, values=t.embedding.row_transformer.norm(axis=0), dtype=...)
add_generated_column(name, *, values=t.image.row_transformer.mean(axis=(0, 1)),
                     dtype=blosc2.ndarray((3,), dtype=...))
Parameters:
  • name – Name of the generated column to create. It must be a valid column name and must not collide with an existing stored or computed column.

  • values

    Definition used to compute the generated values. Accepted forms:

    • str: scalar expression over stored scalar columns, e.g. "price * qty". The expression must produce one scalar value per row.

    • blosc2.LazyExpr: scalar lazy expression over stored columns of this table. It must produce a 1-D scalar stream.

    • blosc2.dsl_kernel()-decorated kernel passed directly with inputs=[...] — one stored scalar column name per kernel parameter, bound positionally. Produces one scalar per row. The kernel source is persisted and recompiled on open; appended rows are auto-filled and refresh_generated_column() recomputes after in-place edits.

    • blosc2.LazyUDF built from a blosc2.dsl_kernel() via blosc2.lazyudf() — column bindings are inferred by identity from the operands, so inputs= is not needed. Accepts Column accessors (t.col1) or raw NDArrays as operands. Same persistence and auto-fill behaviour as above.

    • callable: called as values(self._cols) and must return a blosc2.LazyExpr or a blosc2.LazyUDF backed by a blosc2.dsl_kernel().

    • RowTransformer: row-wise projection/reduction bound to a fixed-shape ndarray column, e.g. t.embedding.row_transformer.norm(axis=0) or t.image.row_transformer.mean(axis=(0, 1)). Row transformers may produce either one scalar per row or one fixed-shape ndarray item per row.

    Expression and DSL forms currently cannot depend on computed columns and cannot directly consume fixed-shape ndarray columns; use a row-transformer for ndarray row projections/reductions.

  • dtype – Output schema or dtype. Scalar outputs may pass a NumPy dtype or a Blosc2 scalar spec such as blosc2.float64(). Fixed-shape ndarray outputs must pass an ndarray spec such as blosc2.ndarray((3,), dtype=blosc2.float32()) unless the table has existing rows from which the output shape can be inferred. When omitted, dtype and fixed-shape output shape are inferred from the current generated values; this is not possible for an empty table.

  • create_index – If True, create an index on the generated column immediately. Only scalar generated columns can be indexed; fixed-shape ndarray generated columns raise ValueError when indexing is requested.

  • inputs – Only used when values is a bare blosc2.dsl_kernel(): a list of stored scalar column names, one per kernel parameter, bound positionally. Not needed when passing a blosc2.LazyUDF or a callable — bindings are inferred from the operands in those cases.

Examples

Create and index a scalar generated column from a string expression:

t.add_generated_column(
    "total",
    values="price * qty",
    dtype=blosc2.float64(),
    create_index=True,
)

Use a callable when normal Python composition is more convenient:

t.add_generated_column(
    "price_with_tax",
    values=lambda cols: cols["price"] * 1.21,
    dtype=blosc2.float64(),
)

Generate a scalar from each fixed-shape ndarray row. For row transformers, axes refer to the per-row item shape, so axis=0 is the embedding-coordinate axis for item_shape=(dim,):

t.add_generated_column(
    "embedding_norm",
    values=t.embedding.row_transformer.norm(axis=0, ord=2),
    dtype=blosc2.float64(),
    create_index=True,
)

Generate a fixed-shape ndarray value per row. Here an image column has item_shape=(height, width, 3) and the generated column stores one RGB vector per row:

t.add_generated_column(
    "image_mean_rgb",
    values=t.image.row_transformer.mean(axis=(0, 1)),
    dtype=blosc2.ndarray((3,), dtype=blosc2.float32()),
)

Generated columns are maintained on append/extend:

t.append((new_id, new_embedding, new_image))
assert t.embedding_norm[-1] == np.linalg.norm(new_embedding)

If source values are changed in place, refresh dependent generated columns before relying on them:

t.embedding[0] = new_embedding
t.refresh_generated_column("embedding_norm")
Raises:
  • ValueError – If called on a view or read-only table, if name already exists, if generated output length/shape is incompatible with the table, or if create_index=True is requested for an ndarray generated column.

  • TypeError – If values has an unsupported form, references unsupported source columns, or cannot be coerced to dtype.

  • KeyError – If a RowTransformer references a missing source column.

append(data: list | void | ndarray) None[source]

Append a single row to the table.

data may be a list, tuple, numpy.void, or structured numpy.ndarray whose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:

# flat dotted keys
t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8,
          "payment.fare": 12.5})

# original nested dict (auto-flattened)
t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}},
          "payment": {"fare": 12.5}})
base: CTable | None

Parent table when this instance is a row-filter or column-projection view (created by where(), select(), or view()). None for top-level tables. Structural mutations such as add_column() and drop_column() are blocked on views.

property blocks: tuple | None

Block shape shared by the table’s aligned fixed-size columns.

None if the table has no fixed-size scalar columns. See chunks for the matching chunk shape.

property cbytes: int

Total compressed size in bytes (all columns + valid_rows mask).

property chunks: tuple | None

Chunk shape shared by the table’s aligned fixed-size columns.

None if the table has no fixed-size scalar columns. See blocks for the matching block shape.

close() None[source]

Close any persistent backing store held by this table.

On the first close of a writable root table, this also builds the automatic SUMMARY indexes (unless create_summary_index=False); see the create_summary_index parameter of CTable for how this interacts with in-memory vs. persistent tables.

col_names: list[str]

Ordered list of stored column names. Computed columns are not included; access those via computed_columns.

column_schema(name: str) CompiledColumn[source]

Return the CompiledColumn descriptor for name.

Raises:

KeyError – If name is not a column in this table.

compact()[source]

Physically rewrite every column array keeping only live rows.

Closes the gaps left by prior delete() calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, use copy() with compact=True instead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. Raises ValueError if the table is read-only or a view.

property computed_columns: dict[str, dict]

Read-only view of the computed-column definitions.

Each value is a dict with keys expression, col_deps, lazy (blosc2.LazyExpr), and dtype.

cov() ndarray[source]

Return the covariance matrix as a numpy array.

Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise TypeError.

Returns:

Shape (ncols, ncols). Column order matches col_names.

Return type:

numpy.ndarray

Raises:
  • TypeError – If any column has an unsupported dtype (complex, string, …).

  • ValueError – If the table has fewer than 2 live rows (covariance undefined).

property cratio: float

Compression ratio for the whole table payload.

delete(ind: int | slice | str | Iterable) None[source]

Mark one or more rows as deleted (tombstone deletion).

ind may be a logical row index (int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed until compact() is called. Raises ValueError if the table is read-only or a view.

describe() None[source]

Print a per-column statistical summary.

Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.

drop_column(name: str) None[source]

Remove a column from the table.

On disk tables the corresponding persisted column leaf is deleted.

Raises:
  • ValueError – If the table is read-only, is a view, or name is the last column.

  • KeyError – If name does not exist.

drop_computed_column(name: str) None[source]

Remove a computed column from the table.

Parameters:

name – Name of the computed column to remove.

Raises:
  • KeyError – If name is not a computed column.

  • ValueError – If called on a view.

extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]

Append multiple rows at once.

data may be:

  • a dict of arrays {"col": array, ...} — all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;

  • a list of rows, each compatible with append();

  • another CTable — columns are matched by name.

Pass validate=False to skip per-row Pydantic validation on trusted bulk imports. Raises ValueError if the table is read-only or a view.

For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:

# nested dict of arrays
t.extend({
    "trip": {"begin": {"lon": lons, "lat": lats}},
    "payment": {"fare": fares},
})

# list of nested dicts
t.extend([
    {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}},
    {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}},
])
classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False, create_summary_index: bool = True, chunks: int | tuple[int, ...] | None = None, blocks: int | tuple[int, ...] | None = None) CTable[source]

Build a CTable from an Arrow schema and iterable of record batches.

Nested struct flattening: top-level Arrow struct<…> fields are automatically and recursively flattened into dotted leaf columns. For example, a field trip: struct<begin: struct<lon: float64, lat: float64>> becomes two CTable columns trip.begin.lon and trip.begin.lat. Each leaf is stored as an independent compressed NDArray. Row reads via t[i] reconstruct the original nested dict shape. Use t["trip.begin.lon"] or t.trip.begin.lon to access a leaf:

import pyarrow as pa, blosc2
trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))])
schema = pa.schema([pa.field("trip", trip_type)])
t = blosc2.CTable.from_arrow(schema, batches)
t.col_names          # ['trip.begin.lon']
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

When string_max_length is None (the default), scalar Arrow string / large_string columns are imported as vlstring() columns and binary / large_binary columns are imported as vlbytes() columns. Non-struct struct columns (not containing only scalar leaves) are imported as struct() columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as native None with no sentinel needed.

When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width string() / bytes() columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remain vlstring() / vlbytes() columns.

blosc2_batch_size controls how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such as vlstring, vlbytes, struct, and schema-less object columns) are flushed to their backend. Set it to None to keep those columns pending until the final flush.

list_serializer selects the backend serializer for imported list columns. "msgpack" is the default; "arrow" stores Arrow list batches directly and can be much faster for deeply nested list columns.

Unsupported Arrow types raise by default. Pass object_fallback=True to import such columns as schema-less object() columns. This fallback is intentionally not used by from_parquet().

column_cparams optionally maps column names to per-column compression parameters. These override the table-level cparams for fixed-width columns imported from Arrow.

classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]

Build a CTable from a CSV file.

Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no extend()).

Parameters:
  • path – Source CSV file path.

  • row_cls – A dataclass whose fields define the column names and types.

  • header – If True (default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

A new in-memory CTable containing all rows from the CSV file.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If a row has a different number of fields than the schema.

classmethod from_pandas(df, row_cls) CTable[source]

Build a CTable from a pandas DataFrame.

Schema comes from row_cls (a dataclass) — CTable is always typed. Object-dtype DataFrame columns are not automatically inferred as ndarray columns; the row_cls must explicitly declare blosc2.ndarray() fields.

Parameters:
  • df – Source pandas DataFrame.

  • row_cls – A dataclass whose fields define the column names and types.

Returns:

A new CTable containing all DataFrame rows.

Return type:

CTable

Raises:
  • TypeError – If row_cls is not a dataclass.

  • ValueError – If DataFrame columns do not match the row_cls schema.

classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]

Read a Parquet file into a CTable.

The Parquet file is streamed batch by batch through pyarrow and then converted into a typed CTable. By default, the result is created in memory, but you can also persist it on disk via urlpath.

This method delegates the actual table construction to CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.

Nested struct flattening: top-level Parquet struct<…> fields are automatically and recursively flattened into dotted leaf columns — the same as in from_arrow(). For example, a Parquet file that contains a column trip: struct<begin: struct<lon: double, lat: double>> produces two CTable columns trip.begin.lon and trip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:

t = blosc2.CTable.from_parquet("trips.parquet")
t.col_names               # e.g. ['trip.begin.lon', 'trip.begin.lat', ...]
t["trip.begin.lon"].mean()
t.trip.begin.lon.max()

Unsupported Parquet types are not silently imported as schema-less object() columns; they raise so callers can decide how to handle them explicitly.

Parameters:
  • path (str or path-like) – Path to the source Parquet file.

  • columns (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.

  • batch_size (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.

  • urlpath (str or None, optional) – Destination storage path for the resulting CTable. If None (the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.

  • mode (str, optional) – Storage open mode for urlpath. Defaults to "w". This is passed through to CTable.from_arrow().

  • cparams (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • dparams (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to CTable.from_arrow().

  • validate (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to False.

  • auto_null_sentinels (bool, optional) – If True (default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.

  • blosc2_batch_size (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to CTable.from_arrow().

  • blosc2_items_per_block (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.

  • list_serializer ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default, "arrow", stores Arrow list batches directly and is much faster for deeply nested or list<struct<...>> columns. The tradeoff is that accessing those list columns later requires PyArrow. Use "msgpack" to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.

  • separate_nested_cols (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to True. In particular, a single unnamed top-level list<struct<...>> field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Use separate_nested_cols=False when closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.

  • max_rows (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root list<struct<...>> files imported with separate_nested_cols=True, this limits flattened element rows.

  • **kwargs – Additional keyword arguments forwarded to pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.

Returns:

A new CTable populated from the Parquet file. The table contains all selected columns and all rows from the file. If urlpath is provided, the returned table is disk-backed; otherwise it is in-memory.

Return type:

CTable

Raises:
  • ImportError – If pyarrow is not installed.

  • ValueError – If batch_size is not greater than 0.

  • ValueError – If max_rows is negative.

  • ValueError – If columns contains duplicate names.

  • Exception – Any exception raised by pyarrow while opening or reading the Parquet file, or by CTable.from_arrow() while converting Arrow data into a CTable.

Examples

Load an entire Parquet file into an in-memory table:

>>> import blosc2
>>> t = blosc2.CTable.from_parquet("data.parquet")

Load only a subset of columns:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     columns=["user_id", "amount", "country"],
... )

Create a disk-backed table while reading in batches:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     batch_size=50_000,
...     urlpath="data.ctable",
... )

Pass additional options through to PyArrow’s Parquet reader:

>>> t = blosc2.CTable.from_parquet(
...     "data.parquet",
...     memory_map=True,
... )
group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]

Return a deferred group-by object for this table.

Parameters:
  • keys – Column name or sequence of column names to group by.

  • sort – If True, sort the result by the group keys. The default False preserves the hash aggregation order and is usually faster.

  • dropna – If True (default), rows with null/NaN group keys are skipped. If False, null/NaN keys form their own group.

  • engine – Execution engine. Phase 1 accepts "auto" and uses the NumPy chunked implementation.

  • chunk_size – Optional number of physical rows processed per chunk.

Returns:

A lightweight deferred operation builder. Call methods such as .size(), .count(column) or .agg({...}) to materialize a grouped result as a new CTable.

Return type:

CTableGroupBy

head(N: int = 5) CTable[source]

Return a view of the first N live rows (default 5).

property info: _CTableInfoReporter

Get information about this table.

Examples

>>> print(t.info)
>>> t.info()
property info_items: list[tuple[str, object]]

Structured summary items used by info().

iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]

Yield live rows as bounded-size pyarrow.RecordBatch objects.

iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]

Iterate rows in sorted order without materializing a full copy.

Uses a FULL index when available (no sort needed); otherwise falls back to np.lexsort on live physical positions. Yields namedtuple-like row objects in the same way as __iter__.

The sorted positions array is stored as a compressed blosc2.NDArray to keep RAM usage low for large tables. batch_size positions are decompressed at a time during iteration.

Parameters:
  • cols – Column name or list of column names to sort by.

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • start – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • stop – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • step – Optional slice applied to the sorted sequence before iteration. E.g. stop=10 yields only the top-10 rows; step=2 yields every other row in sorted order.

  • batch_size – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.

materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]

Materialize a computed column into a new stored snapshot column.

Parameters:
  • name – Existing computed column to materialize.

  • new_name – Name of the new stored column. Defaults to f"{name}_stored".

  • dtype – Optional target dtype for the stored column. Defaults to the computed column dtype.

  • cparams – Optional compression parameters for the new stored column.

Raises:
  • ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.

  • KeyError – If name is not a computed column.

  • TypeError – If dtype is incompatible with the computed values.

property nbytes: int

Total uncompressed size in bytes (all columns + valid_rows mask).

property ncols: int

Total number of columns, including computed (virtual) columns.

refresh_generated_column(name: str) None[source]

Recompute a stored generated/materialized column from its source columns.

refresh_generated_columns(*, source: str | None = None) None[source]

Refresh all generated columns, optionally only those depending on source.

rename_column(old: str, new: str) None[source]

Rename a column.

On disk tables the corresponding persisted column leaf is renamed.

Renaming a flat column to a dotted name (e.g. "trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path /_cols/trip/begin/lon on disk and can be accessed via t["trip.begin.lon"] or the attribute-chain proxy t.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:

t.rename_column("trip_begin_lon", "trip.begin.lon")
t["trip.begin.lon"].mean()   # works as a regular Column
Raises:
  • ValueError – If the table is read-only, is a view, or new already exists.

  • KeyError – If old does not exist.

sample(n: int, *, seed: int | None = None) CTable[source]

Return a read-only view of n randomly chosen live rows.

Parameters:
  • n – Number of rows to sample. If n >= number of live rows, returns a view of the whole table.

  • seed – Optional random seed for reproducibility.

Returns:

A read-only view sharing columns with this table.

Return type:

CTable

property schema: CompiledSchema

The compiled schema that drives this table’s columns and validation.

schema_dict() dict[str, Any][source]

Return a JSON-compatible dict describing this table’s schema.

select(cols: list[str]) CTable[source]

Return a column-projection view exposing only cols.

The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.

Parameters:

cols

Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:

t.select(["trip.begin"])   # expands to trip.begin.lon, trip.begin.lat
t.select(["trip"])          # expands to all trip.* leaves

Raises:
  • KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).

  • ValueError – If cols is empty.

slice(start, stop=None, /, *, copy: bool = True) CTable[source]

Return a contiguous range of live (non-deleted) rows.

The range is given the way range() takes its bounds, either as a single stop (table.slice(stop)), as start/stop integers (table.slice(start, stop)), or as a Python slice (table.slice(slice(start, stop))). Negative bounds count from the end; step is not supported.

Parameters:
  • start – Range bounds, interpreted as logical positions among the live rows.

  • stop – Range bounds, interpreted as logical positions among the live rows.

  • copy – When True (the default, mirroring NDArray.slice()) a compact copy of the range is returned. When False a zero-copy view is returned instead, sharing the parent’s column data (read-only, like head()/tail()).

Returns:

out – The requested rows, re-indexed from 0.

Return type:

CTable

sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]

Return a copy of the table sorted by one or more columns.

Parameters:
  • cols

    Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:

    t.sort_by("trip.begin.lon")
    t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
    

  • ascending – Sort direction. A single bool applies to all keys; a list must have the same length as cols.

  • inplace – If True, rewrite the physical data in place and return self (like compact() but sorted). If False (default), return a new in-memory CTable leaving this one untouched.

Raises:
  • ValueError – If called on a view or a read-only table when inplace=True.

  • KeyError – If any column name is not found.

  • TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).

tail(N: int = 5) CTable[source]

Return a view of the last N live rows (default 5).

to_arrow()[source]

Convert all live rows to a pyarrow.Table.

to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a directory-backed store.

Directory-backed CTable stores may use any path that does not end in .b2z; using a .b2d suffix is recommended for clarity. For persistent, non-view .b2z tables opened read-only and compact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.

For in-memory tables, views, writable .b2z tables, existing directory-backed tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new directory-backed store.

Examples

Fast-unpack an existing compact zip store into a directory-backed table:

table = blosc2.CTable.open("data.b2z", mode="r")
table.to_b2d("data.b2d", overwrite=True)
table.close()

Materialize a filtered view into a directory-backed store:

view = table.where(table["score"] > 10)
view.to_b2d("high-score.b2d", overwrite=True)

Force a logical compacted copy, even for a persistent .b2z table:

table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]

Write this table to a compact .b2z container.

.b2z is the compact zip-backed CTable format. For persistent, non-view directory-backed tables and compact=False, this uses a fast physical-pack path: the backing TreeStore directory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A .b2d suffix is recommended for directory-backed stores, but not required.

For in-memory tables, views, existing .b2z tables, or compact=True, this falls back to the logical save() path, materializing only visible/live rows into a new .b2z store.

Examples

Fast-pack an existing directory-backed table into a compact zip store:

table = blosc2.CTable.open("data.b2d", mode="r")
table.to_b2z("data.b2z", overwrite=True)
table.close()

Materialize a filtered view into a new compact store:

view = table.where(table["score"] > 10)
view.to_b2z("high-score.b2z", overwrite=True)

Force a logical compacted copy, even for a persistent .b2d table:

table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
to_csv(path: str | None = None, *, header: bool = True, sep: str = ',') str | None[source]

Write all live rows to CSV.

Uses Python’s stdlib csv module — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g. "[1.0, 2.0, 3.0]").

Parameters:
  • path – Destination file path (created or overwritten). If None (the default), nothing is written and the CSV is returned as a string, like pandasDataFrame.to_csv().

  • header – If True (default), write column names as the first row.

  • sep – Field delimiter. Defaults to ","; use "\t" for TSV.

Returns:

The CSV text when path is None, otherwise None.

Return type:

str or None

to_pandas()[source]

Convert to a pandas DataFrame.

Scalar columns become regular DataFrame columns. Fixed-shape ndarray columns become object-dtype columns whose cells hold NumPy arrays of per-row shape item_shape.

Return type:

pandas.DataFrame

Examples

>>> import blosc2
>>> from dataclasses import dataclass
>>> import numpy as np
>>> @dataclass
... class Row:
...     id: int = blosc2.field(blosc2.int64())
...     embedding: object = blosc2.field(blosc2.ndarray((3,), dtype=blosc2.float32()))
>>> t = blosc2.CTable(Row, new_data=[
...     (1, np.array([1, 2, 3], dtype=np.float32)),
...     (2, np.array([4, 5, 6], dtype=np.float32)),
... ])
>>> df = t.to_pandas()
>>> df["id"].tolist()
[1, 2]
>>> df["embedding"].dtype
dtype('O')
>>> np.testing.assert_array_equal(df["embedding"][0], np.array([1, 2, 3], dtype=np.float32))
to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]

Write this table to a Parquet file batch-wise using pyarrow.

to_string(*, max_rows: int | None = None, max_width: int | None = None, show_dimensions: bool | str = False, display_index: bool | None = None, index_name: str = '') str[source]

Return a tabular string representation of the table.

By default (max_rows=None, max_width=None) this renders the whole table — every row and every column — like pandasDataFrame.to_string(). This is independent of the global blosc2.set_printoptions(); those only affect the truncated str/repr/print view.

Parameters:
  • max_rows – Maximum number of rows before truncating to a compact head/tail view. None (default) shows all rows; -1 also means all, 0 shows none, a positive int caps it.

  • max_width – Character budget for column fitting. None (default) or -1 shows all columns; a positive int truncates the middle ones with ... to fit.

  • show_dimensions – Whether to append a [N rows x M columns] footer. False (default) omits it, matching pandasto_string(); True always shows it; "truncate" shows it only when the view is truncated (the behaviour of str/repr).

  • display_index – Whether to include a pandas-like logical row index column. If None (default), use the global value configured with blosc2.set_printoptions().

  • index_name – Optional label for the displayed index column.

trim_capacity() None[source]

Shrink fixed-width physical storage to the last live row position.

This removes unused append capacity while preserving holes left by deletes before the last live row. List and variable-length scalar columns already grow to their logical length and are left untouched.

view(new_valid_rows)[source]

Return a row-filter view backed by a boolean mask array without copying data.

property vlmeta

Variable-length metadata attached to this table.

Returns a mapping-like proxy that supports item access, iteration, and the [:] bulk getter. Values are serialised via msgpack, so all standard types (int, float, str, bool, list, dict) are supported. The metadata is stored separately from the internal schema metadata and persists through close() / reopen for disk-backed tables.

Examples

>>> import blosc2
>>> import dataclasses
>>> @dataclasses.dataclass
... class Row:
...     x: int = 0
>>> t = blosc2.CTable(Row)
>>> t.vlmeta["author"] = "Alice"
>>> t.vlmeta["tags"] = ["alpha", "beta"]
>>> t.vlmeta["count"] = 42
>>> print(t.vlmeta["author"])
Alice
>>> print(t.vlmeta[:])
{'author': 'Alice', 'tags': ['alpha', 'beta'], 'count': 42}
>>> del t.vlmeta["count"]
>>> for name in t.vlmeta:
...     print(name, t.vlmeta[name])
...
author Alice
tags ['alpha', 'beta']
class blosc2.Column(table: CTable, col_name: str, mask=None)[source]

Column view for a CTable, with vectorized operations and reductions.

Attributes:
dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

info

Get information about this column.

info_items

Structured summary items used by info.

is_computed

True if this column is a virtual computed column (read-only).

is_dictionary

True if this column is a dictionary-encoded string column.

is_generated

True if this column is a stored generated/materialized column.

is_list
is_ndarray

True if this column stores fixed-shape N-D array values per row.

is_stale

True if this generated column needs to be refreshed before use.

is_varlen_scalar

True if this column holds variable-length scalar strings or bytes.

item_ndim

Number of per-row item dimensions.

item_shape

Per-row item shape; () for scalar columns.

item_size

Number of scalar values stored in each row item.

ndim

Number of logical dimensions.

null_value

The sentinel value that represents NULL for this column, or None.

raw

The underlying storage container for this column, without null-value processing.

row_transformer

Build row-wise projections/reductions for generated columns.

shape

Logical shape of the live column values.

size

Number of live scalar values in the logical column array.

view

Return a ColumnViewIndexer for creating logical sub-views.

Methods

assign(data)

Replace all live values in this column with data.

is_null()

Return a boolean array True where the live value is the null sentinel.

isin(values)

Return a boolean array True where the live value is in values.

iter_chunks([size])

Iterate over live column values in chunks of size rows.

norm([ord, axis, where])

Vector/matrix norm of a fixed-shape ndarray column.

notnull()

Return a boolean array True where the live value is not the null sentinel.

null_count()

Return the number of live rows whose value equals the null sentinel.

read_stale([key])

Read stored values even when this generated column is marked stale.

summary()

Return and print a compact summary for this column.

unique()

Return sorted array of unique live, non-null values.

value_counts()

Return a {value: count} dict sorted by count descending.

assign(data) None[source]

Replace all live values in this column with data.

Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.

Parameters:

data – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.

Raises:
  • ValueError – If len(data) does not match the number of live rows, or the table is opened read-only.

  • TypeError – If values cannot be coerced to the column’s dtype.

property dtype

NumPy dtype of the underlying storage, or None for variable-length columns (vlstring(), vlbytes(), list()).

property info: _CTableInfoReporter

Get information about this column.

The report includes both logical/live-row details and, when available, the physical storage details used internally by lazy predicates.

Examples

>>> print(t["score"].info)
>>> t["score"].info()
property info_items: list[tuple[str, object]]

Structured summary items used by info.

property is_computed: bool

True if this column is a virtual computed column (read-only).

property is_dictionary: bool

True if this column is a dictionary-encoded string column.

property is_generated: bool

True if this column is a stored generated/materialized column.

property is_ndarray: bool

True if this column stores fixed-shape N-D array values per row.

is_null() ndarray[source]

Return a boolean array True where the live value is the null sentinel.

For varlen scalar columns (vlstring/vlbytes) nullability is represented as native None values, so this returns True wherever the value is None. For dictionary columns, returns True where the code equals the null_code (-1 by default).

property is_stale: bool

True if this generated column needs to be refreshed before use.

property is_varlen_scalar: bool

True if this column holds variable-length scalar strings or bytes.

isin(values) ndarray[source]

Return a boolean array True where the live value is in values.

For dictionary columns this performs efficient integer-code membership testing (no decoding of all values). Values absent from the dictionary are treated as not-present.

For non-dictionary columns this decodes all live values and tests membership in a set.

property item_ndim: int

Number of per-row item dimensions.

property item_shape: tuple[int, ...]

Per-row item shape; () for scalar columns.

property item_size: int

Number of scalar values stored in each row item.

iter_chunks(size: int = 65536)[source]

Iterate over live column values in chunks of size rows.

Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.

Parameters:

size – Number of live rows per yielded chunk. Defaults to 65 536.

Yields:

numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.

Examples

>>> for chunk in t["score"].iter_chunks(size=100_000):
...     process(chunk)
property ndim: int

Number of logical dimensions.

norm(ord=None, axis=None, *, where=None)[source]

Vector/matrix norm of a fixed-shape ndarray column.

The column is treated as a logical array of shape (nrows, *item_shape). For example, axis=1 computes one norm per row for a 1-D item shape.

notnull() ndarray[source]

Return a boolean array True where the live value is not the null sentinel.

null_count() int[source]

Return the number of live rows whose value equals the null sentinel.

Returns 0 in O(1) if no null_value is configured for this column and the column is not a varlen scalar column.

property null_value

The sentinel value that represents NULL for this column, or None.

property raw

The underlying storage container for this column, without null-value processing.

Returns the raw blosc2.NDArray, ListArray, DictionaryColumn, or scalar varlen array directly. Unlike __getitem__(), which always materializes NumPy arrays, this is the column as a blosc2-native compressed object: usable as a lazy-expression operand without decompressing, and exposing storage details such as schunk, chunks, cparams or iterchunks_info().

This is a physical view of the column: fixed-width containers are over-allocated to chunk capacity for appends, so their first axis is longer than len(column) and positions of rows deleted from the table still hold their old values. No validity-mask or null-sentinel processing is applied; use the Column interface for logical reads.

Raises AttributeError for computed (virtual) columns, which have no backing storage.

read_stale(key=slice(None, None, None))[source]

Read stored values even when this generated column is marked stale.

This is an explicit escape hatch for inspecting the last materialized values. Normal reads raise for stale generated columns so outdated values are not used accidentally.

property row_transformer: RowTransformer

Build row-wise projections/reductions for generated columns.

property shape: tuple[int, ...]

Logical shape of the live column values.

property size: int

Number of live scalar values in the logical column array.

summary() str[source]

Return and print a compact summary for this column.

For fixed-shape ndarray columns this includes logical shape, storage, and row-norm statistics when numeric. Scalar columns fall back to info.

unique() ndarray[source]

Return sorted array of unique live, non-null values.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

value_counts() dict[source]

Return a {value: count} dict sorted by count descending.

Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.

Example

>>> t["active"].value_counts()
{True: 8432, False: 1568}
property view: ColumnViewIndexer

Return a ColumnViewIndexer for creating logical sub-views.

Examples

Read a sub-view for chained aggregates:

sub = t.price.view[2:10]
sub.sum()

Bulk write through a sub-view:

t.price.view[0:5][:] = np.zeros(5)
class blosc2.NestedColumn(table: CTable, prefix: str)[source]

A read-only accessor for a nested (dotted) group of CTable columns.

Returned by attribute access on a CTable (or on another NestedColumn) when the name refers to an internal node of the dotted column tree rather than a leaf. For a table flattened from a struct/list<struct> schema, t.trip is a NestedColumn grouping every leaf under the trip. prefix, while a leaf such as t.trip.sec (or t.trip.begin.lon) is a Column. Drilling into an intermediate node (e.g. t.trip.begin) yields another NestedColumn.

Exposes aggregate metadata over its descendant leaf columns (col_names, nrows, ncols, nbytes, cbytes, cratio) and an info report.

Examples

>>> t.trip
<NestedColumn 'trip'>
>>> t.trip.col_names
['sec', 'km', 'begin.lon', ...]
>>> t.trip.sec                  # a leaf -> Column
Attributes:
cbytes

Compressed size in bytes for stored descendant columns.

col_names

Descendant leaf column names relative to this nested prefix.

cratio

Compression ratio for stored descendant columns.

info

Get information about this nested column namespace.

info_items

Structured summary items used by info.

nbytes

Uncompressed size in bytes for stored descendant columns.

ncols

Number of descendant leaf columns in this nested namespace.

nrows

Number of logical rows in this nested namespace.

property cbytes: int

Compressed size in bytes for stored descendant columns.

property col_names: list[str]

Descendant leaf column names relative to this nested prefix.

property cratio: float

Compression ratio for stored descendant columns.

property info: _CTableInfoReporter

Get information about this nested column namespace.

Examples

>>> print(t.trip.info)
>>> t.trip.info()
property info_items: list[tuple[str, object]]

Structured summary items used by info.

property nbytes: int

Uncompressed size in bytes for stored descendant columns.

property ncols: int

Number of descendant leaf columns in this nested namespace.

property nrows: int

Number of logical rows in this nested namespace.

blosc2.get_printoptions() dict[str, Any][source]

Return a copy of the global CTable display options.

blosc2.printoptions(**kwargs: Any)

Temporarily set CTable display options, restored on exit.

Accepts the same keyword arguments as set_printoptions(). Handy for a one-off full dump, e.g.:

with blosc2.printoptions(display_rows=-1, display_width=-1):
    print(ctable)
blosc2.set_printoptions(*, display_index: bool | None = None, display_rows: int | None = None, display_width: int | None = <object object>, display_precision: int | None = None, fancy: bool | None = None) None[source]

Set global display options for CTable string representations.

These options affect str(ctable)/repr(ctable)/print(ctable) (the interactive, truncated view). They do not affect CTable.to_string(), which renders everything by default.

Parameters:
  • display_index – Whether the display should include a pandas-like logical row index column. None leaves the current setting unchanged.

  • display_rows – Maximum number of rows shown before truncating to a compact head/tail view (five first and five last rows, when possible). -1 shows all rows, 0 shows none. None leaves the current setting unchanged.

  • display_width – Character budget used to decide how many columns fit before truncating the middle ones with .... None (the default) auto-detects the terminal width, -1 shows all columns, a positive int sets a fixed budget. Omit the argument to leave the current setting unchanged.

  • display_precision – Number of digits after the decimal point for floating-point values in table displays. Trailing zeros are trimmed. None leaves the current setting unchanged.

  • fancy – Whether to use the more decorated table display, including separator rules and a detailed footer. False (default) uses a simpler pandas-like footer such as [726017 rows x 5 columns] and omits separator rules. None leaves the current setting unchanged.