Miscellaneous¶
This page documents the miscellaneous members of the blosc2 module that do not fit into other categories.
- blosc2.cpu_info = {'count': 4, 'l1_data_cache_size': 32768, 'l2_cache_size': 524288, 'l3_cache_size': 33554432}¶
- class blosc2.finfo(dtype)¶
Machine limits for floating point types.
- bits¶
The number of bits occupied by the type.
- Type:
int
- dtype¶
Returns the dtype for which finfo returns information. For complex input, the returned dtype is the associated
float*dtype for its real and complex components.- Type:
dtype
- eps¶
The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
eps = 2**-52, approximately 2.22e-16.- Type:
float
- epsneg¶
The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
epsneg = 2**-53, approximately 1.11e-16.- Type:
float
- iexp¶
The number of bits in the exponent portion of the floating point representation.
- Type:
int
- machep¶
The exponent that yields eps.
- Type:
int
- max¶
The largest representable number.
- Type:
floating point number of the appropriate type
- maxexp¶
The smallest positive power of the base (2) that causes overflow. Corresponds to the C standard MAX_EXP.
- Type:
int
- min¶
The smallest representable number, typically
-max.- Type:
floating point number of the appropriate type
- minexp¶
The most negative power of the base (2) consistent with there being no leading 0’s in the mantissa. Corresponds to the C standard MIN_EXP - 1.
- Type:
int
- negep¶
The exponent that yields epsneg.
- Type:
int
- nexp¶
The number of bits in the exponent including its sign and bias.
- Type:
int
- nmant¶
The number of explicit bits in the mantissa (excluding the implicit leading bit for normalized numbers).
- Type:
int
- precision¶
The approximate number of decimal digits to which this kind of float is precise.
- Type:
int
- resolution¶
The approximate decimal resolution of this type, i.e.,
10**-precision.- Type:
floating point number of the appropriate type
- tiny¶
An alias for smallest_normal, kept for backwards compatibility.
- Type:
float
- smallest_normal¶
The smallest positive floating point number with 1 as leading bit in the mantissa following IEEE-754 (see Notes).
- Type:
float
- smallest_subnormal¶
The smallest positive floating point number with 0 as leading bit in the mantissa following IEEE-754.
- Type:
float
- Parameters:
dtype¶ (float, dtype, or instance) – Kind of floating point or complex floating point data-type about which to get information.
See also
Notes
For developers of NumPy: do not instantiate this at the module level. The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling
finfo()repeatedly inside your functions is not a problem.Note that
smallest_normalis not actually the smallest positive representable value in a NumPy floating point type. As in the IEEE-754 standard [1], NumPy floating point types make use of subnormal numbers to fill the gap between 0 andsmallest_normal. However, subnormal numbers may have significantly reduced precision [2].For
longdouble, the representation varies across platforms. On most platforms it is IEEE 754 binary128 (quad precision) or binary64-extended (80-bit extended precision). On PowerPC systems, it may use the IBM double-double format (a pair of float64 values), which has special characteristics for precision and range.This function can also be used for complex data types as well. If used, the output will be the same as the corresponding real float type (e.g. numpy.finfo(numpy.csingle) is the same as numpy.finfo(numpy.single)). However, the output is true for the real and imaginary components.
References
[1]IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2008, pp.1-70, 2008, https://doi.org/10.1109/IEEESTD.2008.4610935
[2]Wikipedia, “Denormal Numbers”, https://en.wikipedia.org/wiki/Denormal_number
Examples
>>> import numpy as np >>> np.finfo(np.float64).dtype dtype('float64') >>> np.finfo(np.complex64).dtype dtype('float32')
- Attributes:
- epsneg
- iexp
- machep
- negep
- nexp
- resolution
- tiny
Return the value for tiny, alias of smallest_normal.
- tinyfloat
Value for the smallest normal, alias of smallest_normal.
- UserWarning
If the calculated value for the smallest normal is requested for double-double.
- class blosc2.iinfo(type)¶
Machine limits for integer types.
- bits¶
The number of bits occupied by the type.
- Type:
int
- dtype¶
Returns the dtype for which iinfo returns information.
- Type:
dtype
- min¶
The smallest integer expressible by the type.
- Type:
int
- max¶
The largest integer expressible by the type.
- Type:
int
- Parameters:
int_type¶ (integer type, dtype, or instance) – The kind of integer data type to get information about.
See also
finfoThe equivalent for floating point data types.
Examples
With types:
>>> import numpy as np >>> ii16 = np.iinfo(np.int16) >>> ii16.min -32768 >>> ii16.max 32767 >>> ii32 = np.iinfo(np.int32) >>> ii32.min -2147483648 >>> ii32.max 2147483647
With instances:
>>> ii32 = np.iinfo(np.int32(10)) >>> ii32.min -2147483648 >>> ii32.max 2147483647
- blosc2.get_matmul_library() str | None[source]¶
Return the library used by the active matmul fast backend, if any.
- Returns:
"Accelerate.framework"when the selected backend is Accelerate, the loaded CBLAS library path for runtime-discovered CBLAS backends, orNonewhen the selected backend isnaive.- Return type:
str | None
Unclassified module members¶
The list below is intentionally generated from blosc2 module members that
are not excluded above. It acts as a reminder to classify newly documented
public objects into the appropriate reference section.
- class blosc2.CTable(row_type: type[RowT], new_data=None, *, urlpath: str | None = None, mode: str = 'a', expected_size: int | None = None, compact: bool = False, validate: bool = True, cparams: dict[str, Any] | None = None, dparams: dict[str, Any] | None = None, create_summary_index: bool = True)[source]¶
Columnar compressed table with typed columns and row-oriented access.
- Attributes:
blocksBlock shape shared by the table’s aligned fixed-size columns.
cbytesTotal compressed size in bytes (all columns + valid_rows mask).
chunksChunk shape shared by the table’s aligned fixed-size columns.
computed_columnsRead-only view of the computed-column definitions.
cratioCompression ratio for the whole table payload.
indexesReturn a list of
blosc2.Indexhandles for all active indexes.infoGet information about this table.
info_itemsStructured summary items used by
info().nbytesTotal uncompressed size in bytes (all columns + valid_rows mask).
ncolsTotal number of columns, including computed (virtual) columns.
- nrows
schemaThe compiled schema that drives this table’s columns and validation.
vlmetaVariable-length metadata attached to this table.
Methods
add_column(name, spec)Add a new column filled from the default declared in spec.
add_computed_column(name, expr, *[, dtype, ...])Add a read-only virtual column computed from stored columns.
add_generated_column(name, *, values[, ...])Add a stored generated column maintained by the table.
append(data)Append a single row to the table.
close()Close any persistent backing store held by this table.
column_schema(name)Return the
CompiledColumndescriptor for name.compact()Physically rewrite every column array keeping only live rows.
compact_index([col_name, expression, name])Compact an index, merging any incremental append runs.
cov()Return the covariance matrix as a numpy array.
create_index([col_name, field, expression, ...])Build and register an index for a stored column or table expression.
delete(ind)Mark one or more rows as deleted (tombstone deletion).
describe()Print a per-column statistical summary.
drop_column(name)Remove a column from the table.
drop_computed_column(name)Remove a computed column from the table.
drop_index([col_name, expression, name])Remove an index and delete any sidecar files.
extend(data, *[, validate])Append multiple rows at once.
from_arrow(schema, batches, *[, urlpath, ...])Build a
CTablefrom an Arrow schema and iterable of record batches.from_csv(path, row_cls, *[, header, sep])Build a
CTablefrom a CSV file.from_pandas(df, row_cls)Build a
CTablefrom a pandas DataFrame.from_parquet(path, *[, columns, batch_size, ...])Read a Parquet file into a
CTable.group_by(keys, *[, sort, dropna, engine, ...])Return a deferred group-by object for this table.
head([N])Return a view of the first N live rows (default 5).
index([col_name, expression, name])Return the index handle for a stored-column or expression target.
iter_arrow_batches(*[, columns, batch_size, ...])Yield live rows as bounded-size
pyarrow.RecordBatchobjects.iter_sorted(cols[, ascending, start, stop, ...])Iterate rows in sorted order without materializing a full copy.
materialize_computed_column(name, *[, ...])Materialize a computed column into a new stored snapshot column.
rebuild_index([col_name, expression, name])Drop and recreate an index with the same parameters.
refresh_generated_column(name)Recompute a stored generated/materialized column from its source columns.
refresh_generated_columns(*[, source])Refresh all generated columns, optionally only those depending on source.
rename_column(old, new)Rename a column.
sample(n, *[, seed])Return a read-only view of n randomly chosen live rows.
Return a JSON-compatible dict describing this table's schema.
select(cols)Return a column-projection view exposing only cols.
slice(start[, stop, copy])Return a contiguous range of live (non-deleted) rows.
sort_by(cols[, ascending, inplace])Return a copy of the table sorted by one or more columns.
tail([N])Return a view of the last N live rows (default 5).
to_arrow()Convert all live rows to a
pyarrow.Table.to_b2d(urlpath, *[, overwrite, compact])Write this table to a directory-backed store.
to_b2z(urlpath, *[, overwrite, compact])Write this table to a compact
.b2zcontainer.to_csv([path, header, sep])Write all live rows to CSV.
Convert to a pandas DataFrame.
to_parquet(path, *[, columns, batch_size, ...])Write this table to a Parquet file batch-wise using pyarrow.
to_string(*[, max_rows, max_width, ...])Return a tabular string representation of the table.
Shrink fixed-width physical storage to the last live row position.
view(new_valid_rows)Return a row-filter view backed by a boolean mask array without copying data.
- add_column(name: str, spec: SchemaSpec | Field) None[source]¶
Add a new column filled from the default declared in spec.
- Parameters:
name¶ – Column name. Must follow the same naming rules as schema fields.
spec¶ – A schema descriptor such as
b2.int64(ge=0)or a field descriptor such asb2.field(b2.int64(ge=0), default=0). When the table already has live rows, useblosc2.field(...)with a default declared so those rows can be backfilled.
- Raises:
ValueError – If the table is read-only, is a view, the column already exists, or a non-empty table is given a column with no default declared.
TypeError – If a declared default cannot be coerced to spec’s dtype.
- add_computed_column(name: str, expr: str | LazyExpr | DSLKernel | Callable[[dict[str, Any]], LazyExpr], *, dtype: dtype | None = None, inputs: list[str] | None = None) None[source]¶
Add a read-only virtual column computed from stored columns.
A computed column has no physical storage. It is backed by a
blosc2.LazyExprand is evaluated when values are read, filtered, displayed, exported, or aggregated. Because it is virtual, it is read-only, cannot be indexed directly, and is not supplied inappend()/extend()inputs. To store and optionally index a computed result, useadd_generated_column()or materialize an existing computed column withmaterialize_computed_column().Supported signatures are:
add_computed_column(name, "price * qty") add_computed_column(name, lazy_expr) add_computed_column(name, dsl_kernel, inputs=["price", "qty"]) add_computed_column(name, blosc2.lazyudf(dsl_kernel, (t.price, t.qty))) add_computed_column(name, lambda cols: cols["price"] * cols["qty"])
- Parameters:
name¶ – Name of the virtual computed column. It must be a valid column name and must not collide with an existing stored or computed column.
expr¶ –
Definition of the virtual column. Accepted forms:
str: scalar expression over stored scalar columns, e.g."price * qty".blosc2.LazyExpr: lazy expression over stored columns of this table.blosc2.dsl_kernel()-decorated kernel passed directly withinputs=[...]— one stored scalar column name per kernel parameter, bound positionally. The kernel may use loops,if/elseandwhere(...). Its source is persisted and recompiled on open; the column stays virtual/unstored.blosc2.LazyUDFbuilt from ablosc2.dsl_kernel()viablosc2.lazyudf()— column bindings are inferred by identity from the operands, soinputs=is not needed. Accepted forms includeblosc2.lazyudf(kernel, (t.col1, t.col2))(usingColumnaccessors) or the raw NDArray equivalents.callable: called as
expr(self._cols)and must return ablosc2.LazyExpror ablosc2.LazyUDFbacked by ablosc2.dsl_kernel().
DSL columns (last three forms) are persisted — their source is stored and recompiled on open — and may be referenced inside
where()predicates.Expressions must depend only on stored columns of this table; computed columns cannot depend on other computed columns in this version. Fixed-shape ndarray columns are not accepted in computed column expressions yet. For row-wise ndarray projections or reductions, use
add_generated_column()withvalues=t.ndarray_col.row_transformer....dtype¶ – Optional dtype override for the computed values. For expression forms it is inferred from the resulting
blosc2.LazyExprwhen omitted. For DSL forms, an omitted dtype is inferred by NumPy type promotion of the input column dtypes (correct for elementwise arithmetic kernels); pass dtype explicitly for kernels that change the type (comparisons/where/casts) or when the kernel has no column inputs. This changes the dtype reported by the CTable column wrapper; it does not create physical storage.inputs¶ – Only used when expr is a bare
blosc2.dsl_kernel(): a list of stored scalar column names, one per kernel parameter, bound positionally (kernel parameteri←inputs[i]). Not needed when passing ablosc2.LazyUDFor a callable — bindings are inferred from the operands in those cases.
Examples
Add a computed column from a string expression and use it like a normal read-only column:
t.add_computed_column("total", "price * qty") assert t.total[:].shape == (t.nrows,)
Add a computed column from a callable. The callable receives the table’s stored column mapping:
t.add_computed_column( "price_with_tax", lambda cols: cols["price"] * 1.21, dtype=np.float64, )
Callable expressions can use normal Python logic while still returning a lazy expression:
def total_expr(cols): base = cols["price"] * cols["qty"] return base * 1.21 if include_tax else base t.add_computed_column("total", total_expr)
They are also convenient for reusable, parameterized helpers:
def ratio(num, den): return lambda cols: cols[num] / cols[den] t.add_computed_column("margin", ratio("profit", "revenue"))
Computed columns participate in filters and aggregates:
expensive = t.where(t.total > 100) total_revenue = t.total.sum()
Computed columns are virtual and read-only and cannot be indexed. If you need to filter or sort by this value frequently, use a generated column instead — it is physically stored and can be indexed:
t.add_generated_column( "total_stored", values="price * qty", dtype=blosc2.float64(), create_index=True, )
Or convert an existing computed column to a stored snapshot:
t.materialize_computed_column("total", new_name="total_stored") t.create_index("total_stored")
- Raises:
ValueError – If called on a view or read-only table, if name already exists, or if an expression operand does not reference a stored column of this table.
TypeError – If expr has an unsupported form, does not produce a
blosc2.LazyExpr, references unsupported source columns, or if aRowTransformeris passed. Row transformers are only accepted byadd_generated_column().
- add_generated_column(name: str, *, values: str | LazyExpr | DSLKernel | Callable[[dict[str, Any]], LazyExpr] | RowTransformer, dtype=None, create_index: bool = False, inputs: list[str] | None = None) None[source]¶
Add a stored generated column maintained by the table.
A generated column is physical storage, not a virtual expression. The initial values are computed for all current live rows, and later
append()/extend()calls automatically compute values for newly inserted rows when source columns are provided. If a source column is modified in-place, dependent generated columns are marked stale; callrefresh_generated_column()orrefresh_generated_columns()to recompute them.Supported signatures are:
add_generated_column(name, *, values="price * qty", dtype=..., create_index=False) add_generated_column(name, *, values=lazy_expr, dtype=...) add_generated_column(name, *, values=dsl_kernel, inputs=["price", "qty"], dtype=...) add_generated_column(name, *, values=blosc2.lazyudf(dsl_kernel, (t.price, t.qty))) add_generated_column(name, *, values=lambda cols: cols["price"] * 1.21, dtype=...) add_generated_column(name, *, values=t.embedding.row_transformer.norm(axis=0), dtype=...) add_generated_column(name, *, values=t.image.row_transformer.mean(axis=(0, 1)), dtype=blosc2.ndarray((3,), dtype=...))
- Parameters:
name¶ – Name of the generated column to create. It must be a valid column name and must not collide with an existing stored or computed column.
values¶ –
Definition used to compute the generated values. Accepted forms:
str: scalar expression over stored scalar columns, e.g."price * qty". The expression must produce one scalar value per row.blosc2.LazyExpr: scalar lazy expression over stored columns of this table. It must produce a 1-D scalar stream.blosc2.dsl_kernel()-decorated kernel passed directly withinputs=[...]— one stored scalar column name per kernel parameter, bound positionally. Produces one scalar per row. The kernel source is persisted and recompiled on open; appended rows are auto-filled andrefresh_generated_column()recomputes after in-place edits.blosc2.LazyUDFbuilt from ablosc2.dsl_kernel()viablosc2.lazyudf()— column bindings are inferred by identity from the operands, soinputs=is not needed. AcceptsColumnaccessors (t.col1) or raw NDArrays as operands. Same persistence and auto-fill behaviour as above.callable: called as
values(self._cols)and must return ablosc2.LazyExpror ablosc2.LazyUDFbacked by ablosc2.dsl_kernel().RowTransformer: row-wise projection/reduction bound to a fixed-shape ndarray column, e.g.t.embedding.row_transformer.norm(axis=0)ort.image.row_transformer.mean(axis=(0, 1)). Row transformers may produce either one scalar per row or one fixed-shape ndarray item per row.
Expression and DSL forms currently cannot depend on computed columns and cannot directly consume fixed-shape ndarray columns; use a row-transformer for ndarray row projections/reductions.
dtype¶ – Output schema or dtype. Scalar outputs may pass a NumPy dtype or a Blosc2 scalar spec such as
blosc2.float64(). Fixed-shape ndarray outputs must pass an ndarray spec such asblosc2.ndarray((3,), dtype=blosc2.float32())unless the table has existing rows from which the output shape can be inferred. When omitted, dtype and fixed-shape output shape are inferred from the current generated values; this is not possible for an empty table.create_index¶ – If
True, create an index on the generated column immediately. Only scalar generated columns can be indexed; fixed-shape ndarray generated columns raiseValueErrorwhen indexing is requested.inputs¶ – Only used when values is a bare
blosc2.dsl_kernel(): a list of stored scalar column names, one per kernel parameter, bound positionally. Not needed when passing ablosc2.LazyUDFor a callable — bindings are inferred from the operands in those cases.
Examples
Create and index a scalar generated column from a string expression:
t.add_generated_column( "total", values="price * qty", dtype=blosc2.float64(), create_index=True, )
Use a callable when normal Python composition is more convenient:
t.add_generated_column( "price_with_tax", values=lambda cols: cols["price"] * 1.21, dtype=blosc2.float64(), )
Generate a scalar from each fixed-shape ndarray row. For row transformers, axes refer to the per-row item shape, so
axis=0is the embedding-coordinate axis foritem_shape=(dim,):t.add_generated_column( "embedding_norm", values=t.embedding.row_transformer.norm(axis=0, ord=2), dtype=blosc2.float64(), create_index=True, )
Generate a fixed-shape ndarray value per row. Here an image column has
item_shape=(height, width, 3)and the generated column stores one RGB vector per row:t.add_generated_column( "image_mean_rgb", values=t.image.row_transformer.mean(axis=(0, 1)), dtype=blosc2.ndarray((3,), dtype=blosc2.float32()), )
Generated columns are maintained on append/extend:
t.append((new_id, new_embedding, new_image)) assert t.embedding_norm[-1] == np.linalg.norm(new_embedding)
If source values are changed in place, refresh dependent generated columns before relying on them:
t.embedding[0] = new_embedding t.refresh_generated_column("embedding_norm")
- Raises:
ValueError – If called on a view or read-only table, if name already exists, if generated output length/shape is incompatible with the table, or if
create_index=Trueis requested for an ndarray generated column.TypeError – If values has an unsupported form, references unsupported source columns, or cannot be coerced to dtype.
KeyError – If a
RowTransformerreferences a missing source column.
- append(data: list | void | ndarray) None[source]¶
Append a single row to the table.
data may be a list, tuple,
numpy.void, or structurednumpy.ndarraywhose fields match the schema column order. Materialized columns whose values are omitted are auto-filled from their recorded expression. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names the row dict may be supplied either as a flat mapping of dotted keys or as a nested dict that mirrors the original struct shape — both are accepted and automatically flattened to the physical dotted leaf names:
# flat dotted keys t.append({"trip.begin.lon": -87.6, "trip.begin.lat": 41.8, "payment.fare": 12.5}) # original nested dict (auto-flattened) t.append({"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}})
- base: CTable | None¶
Parent table when this instance is a row-filter or column-projection view (created by
where(),select(), orview()).Nonefor top-level tables. Structural mutations such asadd_column()anddrop_column()are blocked on views.
- property blocks: tuple | None¶
Block shape shared by the table’s aligned fixed-size columns.
Noneif the table has no fixed-size scalar columns. Seechunksfor the matching chunk shape.
- property cbytes: int¶
Total compressed size in bytes (all columns + valid_rows mask).
- property chunks: tuple | None¶
Chunk shape shared by the table’s aligned fixed-size columns.
Noneif the table has no fixed-size scalar columns. Seeblocksfor the matching block shape.
- close() None[source]¶
Close any persistent backing store held by this table.
On the first close of a writable root table, this also builds the automatic SUMMARY indexes (unless
create_summary_index=False); see thecreate_summary_indexparameter ofCTablefor how this interacts with in-memory vs. persistent tables.
- col_names: list[str]¶
Ordered list of stored column names. Computed columns are not included; access those via
computed_columns.
- column_schema(name: str) CompiledColumn[source]¶
Return the
CompiledColumndescriptor for name.- Raises:
KeyError – If name is not a column in this table.
- compact()[source]¶
Physically rewrite every column array keeping only live rows.
Closes the gaps left by prior
delete()calls by shuffling live data to the front of each column array. The underlying NDArray allocations are not resized — each column retains its original capacity. To actually reclaim memory, usecopy()withcompact=Trueinstead, which allocates fresh arrays sized to the live row count. All existing indexes are dropped and must be recreated afterwards. RaisesValueErrorif the table is read-only or a view.
- property computed_columns: dict[str, dict]¶
Read-only view of the computed-column definitions.
Each value is a dict with keys
expression,col_deps,lazy(blosc2.LazyExpr), anddtype.
- cov() ndarray[source]¶
Return the covariance matrix as a numpy array.
Only int, float, and bool columns are supported. Bool columns are cast to int (0/1) before computation. Complex columns raise
TypeError.- Returns:
Shape
(ncols, ncols). Column order matchescol_names.- Return type:
numpy.ndarray
- Raises:
TypeError – If any column has an unsupported dtype (complex, string, …).
ValueError – If the table has fewer than 2 live rows (covariance undefined).
- property cratio: float¶
Compression ratio for the whole table payload.
- delete(ind: int | slice | str | Iterable) None[source]¶
Mark one or more rows as deleted (tombstone deletion).
ind may be a logical row index (
int), a slice, or an iterable of logical indices. Deleted rows are excluded from all subsequent queries and aggregates. Physical storage is not reclaimed untilcompact()is called. RaisesValueErrorif the table is read-only or a view.
- describe() None[source]¶
Print a per-column statistical summary.
Numeric columns (int, float): count, mean, std, min, max. Bool columns: count, true-count, true-%. String columns: count, min (lex), max (lex), n-unique.
- drop_column(name: str) None[source]¶
Remove a column from the table.
On disk tables the corresponding persisted column leaf is deleted.
- Raises:
ValueError – If the table is read-only, is a view, or name is the last column.
KeyError – If name does not exist.
- drop_computed_column(name: str) None[source]¶
Remove a computed column from the table.
- Parameters:
name¶ – Name of the computed column to remove.
- Raises:
KeyError – If name is not a computed column.
ValueError – If called on a view.
- extend(data: list | CTable | Any, *, validate: bool | None = None) None[source]¶
Append multiple rows at once.
data may be:
a dict of arrays
{"col": array, ...}— all arrays must have the same length; omitted columns are filled from their declared default; columns with no default declared must be provided;a list of rows, each compatible with
append();another CTable — columns are matched by name.
Pass
validate=Falseto skip per-row Pydantic validation on trusted bulk imports. RaisesValueErrorif the table is read-only or a view.For tables with nested (dotted) column names both the dict-of-arrays and list-of-dicts forms accept the original nested dict shape and auto-flatten it to physical dotted leaf names:
# nested dict of arrays t.extend({ "trip": {"begin": {"lon": lons, "lat": lats}}, "payment": {"fare": fares}, }) # list of nested dicts t.extend([ {"trip": {"begin": {"lon": -87.6, "lat": 41.8}}, "payment": {"fare": 12.5}}, {"trip": {"begin": {"lon": -87.5, "lat": 41.7}}, "payment": {"fare": 8.0}}, ])
- classmethod from_arrow(schema, batches, *, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, capacity_hint: int | None = None, string_max_length: int | Mapping[str, int] | None = None, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'msgpack', object_fallback: bool = False, column_cparams: Mapping[str, dict[str, Any]] | None = None, separate_nested_cols: bool = False, create_summary_index: bool = True, chunks: int | tuple[int, ...] | None = None, blocks: int | tuple[int, ...] | None = None) CTable[source]¶
Build a
CTablefrom an Arrow schema and iterable of record batches.Nested struct flattening: top-level Arrow
struct<…>fields are automatically and recursively flattened into dotted leaf columns. For example, a fieldtrip: struct<begin: struct<lon: float64, lat: float64>>becomes two CTable columnstrip.begin.lonandtrip.begin.lat. Each leaf is stored as an independent compressedNDArray. Row reads viat[i]reconstruct the original nested dict shape. Uset["trip.begin.lon"]ort.trip.begin.lonto access a leaf:import pyarrow as pa, blosc2 trip_type = pa.struct([("begin", pa.struct([("lon", pa.float64())]))]) schema = pa.schema([pa.field("trip", trip_type)]) t = blosc2.CTable.from_arrow(schema, batches) t.col_names # ['trip.begin.lon'] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
When string_max_length is
None(the default), scalar Arrowstring/large_stringcolumns are imported asvlstring()columns andbinary/large_binarycolumns are imported asvlbytes()columns. Non-structstructcolumns (not containing only scalar leaves) are imported asstruct()columns backed by batched variable-length storage. Null values for these variable-length scalar columns are represented as nativeNonewith no sentinel needed.When string_max_length is set to a positive integer, scalar string and binary columns are imported as fixed-width
string()/bytes()columns whose dtype is sized to string_max_length characters/bytes. It may also be a mapping from column name to max length; omitted string/binary columns remainvlstring()/vlbytes()columns.blosc2_batch_sizecontrols how many rows are buffered before BatchArray-backed imported columns (list columns and variable-length scalar columns such asvlstring,vlbytes,struct, and schema-lessobjectcolumns) are flushed to their backend. Set it toNoneto keep those columns pending until the final flush.list_serializerselects the backend serializer for imported list columns."msgpack"is the default;"arrow"stores Arrow list batches directly and can be much faster for deeply nested list columns.Unsupported Arrow types raise by default. Pass
object_fallback=Trueto import such columns as schema-lessobject()columns. This fallback is intentionally not used byfrom_parquet().column_cparamsoptionally maps column names to per-column compression parameters. These override the table-levelcparamsfor fixed-width columns imported from Arrow.
- classmethod from_csv(path: str, row_cls, *, header: bool = True, sep: str = ',') CTable[source]¶
Build a
CTablefrom a CSV file.Schema comes from row_cls (a dataclass) — CTable is always typed. All rows are read in a single pass into per-column Python lists, then each column is bulk-written into a pre-allocated NDArray (one slice assignment per column, no
extend()).- Parameters:
path¶ – Source CSV file path.
row_cls¶ – A dataclass whose fields define the column names and types.
header¶ – If
True(default), the first row is treated as a header and skipped. Column order in the file must match row_cls field order regardless.sep¶ – Field delimiter. Defaults to
","; use"\t"for TSV.
- Returns:
A new in-memory CTable containing all rows from the CSV file.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If a row has a different number of fields than the schema.
- classmethod from_pandas(df, row_cls) CTable[source]¶
Build a
CTablefrom a pandas DataFrame.Schema comes from row_cls (a dataclass) — CTable is always typed. Object-dtype DataFrame columns are not automatically inferred as ndarray columns; the row_cls must explicitly declare
blosc2.ndarray()fields.- Parameters:
- Returns:
A new CTable containing all DataFrame rows.
- Return type:
- Raises:
TypeError – If row_cls is not a dataclass.
ValueError – If DataFrame columns do not match the row_cls schema.
- classmethod from_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, urlpath: str | None = None, mode: str = 'w', cparams=None, dparams=None, validate: bool = False, auto_null_sentinels: bool = True, blosc2_batch_size: int | None = 2048, blosc2_items_per_block: int | None = None, list_serializer: Literal['msgpack', 'arrow'] = 'arrow', separate_nested_cols: bool = True, max_rows: int | None = None, **kwargs) CTable[source]¶
Read a Parquet file into a
CTable.The Parquet file is streamed batch by batch through
pyarrowand then converted into a typedCTable. By default, the result is created in memory, but you can also persist it on disk viaurlpath.This method delegates the actual table construction to
CTable.from_arrow(), so Arrow schema handling, nullable-column support, and Blosc2 write tuning follow the same rules as that method.Nested struct flattening: top-level Parquet
struct<…>fields are automatically and recursively flattened into dotted leaf columns — the same as infrom_arrow(). For example, a Parquet file that contains a columntrip: struct<begin: struct<lon: double, lat: double>>produces two CTable columnstrip.begin.lonandtrip.begin.lat. Row reads reconstruct the original nested dict shape; individual leaves are accessed via dotted names or attribute-chain proxies:t = blosc2.CTable.from_parquet("trips.parquet") t.col_names # e.g. ['trip.begin.lon', 'trip.begin.lat', ...] t["trip.begin.lon"].mean() t.trip.begin.lon.max()
Unsupported Parquet types are not silently imported as schema-less
object()columns; they raise so callers can decide how to handle them explicitly.- Parameters:
path¶ (str or path-like) – Path to the source Parquet file.
columns¶ (list[str] or None, optional) – Subset of columns to read from the Parquet file. If provided, only these columns are loaded and their order in the resulting table matches the order in this list. Column names must be unique.
batch_size¶ (int, optional) – Number of rows per Arrow batch read from the Parquet file. This controls how much data is pulled from the file at a time before being handed off to the CTable builder. Must be greater than 0.
urlpath¶ (str or None, optional) – Destination storage path for the resulting CTable. If
None(the default), the table is created in memory. If provided, the table is backed by persistent on-disk storage.mode¶ (str, optional) – Storage open mode for
urlpath. Defaults to"w". This is passed through toCTable.from_arrow().cparams¶ (object, optional) – Compression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().dparams¶ (object, optional) – Decompression parameters for the created Blosc2 containers. Passed through to
CTable.from_arrow().validate¶ (bool, optional) – Whether to enable extra internal validation while building the table. Defaults to
False.auto_null_sentinels¶ (bool, optional) – If
True(default), nullable scalar columns imported from Parquet may automatically receive per-column null sentinel values when needed. Sentinel selection follows the current null-policy rules used by CTable schema handling.blosc2_batch_size¶ (int or None, optional) – Number of items written to Blosc2 containers per internal write batch. Passed through to
CTable.from_arrow().blosc2_items_per_block¶ (int or None, optional) – Target number of items per internal Blosc2 block. Passed through to
CTable.from_arrow(). In general, larger number of items favors compression ratios but make random access slower.list_serializer¶ ({"msgpack", "arrow"}, optional) – Serializer used for imported list columns. The default,
"arrow", stores Arrow list batches directly and is much faster for deeply nested orlist<struct<...>>columns. The tradeoff is that accessing those list columns later requires PyArrow. Use"msgpack"to keep list-column stores independent of PyArrow at read time; it can be smaller for simple lists but is much slower and more memory-intensive for deeply nested data.separate_nested_cols¶ (bool, optional) – Whether to separate qualifying nested columns during import. Defaults to
True. In particular, a single unnamed top-levellist<struct<...>>field is treated as a root record stream: each list element becomes a CTable row and struct leaves become ordinary nested CTable columns. Useseparate_nested_cols=Falsewhen closer fidelity to the original Parquet row/schema shape is more important than the separated column layout.max_rows¶ (int or None, optional) – Maximum number of rows to import. For ordinary Parquet files this limits Parquet/CTable rows. For unnamed-root
list<struct<...>>files imported withseparate_nested_cols=True, this limits flattened element rows.**kwargs¶ – Additional keyword arguments forwarded to
pyarrow.parquet.ParquetFile. Use these for Parquet-reader-specific options supported by PyArrow.
- Returns:
A new
CTablepopulated from the Parquet file. The table contains all selected columns and all rows from the file. Ifurlpathis provided, the returned table is disk-backed; otherwise it is in-memory.- Return type:
- Raises:
ImportError – If
pyarrowis not installed.ValueError – If
batch_sizeis not greater than 0.ValueError – If
max_rowsis negative.ValueError – If
columnscontains duplicate names.Exception – Any exception raised by
pyarrowwhile opening or reading the Parquet file, or byCTable.from_arrow()while converting Arrow data into a CTable.
Examples
Load an entire Parquet file into an in-memory table:
>>> import blosc2 >>> t = blosc2.CTable.from_parquet("data.parquet")
Load only a subset of columns:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... columns=["user_id", "amount", "country"], ... )
Create a disk-backed table while reading in batches:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... batch_size=50_000, ... urlpath="data.ctable", ... )
Pass additional options through to PyArrow’s Parquet reader:
>>> t = blosc2.CTable.from_parquet( ... "data.parquet", ... memory_map=True, ... )
- group_by(keys: str | Sequence[str], *, sort: bool = False, dropna: bool = True, engine: str = 'auto', chunk_size: int | None = None)[source]¶
Return a deferred group-by object for this table.
- Parameters:
keys¶ – Column name or sequence of column names to group by.
sort¶ – If
True, sort the result by the group keys. The defaultFalsepreserves the hash aggregation order and is usually faster.dropna¶ – If
True(default), rows with null/NaN group keys are skipped. IfFalse, null/NaN keys form their own group.engine¶ – Execution engine. Phase 1 accepts
"auto"and uses the NumPy chunked implementation.chunk_size¶ – Optional number of physical rows processed per chunk.
- Returns:
A lightweight deferred operation builder. Call methods such as
.size(),.count(column)or.agg({...})to materialize a grouped result as a newCTable.- Return type:
- property info: _CTableInfoReporter¶
Get information about this table.
Examples
>>> print(t.info) >>> t.info()
- iter_arrow_batches(*, columns: list[str] | None = None, batch_size: int = 2048, include_computed: bool = True)[source]¶
Yield live rows as bounded-size
pyarrow.RecordBatchobjects.
- iter_sorted(cols: str | list[str], ascending: bool | list[bool] = True, *, start: int | None = None, stop: int | None = None, step: int | None = None, batch_size: int = 4096)[source]¶
Iterate rows in sorted order without materializing a full copy.
Uses a FULL index when available (no sort needed); otherwise falls back to
np.lexsorton live physical positions. Yields namedtuple-like row objects in the same way as__iter__.The sorted positions array is stored as a compressed
blosc2.NDArrayto keep RAM usage low for large tables.batch_sizepositions are decompressed at a time during iteration.- Parameters:
cols¶ – Column name or list of column names to sort by.
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
start¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.stop¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.step¶ – Optional slice applied to the sorted sequence before iteration. E.g.
stop=10yields only the top-10 rows;step=2yields every other row in sorted order.batch_size¶ – Number of positions decompressed per iteration step. Larger values reduce decompression overhead; smaller values use less transient RAM. Default is 4096.
- materialize_computed_column(name: str, *, new_name: str | None = None, dtype: dtype | None = None, cparams: dict | CParams | None = None) None[source]¶
Materialize a computed column into a new stored snapshot column.
- Parameters:
- Raises:
ValueError – If called on a view, on a read-only table, or if the target name collides with an existing stored or computed column.
KeyError – If name is not a computed column.
TypeError – If dtype is incompatible with the computed values.
- property nbytes: int¶
Total uncompressed size in bytes (all columns + valid_rows mask).
- property ncols: int¶
Total number of columns, including computed (virtual) columns.
- refresh_generated_column(name: str) None[source]¶
Recompute a stored generated/materialized column from its source columns.
- refresh_generated_columns(*, source: str | None = None) None[source]¶
Refresh all generated columns, optionally only those depending on source.
- rename_column(old: str, new: str) None[source]¶
Rename a column.
On disk tables the corresponding persisted column leaf is renamed.
Renaming a flat column to a dotted name (e.g.
"trip.begin.lon") promotes it to a nested leaf column: it will be stored under the hierarchical path/_cols/trip/begin/lonon disk and can be accessed viat["trip.begin.lon"]or the attribute-chain proxyt.trip.begin.lon. This is the primary way to define nested columns when importing from non-Arrow sources:t.rename_column("trip_begin_lon", "trip.begin.lon") t["trip.begin.lon"].mean() # works as a regular Column
- Raises:
ValueError – If the table is read-only, is a view, or new already exists.
KeyError – If old does not exist.
- sample(n: int, *, seed: int | None = None) CTable[source]¶
Return a read-only view of n randomly chosen live rows.
- property schema: CompiledSchema¶
The compiled schema that drives this table’s columns and validation.
- select(cols: list[str]) CTable[source]¶
Return a column-projection view exposing only cols.
The returned object shares the underlying NDArrays with this table (no data is copied). Row filtering and value writes work as usual; structural mutations (add/drop/rename column, append, …) are blocked.
- Parameters:
cols¶ –
Ordered list of column names to keep. For tables with nested (dotted) column names, a struct-prefix name automatically expands to all descendant leaves:
t.select(["trip.begin"]) # expands to trip.begin.lon, trip.begin.lat t.select(["trip"]) # expands to all trip.* leaves
- Raises:
KeyError – If any name in cols is not a column of this table (and does not match any struct prefix).
ValueError – If cols is empty.
- slice(start, stop=None, /, *, copy: bool = True) CTable[source]¶
Return a contiguous range of live (non-deleted) rows.
The range is given the way
range()takes its bounds, either as a single stop (table.slice(stop)), as start/stop integers (table.slice(start, stop)), or as a Pythonslice(table.slice(slice(start, stop))). Negative bounds count from the end;stepis not supported.- Parameters:
start¶ – Range bounds, interpreted as logical positions among the live rows.
stop¶ – Range bounds, interpreted as logical positions among the live rows.
copy¶ – When
True(the default, mirroringNDArray.slice()) a compact copy of the range is returned. WhenFalsea zero-copy view is returned instead, sharing the parent’s column data (read-only, likehead()/tail()).
- Returns:
out – The requested rows, re-indexed from 0.
- Return type:
- sort_by(cols: str | list[str], ascending: bool | list[bool] = True, *, inplace: bool = False) CTable[source]¶
Return a copy of the table sorted by one or more columns.
- Parameters:
cols¶ –
Column name or list of column names to sort by. When multiple columns are given, the first is the primary key, the second is the tiebreaker, and so on. For tables with nested (dotted) column names, pass the dotted leaf name directly:
t.sort_by("trip.begin.lon") t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])
ascending¶ – Sort direction. A single bool applies to all keys; a list must have the same length as cols.
inplace¶ – If
True, rewrite the physical data in place and returnself(likecompact()but sorted). IfFalse(default), return a new in-memory CTable leaving this one untouched.
- Raises:
ValueError – If called on a view or a read-only table when
inplace=True.KeyError – If any column name is not found.
TypeError – If a column used as a sort key does not support ordering (e.g. complex numbers).
- to_b2d(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a directory-backed store.
Directory-backed CTable stores may use any path that does not end in
.b2z; using a.b2dsuffix is recommended for clarity. For persistent, non-view.b2ztables opened read-only andcompact=False, this uses a fast physical-unpack path: the zip members are extracted as already-compressed leaves. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns.For in-memory tables, views, writable
.b2ztables, existing directory-backed tables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new directory-backed store.Examples
Fast-unpack an existing compact zip store into a directory-backed table:
table = blosc2.CTable.open("data.b2z", mode="r") table.to_b2d("data.b2d", overwrite=True) table.close()
Materialize a filtered view into a directory-backed store:
view = table.where(table["score"] > 10) view.to_b2d("high-score.b2d", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2ztable:table.to_b2d("data-compact.b2d", overwrite=True, compact=True)
- to_b2z(urlpath: str, *, overwrite: bool = False, compact: bool = False) str[source]¶
Write this table to a compact
.b2zcontainer..b2zis the compact zip-backed CTable format. For persistent, non-view directory-backed tables andcompact=False, this uses a fast physical-pack path: the backingTreeStoredirectory is zipped with already-compressed leaves stored as-is. This preserves the physical layout, including deleted rows and spare capacity, and does not recompress columns. A.b2dsuffix is recommended for directory-backed stores, but not required.For in-memory tables, views, existing
.b2ztables, orcompact=True, this falls back to the logicalsave()path, materializing only visible/live rows into a new.b2zstore.Examples
Fast-pack an existing directory-backed table into a compact zip store:
table = blosc2.CTable.open("data.b2d", mode="r") table.to_b2z("data.b2z", overwrite=True) table.close()
Materialize a filtered view into a new compact store:
view = table.where(table["score"] > 10) view.to_b2z("high-score.b2z", overwrite=True)
Force a logical compacted copy, even for a persistent
.b2dtable:table.to_b2z("data-compact.b2z", overwrite=True, compact=True)
- to_csv(path: str | None = None, *, header: bool = True, sep: str = ',') str | None[source]¶
Write all live rows to CSV.
Uses Python’s stdlib
csvmodule — no extra dependency required. Fixed-shape ndarray column cells are serialised as JSON arrays for readability and shape safety (e.g."[1.0, 2.0, 3.0]").- Parameters:
- Returns:
The CSV text when path is
None, otherwiseNone.- Return type:
str or None
- to_pandas()[source]¶
Convert to a pandas DataFrame.
Scalar columns become regular DataFrame columns. Fixed-shape ndarray columns become
object-dtype columns whose cells hold NumPy arrays of per-row shape item_shape.- Return type:
pandas.DataFrame
Examples
>>> import blosc2 >>> from dataclasses import dataclass >>> import numpy as np >>> @dataclass ... class Row: ... id: int = blosc2.field(blosc2.int64()) ... embedding: object = blosc2.field(blosc2.ndarray((3,), dtype=blosc2.float32())) >>> t = blosc2.CTable(Row, new_data=[ ... (1, np.array([1, 2, 3], dtype=np.float32)), ... (2, np.array([4, 5, 6], dtype=np.float32)), ... ]) >>> df = t.to_pandas() >>> df["id"].tolist() [1, 2] >>> df["embedding"].dtype dtype('O') >>> np.testing.assert_array_equal(df["embedding"][0], np.array([1, 2, 3], dtype=np.float32))
- to_parquet(path, *, columns: list[str] | None = None, batch_size: int = 2048, compression: str | None = 'zstd', row_group_size: int | None = None, include_computed: bool = True, **kwargs) None[source]¶
Write this table to a Parquet file batch-wise using pyarrow.
- to_string(*, max_rows: int | None = None, max_width: int | None = None, show_dimensions: bool | str = False, display_index: bool | None = None, index_name: str = '') str[source]¶
Return a tabular string representation of the table.
By default (
max_rows=None,max_width=None) this renders the whole table — every row and every column — likepandas’DataFrame.to_string(). This is independent of the globalblosc2.set_printoptions(); those only affect the truncatedstr/repr/printview.- Parameters:
max_rows¶ – Maximum number of rows before truncating to a compact head/tail view.
None(default) shows all rows;-1also means all,0shows none, a positive int caps it.max_width¶ – Character budget for column fitting.
None(default) or-1shows all columns; a positive int truncates the middle ones with...to fit.show_dimensions¶ – Whether to append a
[N rows x M columns]footer.False(default) omits it, matchingpandas’to_string();Truealways shows it;"truncate"shows it only when the view is truncated (the behaviour ofstr/repr).display_index¶ – Whether to include a pandas-like logical row index column. If
None(default), use the global value configured withblosc2.set_printoptions().index_name¶ – Optional label for the displayed index column.
- trim_capacity() None[source]¶
Shrink fixed-width physical storage to the last live row position.
This removes unused append capacity while preserving holes left by deletes before the last live row. List and variable-length scalar columns already grow to their logical length and are left untouched.
- view(new_valid_rows)[source]¶
Return a row-filter view backed by a boolean mask array without copying data.
- property vlmeta¶
Variable-length metadata attached to this table.
Returns a mapping-like proxy that supports item access, iteration, and the
[:]bulk getter. Values are serialised via msgpack, so all standard types (int, float, str, bool, list, dict) are supported. The metadata is stored separately from the internal schema metadata and persists throughclose()/ reopen for disk-backed tables.Examples
>>> import blosc2 >>> import dataclasses >>> @dataclasses.dataclass ... class Row: ... x: int = 0 >>> t = blosc2.CTable(Row) >>> t.vlmeta["author"] = "Alice" >>> t.vlmeta["tags"] = ["alpha", "beta"] >>> t.vlmeta["count"] = 42 >>> print(t.vlmeta["author"]) Alice >>> print(t.vlmeta[:]) {'author': 'Alice', 'tags': ['alpha', 'beta'], 'count': 42} >>> del t.vlmeta["count"] >>> for name in t.vlmeta: ... print(name, t.vlmeta[name]) ... author Alice tags ['alpha', 'beta']
- class blosc2.Column(table: CTable, col_name: str, mask=None)[source]¶
Column view for a
CTable, with vectorized operations and reductions.- Attributes:
dtypeNumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).infoGet information about this column.
info_itemsStructured summary items used by
info.is_computedTrue if this column is a virtual computed column (read-only).
is_dictionaryTrue if this column is a dictionary-encoded string column.
is_generatedTrue if this column is a stored generated/materialized column.
- is_list
is_ndarrayTrue if this column stores fixed-shape N-D array values per row.
is_staleTrue if this generated column needs to be refreshed before use.
is_varlen_scalarTrue if this column holds variable-length scalar strings or bytes.
item_ndimNumber of per-row item dimensions.
item_shapePer-row item shape;
()for scalar columns.item_sizeNumber of scalar values stored in each row item.
ndimNumber of logical dimensions.
null_valueThe sentinel value that represents NULL for this column, or
None.rawThe underlying storage container for this column, without null-value processing.
row_transformerBuild row-wise projections/reductions for generated columns.
shapeLogical shape of the live column values.
sizeNumber of live scalar values in the logical column array.
viewReturn a
ColumnViewIndexerfor creating logical sub-views.
Methods
assign(data)Replace all live values in this column with data.
is_null()Return a boolean array True where the live value is the null sentinel.
isin(values)Return a boolean array True where the live value is in values.
iter_chunks([size])Iterate over live column values in chunks of size rows.
norm([ord, axis, where])Vector/matrix norm of a fixed-shape ndarray column.
notnull()Return a boolean array True where the live value is not the null sentinel.
Return the number of live rows whose value equals the null sentinel.
read_stale([key])Read stored values even when this generated column is marked stale.
summary()Return and print a compact summary for this column.
unique()Return sorted array of unique live, non-null values.
Return a
{value: count}dict sorted by count descending.- assign(data) None[source]¶
Replace all live values in this column with data.
Works on both full tables and views — on a view, only the rows visible through the view’s mask are overwritten.
- Parameters:
data¶ – List, numpy array, or any iterable. Must have exactly as many elements as there are live rows in this column. Values are coerced to the column’s dtype if possible.
- Raises:
ValueError – If
len(data)does not match the number of live rows, or the table is opened read-only.TypeError – If values cannot be coerced to the column’s dtype.
- property dtype¶
NumPy dtype of the underlying storage, or
Nonefor variable-length columns (vlstring(),vlbytes(),list()).
- property info: _CTableInfoReporter¶
Get information about this column.
The report includes both logical/live-row details and, when available, the physical storage details used internally by lazy predicates.
Examples
>>> print(t["score"].info) >>> t["score"].info()
- is_null() ndarray[source]¶
Return a boolean array True where the live value is the null sentinel.
For varlen scalar columns (vlstring/vlbytes) nullability is represented as native
Nonevalues, so this returns True wherever the value isNone. For dictionary columns, returns True where the code equals the null_code (-1by default).
- isin(values) ndarray[source]¶
Return a boolean array True where the live value is in values.
For dictionary columns this performs efficient integer-code membership testing (no decoding of all values). Values absent from the dictionary are treated as not-present.
For non-dictionary columns this decodes all live values and tests membership in a set.
- property item_ndim: int¶
Number of per-row item dimensions.
- property item_shape: tuple[int, ...]¶
Per-row item shape;
()for scalar columns.
- property item_size: int¶
Number of scalar values stored in each row item.
- iter_chunks(size: int = 65536)[source]¶
Iterate over live column values in chunks of size rows.
Yields numpy arrays of at most size elements each, skipping deleted rows. The last chunk may be smaller than size.
- Parameters:
size¶ – Number of live rows per yielded chunk. Defaults to 65 536.
- Yields:
numpy.ndarray – A 1-D array of up to size live values with this column’s dtype.
Examples
>>> for chunk in t["score"].iter_chunks(size=100_000): ... process(chunk)
- property ndim: int¶
Number of logical dimensions.
- norm(ord=None, axis=None, *, where=None)[source]¶
Vector/matrix norm of a fixed-shape ndarray column.
The column is treated as a logical array of shape
(nrows, *item_shape). For example,axis=1computes one norm per row for a 1-D item shape.
- notnull() ndarray[source]¶
Return a boolean array True where the live value is not the null sentinel.
- null_count() int[source]¶
Return the number of live rows whose value equals the null sentinel.
Returns
0in O(1) if nonull_valueis configured for this column and the column is not a varlen scalar column.
- property null_value¶
The sentinel value that represents NULL for this column, or
None.
- property raw¶
The underlying storage container for this column, without null-value processing.
Returns the raw
blosc2.NDArray,ListArray,DictionaryColumn, or scalar varlen array directly. Unlike__getitem__(), which always materializes NumPy arrays, this is the column as a blosc2-native compressed object: usable as a lazy-expression operand without decompressing, and exposing storage details such asschunk,chunks,cparamsoriterchunks_info().This is a physical view of the column: fixed-width containers are over-allocated to chunk capacity for appends, so their first axis is longer than
len(column)and positions of rows deleted from the table still hold their old values. No validity-mask or null-sentinel processing is applied; use theColumninterface for logical reads.Raises
AttributeErrorfor computed (virtual) columns, which have no backing storage.
- read_stale(key=slice(None, None, None))[source]¶
Read stored values even when this generated column is marked stale.
This is an explicit escape hatch for inspecting the last materialized values. Normal reads raise for stale generated columns so outdated values are not used accidentally.
- property row_transformer: RowTransformer¶
Build row-wise projections/reductions for generated columns.
- property shape: tuple[int, ...]¶
Logical shape of the live column values.
- property size: int¶
Number of live scalar values in the logical column array.
- summary() str[source]¶
Return and print a compact summary for this column.
For fixed-shape ndarray columns this includes logical shape, storage, and row-norm statistics when numeric. Scalar columns fall back to
info.
- unique() ndarray[source]¶
Return sorted array of unique live, non-null values.
Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
- value_counts() dict[source]¶
Return a
{value: count}dict sorted by count descending.Null sentinel values are excluded. Processes data in chunks — never loads the full column at once.
Example
>>> t["active"].value_counts() {True: 8432, False: 1568}
- property view: ColumnViewIndexer¶
Return a
ColumnViewIndexerfor creating logical sub-views.Examples
Read a sub-view for chained aggregates:
sub = t.price.view[2:10] sub.sum()
Bulk write through a sub-view:
t.price.view[0:5][:] = np.zeros(5)
- class blosc2.NestedColumn(table: CTable, prefix: str)[source]¶
A read-only accessor for a nested (dotted) group of CTable columns.
Returned by attribute access on a
CTable(or on anotherNestedColumn) when the name refers to an internal node of the dotted column tree rather than a leaf. For a table flattened from astruct/list<struct>schema,t.tripis aNestedColumngrouping every leaf under thetrip.prefix, while a leaf such ast.trip.sec(ort.trip.begin.lon) is aColumn. Drilling into an intermediate node (e.g.t.trip.begin) yields anotherNestedColumn.Exposes aggregate metadata over its descendant leaf columns (
col_names,nrows,ncols,nbytes,cbytes,cratio) and aninforeport.Examples
>>> t.trip <NestedColumn 'trip'> >>> t.trip.col_names ['sec', 'km', 'begin.lon', ...] >>> t.trip.sec # a leaf -> Column
- Attributes:
cbytesCompressed size in bytes for stored descendant columns.
col_namesDescendant leaf column names relative to this nested prefix.
cratioCompression ratio for stored descendant columns.
infoGet information about this nested column namespace.
info_itemsStructured summary items used by
info.nbytesUncompressed size in bytes for stored descendant columns.
ncolsNumber of descendant leaf columns in this nested namespace.
nrowsNumber of logical rows in this nested namespace.
- property cbytes: int¶
Compressed size in bytes for stored descendant columns.
- property col_names: list[str]¶
Descendant leaf column names relative to this nested prefix.
- property cratio: float¶
Compression ratio for stored descendant columns.
- property info: _CTableInfoReporter¶
Get information about this nested column namespace.
Examples
>>> print(t.trip.info) >>> t.trip.info()
- property nbytes: int¶
Uncompressed size in bytes for stored descendant columns.
- property ncols: int¶
Number of descendant leaf columns in this nested namespace.
- property nrows: int¶
Number of logical rows in this nested namespace.
- blosc2.get_printoptions() dict[str, Any][source]¶
Return a copy of the global
CTabledisplay options.
- blosc2.printoptions(**kwargs: Any)¶
Temporarily set
CTabledisplay options, restored on exit.Accepts the same keyword arguments as
set_printoptions(). Handy for a one-off full dump, e.g.:with blosc2.printoptions(display_rows=-1, display_width=-1): print(ctable)
- blosc2.set_printoptions(*, display_index: bool | None = None, display_rows: int | None = None, display_width: int | None = <object object>, display_precision: int | None = None, fancy: bool | None = None) None[source]¶
Set global display options for
CTablestring representations.These options affect
str(ctable)/repr(ctable)/print(ctable)(the interactive, truncated view). They do not affectCTable.to_string(), which renders everything by default.- Parameters:
display_index¶ – Whether the display should include a pandas-like logical row index column.
Noneleaves the current setting unchanged.display_rows¶ – Maximum number of rows shown before truncating to a compact head/tail view (five first and five last rows, when possible).
-1shows all rows,0shows none.Noneleaves the current setting unchanged.display_width¶ – Character budget used to decide how many columns fit before truncating the middle ones with
....None(the default) auto-detects the terminal width,-1shows all columns, a positive int sets a fixed budget. Omit the argument to leave the current setting unchanged.display_precision¶ – Number of digits after the decimal point for floating-point values in table displays. Trailing zeros are trimmed.
Noneleaves the current setting unchanged.fancy¶ – Whether to use the more decorated table display, including separator rules and a detailed footer.
False(default) uses a simpler pandas-like footer such as[726017 rows x 5 columns]and omits separator rules.Noneleaves the current setting unchanged.