Indexing CTables¶

CTable supports persistent, table-owned indexes that speed up where() queries on numeric columns. An index maps sorted-value ranges to the chunk positions that contain matching rows, allowing Blosc2 to skip large parts of the table without reading every row.

This tutorial covers:

Creating an index on a CTable column
Querying with an index (automatic)
Stale detection and automatic scan fallback
Rebuilding and dropping indexes
Persistent tables: indexes survive close/reopen
Views and indexes

Setup¶

We will use a simple measurement table with three numeric columns.

[1]:

import dataclasses

import numpy as np

import blosc2


@dataclasses.dataclass
class Measurement:
    sensor_id: int = blosc2.field(blosc2.int32())
    temperature: float = blosc2.field(blosc2.float64())
    region: int = blosc2.field(blosc2.int32())


N = 500
t = blosc2.CTable(Measurement)
rng = np.random.default_rng(42)
for i in range(N):
    t.append([i, 15.0 + rng.random() * 25, int(rng.integers(0, 4))])

print(f"Table: {N} rows")

Table: 500 rows

Computed columns (quick note)¶

CTables can also expose computed columns via add_computed_column(...). They are read-only, use no extra storage, and participate in display, filtering, sorting, and aggregates.

For indexing, you now have two options:

Materialize first with materialize_computed_column(...), then call create_index() on the new stored column. Materialized columns are stored snapshots, and future append() / extend() calls auto-fill omitted values for them.
Build a direct expression index with create_index(expression=...) over stored table columns. Matching where() predicates can reuse that index directly, and a matching FULL expression index can also be reused when ordering by a computed column backed by the same expression.

[2]:

t.add_computed_column("temperature_f", "temperature * 9 / 5 + 32")
print(t.select(["sensor_id", "temperature", "temperature_f"]).head(3))

   sensor_id  temperature  temperature_f
0          0    34.348901      93.828022
1          1    36.464948      97.636906
2          2    32.434201      90.381561

[3 rows x 3 columns]

Creating an index¶

Call create_index(col_name) to build a bucket index on a column. Pass kind=... to choose another index kind, including blosc2.IndexKind.OPSI for tunable iterative ordering or blosc2.IndexKind.FULL for globally sorted indexes that can also support ordered reuse. OPSI is a separate exact-filtering index kind, not a slower way to build a FULL/CSI index; its build effort is controlled by optlevel or the explicit opsi_max_cycles keyword.

The returned Index handle shows the column name, kind, and whether the index is stale. Use storage_stats() to inspect the total (nbytes, cbytes, cratio) for the index payload.

[3]:

idx = t.create_index("sensor_id")
print(idx)
print("stale?", idx.stale)
print("storage stats:", idx.storage_stats())
print("all indexes:", t.indexes)

Index(kind='bucket', col_name='sensor_id', name='__self__', stale=False)
stale? False
storage stats: (6292289, 6333, 993.5716090320543)
all indexes: [Index(kind='bucket', col_name='sensor_id', name='__self__', stale=False)]

Querying with an index¶

where() automatically uses an available (non-stale) index when the filter expression matches the indexed column. The result is identical to a full scan.

[4]:

result = t.where(t["sensor_id"] > 450)
print("Rows sensor_id > 450:", len(result))
print("sensor_ids:", sorted(int(v) for v in result["sensor_id"][:]))

Rows sensor_id > 450: 49
sensor_ids: [451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499]

Stale detection¶

Any mutation — append, extend, Column.__setitem__, Column.assign, sort_by, compact — marks all indexes stale. When an index is stale, where() falls back to a full scan automatically so results are always correct.

[5]:

t.append([9999, 30.0, 1])  # any mutation marks indexes stale

idx = t.index("sensor_id")
print("stale after append?", idx.stale)

# Query still works — scan fallback
result_stale = t.where(t["sensor_id"] == 9999)
print("Found row:", len(result_stale))

stale after append? True
Found row: 1

Note: delete() only bumps the visibility epoch (it does not change column values) so it does not mark indexes stale.

Rebuilding an index¶

rebuild_index(col_name) drops the old index and builds a fresh one from the current table state.

[6]:

idx = t.rebuild_index("sensor_id")
print("stale after rebuild?", idx.stale)

result_rebuilt = t.where(t["sensor_id"] == 9999)
print("Found row via rebuilt index:", len(result_rebuilt))

stale after rebuild? False
Found row via rebuilt index: 1

Dropping an index¶

drop_index(col_name) removes the index from the catalog and deletes any sidecar files (for persistent tables).

[7]:

t.drop_index("sensor_id")
print("Indexes after drop:", t.indexes)

Indexes after drop: []

Persistent tables¶

Indexes on persistent tables (tables with a urlpath) survive close and reopen because the catalog is stored inside the table’s own /_meta sidecar and the index data lives under <table.b2d>/_indexes/<col_name>/.

[8]:

import shutil
import tempfile
from pathlib import Path

tmpdir = Path(tempfile.mkdtemp())
path = str(tmpdir / "sensors.b2d")

# Create a persistent table and build an index
pt = blosc2.CTable(Measurement, urlpath=path, mode="w")
rng2 = np.random.default_rng(0)
for i in range(300):
    pt.append([i, 15.0 + rng2.random() * 25, int(rng2.integers(0, 4))])

pidx = pt.create_index("sensor_id")
print("Created:", pidx)

# Storage usage for all index sidecars
print("Storage stats:", pidx.storage_stats())

# Query before close
r1 = pt.where(pt["sensor_id"] > 280)
print("Rows > 280 (before close):", len(r1))

Created: Index(kind='bucket', col_name='sensor_id', name='__self__', stale=False)
Storage stats: (12583745, 7788, 1615.7864663585003)
Rows > 280 (before close): 19

[9]:

# Close and reopen — catalog is preserved
del pt
pt2 = blosc2.open(path, mode="r")

print("Indexes after reopen:", pt2.indexes)
print("Storage stats after reopen:", pt2.index("sensor_id").storage_stats())

r2 = pt2.where(pt2["sensor_id"] > 280)
print("Rows > 280 (after reopen):", len(r2))

ids1 = sorted(int(v) for v in r1["sensor_id"][:])
ids2 = sorted(int(v) for v in r2["sensor_id"][:])
assert ids1 == ids2, "Results differ!"
print("Results match ✓")

shutil.rmtree(tmpdir, ignore_errors=True)

Indexes after reopen: [Index(kind='bucket', col_name='sensor_id', name='__self__', stale=False)]
Storage stats after reopen: (12583745, 7788, 1615.7864663585003)
Rows > 280 (after reopen): 19
Results match ✓

Views and indexes¶

A view (the result of where()) is a filtered window into the underlying table. Index management methods (create_index, drop_index, rebuild_index, compact_index) are not available on views — they raise ValueError.

[10]:

t2 = blosc2.CTable(Measurement)
for i in range(50):
    t2.append([i, 20.0, i % 3])
t2.create_index("sensor_id")

view = t2.where(t2["sensor_id"] > 10)
print("View type:", type(view).__name__)

try:
    view.create_index("sensor_id")
except ValueError as e:
    print("create_index on view:", e)

try:
    view.drop_index("sensor_id")
except ValueError as e:
    print("drop_index on view:", e)

View type: CTable
create_index on view: Cannot create an index on a view.
drop_index on view: Cannot drop an index from a view.

Summary¶

Operation	Method
Build index	`t.create_index(col)`
Query (auto)	`t.where(expr)` — uses index when fresh
Check if stale	`t.index(col).stale`
Rebuild	`t.rebuild_index(col)`
Drop	`t.drop_index(col)`
Compact (full indexes)	`t.compact_index(col)`
List all	`t.indexes`

Key behaviours:

Mutations (append, extend, setitem, assign, sort_by, compact) mark indexes stale.
Stale indexes trigger automatic scan fallback — no user intervention needed.
Persistent indexes survive table close and reopen.
OPSI indexes are tunable iterative-ordering indexes for exact filtering; use FULL for completely sorted ordered reuse.
Views cannot own indexes; only root tables can.

[10]: