Indexing CTables¶
CTable supports persistent, table-owned indexes that speed up where() queries on numeric columns. An index maps sorted-value ranges to the chunk positions that contain matching rows, allowing Blosc2 to skip large parts of the table without reading every row.
This tutorial covers:
Creating an index on a CTable column
Querying with an index (automatic)
Stale detection and automatic scan fallback
Rebuilding and dropping indexes
Persistent tables: indexes survive close/reopen
Views and indexes
Setup¶
We will use a simple measurement table with three numeric columns.
[1]:
import dataclasses
import numpy as np
import blosc2
@dataclasses.dataclass
class Measurement:
sensor_id: int = blosc2.field(blosc2.int32())
temperature: float = blosc2.field(blosc2.float64())
region: int = blosc2.field(blosc2.int32())
N = 500
t = blosc2.CTable(Measurement)
rng = np.random.default_rng(42)
for i in range(N):
t.append([i, 15.0 + rng.random() * 25, int(rng.integers(0, 4))])
print(f"Table: {N} rows")
Table: 500 rows
Computed columns (quick note)¶
CTables can also expose computed columns via add_computed_column(...). They are read-only, use no extra storage, and participate in display, filtering, sorting, and aggregates.
For indexing, you now have two options:
Materialize first with
materialize_computed_column(...), then callcreate_index()on the new stored column. Materialized columns are stored snapshots, and futureappend()/extend()calls auto-fill omitted values for them.Build a direct expression index with
create_index(expression=...)over stored table columns. Matchingwhere()predicates can reuse that index directly, and a matchingFULLexpression index can also be reused when ordering by a computed column backed by the same expression.
[2]:
t.add_computed_column("temperature_f", "temperature * 9 / 5 + 32")
print(t.select(["sensor_id", "temperature", "temperature_f"]).head(3))
sensor_id temperature temperature_f
int32 float64 float64
──────────── ───────────────── ─────────────────
0 34.34890121389… 93.82802218501…
1 36.46494799778… 97.63690639601…
2 32.43420072648… 90.38156130767…
──────────── ───────────────── ─────────────────
3 rows × 3 columns
Creating an index¶
Call create_index(col_name) to build a bucket index on a column. The returned CTableIndex handle shows the column name, kind, and whether the index is stale.
[3]:
idx = t.create_index("sensor_id")
print(idx)
print("stale?", idx.stale)
print("all indexes:", t.indexes)
<CTableIndex col='sensor_id' kind='bucket' name='__self__'>
stale? False
all indexes: [<CTableIndex col='sensor_id' kind='bucket' name='__self__'>]
Querying with an index¶
where() automatically uses an available (non-stale) index when the filter expression matches the indexed column. The result is identical to a full scan.
[4]:
result = t.where(t["sensor_id"] > 450)
print("Rows sensor_id > 450:", len(result))
print("sensor_ids:", sorted(int(v) for v in result["sensor_id"][:]))
Rows sensor_id > 450: 49
sensor_ids: [451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499]
Stale detection¶
Any mutation — append, extend, Column.__setitem__, Column.assign, sort_by, compact — marks all indexes stale. When an index is stale, where() falls back to a full scan automatically so results are always correct.
[5]:
t.append([9999, 30.0, 1]) # any mutation marks indexes stale
idx = t.index("sensor_id")
print("stale after append?", idx.stale)
# Query still works — scan fallback
result_stale = t.where(t["sensor_id"] == 9999)
print("Found row:", len(result_stale))
stale after append? True
Found row: 1
Note: delete() only bumps the visibility epoch (it does not change column values) so it does not mark indexes stale.
Rebuilding an index¶
rebuild_index(col_name) drops the old index and builds a fresh one from the current table state.
[6]:
idx = t.rebuild_index("sensor_id")
print("stale after rebuild?", idx.stale)
result_rebuilt = t.where(t["sensor_id"] == 9999)
print("Found row via rebuilt index:", len(result_rebuilt))
stale after rebuild? False
Found row via rebuilt index: 1
Dropping an index¶
drop_index(col_name) removes the index from the catalog and deletes any sidecar files (for persistent tables).
[7]:
t.drop_index("sensor_id")
print("Indexes after drop:", t.indexes)
Indexes after drop: []
Persistent tables¶
Indexes on persistent tables (tables with a urlpath) survive close and reopen because the catalog is stored inside the table’s own /_meta sidecar and the index data lives under <table.b2d>/_indexes/<col_name>/.
[8]:
import shutil
import tempfile
from pathlib import Path
tmpdir = Path(tempfile.mkdtemp())
path = str(tmpdir / "sensors.b2d")
# Create a persistent table and build an index
pt = blosc2.CTable(Measurement, urlpath=path, mode="w")
rng2 = np.random.default_rng(0)
for i in range(300):
pt.append([i, 15.0 + rng2.random() * 25, int(rng2.integers(0, 4))])
pidx = pt.create_index("sensor_id")
print("Created:", pidx)
# Sidecar files
index_dir = Path(path) / "_indexes" / "sensor_id"
print("Sidecar files:", len(list(index_dir.glob("**/*.b2nd"))))
# Query before close
r1 = pt.where(pt["sensor_id"] > 280)
print("Rows > 280 (before close):", len(r1))
Created: <CTableIndex col='sensor_id' kind='bucket' name='__self__'>
Sidecar files: 7
Rows > 280 (before close): 19
[9]:
# Close and reopen — catalog is preserved
del pt
pt2 = blosc2.open(path)
print("Indexes after reopen:", pt2.indexes)
r2 = pt2.where(pt2["sensor_id"] > 280)
print("Rows > 280 (after reopen):", len(r2))
ids1 = sorted(int(v) for v in r1["sensor_id"][:])
ids2 = sorted(int(v) for v in r2["sensor_id"][:])
assert ids1 == ids2, "Results differ!"
print("Results match ✓")
shutil.rmtree(tmpdir, ignore_errors=True)
Indexes after reopen: [<CTableIndex col='sensor_id' kind='bucket' name='__self__'>]
Rows > 280 (after reopen): 19
Results match ✓
/var/folders/r3/bycghmsx079bmglqt_2xmlt00000gn/T/ipykernel_81567/2229258138.py:3: FutureWarning: blosc2.open() currently defaults to mode='a', but this will change to mode='r' in a future release. Pass mode='a' explicitly to keep writable behavior, or mode='r' for read-only access.
pt2 = blosc2.open(path)
Views and indexes¶
A view (the result of where()) is a filtered window into the underlying table. Index management methods (create_index, drop_index, rebuild_index, compact_index) are not available on views — they raise ValueError.
[10]:
t2 = blosc2.CTable(Measurement)
for i in range(50):
t2.append([i, 20.0, i % 3])
t2.create_index("sensor_id")
view = t2.where(t2["sensor_id"] > 10)
print("View type:", type(view).__name__)
try:
view.create_index("sensor_id")
except ValueError as e:
print("create_index on view:", e)
try:
view.drop_index("sensor_id")
except ValueError as e:
print("drop_index on view:", e)
View type: CTable
create_index on view: Cannot create an index on a view.
drop_index on view: Cannot drop an index from a view.
Summary¶
Operation |
Method |
|---|---|
Build index |
|
Query (auto) |
|
Check if stale |
|
Rebuild |
|
Drop |
|
Compact (full indexes) |
|
List all |
|
Key behaviours:
Mutations (
append,extend,setitem,assign,sort_by,compact) mark indexes stale.Stale indexes trigger automatic scan fallback — no user intervention needed.
Persistent indexes survive table close and reopen.
Views cannot own indexes; only root tables can.
[10]: