Indexing Arrays¶
Blosc2 can attach indexes to 1-D NDArray objects and to fields inside 1-D structured arrays. These indexes accelerate selective masks, and full indexes can also drive ordered access directly through sort(order=...), NDArray.argsort(order=...), LazyExpr.argsort(order=...), and iter_sorted(...).
This tutorial covers:
how to create field and expression indexes,
how to tell whether a mask is using an index,
what sort of acceleration different index kinds can deliver on a selective mask,
how index persistence works,
when to rebuild indexes,
and a recommended workflow for keeping append-heavy
fullindexes compact.
Setup¶
[1]:
import statistics
import time
from pathlib import Path
import numpy as np
import blosc2
def format_bytes(nbytes):
units = ("B", "KiB", "MiB", "GiB", "TiB")
value = float(nbytes)
for unit in units:
if value < 1024.0 or unit == units[-1]:
if unit == "B":
return f"{int(value)} {unit}"
return f"{value:.2f} {unit}"
value /= 1024.0
return f"{value:.2f} {units[-1]}"
def show_index_summary(label, descriptor):
print(
f"{label}: kind={descriptor['kind']}, persistent={descriptor['persistent']}, "
f"ooc={descriptor['ooc']}, stale={descriptor['stale']}"
)
def explain_subset(expr):
info = expr.explain()
keep = {}
for key in ("will_use_index", "reason", "kind", "level", "lookup_path", "full_runs"):
if key in info:
keep[key] = info[key]
return keep
def median_ms(func, repeats=5, warmup=1):
for _ in range(warmup):
func()
samples = []
for _ in range(repeats):
t0 = time.perf_counter()
func()
samples.append((time.perf_counter() - t0) * 1e3)
return statistics.median(samples)
paths = [
Path("indexing_tutorial_partial.b2nd"),
Path("indexing_tutorial_append_full.b2nd"),
]
for path in paths:
blosc2.remove_urlpath(path)
Index kinds and how to create them¶
Blosc2 currently supports four index kinds:
summary: compact summaries only,bucket: summary levels plus lightweight per-block payloads,partial: richer payloads for positional filtering,full: globally sorted payloads for positional filtering and ordered reuse.
There is one active index per target field or expression. If you create another index on the same target, it replaces the previous one. The easiest way to compare kinds is to build them on separate arrays.
The next cell times index creation and reports the compressed storage footprint of each index relative to the compressed base array.
[2]:
N_ROWS = 10_000_000
MASK_TEXT = "(id >= -5.0) & (id < 5.0)"
rng = np.random.default_rng(0)
dtype = np.dtype([("id", np.float64), ("payload", np.int32)])
ids = np.arange(-N_ROWS // 2, N_ROWS // 2, dtype=np.float64)
rng.shuffle(ids)
data = blosc2.fromiter(((id_, i) for i, id_ in enumerate(ids)), shape=(N_ROWS,), dtype=dtype)
indexed_arrays = {}
build_rows = []
base_cbytes = data.cbytes
for kind in (
blosc2.IndexKind.SUMMARY,
blosc2.IndexKind.BUCKET,
blosc2.IndexKind.PARTIAL,
blosc2.IndexKind.FULL,
):
arr = data.copy()
t0 = time.perf_counter()
arr.create_index(field="id", kind=kind)
build_ms = (time.perf_counter() - t0) * 1e3
index_obj = arr.index("id")
indexed_arrays[kind.value] = arr
build_rows.append((kind.value, build_ms, index_obj.cbytes, index_obj.cbytes / base_cbytes))
print(f"Compressed base array size: {format_bytes(base_cbytes)}")
print(f"{'kind':<12} {'build_ms':>10} {'index_size':>12} {'overhead':>10}")
for kind, build_ms, index_cbytes, overhead in build_rows:
print(f"{kind:<12} {build_ms:10.3f} {format_bytes(index_cbytes):>12} {overhead:>9.2f}x")
Compressed base array size: 30.74 MiB
kind build_ms index_size overhead
summary 26.726 142 B 0.00x
bucket 455.373 26.04 MiB 0.85x
partial 404.564 34.99 MiB 1.14x
full 1635.311 28.44 MiB 0.93x
Using an index for masks¶
Range predicates are planned automatically when you use where(...). If you just want the matching values, expr[:] is the shortest form. In the comparisons below we use compute() so the result stays as an NDArray, and we force a scan by passing _use_index=False.
[3]:
partial_arr = indexed_arrays["partial"]
expr = blosc2.lazyexpr(MASK_TEXT, partial_arr.fields).where(partial_arr)
print(explain_subset(expr))
indexed = expr.compute()
scanned = expr.compute(_use_index=False)
np.testing.assert_array_equal(indexed, scanned)
print(f"Matched rows: {len(indexed)}")
{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'partial', 'level': 'partial', 'lookup_path': 'chunk-nav', 'full_runs': 0}
Matched rows: 10
Timing the mask with and without indexes¶
The next cell measures the same selective mask on all four index kinds and compares it with a forced full scan. On this workload, partial and full usually show the clearest benefit because they carry richer payloads for positional filtering.
[4]:
timing_rows = []
expected = None
for kind, arr in indexed_arrays.items():
expr = blosc2.lazyexpr(MASK_TEXT, arr.fields).where(arr)
result = expr.compute()
if expected is None:
expected = result
else:
np.testing.assert_array_equal(result, expected)
scan_ms = median_ms(lambda expr=expr: expr.compute(_use_index=False), repeats=3)
index_ms = median_ms(lambda expr=expr: expr.compute(), repeats=3)
timing_rows.append((kind, scan_ms, index_ms, scan_ms / index_ms))
print(f"Selective mask over {N_ROWS:,} rows")
print(f"{'kind':<12} {'scan_ms':>11} {'index_ms':>10} {'speedup':>10}")
for kind, scan_ms, index_ms, speedup in timing_rows:
print(f"{kind:<12} {scan_ms:11.3f} {index_ms:10.3f} {speedup:10.2f}x")
Selective mask over 10,000,000 rows
kind scan_ms index_ms speedup
summary 47.485 49.725 0.95x
bucket 43.921 0.941 46.68x
partial 42.991 0.921 46.67x
full 43.695 0.944 46.28x
full indexes and ordered access¶
A full index stores a global sorted payload. This is the required index tier for direct ordered reuse. Build it directly with create_index(kind=blosc2.IndexKind.FULL).
[5]:
ordered_dtype = np.dtype([("id", np.int64), ("payload", np.int64)])
ordered_data = np.array(
[(2, 9), (1, 8), (2, 7), (1, 6), (2, 5), (1, 4), (2, 3), (1, 2)],
dtype=ordered_dtype,
)
ordered_arr = blosc2.asarray(ordered_data)
ordered_arr.create_index("id", kind=blosc2.IndexKind.FULL)
print("Sorted positions:", ordered_arr.argsort(order=["id", "payload"])[:])
print("Sorted rows:")
print(ordered_arr.sort(order=["id", "payload"])[:])
Sorted positions: [7 5 3 1 6 4 2 0]
Sorted rows:
[(1, 2) (1, 4) (1, 6) (1, 8) (2, 3) (2, 5) (2, 7) (2, 9)]
Expression indexes¶
You can also index a deterministic scalar expression stream. Expression indexes are matched by normalized expression identity, so the same expression can be reused for masks and ordered access.
[6]:
expr_dtype = np.dtype([("x", np.int64), ("payload", np.int32)])
expr_data = np.array([(-8, 0), (5, 1), (-2, 2), (11, 3), (3, 4), (-3, 5), (2, 6), (-5, 7)], dtype=expr_dtype)
expr_arr = blosc2.asarray(expr_data)
expr_arr.create_index(expression="abs(x)", kind=blosc2.IndexKind.FULL, name="abs_x")
ordered_expr = blosc2.lazyexpr("(abs(x) >= 2) & (abs(x) < 8)", expr_arr.fields).where(expr_arr)
print(explain_subset(ordered_expr))
print("Expression-order positions:", ordered_expr.argsort(order="abs(x)").compute()[:])
{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'sidecar-stream', 'full_runs': 0}
Expression-order positions: [2 6 4 5 1 7]
Persistence: automatic or manual?¶
Index persistence follows the base array by default:
for a persistent array (
urlpath=...),persistent=Nonemeans the index sidecars are persisted automatically,for an in-memory array, the index lives only in memory,
on a persistent array,
persistent=Falsekeeps the index process-local instead of writing sidecars.
In practice, if you want an index to survive reopen, persist the array and use the default behavior.
[7]:
persistent_arr = data.copy(urlpath=paths[0], mode="w")
persistent_descriptor = persistent_arr.create_index(field="id", kind=blosc2.IndexKind.PARTIAL)
show_index_summary("persistent partial", persistent_descriptor)
reopened = blosc2.open(paths[0], mode="a")
print(f"Reopened index count: {len(reopened.indexes)}")
print(f"Persisted sidecar path: {reopened.indexes[0]['partial']['values_path']}")
persistent partial: kind=partial, persistent=True, ooc=True, stale=False
Reopened index count: 1
Persisted sidecar path: indexing_tutorial_partial.__index__.id.partial.partial.values.b2nd
When to rebuild an index¶
Appending is special-cased and keeps compatible indexes current. General mutation and resize operations do not. After unsupported mutations, the index is marked stale and should be refreshed explicitly with rebuild_index().
[8]:
mutable_arr = blosc2.arange(20, dtype=np.int64)
mutable_arr.create_index(kind=blosc2.IndexKind.FULL)
mutable_arr[:3] = -1
print("Stale after direct mutation:", mutable_arr.indexes[0]["stale"])
mutable_arr.rebuild_index()
print("Stale after rebuild:", mutable_arr.indexes[0]["stale"])
Stale after direct mutation: True
Stale after rebuild: False
Recommended workflow for append-heavy full indexes¶
Appending to a full index is intentionally cheap: appended tails become sorted runs instead of forcing an immediate rewrite of the compact base sidecars.
That means the recommended workflow is:
create a persistent
fullindex once,append freely during ingestion,
let masks keep working while runs accumulate,
call
compact_index()after ingestion windows or before latency-sensitive read phases.
The next example uses a larger append-heavy array and times the same selective mask before and after compaction. The positional-filter path reports whether it is using a compact lookup layout or a run-aware fallback. After compaction, full["runs"] becomes empty again.
[9]:
append_dtype = np.dtype([("id", np.int64), ("payload", np.int32)])
base_rows = 200_000
append_batch = 500
num_runs = 40
append_data = blosc2.zeros(base_rows, dtype=append_dtype)[:]
append_data["id"] = blosc2.arange(base_rows, dtype=np.int64)
append_data["payload"] = blosc2.arange(base_rows, dtype=np.int32)
append_arr = blosc2.asarray(append_data, urlpath=paths[1], mode="w")
append_arr.create_index(field="id", kind=blosc2.IndexKind.FULL)
for run in range(num_runs):
start = 300_000 + run * append_batch
batch = blosc2.zeros(append_batch, dtype=append_dtype)[:]
batch["id"] = blosc2.arange(start, start + append_batch, dtype=np.int64)
batch["payload"] = blosc2.arange(append_batch, dtype=np.int32)
append_arr.append(batch)
mask_str = "(id >= 310_000) & (id < 310_020)"
append_expr = blosc2.lazyexpr(mask_str, append_arr.fields).where(append_arr)
before_info = explain_subset(append_expr)
before_ms = median_ms(lambda: append_expr.compute(), repeats=5)
print("Before compaction:", before_info)
print("Pending runs:", len(append_arr.indexes[0]["full"]["runs"]))
print(f"Median mask time before compaction: {before_ms:.3f} ms")
append_arr.compact_index("id")
append_expr = blosc2.lazyexpr(mask_str, append_arr.fields).where(append_arr)
after_info = explain_subset(append_expr)
after_ms = median_ms(lambda: append_expr.compute(), repeats=5)
print("After compaction:", after_info)
print("Pending runs:", len(append_arr.indexes[0]["full"]["runs"]))
print(f"Median mask time after compaction: {after_ms:.3f} ms")
print(f"Speedup after compaction: {before_ms / after_ms:.2f}x")
Before compaction: {'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'run-bounded-ooc', 'full_runs': 40}
Pending runs: 40
Median mask time before compaction: 0.299 ms
After compaction: {'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'compact-selective-ooc', 'full_runs': 0}
Pending runs: 0
Median mask time after compaction: 0.266 ms
Speedup after compaction: 1.12x
Practical guidance¶
Use
partialwhen your main goal is faster selective masks.Use
fullwhen you also want ordered reuse throughsort(order=...),NDArray.argsort(order=...),LazyExpr.argsort(order=...), oriter_sorted(...).Persist the base array if you want indexes to survive reopen automatically.
After unsupported mutations, use
rebuild_index().For append-heavy
fullindexes, compact explicitly at convenient maintenance boundaries instead of on every append.Measure your own workload: compact indexes, predicate selectivity, and ordered access needs all affect which kind is best.
[10]:
for path in paths:
blosc2.remove_urlpath(path)
[10]: