Indexing Arrays¶

Blosc2 can attach indexes to 1-D NDArray objects and to fields inside 1-D structured arrays. These indexes accelerate selective masks, and full indexes can also drive ordered access directly through sort(order=...), NDArray.argsort(order=...), LazyExpr.argsort(order=...), and iter_sorted(...). OPSI indexes are a separate tunable iterative-ordering kind: they improve the physical order used for exact filtering, but they are not intended to converge to a completely sorted full/CSI index.

This tutorial covers:

how to create field and expression indexes,
how to tell whether a mask is using an index,
what sort of acceleration different index kinds can deliver on a selective mask,
how index persistence works,
when to rebuild indexes,
and a recommended workflow for keeping append-heavy full indexes compact.

Setup¶

[1]:

import statistics
import time
from pathlib import Path

import numpy as np

import blosc2


def format_bytes(nbytes):
    units = ("B", "KiB", "MiB", "GiB", "TiB")
    value = float(nbytes)
    for unit in units:
        if value < 1024.0 or unit == units[-1]:
            if unit == "B":
                return f"{int(value)} {unit}"
            return f"{value:.2f} {unit}"
        value /= 1024.0
    return f"{value:.2f} {units[-1]}"


def show_index_summary(label, descriptor):
    print(
        f"{label}: kind={descriptor['kind']}, persistent={descriptor['persistent']}, "
        f"ooc={descriptor['ooc']}, stale={descriptor['stale']}"
    )


def explain_subset(expr):
    info = expr.explain()
    keep = {}
    for key in ("will_use_index", "reason", "kind", "level", "lookup_path", "full_runs"):
        if key in info:
            keep[key] = info[key]
    return keep


def median_ms(func, repeats=5, warmup=1):
    for _ in range(warmup):
        func()
    samples = []
    for _ in range(repeats):
        t0 = time.perf_counter()
        func()
        samples.append((time.perf_counter() - t0) * 1e3)
    return statistics.median(samples)


paths = [
    Path("indexing_tutorial_partial.b2nd"),
    Path("indexing_tutorial_append_full.b2nd"),
]
for path in paths:
    blosc2.remove_urlpath(path)

Index kinds and how to create them¶

Blosc2 currently supports five index kinds:

summary: compact summaries only,
bucket: summary levels plus lightweight per-block payloads,
partial: richer payloads for positional filtering,
opsi: tunable iterative ordering for exact filtering,
full: globally sorted payloads for positional filtering and ordered reuse.

OPSI is intentionally a separate kind, not a full index construction method. It performs a configurable number of ordering cycles and then keeps that iterative ordering as-is. Achieving a completely sorted index (CSI) is not a goal for OPSI; use FULL when you require global sorted order or direct ordered reuse. By default, OPSI uses optlevel cycles for optlevel < 8, and 2 * optlevel cycles for optlevel >= 8. You can override this with opsi_max_cycles=....

There is one active index per target field or expression. If you create another index on the same target, it replaces the previous one. The easiest way to compare kinds is to build them on separate arrays.

The next cell times index creation and reports the compressed storage footprint of each index relative to the compressed base array.

[2]:

N_ROWS = 10_000_000
MASK_TEXT = "(id >= -5.0) & (id < 5.0)"

rng = np.random.default_rng(0)
dtype = np.dtype([("id", np.float64), ("payload", np.int32)])
ids = np.arange(-N_ROWS // 2, N_ROWS // 2, dtype=np.float64)
rng.shuffle(ids)
data = blosc2.fromiter(((id_, i) for i, id_ in enumerate(ids)), shape=(N_ROWS,), dtype=dtype)

indexed_arrays = {}
build_rows = []
base_cbytes = data.cbytes
for kind in (
    blosc2.IndexKind.SUMMARY,
    blosc2.IndexKind.BUCKET,
    blosc2.IndexKind.PARTIAL,
    blosc2.IndexKind.OPSI,
    blosc2.IndexKind.FULL,
):
    arr = data.copy()
    t0 = time.perf_counter()
    arr.create_index(field="id", kind=kind)
    build_ms = (time.perf_counter() - t0) * 1e3
    index_obj = arr.index("id")
    indexed_arrays[kind.value] = arr
    build_rows.append((kind.value, build_ms, index_obj.cbytes, index_obj.cbytes / base_cbytes))

print(f"Compressed base array size: {format_bytes(base_cbytes)}")
print(f"{'kind':<12} {'build_ms':>10} {'index_size':>12} {'overhead':>10}")
for kind, build_ms, index_cbytes, overhead in build_rows:
    print(f"{kind:<12} {build_ms:10.3f} {format_bytes(index_cbytes):>12} {overhead:>9.2f}x")

Compressed base array size: 30.74 MiB
kind           build_ms   index_size   overhead
summary          26.726        142 B      0.00x
bucket          455.373    26.04 MiB      0.85x
partial         404.564    34.99 MiB      1.14x
full           1635.311    28.44 MiB      0.93x

Using an index for masks¶

Range predicates are planned automatically when you use where(...). If you just want the matching values, expr[:] is the shortest form. In the comparisons below we use compute() so the result stays as an NDArray, and we force a scan by passing _use_index=False.

[3]:

partial_arr = indexed_arrays["partial"]
expr = blosc2.lazyexpr(MASK_TEXT, partial_arr.fields).where(partial_arr)

print(explain_subset(expr))

indexed = expr.compute()
scanned = expr.compute(_use_index=False)
np.testing.assert_array_equal(indexed, scanned)
print(f"Matched rows: {len(indexed)}")

{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'partial', 'level': 'partial', 'lookup_path': 'chunk-nav', 'full_runs': 0}
Matched rows: 10

Timing the mask with and without indexes¶

The next cell measures the same selective mask on all five index kinds and compares it with a forced full scan. On this workload, partial, opsi, and full usually show the clearest benefit because they carry richer payloads for positional filtering.

[4]:

timing_rows = []
expected = None
for kind, arr in indexed_arrays.items():
    expr = blosc2.lazyexpr(MASK_TEXT, arr.fields).where(arr)
    result = expr.compute()
    if expected is None:
        expected = result
    else:
        np.testing.assert_array_equal(result, expected)

    scan_ms = median_ms(lambda expr=expr: expr.compute(_use_index=False), repeats=3)
    index_ms = median_ms(lambda expr=expr: expr.compute(), repeats=3)
    timing_rows.append((kind, scan_ms, index_ms, scan_ms / index_ms))

print(f"Selective mask over {N_ROWS:,} rows")
print(f"{'kind':<12} {'scan_ms':>11} {'index_ms':>10} {'speedup':>10}")
for kind, scan_ms, index_ms, speedup in timing_rows:
    print(f"{kind:<12} {scan_ms:11.3f} {index_ms:10.3f} {speedup:10.2f}x")

Selective mask over 10,000,000 rows
kind             scan_ms   index_ms    speedup
summary           47.485     49.725       0.95x
bucket            43.921      0.941      46.68x
partial           42.991      0.921      46.67x
full              43.695      0.944      46.28x

`full` indexes and ordered access¶

A full index stores a global sorted payload. This is the required index tier for direct ordered reuse. Build it directly with create_index(kind=blosc2.IndexKind.FULL).

If you only want a tunable iterative ordering index for exact filtering, use create_index(kind=blosc2.IndexKind.OPSI) instead. OPSI can improve cold-query locality as optlevel or opsi_max_cycles increases, but it does not replace FULL for globally sorted access.

[5]:

ordered_dtype = np.dtype([("id", np.int64), ("payload", np.int64)])
ordered_data = np.array(
    [(2, 9), (1, 8), (2, 7), (1, 6), (2, 5), (1, 4), (2, 3), (1, 2)],
    dtype=ordered_dtype,
)
ordered_arr = blosc2.asarray(ordered_data)
ordered_arr.create_index("id", kind=blosc2.IndexKind.FULL)

print("Sorted positions:", ordered_arr.argsort(order=["id", "payload"])[:])
print("Sorted rows:")
print(ordered_arr.sort(order=["id", "payload"])[:])

Sorted positions: [7 5 3 1 6 4 2 0]
Sorted rows:
[(1, 2) (1, 4) (1, 6) (1, 8) (2, 3) (2, 5) (2, 7) (2, 9)]

Expression indexes¶

You can also index a deterministic scalar expression stream. Expression indexes are matched by normalized expression identity, so the same expression can be reused for masks and ordered access.

[6]:

expr_dtype = np.dtype([("x", np.int64), ("payload", np.int32)])
expr_data = np.array([(-8, 0), (5, 1), (-2, 2), (11, 3), (3, 4), (-3, 5), (2, 6), (-5, 7)], dtype=expr_dtype)
expr_arr = blosc2.asarray(expr_data)
expr_arr.create_index(expression="abs(x)", kind=blosc2.IndexKind.FULL, name="abs_x")

ordered_expr = blosc2.lazyexpr("(abs(x) >= 2) & (abs(x) < 8)", expr_arr.fields).where(expr_arr)
print(explain_subset(ordered_expr))
print("Expression-order positions:", ordered_expr.argsort(order="abs(x)").compute()[:])

{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'sidecar-stream', 'full_runs': 0}
Expression-order positions: [2 6 4 5 1 7]

Persistence: automatic or manual?¶

Index persistence follows the base array by default:

for a persistent array (urlpath=...), persistent=None means the index sidecars are persisted automatically,
for an in-memory array, the index lives only in memory,
on a persistent array, persistent=False keeps the index process-local instead of writing sidecars.

In practice, if you want an index to survive reopen, persist the array and use the default behavior.

[7]:

persistent_arr = data.copy(urlpath=paths[0], mode="w")
persistent_descriptor = persistent_arr.create_index(field="id", kind=blosc2.IndexKind.PARTIAL)
show_index_summary("persistent partial", persistent_descriptor)

reopened = blosc2.open(paths[0], mode="a")
print(f"Reopened index count: {len(reopened.indexes)}")
print(f"Persisted sidecar path: {reopened.indexes[0]['partial']['values_path']}")

persistent partial: kind=partial, persistent=True, ooc=True, stale=False
Reopened index count: 1
Persisted sidecar path: indexing_tutorial_partial.__index__.id.partial.partial.values.b2nd

When to rebuild an index¶

Appending is special-cased and keeps compatible indexes current. General mutation and resize operations do not. After unsupported mutations, the index is marked stale and should be refreshed explicitly with rebuild_index().

[8]:

mutable_arr = blosc2.arange(20, dtype=np.int64)
mutable_arr.create_index(kind=blosc2.IndexKind.FULL)
mutable_arr[:3] = -1

print("Stale after direct mutation:", mutable_arr.indexes[0]["stale"])
mutable_arr.rebuild_index()
print("Stale after rebuild:", mutable_arr.indexes[0]["stale"])

Stale after direct mutation: True
Stale after rebuild: False

Recommended workflow for append-heavy `full` indexes¶

Appending to a full index is intentionally cheap: appended tails become sorted runs instead of forcing an immediate rewrite of the compact base sidecars.

That means the recommended workflow is:

create a persistent full index once,
append freely during ingestion,
let masks keep working while runs accumulate,
call compact_index() after ingestion windows or before latency-sensitive read phases.

The next example uses a larger append-heavy array and times the same selective mask before and after compaction. The positional-filter path reports whether it is using a compact lookup layout or a run-aware fallback. After compaction, full["runs"] becomes empty again.

[9]:

append_dtype = np.dtype([("id", np.int64), ("payload", np.int32)])
base_rows = 200_000
append_batch = 500
num_runs = 40

append_data = blosc2.zeros(base_rows, dtype=append_dtype)[:]
append_data["id"] = blosc2.arange(base_rows, dtype=np.int64)
append_data["payload"] = blosc2.arange(base_rows, dtype=np.int32)

append_arr = blosc2.asarray(append_data, urlpath=paths[1], mode="w")
append_arr.create_index(field="id", kind=blosc2.IndexKind.FULL)

for run in range(num_runs):
    start = 300_000 + run * append_batch
    batch = blosc2.zeros(append_batch, dtype=append_dtype)[:]
    batch["id"] = blosc2.arange(start, start + append_batch, dtype=np.int64)
    batch["payload"] = blosc2.arange(append_batch, dtype=np.int32)
    append_arr.append(batch)

mask_str = "(id >= 310_000) & (id < 310_020)"
append_expr = blosc2.lazyexpr(mask_str, append_arr.fields).where(append_arr)
before_info = explain_subset(append_expr)
before_ms = median_ms(lambda: append_expr.compute(), repeats=5)
print("Before compaction:", before_info)
print("Pending runs:", len(append_arr.indexes[0]["full"]["runs"]))
print(f"Median mask time before compaction: {before_ms:.3f} ms")

append_arr.compact_index("id")
append_expr = blosc2.lazyexpr(mask_str, append_arr.fields).where(append_arr)
after_info = explain_subset(append_expr)
after_ms = median_ms(lambda: append_expr.compute(), repeats=5)
print("After compaction:", after_info)
print("Pending runs:", len(append_arr.indexes[0]["full"]["runs"]))
print(f"Median mask time after compaction: {after_ms:.3f} ms")
print(f"Speedup after compaction: {before_ms / after_ms:.2f}x")

Before compaction: {'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'run-bounded-ooc', 'full_runs': 40}
Pending runs: 40
Median mask time before compaction: 0.299 ms
After compaction: {'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'compact-selective-ooc', 'full_runs': 0}
Pending runs: 0
Median mask time after compaction: 0.266 ms
Speedup after compaction: 1.12x

Practical guidance¶

Use partial when your main goal is faster selective masks.
Use opsi when you want exact filtering with tunable iterative ordering. Increase optlevel or pass opsi_max_cycles to spend more build time on ordering; do not expect OPSI to become a full/CSI index.
Use full when you also want ordered reuse through sort(order=...), NDArray.argsort(order=...), LazyExpr.argsort(order=...), or iter_sorted(...).
Persist the base array if you want indexes to survive reopen automatically.
After unsupported mutations, use rebuild_index().
For append-heavy full indexes, compact explicitly at convenient maintenance boundaries instead of on every append.
Measure your own workload: compact indexes, predicate selectivity, iterative-ordering level, and ordered access needs all affect which kind is best.

[10]:

for path in paths:
    blosc2.remove_urlpath(path)

[10]: