Indexing Arrays

Blosc2 can attach indexes to 1-D NDArray objects and to fields inside 1-D structured arrays. These indexes accelerate selective masks, and full indexes can also drive ordered access directly through sort(order=...), NDArray.argsort(order=...), LazyExpr.argsort(order=...), and iter_sorted(...).

This tutorial covers:

  • how to create field and expression indexes,

  • how to tell whether a mask is using an index,

  • what sort of acceleration different index kinds can deliver on a selective mask,

  • how index persistence works,

  • when to rebuild indexes,

  • and a recommended workflow for keeping append-heavy full indexes compact.

Setup

[1]:
import statistics
import time
from pathlib import Path

import numpy as np

import blosc2


def format_bytes(nbytes):
    units = ("B", "KiB", "MiB", "GiB", "TiB")
    value = float(nbytes)
    for unit in units:
        if value < 1024.0 or unit == units[-1]:
            if unit == "B":
                return f"{int(value)} {unit}"
            return f"{value:.2f} {unit}"
        value /= 1024.0
    return f"{value:.2f} {units[-1]}"


def show_index_summary(label, descriptor):
    print(
        f"{label}: kind={descriptor['kind']}, persistent={descriptor['persistent']}, "
        f"ooc={descriptor['ooc']}, stale={descriptor['stale']}"
    )


def explain_subset(expr):
    info = expr.explain()
    keep = {}
    for key in ("will_use_index", "reason", "kind", "level", "lookup_path", "full_runs"):
        if key in info:
            keep[key] = info[key]
    return keep


def median_ms(func, repeats=5, warmup=1):
    for _ in range(warmup):
        func()
    samples = []
    for _ in range(repeats):
        t0 = time.perf_counter()
        func()
        samples.append((time.perf_counter() - t0) * 1e3)
    return statistics.median(samples)


paths = [
    Path("indexing_tutorial_partial.b2nd"),
    Path("indexing_tutorial_append_full.b2nd"),
]
for path in paths:
    blosc2.remove_urlpath(path)

Index kinds and how to create them

Blosc2 currently supports four index kinds:

  • summary: compact summaries only,

  • bucket: summary levels plus lightweight per-block payloads,

  • partial: richer payloads for positional filtering,

  • full: globally sorted payloads for positional filtering and ordered reuse.

There is one active index per target field or expression. If you create another index on the same target, it replaces the previous one. The easiest way to compare kinds is to build them on separate arrays.

The next cell times index creation and reports the compressed storage footprint of each index relative to the compressed base array.

[2]:
N_ROWS = 10_000_000
MASK_TEXT = "(id >= -5.0) & (id < 5.0)"

rng = np.random.default_rng(0)
dtype = np.dtype([("id", np.float64), ("payload", np.int32)])
ids = np.arange(-N_ROWS // 2, N_ROWS // 2, dtype=np.float64)
rng.shuffle(ids)
data = blosc2.fromiter(((id_, i) for i, id_ in enumerate(ids)), shape=(N_ROWS,), dtype=dtype)

indexed_arrays = {}
build_rows = []
base_cbytes = data.cbytes
for kind in (
    blosc2.IndexKind.SUMMARY,
    blosc2.IndexKind.BUCKET,
    blosc2.IndexKind.PARTIAL,
    blosc2.IndexKind.FULL,
):
    arr = data.copy()
    t0 = time.perf_counter()
    arr.create_index(field="id", kind=kind)
    build_ms = (time.perf_counter() - t0) * 1e3
    index_obj = arr.index("id")
    indexed_arrays[kind.value] = arr
    build_rows.append((kind.value, build_ms, index_obj.cbytes, index_obj.cbytes / base_cbytes))

print(f"Compressed base array size: {format_bytes(base_cbytes)}")
print(f"{'kind':<12} {'build_ms':>10} {'index_size':>12} {'overhead':>10}")
for kind, build_ms, index_cbytes, overhead in build_rows:
    print(f"{kind:<12} {build_ms:10.3f} {format_bytes(index_cbytes):>12} {overhead:>9.2f}x")
Compressed base array size: 30.74 MiB
kind           build_ms   index_size   overhead
summary          26.726        142 B      0.00x
bucket          455.373    26.04 MiB      0.85x
partial         404.564    34.99 MiB      1.14x
full           1635.311    28.44 MiB      0.93x

Using an index for masks

Range predicates are planned automatically when you use where(...). If you just want the matching values, expr[:] is the shortest form. In the comparisons below we use compute() so the result stays as an NDArray, and we force a scan by passing _use_index=False.

[3]:
partial_arr = indexed_arrays["partial"]
expr = blosc2.lazyexpr(MASK_TEXT, partial_arr.fields).where(partial_arr)

print(explain_subset(expr))

indexed = expr.compute()
scanned = expr.compute(_use_index=False)
np.testing.assert_array_equal(indexed, scanned)
print(f"Matched rows: {len(indexed)}")
{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'partial', 'level': 'partial', 'lookup_path': 'chunk-nav', 'full_runs': 0}
Matched rows: 10

Timing the mask with and without indexes

The next cell measures the same selective mask on all four index kinds and compares it with a forced full scan. On this workload, partial and full usually show the clearest benefit because they carry richer payloads for positional filtering.

[4]:
timing_rows = []
expected = None
for kind, arr in indexed_arrays.items():
    expr = blosc2.lazyexpr(MASK_TEXT, arr.fields).where(arr)
    result = expr.compute()
    if expected is None:
        expected = result
    else:
        np.testing.assert_array_equal(result, expected)

    scan_ms = median_ms(lambda expr=expr: expr.compute(_use_index=False), repeats=3)
    index_ms = median_ms(lambda expr=expr: expr.compute(), repeats=3)
    timing_rows.append((kind, scan_ms, index_ms, scan_ms / index_ms))

print(f"Selective mask over {N_ROWS:,} rows")
print(f"{'kind':<12} {'scan_ms':>11} {'index_ms':>10} {'speedup':>10}")
for kind, scan_ms, index_ms, speedup in timing_rows:
    print(f"{kind:<12} {scan_ms:11.3f} {index_ms:10.3f} {speedup:10.2f}x")
Selective mask over 10,000,000 rows
kind             scan_ms   index_ms    speedup
summary           47.485     49.725       0.95x
bucket            43.921      0.941      46.68x
partial           42.991      0.921      46.67x
full              43.695      0.944      46.28x

full indexes and ordered access

A full index stores a global sorted payload. This is the required index tier for direct ordered reuse. Build it directly with create_index(kind=blosc2.IndexKind.FULL).

[5]:
ordered_dtype = np.dtype([("id", np.int64), ("payload", np.int64)])
ordered_data = np.array(
    [(2, 9), (1, 8), (2, 7), (1, 6), (2, 5), (1, 4), (2, 3), (1, 2)],
    dtype=ordered_dtype,
)
ordered_arr = blosc2.asarray(ordered_data)
ordered_arr.create_index("id", kind=blosc2.IndexKind.FULL)

print("Sorted positions:", ordered_arr.argsort(order=["id", "payload"])[:])
print("Sorted rows:")
print(ordered_arr.sort(order=["id", "payload"])[:])
Sorted positions: [7 5 3 1 6 4 2 0]
Sorted rows:
[(1, 2) (1, 4) (1, 6) (1, 8) (2, 3) (2, 5) (2, 7) (2, 9)]

Expression indexes

You can also index a deterministic scalar expression stream. Expression indexes are matched by normalized expression identity, so the same expression can be reused for masks and ordered access.

[6]:
expr_dtype = np.dtype([("x", np.int64), ("payload", np.int32)])
expr_data = np.array([(-8, 0), (5, 1), (-2, 2), (11, 3), (3, 4), (-3, 5), (2, 6), (-5, 7)], dtype=expr_dtype)
expr_arr = blosc2.asarray(expr_data)
expr_arr.create_index(expression="abs(x)", kind=blosc2.IndexKind.FULL, name="abs_x")

ordered_expr = blosc2.lazyexpr("(abs(x) >= 2) & (abs(x) < 8)", expr_arr.fields).where(expr_arr)
print(explain_subset(ordered_expr))
print("Expression-order positions:", ordered_expr.argsort(order="abs(x)").compute()[:])
{'will_use_index': True, 'reason': 'multi-field positional indexes selected', 'kind': 'full', 'level': 'partial', 'lookup_path': 'sidecar-stream', 'full_runs': 0}
Expression-order positions: [2 6 4 5 1 7]

Persistence: automatic or manual?

Index persistence follows the base array by default:

  • for a persistent array (urlpath=...), persistent=None means the index sidecars are persisted automatically,

  • for an in-memory array, the index lives only in memory,

  • on a persistent array, persistent=False keeps the index process-local instead of writing sidecars.

In practice, if you want an index to survive reopen, persist the array and use the default behavior.

[7]:
persistent_arr = data.copy(urlpath=paths[0], mode="w")
persistent_descriptor = persistent_arr.create_index(field="id", kind=blosc2.IndexKind.PARTIAL)
show_index_summary("persistent partial", persistent_descriptor)

reopened = blosc2.open(paths[0], mode="a")
print(f"Reopened index count: {len(reopened.indexes)}")
print(f"Persisted sidecar path: {reopened.indexes[0]['partial']['values_path']}")
persistent partial: kind=partial, persistent=True, ooc=True, stale=False
Reopened index count: 1
Persisted sidecar path: indexing_tutorial_partial.__index__.id.partial.partial.values.b2nd

When to rebuild an index

Appending is special-cased and keeps compatible indexes current. General mutation and resize operations do not. After unsupported mutations, the index is marked stale and should be refreshed explicitly with rebuild_index().

[8]:
mutable_arr = blosc2.arange(20, dtype=np.int64)
mutable_arr.create_index(kind=blosc2.IndexKind.FULL)
mutable_arr[:3] = -1

print("Stale after direct mutation:", mutable_arr.indexes[0]["stale"])
mutable_arr.rebuild_index()
print("Stale after rebuild:", mutable_arr.indexes[0]["stale"])
Stale after direct mutation: True
Stale after rebuild: False

Practical guidance

  • Use partial when your main goal is faster selective masks.

  • Use full when you also want ordered reuse through sort(order=...), NDArray.argsort(order=...), LazyExpr.argsort(order=...), or iter_sorted(...).

  • Persist the base array if you want indexes to survive reopen automatically.

  • After unsupported mutations, use rebuild_index().

  • For append-heavy full indexes, compact explicitly at convenient maintenance boundaries instead of on every append.

  • Measure your own workload: compact indexes, predicate selectivity, and ordered access needs all affect which kind is best.

[10]:
for path in paths:
    blosc2.remove_urlpath(path)
[10]: