Working with BatchArray¶
A BatchArray is a batch-oriented container for variable-length Python items backed by a single SChunk. Each batch is stored in one compressed chunk, and each chunk may contain one or more internal variable-length blocks.
This makes BatchArray a good fit when data arrives naturally in batches and you want efficient batch append/update operations together with occasional item-level access inside each batch.
[1]:
import blosc2
def show(label, value):
print(f"{label}: {value}")
urlpath = "batcharray_tutorial.b2b"
copy_path = "batcharray_tutorial_copy.b2b"
blosc2.remove_urlpath(urlpath)
blosc2.remove_urlpath(copy_path)
Creating and populating a BatchArray¶
A BatchArray is indexed by batch. Batches can be appended one by one with append() or in bulk with extend(). Here we set a small items_per_block just so the internal block structure is easy to observe in .info.
[2]:
store = blosc2.BatchArray(urlpath=urlpath, mode="w", contiguous=True, items_per_block=2)
store.append(
[
{"name": "alpha", "count": 1},
{"name": "beta", "count": 2},
{"name": "gamma", "count": 3},
]
)
store.append(
[
{"name": "delta", "count": 4},
{"name": "epsilon", "count": 5},
]
)
store.extend(
[
[{"name": "zeta", "count": 6}],
[{"name": "eta", "count": 7}, {"name": "theta", "count": 8}],
[
{"name": "iota", "count": 9},
{"name": "kappa", "count": 10},
{"name": "lambda", "count": 11},
],
]
)
show("Batches", [batch[:] for batch in store])
show("Number of batches", len(store))
Batches: [[{'name': 'alpha', 'count': 1}, {'name': 'beta', 'count': 2}, {'name': 'gamma', 'count': 3}], [{'name': 'delta', 'count': 4}, {'name': 'epsilon', 'count': 5}], [{'name': 'zeta', 'count': 6}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
Number of batches: 5
Batch and item access¶
Indexing the store returns a batch. Indexing a batch returns an item inside that batch. Flat item-wise traversal is available through iter_items().
[3]:
show("First batch", store[0][:])
show("Second item in first batch", store[0][1])
show("Slice of second batch", store[1][:1])
show("All items", list(store.iter_items()))
First batch: [{'name': 'alpha', 'count': 1}, {'name': 'beta', 'count': 2}, {'name': 'gamma', 'count': 3}]
Second item in first batch: {'name': 'beta', 'count': 2}
Slice of second batch: [{'name': 'delta', 'count': 4}]
All items: [{'name': 'alpha', 'count': 1}, {'name': 'beta', 'count': 2}, {'name': 'gamma', 'count': 3}, {'name': 'delta', 'count': 4}, {'name': 'epsilon', 'count': 5}, {'name': 'zeta', 'count': 6}, {'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}, {'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]
Updating, inserting, and deleting batches¶
Mutation is batch-oriented too: you overwrite, insert, delete, and pop whole batches.
[4]:
store[1] = [
{"name": "delta*", "count": 40},
{"name": "epsilon*", "count": 50},
]
store.insert(2, [{"name": "between-a", "count": 99}, {"name": "between-b", "count": 100}])
removed = store.pop(3)
del store[0]
store.insert(0, [{"name": "alpha*", "count": 10}, {"name": "beta*", "count": 20}])
show("Popped batch", removed)
show("After updates", [batch[:] for batch in store])
Popped batch: [{'name': 'zeta', 'count': 6}]
After updates: [[{'name': 'alpha*', 'count': 10}, {'name': 'beta*', 'count': 20}], [{'name': 'delta*', 'count': 40}, {'name': 'epsilon*', 'count': 50}], [{'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
Iteration and summary info¶
Iterating a BatchArray yields batches. The .info summary reports both batch-level and internal block-level statistics.
[5]:
show("Batches via iteration", [batch[:] for batch in store])
print(store.info)
Batches via iteration: [[{'name': 'alpha*', 'count': 10}, {'name': 'beta*', 'count': 20}], [{'name': 'delta*', 'count': 40}, {'name': 'epsilon*', 'count': 50}], [{'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
type : BatchArray
serializer : msgpack
nbatches : 5 (items per batch: mean=2.20, max=3, min=2)
nblocks : 6 (items per block: mean=1.83, max=2, min=1)
nitems : 11
nbytes : 226 (226 B)
cbytes : 680 (680 B)
cratio : 0.33
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=5, use_dict=False, typesize=1,
: nthreads=12, blocksize=0, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0,
: 0, 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)
Copying and changing storage settings¶
Like other Blosc2 containers, BatchArray.copy() can write a new persistent store while changing storage or compression settings.
[6]:
store_copy = store.copy(
urlpath=copy_path,
contiguous=False,
cparams={"codec": blosc2.Codec.LZ4, "clevel": 5},
)
show("Copied batches", [batch[:] for batch in store_copy])
show("Copy serializer", store_copy.serializer)
show("Copy codec", store_copy.cparams.codec)
Copied batches: [[{'name': 'alpha*', 'count': 10}, {'name': 'beta*', 'count': 20}], [{'name': 'delta*', 'count': 40}, {'name': 'epsilon*', 'count': 50}], [{'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
Copy serializer: msgpack
Copy codec: Codec.LZ4
Round-tripping through cframes and reopening from disk¶
Tagged persistent stores automatically reopen as BatchArray, and a serialized cframe buffer does too.
[7]:
cframe = store.to_cframe()
restored = blosc2.from_cframe(cframe)
show("from_cframe type", type(restored).__name__)
show("from_cframe batches", [batch[:] for batch in restored])
reopened = blosc2.open(urlpath, mode="r", mmap_mode="r")
show("Reopened type", type(reopened).__name__)
show("Reopened batches", [batch[:] for batch in reopened])
from_cframe type: BatchArray
from_cframe batches: [[{'name': 'alpha*', 'count': 10}, {'name': 'beta*', 'count': 20}], [{'name': 'delta*', 'count': 40}, {'name': 'epsilon*', 'count': 50}], [{'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
Reopened type: BatchArray
Reopened batches: [[{'name': 'alpha*', 'count': 10}, {'name': 'beta*', 'count': 20}], [{'name': 'delta*', 'count': 40}, {'name': 'epsilon*', 'count': 50}], [{'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}], [{'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}], [{'name': 'iota', 'count': 9}, {'name': 'kappa', 'count': 10}, {'name': 'lambda', 'count': 11}]]
Clearing and reusing a store¶
Calling clear() resets the backing storage so the container remains ready for new batches.
[8]:
scratch = store.copy()
scratch.clear()
scratch.extend(
[
[{"name": "fresh", "count": 1}],
[{"name": "again", "count": 2}, {"name": "done", "count": 3}],
]
)
show("After clear + extend", [batch[:] for batch in scratch])
After clear + extend: [[{'name': 'fresh', 'count': 1}], [{'name': 'again', 'count': 2}, {'name': 'done', 'count': 3}]]
Flat item access with .items¶
The main BatchArray API remains batch-oriented, but the .items accessor offers a read-only flat view across all items. Integer indexing returns one item and slicing returns a Python list.
[9]:
show("Flat item 0", store.items[0])
show("Flat item 6", store.items[6])
show("Flat slice 3:8", store.items[3:8])
Flat item 0: {'name': 'alpha*', 'count': 10}
Flat item 6: {'name': 'eta', 'count': 7}
Flat slice 3:8: [{'name': 'epsilon*', 'count': 50}, {'name': 'between-a', 'count': 99}, {'name': 'between-b', 'count': 100}, {'name': 'eta', 'count': 7}, {'name': 'theta', 'count': 8}]
[10]:
blosc2.remove_urlpath(urlpath)
blosc2.remove_urlpath(copy_path)