BatchArray¶
Overview¶
BatchArray is a batch-oriented container for variable-length Python items
backed by a single Blosc2 SChunk.
Each batch is stored in one compressed chunk:
batches contain one or more Python items
each chunk may contain one or more internal variable-length blocks
the store itself is indexed by batch
item-wise traversal is available via
BatchArray.iter_items()
BatchArray is a good fit when data arrives naturally in batches and you want:
efficient batch append/update operations
persistent
.b2bstoresitem-level reads inside a batch
compact summary information about batches and internal blocks via
.info
Serializer support¶
BatchArray currently supports two serializers:
"msgpack": the default and general-purpose choice for Python items"arrow": optional and requirespyarrow; mainly useful when data is already Arrow-shaped before ingestion
Quick example¶
import blosc2
store = blosc2.BatchArray(urlpath="example_batch_array.b2b", mode="w", contiguous=True)
store.append([{"red": 1, "green": 2, "blue": 3}, {"red": 4, "green": 5, "blue": 6}])
store.append([{"red": 7, "green": 8, "blue": 9}])
print(store[0]) # first batch
print(store[0][1]) # second item in first batch
print(list(store.iter_items()))
reopened = blosc2.open("example_batch_array.b2b", mode="r")
print(type(reopened).__name__)
print(reopened.info)
Note
BatchArray is batch-oriented by design. store[i] returns a batch, not a
single item. Use BatchArray.iter_items() for flat item-wise traversal.
- class blosc2.BatchArray(items_per_block: int | None = None, serializer: str = 'msgpack', _from_schunk: SChunk | None = None, **kwargs: Any)[source]¶
A batched container for variable-length Python items.
BatchArray stores data as a sequence of batches, where each batch contains one or more Python items. Each batch is stored in one compressed chunk, and each chunk is internally split into one or more variable-length blocks for efficient item access.
The main abstraction is batch-oriented:
indexing the store returns batches
iterating the store yields batches
iter_items()provides flat item-wise traversal
BatchArray is a good fit when:
data arrives naturally in batches
batch-level append/update operations are important
occasional item-level reads are needed inside a batch
- Parameters:
items_per_block¶ (int, optional) – Maximum number of items stored in each internal variable-length block. The last block in a batch may contain fewer items than this cap. If not provided, a value is inferred from the first batch.
serializer¶ ({"msgpack", "arrow"}, optional) – Serializer used for batch payloads.
"msgpack"is the default and is the general-purpose choice for Python items, including nested Blosc2 containers such asblosc2.NDArray,blosc2.SChunk,blosc2.VLArray,blosc2.BatchArray, andblosc2.EmbedStore, which are serialized transparently viato_cframe()/blosc2.from_cframe(). Msgpack also supports structured Blosc2 reference objects, currentlyblosc2.C2Array,blosc2.LazyExpr, andblosc2.LazyUDFbacked byblosc2.dsl_kernel(). These lazy objects preserve reference semantics, so only persistent local operands,blosc2.C2Arrayoperands, andblosc2.DictStoremembers are supported; purely in-memory operands are rejected. Plain Pythonblosc2.LazyUDFcallables are not serialized by msgpack."arrow"is optional and requirespyarrow._from_schunk¶ (blosc2.SChunk, optional) – Internal hook used when reopening an already-tagged BatchArray.
**kwargs¶ – Storage, compression, and decompression arguments accepted by the constructor.
- Attributes:
- cbytes
- contiguous
- cparams
- cratio
- dparams
infoReturn an info reporter with a compact summary of the store.
info_itemsReturn summary information as
(name, value)pairs.- items
items_per_blockMaximum number of items per internal block.
- meta
- nbytes
serializerSerializer name used for batch payloads.
- typesize
- urlpath
- vlmeta
Methods
append(value)Append one batch and return the new number of batches.
clear()Remove all entries from the container.
copy(**kwargs)Create a copy of the store with optional constructor overrides.
delete(index)Delete the batch at
indexand return the new number of batches.extend(values)Append all batches from an iterable of batches.
insert(index, value)Insert one batch at
indexand return the new number of batches.Iterate over all items across all batches in order.
pop([index])Remove and return the batch at
indexas a Python list.Serialize the full store to a Blosc2 cframe buffer.
Constructors¶
- __init__(items_per_block: int | None = None, serializer: str = 'msgpack', _from_schunk: SChunk | None = None, **kwargs: Any) None[source]¶
Create a new BatchArray or reopen an existing one.
When a persistent
urlpathpoints to an existing BatchArray and the mode is"r"or"a", the container is reopened automatically. Otherwise a new empty store is created.
Batch Interface¶
Mutation¶
- insert(index: int, value: object) int[source]¶
Insert one batch at
indexand return the new number of batches.
- delete(index: int | slice) int[source]¶
Delete the batch at
indexand return the new number of batches.
- copy(**kwargs: Any) BatchArray[source]¶
Create a copy of the store with optional constructor overrides.
Context Manager¶
- __enter__() BatchArray[source]¶
Public Members¶
- class blosc2.Batch(parent: BatchArray, nbatch: int, lazybatch: bytes)[source]¶
A lazy sequence representing one batch in a
BatchArray.Batchprovides sequence-style access to the items stored in a single batch. Integer indexing can use block-local reads when possible, while slicing materializes the full batch into Python items.Batch instances are normally obtained via
BatchArrayindexing or iteration rather than constructed directly.- Attributes:
- cbytes
- cratio
- lazybatch
- nbytes
Methods
count(value)index(value, [start, [stop]])Raises ValueError if the value is not present.