What is it?#

Python-Blosc2 is a high-performance compressed ndarray library with a flexible compute engine. It uses the C-Blosc2 library as the compression backend. C-Blosc2 is the next generation of Blosc, an award-winning library that has been around for more than a decade, and that is being used by many projects, including PyTables or Zarr.

Python-Blosc2 is a Python wrapper around the C-Blosc2 library, enhanced with an integrated compute engine. This allows for complex computations on compressed data, whether the operands are in memory, on disk, or accessed over a network. This capability makes it easier to work with very large datasets, even in distributed environments.

Most importantly, Python-Blosc2 uses the C-Blosc2 simple and open format for storing compressed data. This facilitates seamless integration with other systems and tools.

Interacting with the Ecosystem#

Python-Blosc2 is designed to integrate seamlessly with existing libraries and tools, offering:

  • Support for NumPy’s universal functions mechanism, enabling the combination of NumPy and Blosc2 computation engines.

  • Excellent integration with Numba and Cython via User Defined Functions.

  • Lazy expressions that are evaluated only when needed and can be stored for future use.

Python-Blosc2 leverages both NumPy and NumExpr to achieve high performance, but with key differences. The main distinctions between the new computing engine and NumPy or NumExpr include:

  • Support for compressed ndarrays stored in memory, on disk, or over the network.

  • Ability to evaluate various mathematical expressions, including reductions, indexing, and filters.

  • Support for broadcasting operations, enabling operations on arrays with different shapes.

  • Improved adherence to NumPy casting rules compared to NumExpr.

  • Support for proxies, facilitating work with compressed data on local or remote machines.

Data Containers#

The main data container objects in Python-Blosc2 are:

  • SChunk: A 64-bit compressed store suitable for any data type supporting the buffer protocol.

  • NDArray: An N-Dimensional store that mirrors the NumPy API, enhanced with efficient compressed data storage.

These containers are described in more detail below.

SChunk: a 64-bit compressed store#

SChunk is a simple data container that handles setting, expanding and getting data and metadata. In contrast to chunks, a super-chunk can update and resize the data that it contains, supports user metadata, and has virtually unlimited storage capacity (chunks, on the other hand, cannot store more than 2 GB).

Additionally, you can convert a SChunk into a contiguous, serialized buffer (aka cframe) and vice-versa; as a bonus, the serialization/deserialization process also works with NumPy arrays and PyTorch/TensorFlow tensors at lightning-fast speed:

Compression speed for different codecs

Decompression speed for different codecs

while reaching excellent compression ratios:

Compression ratio for different codecs

Also, if you are a Mac M1/M2 owner, do yourself a favor and use its native arm64 arch (yes, we are distributing Mac arm64 wheels too; you’re welcome ;-) ):

Compression speed for different codecs on Apple M1

Decompression speed for different codecs on Apple M1

Read more about SChunk features in our blog entry at: https://www.blosc.org/posts/python-blosc2-improvements

NDArray: an N-Dimensional store#

A recent feature in Python-Blosc2 is the NDArray object. It builds upon the SChunk object, offering a NumPy-like API for compressed n-dimensional data.

It efficiently reads/writes n-dimensional datasets using an n-dimensional two-level partitioning scheme, enabling fine-grained slicing of large, compressed data:

https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true

As an example, see how the NDArray object excels at retrieving slices orthogonal to different axes of a 4-dimensional dataset:

https://github.com/Blosc/python-blosc2/blob/main/images/Read-Partial-Slices-B2ND.png?raw=true

More information is available in this blog post: https://www.blosc.org/posts/blosc2-ndim-intro

Check this short video explaining why slicing in a pineapple-style (aka double partition) is useful:

Slicing a dataset in pineapple-style

Operating with NDArrays#

Python-Blosc2’s NDArray objects are designed for ease of use, demonstrated by this example:

import blosc2

N = 20_000
# N = 70_000 # for large scenario
a = blosc2.linspace(0, 1, N * N).reshape(N, N)
b = blosc2.linspace(1, 2, N * N).reshape(N, N)
c = blosc2.linspace(-10, 10, N * N).reshape(N, N)
expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)

out = expr.compute()
print(out.info)

NDArray instances resemble NumPy arrays but store compressed data, processed efficiently by Python-Blosc2’s engine.

When operands fit in memory (20,000 x 20,000), performance nears top-tier libraries like NumExpr, exceeding NumPy and Numba, with low memory use via default compression. As you can see, Blosc2 compression can speed computation via fast codecs and filters, plus efficient CPU cache use.

Performance when operands comfortably fit in-memory

For larger datasets exceeding memory, Python-Blosc2 rivals Dask+Zarr in performance (70,000 x 70,000).

Performance when operands do not fit in memory (uncompressed)

Blosc2 can utilize MKL-enabled Numexpr for optimized transcendental functions on Intel compatible CPUs (as used for the above plots).

Benchmark notebooks:

Blosc/python-blosc2

Blosc/python-blosc2

Reductions and disk-based computations#

One key feature of Python-Blosc2’s compute engine is its ability to perform reductions on compressed data, optionally stored on disk, enabling calculations on datasets too large for memory.

Example:

import blosc2

N = 20_000  # for small scenario
# N = 100_000 # for large scenario
a = blosc2.linspace(0, 1, N * N, shape=(N, N), urlpath="a.b2nd", mode="w")
b = blosc2.linspace(1, 2, N * N, shape=(N, N), urlpath="b.b2nd", mode="w")
c = blosc2.linspace(-10, 10, N * N, shape=(N, N))  # small and in-memory
# Expression
expr = np.sum(((a**3 + np.sin(a * 2)) < c) & (b > 0), axis=1)

# Evaluate and get a NDArray as result
out = expr.compute()
print(out.info)

This example computes the sum of a boolean array resulting from an expression, where the operands are on disk, with the result being a 1D array stored in memory (or optionally on disk via the out= parameter in compute() or sum() functions).

Check out a blog post about this feature, with performance comparisons, at: https://ironarray.io/blog/compute-bigger

Hopefully, this overview has provided a good understanding of Python-Blosc2’s capabilities. To begin your journey with Python-Blosc2, proceed to the installation instructions. Then explore the tutorials and reference sections for further information!