What is it?#
Python-Blosc2 is a high-performance compressed ndarray library with a flexible compute engine. It uses the C-Blosc2 library as the compression backend. C-Blosc2 is the next generation of Blosc, an award-winning library that has been around for more than a decade, and that is been used by many projects, including PyTables or Zarr.
Python-Blosc2 is Python wrapper that exposes the C-Blosc2 API, plus an integrated compute engine. This allows to perform complex calculations on compressed data in a way that operands do not need to be in-memory, but can be stored on disk or on the network. This makes possible to work with data no matter how large it is, and that can be stored in a distributed fashion.
Most importantly, Python-Blosc2 uses the C-Blosc2 simple and open format for storing compressed data, making it easy to integrate with other systems and tools.
Interacting with the ecosystem#
Python-Blosc2 makes special emphasis on interacting well with existing libraries and tools. In particular, it provides:
Support for NumPy universal functions mechanism, allowing to mix and match NumPy and Blosc2 computation engines.
Excellent integration with Numba and Cython via User Defined Functions.
Lazy expressions that are computed only when needed, and that can be stored for later use.
Python-Blosc2 leverages both NumPy and NumExpr for achieving great performance, but with a twist. Among the main differences between the new computing engine and NumPy or numexpr, you can find:
Support for ndarrays that can be compressed and stored in-memory, on-disk or on the network.
Can perform many kind of math expressions, including reductions, indexing, filters and more.
Support for broadcasting operations. Allows to perform operations on arrays of different shapes.
Much better adherence to the NumPy casting rules than numexpr.
Persistent reductions where ndarrays that can be updated incrementally.
Support for proxies that allow to work with compressed data on local or remote machines.
Data containers#
The main data container objects in Python-Blosc2 are:
SChunk
: a 64-bit compressed store. It can be used to store any kind of data that supports the buffer protocol.NDArray
: an N-Dimensional store. This mimic the NumPy API, but with the added capability of storing compressed data in a more efficient way.
They are described in more detail below.
SChunk: a 64-bit compressed store#
SChunk
is the simple data container that handles setting, expanding and getting
data and metadata. Contrarily to chunks, a super-chunk can update and resize the data
that it contains, supports user metadata, and it does not have the 2 GB storage limitation.
Additionally, you can convert a SChunk into a contiguous, serialized buffer (aka cframe) and vice-versa; as a bonus, the serialization/deserialization process also works with NumPy arrays and PyTorch/TensorFlow tensors at a blazing speed:
while reaching excellent compression ratios:
![Compression ratio for different codecs](https://github.com/Blosc/python-blosc2/blob/main/images/pack-array-cratios.png?raw=true)
Also, if you are a Mac M1/M2 owner, make you a favor and use its native arm64 arch (yes, we are distributing Mac arm64 wheels too; you are welcome ;-):
Read more about SChunk
features in our blog entry at: https://www.blosc.org/posts/python-blosc2-improvements
NDArray: an N-Dimensional store#
One of the latest and more exciting additions in Python-Blosc2 is the NDArray object. It can write and read n-dimensional datasets in an extremely efficient way thanks to a n-dim 2-level partitioning, allowing to slice and dice arbitrary large and compressed data in a more fine-grained way:
![https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true](https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true)
To wet you appetite, here it is how the NDArray
object performs on getting slices
orthogonal to the different axis of a 4-dim dataset:
![https://github.com/Blosc/python-blosc2/blob/main/images/Read-Partial-Slices-B2ND.png?raw=true](https://github.com/Blosc/python-blosc2/blob/main/images/Read-Partial-Slices-B2ND.png?raw=true)
We have blogged about this: https://www.blosc.org/posts/blosc2-ndim-intro
We also have a ~2 min explanatory video on why slicing in a pineapple-style (aka double partition) is useful:
![Slicing a dataset in pineapple-style](https://github.com/Blosc/blogsite/blob/master/files/images/slicing-pineapple-style.png?raw=true)
Operating with NDArrays#
The NDArray
objects are easy to work with in Python-Blosc2.
Here it is a simple example:
import blosc2
N = 20_000 # for small scenario
# N = 70_000 # for large scenario
a = blosc2.linspace(0, 1, N * N).reshape(N, N)
b = blosc2.linspace(1, 2, N * N).reshape(N, N)
c = blosc2.linspace(-10, 10, N * N).reshape(N, N)
# Expression
expr = ((a**3 + blosc2.sin(c * 2)) < b) & (c > 0)
# Evaluate and get a NDArray as result
out = expr.compute()
print(out.info)
As you can see, the NDArray
instances are very similar to NumPy arrays,
but behind the scenes, they store compressed data that can be processed
efficiently using the new computing engine included in Python-Blosc2.
To wet your appetite, here is the performance (measured on a modern desktop machine) that you can achieve when the operands in the expression above fit comfortably in memory (20_000 x 20_000):
![Performance when operands comfortably fit in-memory](https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-dask-small.png?raw=true)
In this case, the performance is somewhat below that of top-tier libraries like Numexpr, but still quite good, specially when compared with plain NumPy. For these short benchmarks, Numba normally loses because its relatively large compiling overhead cannot be amortized. And although Dask implements a similar lazy evaluation mechanism, it is not as efficient as the one in Python-Blosc2.
One important point is that the memory consumption when using the LazyArray.compute()
method is pretty low (does not exceed 100 MB) because the output is an NDArray
object,
which is compressed by default. On the other hand, the LazyArray.__getitem__()
method
returns an actual NumPy array and hence takes about 400 MB of memory (the 20,000 x 20,000
array of booleans), so using it is not recommended for large datasets, (although it may
still be convenient for small outputs, and most specially slices).
Another point is that, when using the Blosc2 engine, computation with compression is actually faster than without it (not by a large margin, but still). To understand why, you may want to read this paper.
And here it is the performance when the operands and result (70,000 x 70,000) cannot fit in memory in an uncompressed form (a machine with 64 GB of RAM, for a working set of 115 GB, uncompressed):
![Performance when operands do not fit in memory (uncompressed)](https://github.com/Blosc/python-blosc2/blob/main/images/lazyarray-dask-large.png?raw=true)
In this latter case, the memory consumption figures do not seem extreme; this
is because both Blosc2 and Dask are using compressed operands. The only difference
between the cases is that the LazyArray.__getitem__()
and Dask.compute()
methods create an uncompressed output, which is why the memory consumption is higher.
Here, the performance compared to Dask is pretty competitive. Note that, when the output is compressed (lower plot), the memory consumption is much lower than Dask, and kept constant during the computation, which is testimonial of the smart use of CPU caches and memory by the Blosc2 engine –for example, the CPU used in the experiment has 128 MB of L3, which is very close to the amount of memory used by Blosc2. This is an important point, because fitting the working set in memory is not enough; you also need to use caches and memory efficiently to get the best performance.
Last but not least, as Blosc2 can use the Numexpr engine, if you have a MKL-enabled
Numexpr (e.g. conda install numexpr mkl
), you benefit from the
Intel MKL
library, which provides a very fast and optimized library for transcendental functions
(among others). This is the version that has been used in the benchmarks above.
You can find the notebooks for these benchmarks at:
Reductions and disk-based computations#
One of the most interesting features of the new computing engine in Python-Blosc2 is the ability to perform reductions on compressed data that can optionally be stored on disk. This allows to perform calculations on data that is too large to fit in memory.
Here is a simple example:
import blosc2
N = 20_000 # for small scenario
# N = 100_000 # for large scenario
a = blosc2.linspace(0, 1, N * N, shape=(N, N), urlpath="a.b2nd", mode="w")
b = blosc2.linspace(1, 2, N * N, shape=(N, N), urlpath="b.b2nd", mode="w")
c = blosc2.linspace(-10, 10, N * N, shape=(N, N)) # small and in-memory
# Expression
expr = np.sum(((a**3 + np.sin(a * 2)) < c) & (b > 0), axis=1)
# Evaluate and get a NDArray as result
out = expr.compute()
print(out.info)
In this case, we are performing a reduction that computes the sum of the boolean
array that results from a expression. The result is a 1D array that will be
stored in memory, but that can be optionally stored on disk (via the out=
parameter in the compute()
or sum()
methods).
Here you can see a plot on how this performs for a series of operands that vary in size (run on a modern desktop machine, using Linux and a 8-core AMD CPU, with a large 96 MB L3 cache, and 64 GB of RAM):
![Performance vs operand sizes for reductions](https://github.com/Blosc/python-blosc2/blob/main/images/reduc-float64-amd.png?raw=true)
As you can see, we are expressing the performance in GB/s, which is a very useful metric when dealing with large datasets. The performance is quite good and, when compression is used, it is kept constant for all operand sizes, which is a sign that Blosc2 is using the CPU caches (and memory) efficiently.
On the other hand, when compression is not used, the performance degrades as the operand size increases, which is a sign that the CPU caches are not being used efficiently. This is a because data needs more time to be fetched from (slower disk) storage, and the CPU is not able to keep up with the data flow.
Finally, here is a plot for a much larger set of datasets (up to 400,000 x 400,000, or 2.3 TB), where the operands do not fit in memory, even when compressed:
![Performance vs large operand sizes for reductions](https://github.com/Blosc/python-blosc2/blob/main/images/reduc-float64-log-amd.png?raw=true)
In this case, we see that for operand sizes exceeding ~1 TB, the performance degrades significantly as well, but it is still quite good, specially when using disk-based operands. This demonstrates how Blosc2 is able to load data from disk more efficiently than the swap subsystem of the operating system; it can do so because it is able to grab data from disk while it is computing, so it can overlap I/O with computation.
You can find the script for these benchmarks at:
All in all, thanks to compression, a fine-tuned partitioning for leveraging modern CPU caches, and an efficient I/O that overlaps with computation, the Blosc2 compute engine allows to perform calculations on data that is too large to fit in memory, and that can be stored in memory, on disk or on the network.