What is it?#
C-Blosc2 is a blocking, shuffling and lossless compression library meant for numerical data written in C. Blosc2 is the next generation of Blosc, an award-winning library that has been around for more than a decade, and that is been used by many projects, including PyTables or Zarr.
On top of C-Blosc2 we built Python-Blosc2, a Python wrapper that exposes the C-Blosc2 API, plus many extensions that allow it to work transparently with NumPy arrays, while performing advanced computations on compressed data that can be stored either in-memory, on-disk or on the network (via the Caterva2 library).
Python-Blosc2 leverages both NumPy and numexpr for achieving great performance, but with a twist. Among the main differences between the new computing engine and NumPy or numexpr, you can find:
Support for n-dim arrays that are compressed in-memory, on-disk or on the network.
High performance compression codecs, for integer, floating point, complex booleans, string and structured data.
Can perform many kind of math expressions, including reductions, indexing, filters and more.
Support for NumPy ufunc mechanism, allowing to mix and match NumPy and Blosc2 computations.
Excellent integration with Numba and Cython via User Defined Functions.
Support for broadcasting operations. This is a powerful feature that allows to perform operations on arrays of different shapes.
Much better adherence to the NumPy casting rules than numexpr.
Lazy expressions that are computed only when needed, and can be stored for later use.
Persistent reductions that can be updated incrementally.
Support for proxies that allow to work with compressed data on local or remote machines.
Currently Python-Blosc2 already reproduces the API of Python-Blosc, so it can be used as a drop-in replacement. However, there are a few exceptions for a full compatibility.
In addition, Python-Blosc2 aims to leverage the new C-Blosc2 API so as to support super-chunks, multi-dimensional arrays (NDArray), serialization and other bells and whistles introduced in C-Blosc2. Although this is always and endless process, we have already catch up with most of the C-Blosc2 API capabilities.
Note: Python-Blosc2 is meant to be backward compatible with Python-Blosc data. That means that it can read data generated with Python-Blosc, but the opposite is not true (i.e. there is no forward compatibility).
The main data container objects in Python-Blosc2 are:
SChunk
: a 64-bit compressed store. It can be used to store any kind of data that supports the buffer protocol.NDArray
: an N-Dimensional store. This mimic the NumPy API, but with the added capability of storing compressed data in a more efficient way.
They are described in more detail below.
SChunk: a 64-bit compressed store#
SChunk
is the simple data container that handles setting, expanding and getting
data and metadata. Contrarily to chunks, a super-chunk can update and resize the data
that it contains, supports user metadata, and it does not have the 2 GB storage limitation.
Additionally, you can convert a SChunk into a contiguous, serialized buffer (aka cframe) and vice-versa; as a bonus, the serialization/deserialization process also works with NumPy arrays and PyTorch/TensorFlow tensors at a blazing speed:
while reaching excellent compression ratios:
Also, if you are a Mac M1/M2 owner, make you a favor and use its native arm64 arch (yes, we are distributing Mac arm64 wheels too; you are welcome ;-):
Read more about SChunk
features in our blog entry at: https://www.blosc.org/posts/python-blosc2-improvements
NDArray: an N-Dimensional store#
One of the latest and more exciting additions in Python-Blosc2 is the NDArray object. It can write and read n-dimensional datasets in an extremely efficient way thanks to a n-dim 2-level partitioning, allowing to slice and dice arbitrary large and compressed data in a more fine-grained way:
To wet you appetite, here it is how the NDArray
object performs on getting slices
orthogonal to the different axis of a 4-dim dataset:
We have blogged about this: https://www.blosc.org/posts/blosc2-ndim-intro
We also have a ~2 min explanatory video on why slicing in a pineapple-style (aka double partition) is useful: