20 years of PyTables

Back in October 2002 the first version of PyTables was released. It was an attempt to store a large amount of tabular data while being able to provide a hierarchical structure around it. Here it is the first public announcement by me:

Hi!,

PyTables is a Python package which allows dealing with HDF5 tables.
Such a table is defined as a collection of records whose values are
stored in fixed-length fields.  PyTables is intended to be easy-to-use,
and tried to be a high-performance interface to HDF5.  To achieve this,
the newest improvements in Python 2.2 (like generators or slots and
metaclasses in brand-new classes) has been used.  Python creation
extension tool has been chosen to access the HDF5 library.

This package should be platform independent, but until now I’ve tested
it only with Linux.  It’s the first public release (v 0.1), and it is
in alpha state.

As noted, PyTables was an early adopter of generators and metaclasses that were introduced in the new (by that time) Python 2.2. It turned out that generators demonstrated to be an excellent tool in many libraries related with data science. Also, Pyrex adoption (which was released just a few months ago) greatly simplified the wrapping of native C libraries like HDF5.

By that time there were not that much Python libraries for persisting tabular data with a format that allowed on-the-flight compression, and that gave PyTables a chance to be considered as a good option. Some months later, PyCon 2003 accepted our first talk about PyTables. Since then, we (mainly me, with the support from Scott Prater on the documentation part) gave several presentations in different international conferences, like SciPy or EuroSciPy and its popularity skyrocketed somehow.

Cárabos Coop. V.

In 2005, and after receiving some good inputs on PyTables by some customers (including The HDF Group), we decided to try to make a life out of PyTables development and together with Vicent Mas and Ivan Vilata, we set out to create a cooperative called Cárabos Coop V. Unfortunately, and after 3 years of enthusiastic (and hard) work, we did not succeed in making the project profitable, and we had to close by 2008.

During this period we managed to make a professional version of PyTables that was using out-of core indexes (aka OPSI) as well as a GUI called ViTables. After closing Cárabos we open sourced both technologies, and we are happy to say that they are still in good use, most specially OPSI indexes, that are meant to perform fast queries in very large datasets; OPSI can still be used straight from pandas.

Crew renewal

After Cárabos closure, I (Francesc Alted) continued to maintain PyTables for a while, but in 2010 I expressed my desire to handover the project, and shortly after, a new gang of people, including Anthony Scopatz and Antonio Valentino, with Andrea Bedini joining shortly after, stepped ahead and took the challenge. This is where open source is strong: whenever a project faces difficulties, there are always people eager to jump up to the wagon and continue providing traction for it.

Attempt to merge with h5py

Meanwhile, the h5py package was receiving a great adoption, specially from the community that valued more the multidimensional arrays than the tabular side of the things. There was a feeling that we were duplicating efforts and by 2016, Andrea Bedini, with the help of Anthony Scopatz, organized a HackFest in Perth, Australia where developers of the h5py and PyTables gathered to attempt a merge of the two projects. After the initial work there, we continued this effort with a grant from NumFOCUS.

Unfortunately, the effort demonstrated to be rather complex, and we could not finish it properly (for the sake of curiosity, the attempt is still available). At any rate, we are actively encouraging people using both packages depending on the need; see for example, the tutorial on h5py/PyTables that Tom Kooij taught at SciPy 2017.

Satellite Projects: Blosc and numexpr

As many other open sources libraries, PyTables stands in the shoulders of giants, and makes use of amazing libraries like HDF5 or NumPy for doing its magic. In addition to that, and in order to allow PyTables push against the hardware I/O and computational limits, it leverages two high-performance packages: Blosc and numexpr. Blosc is in charge of compressing data efficiently and at very high speeds to overcome the limits imposed by the I/O subsystem, while numexpr allows to get maximum performance from computations in CPU when querying large tables. Both projects have been substantially improved by the PyTables crew, and actually, they are quite popular by themselves.

Specifically, the Blosc compressor, although born out of the needs of PyTables, it spun off as a standalone compressor (or meta-compressor, as it can use several codecs internally) meant to accelerate not just disk I/O, but also memory access in general. In an unexpected twist, Blosc2, has developed its own multi-level data partitioning system, which goes beyond the single-level partitions in HDF5, and is currently helping PyTables to reach new performance heights. By teaming with the HDF5 library (and hence PyTables), Blosc2 is allowing PyTables to query 100 trillion rows in human timeframes.

Thank you!

It has been a long way since PyTables started 20 years ago. We are happy to have helped in providing a useful framework for data storage and querying needs for many people during the journey.

Many thanks to all maintainers and contributors (either with code or donations) to the project; they are too numerous to mention them all here, but if you are reading this and are among them, you should be proud to have contributed to PyTables. In hindsight, the road may have been certainly bumpy, but it somehow worked and many difficulties have been surpassed; such is the magic and grace of Open Source!

Blosc2 Meets PyTables: Making HDF5 I/O Performance Awesome

PyTables lets users to easily handle data tables and array objects in a hierarchical structure. It also supports a variety of different data compression libraries through HDF5 filters. With the release of PyTables 3.8.0, the Blosc Development Team is pleased to announce the availability of Blosc2, acting not only as another HDF5 filter, but also as an additional partition tool (aka 'second partition') that complements the existing HDF5 chunking schema.

By providing support for a second partition in HDF5, the chunks (aka the 'first partition') can be made larger, ideally fitting in cache level 3 in modern CPUs (see below for advantages of this). Meanwhile, Blosc2 will use its internal blocks (aka the second partition) as the minimum amount of data that should be read and decompressed during data retrieval, no matter how small the hyperslice to be read is.

When Blosc2 is used to implement a second partition for data (referred ahead as 'optimized Blosc2' too), it can bypass the HDF5 pipeline for writing and for reading. This brings another degree of freedom when choosing the different internal I/O buffers, which can be of extraordinary importance in terms of performance and/or resource saving.

How second partition allows for Big Chunking

Blosc2 in PyTables is meant for compressing data in big chunks (typically in the range of level 3 caches in modern CPUs, that is, 10 ~ 1000 MB). This has some interesting advantages:

  • It allows to reduce the number of entries in the HDF5 hash table. This means less resource consumption in the HDF5 side, so PyTables can handle larger tables using less resources.

  • It speeds-up compression and decompression because multithreading works better with more blocks. Remember that you can specify the number of threads to use by using the MAX_BLOSC_THREADS parameter, or by using the BLOSC_NTHREADS environment variable.

However, the traditional drawback of having large chunks is that getting small slices would take long time because the whole chunk has to be read completely and decompressed. Blosc2 surmounts that difficulty like this: it asks HDF5 where chunks start on-disk (via H5Dget_chunk_info()), and then it access to the internal blocks (aka the second partition) independently instead of having to decompress the entire chunk. This effectively avoids penalizing access to small data slices.

In the graphic below you can see the second partition in action where, in order to retrieve the green slice, only blocks 2 and 3 needs to be addressed and decompressed, instead of the (potentially much) larger chunk 0 and 1, which would be the case for the traditional 1 single partition in HDF5:

/images/blosc2_pytables/block-slice.png

In the benchmarks below we are comparing the performance of existing filters inside PyTables (like Zlib or Blosc(1)) against Blosc2, both working as a filter or in optimized mode, that is, bypassing the HDF5 filter pipeline completely.

Benchmarks

The data used in this section have been fetched from ERA5 database (see downloading script), which provides hourly estimates of a large number of atmospheric, land and oceanic climate variables. To build the tables used for reading and writing operations, we have used five different ERA5 datasets with the same shape (100 x 720 x 1440) and the same variables (latitude, longitude and time). Then, we have built a table with a column for each variable and each dataset and added the latitude, longitude and time as columns (for a total of 8 cols). Finally, there have been written 100 x 720 x 1440 rows (more than 100 million) to this table, which makes for a total data size of 3.1 GB.

We present different scenarios when comparing resource usage for writing and reading between the Blosc and Blosc2 filters, including the Blosc2 optimized versions. First one is when PyTables is choosing the chunkshape automatically (the default); as Blosc2 is meant towards large chunks, PyTables has been tuned to produce far larger chunks for Blosc2 in this case (Blosc and other filters will remain using the same chunk sizes as usual). Second, we will visit the case where the chunkshape is equal for both Blosc and Blosc2. Spoiler alert: we will see how Blosc2 behaves well (and sometimes much beter) in both scenarios.

Automatic chunkshape

Inkernel searches

We start by performing inkernel queries where the chunkshape for the table is chosen automatically by the PyTables machinery (see query script here). This size is the same for Blosc, Zlib and uncompressed cases which are all using 16384 rows (about 512 KB), whereas for Blosc2 the computed chunkshape is quite larger: 1179648 rows (about 36 MB; this actually depends on the size of the L3 cache, which is automatically queried in real-time by PyTables and it turns out to be exactly 36 MB for our CPU, an Intel i9-13900K).

Now, we are going to analyze the memory and time usage of performing six inkernel searches, which means scanning the full table six times, in different cases:

  • With no compression; size is 3,1 GB.

  • Using HDF5 with ZLIB + Shuffle; size is 407 MB.

  • Using Blosc filter with BloscLZ codec + Bitshuffle; size is 468 MB.

  • Using Blosc2 filter with BloscLZ codec + Bitshuffle; size is 421 MB.

  • Using Blosc2 filter with Zstd codec + Bitshuffle; size is 341 MB.

/images/blosc2_pytables/inkernel-nocomp-zlib-blosc-zstd.png

As we can see, the queries with no compression enable do not take much time or memory consumption, but it requires storing the full 3.1 GB of data. When using ZLIB, which is the HDF5 default, it does not require much memory either, but it takes a much more time (about 10x more), although the stored data is more than 6x smaller. When using Blosc, the time spent in (de-)compression is much less, but the queries still takes more time (1.7x more) than the no compression case; in addition, the compression ratio is quite close to the ZLIB case.

However, the big jump comes when using Blosc2 with BloscLZ and BitShuffle, since although it uses just a little bit more memory than Blosc (a consequence of using larger chunks), in exchange it is quite faster than the previous methods while achieving a noticeably better compression ratio. Actually, this combination is 1.3x faster than using no compression; this is one of the main features of Blosc (and even more with Blosc2): it can accelerate operation by using compression.

Finally, in case we want to improve compression further, Blosc2 can be used with the ZSTD codec, which achieves the best compression ratio here, in exchange for a slightly slower time (but still 1.15x faster than not using compression).

PyTables inkernel vs pandas queries

Now that we have seen how Blosc2 can help PyTables in getting great query performance, we are going to compare it against pandas queries; to make things more interesting, we will be using the same NumExpr engine in both PyTables (where it is used in inkernel queries) and pandas.

For this benchmark, we have been exploring the best configuration for speed, so we will be using 16 threads (for both Blosc2 and NumExpr) and the Shuffle filter instead of Bitshuffle; this leads to slightly less compression ratios (see below), but now the goal is getting full speed, not reducing storage (keep in mind that Pandas stores data in-memory without compression).

Here it is how PyTables and pandas behave when doing the same 6 queries than in the previous section.

/images/blosc2_pytables/inkernel-pandas.png

And here it is another plot for the same queries, but comparing raw I/O performance for a variety of codecs and filters:

/images/blosc2_pytables/inkernel-vs-pandas2.png

As we can see, the queries using Blosc2 are generally faster (up to 2x faster) than not using compression. Furthermore, Blosc2 + LZ4 get nearly as good times as pandas, while the memory consumption is much smaller with Blosc2 (as much as 20x less in this case; more for larger tables indeed). This is remarkable, as this means that Blosc2 compression results in acceleration that almost compensates for all the additional layers in PyTables (the disk subsystem and the HDF5 library itself).

And in case you wonder how much compression ratio we have lost by switching from Bitshuffle to Shuffle, not much actually:

/images/blosc2_pytables/shuffle-bitshuffle-ratios.png

The take away message here is that, when used correctly, compression can make out-of-core queries go as fast as pure in-memory ones (even when using a high performance tool-set like pandas + NumExpr).

Writing and reading speed with automatic chunkshape

Now, let's have a look at the raw write and read performance. In this case we are going to compare Blosc, Blosc2 as an HDF5 filter, and the optimized Blosc2 (acting as a de facto second partition). Remember that in this section the chunkshape determination is still automatic and different for Blosc (16384 rows, about 512 KB) and Blosc2 (1179648 rows, about 36 MB).

/images/blosc2_pytables/append-expectedrows.png

For writing, optimized Blosc2 is able to do the job faster and get better compression ratios than others, mainly because it uses the HDF5 direct chunking mechanism, bypassing the overhead of the HDF5 pipeline.

Note: the standard Blosc2 filter cannot make of use HDF5 direct chunking, but it still has an advantage when using bigger chunks because it allows for more threads working in parallel and hence, allowing improved parallel (de-)compression.

The plot below shows how optimized Blosc2 is able to read the table faster and how the performance advantage grows as we use more threads.

/images/blosc2_pytables/read-expectedrows.png

And now, let's compare the mean times of Blosc and Blosc2 optimized to read a small slice. In this case, Blosc chunkshape is much smaller, but optimized Blosc2 still can reach similar speed since it uses blocks that are similar in size to Blosc chunks.

/images/blosc2_pytables/slice-read-expectedrows.png

Writing and reading speed when using the same chunkshape

In this scenario, we are choosing the same chunkshape (720 x 1440 rows, about 32 MB) for both Blosc and Blosc2. Let's see how this affects performance:

/images/blosc2_pytables/append-chunklen.png

The plot above shows how optimized Blosc2 manages to write the table faster (mainly because it can bypass the HDF5 pipeline); with the advantage being larger as more threads are used.

/images/blosc2_pytables/read-chunklen.png

Regarding reading, the optimized Blosc2 is able to perform faster too, and we continue to see the same trend of getting more speed when more threads are thrown at the task, with optimized Blosc2 scaling better.

Finally, let's compare the mean times of Blosc and Blosc2 when reading a small slice in this same chunkshape scenario. In this case, since chunkshapes are equal and large, optimized Blosc2 is much faster than the others because it has the ability to decompresses just the necessary internal blocks, instead of the whole chunks. However, the Blosc and the Blosc2 filter still need to decompress the whole chunk, so getting much worse times. See this effect below:

/images/blosc2_pytables/slice-read-chunklen.png

Final words

By allowing a second partition on top of the HDF5 layer, Blosc2 provides a great boost in PyTables I/O speed, specially when using big chunks (mainly when they fit in L3 CPU cache). That means that you can read, write and query large compressed datasets in less time. Interestingly, Blosc2 compression can make these operations faster than when using no compression at all, and even being competitive against a pure in-memory solution like pandas (but consuming vastly less memory).

On the other hand, there are situations where using big chunks would not be acceptable. For example, when using other HDF5 apps that do not support the optimized paths for Blosc2 second partition, and one is forced to use the plain Blosc2 filter. In this case having large chunks would penalize the retrieval of small data slices too much. By the way, you can find a nice assortment of generic filters (including Blosc2) for HDF5 in the hdf5plugin library.

Also note that, in the current implementation we have just provided optimized Blosc2 paths for the Table object in PyTables. That makes sense because Table is probably the most used entity in PyTables. Other chunked objects in PyTables (like EArray or CArray) could be optimized with Blosc2 in the future too (although that would require providing a *multidimensional* second partition for Blosc2).

Last but not least, we would like to thank NumFOCUS and other PyTables donors for providing the funds required to implement Blosc2 support in PyTables. If you like what we are doing, and would like our effort to continue developing well, you can support our work by donating to PyTables project or Blosc project teams. Thank you!

User Defined Pipeline for Python-Blosc2

The Blosc Development Team is happy to announce that the latest version of Python-Blosc2 allows user-defined Python functions all throughout its compression pipeline: you can use Python for building prefilters, postfilters, filters, codecs for Blosc2 and explore all its capabilities. And if done correctly (by using e.g. NumPy, numba, numexpr...), most of the time you won't even need to translate those into C for speed.

The Blosc2 pipeline

The Blosc2 pipeline includes all the functions that are applied to the data until it is finally compressed (and decompressed back). As can be seen in the image below, during compression the first function that can be applied to the data is the prefilter (if any), then the filters pipeline (with a maximum of six filters) and, last but not least, the codec itself. For decompressing, the order will be the other way around: first the codec, then the filters pipeline and finally the postfilter (if any).

blosc2-pipeline

Defining prefilters and postfilters

A prefilter is a function that is applied to the SChunk each time you add data into it (e.g. when appending or updating). It is executed for each data block and receives three parameters: input, output and offset. For convenience, the input and output are presented as NumPy arrays; the former is a view of the input data and the later is an empty NumPy array that must be filled (this is actually what the first filter in the filters pipeline will receive). Regarding the offset, it is an integer which indicates where the corresponding block begins inside the SChunk container.

You can easily set a prefilter to a specific SChunk through a decorator. For example:

schunk = blosc2.SChunk()
@schunk.prefilter(np.int64, np.float64)
def pref(input, output, offset):
    output[:] = input - np.pi + offset

This decorator requires the data types for the input (original data) and output NumPy arrays, which must be of same itemsize.

If you do not want the prefilter to be applied anymore, you can always remove it:

schunk.remove_prefilter("pref")

As for the postfilters, they are applied at the end of the pipeline during decompression. A postfilter receives the same parameters as the prefilter and can be set in the same way:

@schunk.postfilter(np.float64, np.int64)
def postf(input, output, offset):
    output[:] = input + np.pi - offset

In this case, the input data is the one from the buffer returned by the filter pipeline, and the output data type should be the same as the original data (for a good round-trip).

You can also remove postfilters whenever you want:

schunk.remove_postfilter("postf")

Fillers

Before we move onto the next step in the pipeline, we need to introduce the fillers. A filler is similar to a prefilter but with a twist. It is used to fill an empty SChunk and you can pass to it any extra parameter you want, as long as it is a NumPy array, SChunk or Python Scalar. All these extra parameters will arrive to the filler function as a tuple containing just the corresponding block slice for each parameter (except for the scalars, that are passed untouched). To declare a filter, you will also need to pass the inputs along with its data type. For example:

@schunk.filler(((schunk0, dtype0), (ndarray1, dtype1), (py_scalar3, dtype2)), schunk_dtype)
def filler(inputs_tuple, output, offset):
    output[:] = inputs_tuple[0] - inputs_tuple[1] * inputs_tuple[2]

This will automatically append the data to the schunk, but applying the filler function first. After that the schunk would be completely filled, the filler will be de-registered, so the next time you update some data the it would not be executed; a filler is meant to build new SChunk objects from other containers.

User-defined filters and codecs

The main difference between prefilters/postfilters and their filters/codecs counterparts is that the former ones are meant to run for an specific SChunk instance, whereas the later can be locally registered and hence, used in any SChunk.

Let's start describing the user-defined filters. A filter is composed by two functions: one for the compression process (forward), and another one for the decompression process (backward). Such functions receive the input and output as NumPy arrays of type uint8 (bytes), plus the filter meta and the SChunk instance to which the data belongs to. The forward function will fill the output with the modified data from input. The backward will be responsible of reversing the changes done by forward so that the data returned at the end of the decompression should be the same as the one received at the beginning of the compression. Check the drawing below:

forwardbackward

So, once we have the pair of forward and backward functions defined, they can be registered locally associating to a filter ID between 160 and 255:

blosc2.register_filter(id, forward, backward)

Now, we can use the user-defined filter in any SChunk instance by choosing the new local ID in the filters pipeline:

schunk.cparams = {"filters": [id], "filters_meta": [meta]}

Regarding the user-defined codecs, they do not differ too much from its filter counterparts. The main difference is that, because their goal is to actually compress data, the corresponding functions (in this case encoder and decoder) will need to return the size in bytes of the compressed/decompressed data respectively. This time the scheme would be:

encoderdecoder

To register a codec, you name it, assign an ID to it and pass the user-defined functions:

blosc2.register_codec(codec_name, id, encoder, decoder)

And to use it you just use its ID in the cparams:

schunk.cparams = {"codec": id, "codec_meta": meta}

Final words

We have seen how easily you can define your own filters and codecs for the Blosc2 compression pipeline. They are very easy to use because they conveniently wrap input and output data as NumPy arrays. Now, you can start experimenting with different filter/compression algorithms straight from Python. You can even come with a library of such filters/codecs that can be used in all your data pipeline processing. Welcome to compression made easy!

See more examples in the repository.

Find the complete documentation at: https://www.blosc.org/python-blosc2/python-blosc2.html.

This work has been made thanks to a Small Development Grant from NumFOCUS. NumFOCUS is a non-profit organization supporting open code for better science. If you like this, consider giving a donation to them (and if you like our work, you can nominate it to our project too!). Thanks!

New features in Python-Blosc2

The Blosc Development Team is happy to announce that the latest version of Python-Blosc2 0.4. It comes with new and exciting features, like a handier way of setting, expanding and getting the data and metadata from a super-chunk (SChunk) instance. Contrarily to chunks, a super-chunk can update and resize the data that it containes, supports user metadata, and it does not have the 2 GB storage limitation.

Additionally, you can now convert now a SChunk into a contiguous, serialized buffer (aka cframe) and vice-versa; as a bonus, this serialization process also works with a NumPy array at a blazing speed.

Continue reading for knowing the new features a bit more in depth.

Retrieve data with __getitem__ and get_slice

The most general way to store data in Python-Blosc2 is through a SChunk (super-chunk) object. Here the data is split into chunks of the same size. So until now, the only way of working with it was chunk by chunk (see tutorial).

With the new version, you can get general data slices with the handy __getitem__() method without having to mess with chunks manually. The only inconvenience is that this returns a bytes object, which is difficult to read by humans. To overcome this, we have also implemented the get_slice() method; it comes with two optional params: start and stop for selecting the slice you are interested in. Also, you can pass to out any Python object supporting the Buffer Protocol and it will be filled with the data slice. One common example is to pass a NumPy array in the out argument:

out_slice = numpy.empty(chunksize * nchunks, dtype=numpy.int32)
schunk.get_slice(out=out_slice)

We have now the out_slice NumPy array filled with the schunk data. Easy and effective.

Set data with __setitem__

Similarly, if we would like to set data, we had different ways of doing it, e.g. with the update_chunk() or the update_data() methods. But those work, again, chunk by chunk, which was a bummer. That's why we also implemented the convenient __setitem__() method. In a similar way to the get_slice() method, the value to be set can be any Python object supporting the Buffer Protocol. In addition, this method is very flexible because it not only can set the data of an already existing slice of the SChunk, but it also can expand (and update at the same time) it.

To do so, the stop param will set the new number of items in SChunk:

start = schunk_nelems - 123
stop = start + new_value.size   # new number of items
schunk[start:stop] = new_value

In the code above, the data between start and the SChunk current size will be updated and then, the data between the previous SChunk size and the new stop will be appended automatically for you. This is very handy indeed (note that the step parameter is not yet supported though).

Serialize SChunk from/to a contiguous compressed buffer

Super-chunks can be serialized in two slightly different format frames: contiguous and sparse. A contiguous frame (aka cframe) serializes the whole super-chunk data and metadata into a sequential buffer, whereas the sparse frame (aka sframe) uses a contiguous frame for metadata and the data is stored separately in so-called chunks. Here it is how they look like:

Contiguous and sparse frames

The contiguous and sparse formats come with its own pros and cons. A contiguous frame is ideal for transmitting / storing data as a whole buffer / file, while the sparse one is better suited to act as a store while a super-chunk is being built.

In this new version of Python-Blosc2 we have added a method to convert from a SChunk to a contiguous, serialized buffer:

buf = schunk.to_cframe()

as well as a function to build back a SChunk instance from that buffer:

schunk = schunk_from_cframe(buf)

This allows for a nice way to serialize / deserialize super-chunks for transmission / storage purposes. Also, and for performance reasons, and for reducing memory usage to a maximum, these functions avoid copies as much as possible. For example, the schunk_from_cframe function can build a SChunk instance without copying the data in cframe. Such a capability makes the use of cframes very desirable whenever you have to transmit and re-create data from one machine to another in a very efficient way.

Serializing NumPy arrays

Last but not least, you can also serialize NumPy arrays with the new pair of functions pack_array2() / unpack_array2(). Although you could already do this with the existing pack_array() / unpack_array() functions, the new ones are much faster and do not have the 2 GB size limitation. To prove this, let's see its performance by looking at some benchmark results obtained with an Intel box (i9-10940X CPU @ 3.30GHz, 14 cores) running Ubuntu 22.04.

In this benchmark we are comparing a plain NumPy array copy against compression/decompression through different compressors and functions (compress() / decompress(), pack_array() / unpack_array() and pack_array2() / unpack_array2()). The data distribution for the plots below is for 3 different data distributions: arange, linspace and random:

Compression ratio for different codecs

As can be seen, different codecs offer different compression ratios for the different distributions. Note in particular how linear distributions (arange for int64 and linspace for float64) can reach really high compression ratios (very low entropy).

Let's see the speed for compression / decompression; in order to not show too many info in this blog, we will show just the plots for the linspace linear distribution:

Compression speed for different codecsDecompression speed for different codecs

Here we can see that the pair pack_array2() / unpack_array2() is consistently (much) faster than their previous version pack_array() / unpack_array(). Despite that, the fastest is the compress() / decompress() pair; however this is not serializing all the properties of a NumPy array, and has the limitation of not being able to compress data larger than 2 GB.

You can test the speed in your box by running the pack_compress bench.

Also, if you would like to store the contiguous buffer on-disk, you can directly use the pair of functions save_array(), load_array().

Native performance on Apple M1 processors

Contrariliy to Blosc1, Blosc2 comes with native support for ARM processors (it leverages the NEON SIMD instruction set there), and that means that it runs very fast in this architecture. As an example, let's see how the new pack_array2() / unpack_array2() works in an Apple M1 laptop (Macbook Air).

Compression speed for different codecsDecompression speed for different codecs

As can be seen, running Blosc2 in native arm64 mode on M1 offers quite a bit more performance (specially during compression) than using the i386 emulation. If speed is important to you, and you have a M1/M2 processor, make sure that you are running Blosc2 in native mode (arm64).

Conclusions

The new features added to python-blosc2 offer an easy way of creating, getting, setting and expanding data by using a SChunk instance. Furthermore, you can get a contiguous compressed representation (aka cframe) of it and re-create it again latter. And you can do the same with NumPy arrays (either in-memory or on-disk) faster than with the former functions, and even faster than a plain memcpy().

For more info on how to use these useful new features, see this Jupyter notebook tutorial.

Finally, the complete documentation is at: https://www.blosc.org/python-blosc2/python-blosc2.html. Thanks to Marc Garcia (@datapythonista) for his fine work and enthusiasm in helping us in providing a better structure to the Blosc documentation!

This work has been possible thanks to a Small Development Grant from NumFOCUS. NumFOCUS is a non-profit organization supporting open code for better science. If you like the goal, consider giving a donation to NumFOCUS (you can optionally make it go to our project too, to which we would be very grateful indeed :-).

Announcing Support for Lossy ZFP Codec as a Plugin for C-Blosc2

Announcing Support for Lossy ZFP Codec as a Plugin for C-Blosc2

Blosc supports different filters and codecs for compressing data, like e.g. the lossless NDLZ codec and the NDCELL filter. These have been developed explicitly to be used in multidimensional datasets (via Caterva or ironArray Community Edition).

However, a lossy codec like ZFP allows for much better compression ratios at the expense of loosing some precision in floating point data. Moreover, while NDLZ is only available for 2-dim datasets, ZFP can be used up to 4-dim datasets.

How ZFP works?

ZFP partitions datasets into cells of 4^(number of dimensions) values, i.e., 4, 16, 64, or 256 values for 1D, 2D, 3D, and 4D arrays, respectively. Each cell is then (de)compressed independently, and the resulting bit strings are concatenated into a single stream of bits.

Furthermore, ZFP usually truncates each input value either to a fixed number of bits to meet a storage budget or to some variable length needed to meet a chosen error tolerance. For more info on how this works, see zfp overview docs.

ZFP implementation

Similarly to other registered Blosc2 official plugins, this codec is now available at the blosc2/plugins directory of the C-Blosc2 repository. However, as there are different modes for working with ZFP, there are several associated codec IDs (see later).

So, in order to use ZFP, users just have to choose the ID for the desired ZFP mode between the ones listed in blosc2/codecs-registry.h. For more info on how the plugin selection mechanism works, see https://www.blosc.org/posts/registering-plugins/.

ZFP modes

ZFP is a lossy codec, but it still lets the user to choose the degree of the data loss. There are different compression modes:

  • BLOSC_CODEC_ZFP_FIXED_ACCURACY: The user can choose the absolute error in truncation. For example, if the desired absolute error is 0.01, each value loss must be less than or equal to 0.01. With that, if 23.0567 is a value of the original input, after compressing and decompressing this input with error=0.01, the new value must be between 23.0467 and 23.0667.

  • BLOSC_CODEC_ZFP_FIXED_PRECISION: The user specifies the maximum number of bit planes encoded during compression (relative error). This is, for each input value, the number of most significant bits that will be encoded.

  • BLOSC_CODEC_ZFP_FIXED_RATE: The user chooses the size that the compressed cells must have based on the input cell size. For example, if the cell size is 2000 bytes and user chooses ratio=50, the output cell size will be 50% of 2000 = 1000 bytes.

For more info, see: https://github.com/Blosc/c-blosc2/blob/main/plugins/codecs/zfp/README.md

Benchmark: ZFP FIXED-ACCURACY VS FIXED_PRECISION VS FIXED-RATE modes

The dataset used in this benchmark is called precipitation_amount_1hour_Accumulation.zarr and has been fetched from ERA5 database, which provides hourly estimates of a large number of atmospheric, land and oceanic climate variables.

Specifically, the downloaded dataset in Caterva format has this parameters:

  • ndim = 3

  • type = float32

  • shape = [720, 721, 1440]

  • chunkshape = [128, 128, 256]

  • blockshape = [16, 32, 64]

The next plots represent the compression results obtained by using the different ZFP modes to compress the already mentioned dataset.

Note: It is important to remark that this is a specific dataset and the codec may perform differently for other ones.

/images/zfp-plugin/ratio_zfp.png/images/zfp-plugin/times_zfp.png

Below the bars it is annotated what parameter is used for each test. For example, for the first column, the different compression modes are setup like this:

  • FIXED-ACCURACY: for each input value, the absolute error is 10^(-6) = 0.000001.

  • FIXED-PRECISION: for each input value, only the 20 most significant bits for the mantissa will be encoded.

  • FIXED-RATE: the size of the output cells is 30% of the input cell size.

Although the FIXED-PRECISION mode does not obtain great results, we see that with the FIXED-ACCURACY mode we do get better performance as the absolute error increases. Similarly, we can see how the FIXED-RATE mode gets the requested ratios, which is cool but, in exchange, the amount of data loss is unknown.

Also, while FIXED-ACCURACY and FIXED-RATE modes consume similar times, the FIXED-PRECISION mode, which seems to have less data loss, also takes longer to compress. Generally speaking we can see how, the more data loss (more data truncation) achieved by a mode, the faster it operates.

"Third partition"

One of the most appealing features of Caterva besides supporting multi-dimensionality is its implementation of a second partition, making slicing more efficient. As one of the distinctive characteristics of ZFP is that it compresses data in independent (and small) cells, we have been toying with the idea of implementing a third partition so that slicing of thin selections or just single-point selection can be made faster.

So, as part of the current ZFP implementation, we have combined the Caterva/Blosc2 partitioning (chunking and blocking) with the independent cell handling of ZFP, allowing to extract single cells within the ZFP streams (blocks in Blosc jargon). Due to the properties and limitations of the different ZFP compression modes, we have been able to implement a sort of "third partition" just for the FIXED-RATE mode when used together with the blosc2_getitem_ctx() function.

Such a combination of the existing partitioning and single cell extraction is useful for selecting more narrowly the data to extract, saving time and memory. As an example, below you can see a comparison of the mean times that it takes to retrieve a bunch of single elements out of different multidimensional arrays from the ERA5 dataset (see above). Here we have used Blosc2 with a regular LZ4 codec compared against the FIXED-RATE mode of the new ZFP codec:

/images/zfp-plugin/zfp_fixed_rate.png

As you can see, using the ZFP codec in FIXED-RATE mode allows for a good improvement in speed (up to more than 2x) for retrieving single elements (or, in general an amount not exceeding the cell size) in comparison with the existing codecs (even the fastest ones, like LZ4) inside Blosc2. As the performance improvement is of the same order than random access time of modern SSDs, we anticipate that this could be a major win in scenarios where random access is important.

If you are curious on how this new functionality performs for your own datasets and computer, you can use/adapt our benchmark code.

Conclusions

The integration of ZFP as a codec plugin will greatly enhance the capabilities of lossy compression inside C-Blosc2. The current ZFP plugin supports different modes; if users want to specify data loss during compression, it is recommended to use the FIXED-ACCURACY or FIXED-PRECISION modes (and most specially the former because of its better compression performance).

However, if the priority is to get specific compression ratios without paying too much attention to the amount of data loss, one should use the FIXED-RATE mode, which let choose the desired compression ratio. This mode also has the advantage that a "third partition" can be used for improving random elements access speed.

This work has been done thanks to a Small Development Grant from the NumFOCUS Foundation, to whom we are very grateful indeed. NumFOCUS is doing a excellent job in sponsoring scientific projects and you can donate to the Blosc project (or many others under the NumFOCUS umbrella) via its donation page.

Caterva Slicing Performance: A Study

/images/cat_slicing/caterva.png

Caterva is a C library for handling multi-dimensional, chunked, compressed datasets in an easy and fast way. It is build on top of the C-Blosc2 library, leveraging all its avantages on modern CPUs.

Caterva can be used in a lot of different situations; however, where it really stands out is for extracting multidimensional slices of compressed datasets because, thanks to the double partitioning schema that it implements, the amount of data that has to be decompressed so as to get the slice is minimized, making data extraction faster (usually). In this installment, you will be offered a rational on how double partitioning works, together with some examples where it shines, and others where it is not that good.

Double partitioning

/images/cat_slicing/cat_vs_zarr,hdf5.png

Some libraries like HDF5 or Zarr store data into multidimensional chunks. This makes slice extraction from compressed datasets more efficient than using monolithic compression, since only the chunks containing the interesting slice are decompressed instead of the entire array.

In addition, Caterva introduces a new level of partitioning. Within each chunk, the data is re-partitioned into smaller multidimensional sets called blocks. This generally improves the slice extraction, since this allows to decompress only the blocks containing the data in desired slice instead of the whole chunks.

Slice extraction with Caterva, HDF5 and Zarr

So as to see how the double partitioning performs with respect to a traditional single partition schema, we are going to compare the ability to extract multidimensional slices from compressed data of Caterva, HDF5 and Zarr. The examples below consist on extracting some hyper-planes from chunked arrays with different properties and seeing how Caterva performs compared with traditional libraries.

Note: So as to better compare apples with apples, all the benchmarks below have been run using Blosc (with LZ4 as the internal codec) as the compressor by default, with the shuffle filter. Even if Caterva uses the newest C-Blosc2 compressor, and HDF5 and Zarr uses its C-Blosc(1) antecessor, the performance of both libraries are very similar. Also, for easier interactivity, we have used the libraries via Python wrappers (python-caterva, h5py, Zarr).

2-dimensional array

This is a 2-dimensional array and has the following properties, designed to optimize slice extraction from the second dimension:

shape = (8_000, 8_000)
chunkshape = (4_000, 100)
blockshape = (500, 25)

Here we can see that the ratio between chunkshape and blockshape is 8x in dimension 0 and 4x in dimension 1.

/images/cat_slicing/dim0.png/images/cat_slicing/dim1.png

Now we are going to extract some planes from the chunked arrays and will plot the performance. For dimension 0 we extract a hyperplane [i, :], and for dimension 1, [:, i], where i is a random integer.

/images/cat_slicing/2dim.png

Here we see that the slicing times are similar in the dimension 1. However, Caterva performs better in the dimension 0. This is because with double partitioning you only have to decompress the blocks containing the slice instead of the whole chunk.

In fact, Caterva is around 12x faster than HDF5 and 9x faster than Zarr for slicing the dimension 0, which makes sense since Caterva decompresses 8x less data. For the dimension 1, Caterva is approximately 3x faster than HDF5 and Zarr; in this case Caterva has to decompress 4x less data.

That is, the difference in slice extraction speed depends largely on the ratio between the chunk size and the block size. Therefore, for slices where the chunks that contain the slice also have many items that do not belong to it, the existence of blocks (i.e. the second partition) allows to significantly reduce the amount of data to decompress.

Overhead of the second partition

So as to better assess the possible performance cost of the second partition, let's analyze a new case of a 3-dimensional array with the following parameters:

shape = (800, 600, 300)
chunkshape = (200, 100, 80)
blockshape = (20, 100, 10)

So, in the dimensions 0 and 2 the difference between shape and chunkshape is not too big whereas the difference between chunkshape and blockshape is remarkable.

However, for the dimension 1, there is not a difference at all between chunkshape and blockshape. This means that in dim 1 the Caterva machinery will make extra work because of the double partitioning, but it will not get any advantage of it since the block size is going to be equal to the chunk size. This a perfect scenario for measuring the overhead of the second partition.

The slices to extract will be [i, :, :], [:, i, :] or [:, :, i]. Let's see the execution times for slicing these planes:

/images/cat_slicing/3dim.png

As we can see, the performance in dim 1 is around the same order than HDF5 and Zarr (Zarr being a bit faster actually), but difference is not large, so that means that the overhead introduced purely by the second partition is not that important. However, in the other dimensions Caterva still outperforms (by far) Zarr and HDF5. This is because the two level partitioning works as intended here.

A last hyper-slicing example

Let's see a final example showing the double partitioning working on a wide range of dimensions. In this case we choose a 4-dimensional array with the following parameters:

shape = (400, 80, 100, 50)
chunkshape = (100, 40, 10, 50)
blockshape = (30, 5, 2, 10)

Here the last dimension (3) is not optimized for getting hyper-slices, specially in containers with just single partitioning (Zarr and HDF5). However, Caterva should still perform well in this situation because of the double partitioning.

The slices we are going to extract will be [i, :, :, :], [:, i, :, :], [:, :, i, :] or [:, :, :, i]. Let's see the execution times for slicing these hyperplanes:

/images/cat_slicing/4dim.png

As we can see, in this case Caterva outperforms Zarr and HDF5 in all dimensions. However, the advantage is not that important for the last dimension. The reason is that in this last dimension Caterva has a noticeably lower ratio between its shape and blockshape than in the other dimensions.

Final thoughts

We have seen that adding a second partition is beneficial for improving slicing performance in general. Of course, there are some situations where the overhead of the second partition can be noticeable, but the good news is that such an overhead does not get too large when compared with containers with only one level of partitioning.

Finally, we can conclude that Caterva usually obtains better results due to its second partitioning, but when it shines the most is when the two levels of partitioning are well balanced among them and also with respect to the shape of the container.

As always, there is no replacement for experimentation so, in case you want to try Caterva by yourself (and you should if you really care about this problem), you can use our Caterva poster; it is based on a Jupyter notebook that you can adapt to your own scenarios.

Registering plugins in C-Blosc2

Blosc has traditionally supported different filters and codecs for compressing data, and it was up to the user to choose one or another depending on her needs. However, there will always be scenarios where a more richer variety of them could be useful.

Blosc2 has now a new plugin register capability in place so that the info about the new filters and codecs can be persisted and transmitted to different machines. In this way Blosc can figure out the info of persistent plugins, and use them so as to decompress the data without problems.

Besides, the Blosc Development Team has implemented a centralized repository so that people can propose new plugins; and if these plugins fulfill a series of requirements, they will be officially accepted, and distributed within the C-Blosc2 library. This provides an easy path for extending C-Blosc2 and hence, better adapt to user needs.

The plugins that can be registered in the repository can be either codecs or filters.

  • A codec is a program able to compress and decompress a data stream with the objective of reducing its size and to enable a faster transmission of data.

  • A filter is a program that reorders the data without changing its size, so that the initial and final size are equal. A filter consists of encoder and decoder. The filter encoder is applied before the codec compressor (or codec encoder) in order to make data easier to compress and the filter decoder is used after codec decompressor (or codec decoder) to restore the original data arrangement.

Here it is an example on how the compression process goes:

--------------------   filter encoder  -------------------   codec encoder   -------
|        src        |   ----------->  |        tmp        |   ---------->   | c_src |
--------------------                   -------------------                   -------

And the decompression process:

--------   codec decoder    -------------------   filter decoder  -------------------
| c_src |   ----------->   |        tmp        |   ---------->   |        src        |
--------                    -------------------                   -------------------

Register for user plugins

User registered plugins are plugins that users register locally so that they can be used in the same way as Blosc official codecs and filters. This option is perfect for users that want to try new filters or codecs on their own.

The register process is quite simple. You just use the blosc2_register_filter() or blosc2_register_codec() function and then the Blosc2 machinery will store its info with the rest of plugins. After that you will be able to access your plugin through its ID by setting Blosc2 compression or decompression params.

                                             filters pipeline
                                          ----------------------
                                         |  BLOSC_SHUFFLE     1 |
                                          ----------------------
                                         |  BLOSC_BITSHUFFLE  2 |
                                          ----------------------
                                         |  BLOSC_DELTA       3 |
                                          ----------------------
                                         |  BLOSC_TRUNC       4 |
                                          ----------------------
                                         |         ...          |
                                          ----------------------
                                         |  BLOSC_NDCELL     32 |
                                          ----------------------
                                         |  BLOSC_NDMEAN     33 |
                                          ----------------------
                                         |         ...          |
                                          ----------------------
                                         |  urfilter1       160 |
                                          ----------------------
blosc2_register_filter(urfilter2)  --->  |  urfilter2       161 |  ---> cparams.filters[4] = 161; // can be used now
                                          ----------------------
                                         |         ...          |
                                          ----------------------

Global register for Blosc plugins

Blosc global registered plugins are Blosc plugins that have passed through a selection process and a review by the Blosc Development Team. These plugins will be available for everybody using the C-Blosc2 library.

You should consider this option if you think that your codec or filter could be useful for the community, or you just want being able to use them with upstream C-Blosc2 library. The steps for registering an official Blosc plugin can be seen at: https://github.com/Blosc/c-blosc2/blob/main/plugins/README.md

Some well documented examples of these kind of plugins are the codec ndlz and the filters ndcell and ndmean on the C-Blosc2 GitHub repository: https://github.com/Blosc/c-blosc2/tree/main/plugins

Compiling plugins examples using Blosc2 wheels

So as to easy the use of the registered filters, full-fledged C-Blosc2 binary libraries including plugins functionality can be installed from python-blosc2 (>= 0.1.8) wheels:

$ pip install blosc2
Collecting blosc2
  Downloading blosc2-0.1.8-cp37-cp37m-manylinux2010_x86_64.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 4.7 MB/s
Installing collected packages: blosc2
Successfully installed blosc2-0.1.8

Once you have installed the C-Blosc2 libraries you can not only use the official Blosc filters and codecs, but you can also register and use them. You can find directions on how to compile C files using the Blosc2 libraries inside these wheels at: https://github.com/Blosc/c-blosc2/blob/main/COMPILING_WITH_WHEELS.rst

Using user plugins

To use your own plugins with the Blosc machinery you first have to register them through the function blosc2_register_codec() or blosc2_register_filter() with an ID between BLOSC2_USER_DEFINED_FILTERS_START and BLOSC2_USER_DEFINED_FILTERS_STOP. Then you can use this ID in the compression parameters (cparams.compcode, cparams.filters) and decompression parameters (dparams.compcode, dparams.filters). For any doubts you can see the whole process in the examples urcodecs.c and urfilters.c.

blosc2_codec urcodec;
udcodec.compcode = 244;
udcodec.compver = 1;
udcodec.complib = 1;
udcodec.compname = "urcodec";
udcodec.encoder = codec_encoder;
udcodec.decoder = codec_decoder;
blosc2_register_codec(&urcodec);

blosc2_cparams cparams = BLOSC2_CPARAMS_DEFAULTS;
cparams.compcode = 244;

Using Blosc official plugins

To use the Blosc official plugins it is mandatory to add the next lines in order to activate the plugins mechanism:

  • #include "blosc2/codecs-registery.h" or #include "blosc2/filters-registery.h" depending on the plugin type at the beginning of the file

  • #include "blosc2/blosc2.h" at the beginning of the file

  • Call blosc_init() at the beginning of main() function

  • Call blosc_destroy() at the end of main() function

Then you just have to use the ID of the plugin that you want to use in the compression parameters (cparams.compcode).

#include "blosc2.h"
#include "../codecs-registry.h"
int main(void) {
    blosc_init();
    ...
    blosc2_cparams cparams = BLOSC2_CPARAMS_DEFAULTS;
    cparams.compcode = BLOSC_CODEC_NDLZ;
    cparams.compcode_meta = 4;
    ...
    blosc_destroy();
}

In case of doubts, you can see how the whole process works in working tests like: test_ndlz.c, test_ndcell.c, test_ndmean_mean.c and test_ndmean_repart.c.

Final remarks

The plugin register functionality let use new codecs and filters within Blosc in an easy and quick way. To enhance the plugin experience, we have implemented a centralized plugin repository, so that users can propose their own plugins to be in the standard C-Blosc2 library for the benefit of all the Blosc community.

The Blosc Development Team kindly invites you to test the different plugins we already offer, but also to try with your own one. Besides, if you are willing to contribute it to the community, then apply to register it. This way everyone will be able to enjoy a variety of different and unique plugins. Hope you will enjoy this new and exciting feature!

Last but not least, a big thank you to the NumFOCUS foundation for providing a grant for implementing the register functionality.

Wrapping C-Blosc2 in Python (a beginner's view)

An initial release of the Python wrapper for C-Blosc2 is now available in: https://github.com/Blosc/python-blosc2. In this blog I will try to explain some of the most difficult aspects that I had to learn in doing this and how I solved them.

This work is being made thanks to a grant from the Python Software Foundation.

Python views

At university, the first programming language that I learned was Python. But because programming was new for the majority of the class the subject only covered the basics: basic statements and classes. And although these were easy to understand, the views were unknown to me (until now).

To explain what the views are, let’s suppose we have the following code in Python:

>>> import sys
>>> a = []
>>> b = a
>>> sys.getrefcount(a)
3

The reference count for the object is 3: a, b and the argument passed to sys.getrefcount().

Basically, to avoid making copies of a same variable, Python uses views. Every variable has its counter and until the counter is 0, the variable is not deleted. But that means that two threads cannot access the counter at the same time. Because having a lock for every variable would be inefficient and could produce deadlocks (which means that several threads are waiting for each other), the GIL was created. So GIL was my next thing to learn.

GIL and Cython

GIL stands for Global Interpreter Lock. With a single lock on the interpreter there are no deadlocks. But the execution of any Python program must acquire the interpreter lock, which prevents some programs to take advantage of the multi-threading execution.

When writing C extensions, this lock is very useful because it can be released. Thus, the program can be more efficient (i.e. threads can actually run in parallel). To write a function with the GIL I spent many time reading about it. Unfortunately, nothing seemed to expain what I wanted to do until I found this nice blog from Nicolas Hug in which he explains the 3 rules you have to follow to make Cython release the GIL.

First of all, Cython needs to know which C functions that were imported are thread-safe. This is done by using the nogil statement in the function declaration. Then, inside the function the with nogil statement lets Cython know that this block is going to be executed with the GIL released. But to make that code block safe, there cannot be any Python interaction inside that block.

To understand it better, an example is shown below:

cdef extern from "math_operation.h":
    int add(int a, int b)nogil

cpdef sum(src, dest):
    cdef int len_src = len(src)
    cdef int len_dest = len(dest)
    cdef int result
    with nogil:
        # Code with the GIL released
        result = add(len_src, len_dest)
    # Code with the GIL, any Python interaction can be done here

The function sum returns the result of adding the length of src and dest. As you can see, the function has been defined with the cpdef statement instead of the def. The c lets Cython know that this function can be called with C. So this is necessary when writing a function with the GIL released, otherwise you will be trying to execute a Python program without the GIL (which, as explained previously cannot be done). Notice that len_src and len_dest have also been defined as C integers with the cdef int statement. If not, it would not be possible to work with them with the GIL released (the with nogil block).

On the other hand, the p lets Cython know that this function can be called through Python. This does not have to be done always, only when you want to call that function from Python.

Cython typed memoryviews

One of the main differences between the python-blosc and python-blosc2 API, is that the functions compress_ptr and decompress_ptr are no longer supported. We decided to do so, because the Pickle protocol 5 already makes an optimization of the copies. That way, we could have a similar performance for compress_ptr and decompress_ptr but with the functions pack and unpack.

However, when timing the functions I realised that in the majority of the cases, although the compress function from python-blosc2 was faster than the compress_ptr, the decompress function was slower than the decompress_ptr. Thus I checked the code to see if the speed could somehow be increased.

Originally, the code used the Python Buffer Protocol. which is part of the Python/C API. The Python Buffer Protocol lets you (among other things) obtain a pointer to the raw data of an object. But because it wasn't clear for me wether it needed to do a copy or not we decided to work with Cython typed memoryviews.

Cython typed memoryviews are very similar to Python memory views, but with the main difference that the first ones are a C-level type and therefore they do not have much Python overhead. Because it is a C-level type you have to know the dimension of the buffer from which you want to obtain the typed memoryview as well as its data type.

The shape dimension of the buffer is expressed writing as many : between brackets as dimensions it has. If the memory is allocated contiguously, you can write ::1 instead in the corresponding dimension. On the other hand, the type is expressed as you would do it in Cython. In the following code, you can see an example for a one-dimensional numpy array:

import numpy as np
arr = np.ones((10**6,), dtype=np.double)
cdef double [:] typed_view = arr

However, if you want to define a function that receives an object whose type may be unknown, you will have to create a Python memoryview and then cast it into the type you wish as in the next example:

# Get a Python memoryview from an object
mem_view = memoryview(object)
# Cast that memory view into an unsigned char memoryview
cdef unsigned char[:]typed_view = mem_view.cast('B')

The 'B' indicates to cast the memoryview type into an unsigned char.

But if I run the latter code for a binary Python string, it produces a runtime error. It took me 10 minutes to fix the error adding the const statement to the definition of the Cython typed memoryview (as shown below), but I spent two days trying to understand the error and its solution.

# Get a Python memoryview from an object
mem_view = memoryview(object)
# Cast that memory view into an unsigned char memoryview
cdef const unsigned char[:]typed_view = mem_view.cast('B')

The reason why the const statement fixed it, is that a binary Python string is a read-only buffer. By declaring the typed memoryview to const, Cython is being told that the object from the memory view is a read-only buffer so that it cannot change it.

Conclusions

So far, my experience wrapping C-Blosc2 has had some ups and downs.

One method that I use whenever I learn something new is to write down a summary of what I read. Sometimes is almost a copy (therefore some people may find it useless), but it always works really well for me. It helps me connect the ideas better or to build a global idea of what I have or want to do.

Another aspect I realized when doing this wrapper is that because I am a stubborn person, I usually tend to force myself to try to understand something and get frustrated if I do not. However, I have to recognize that sometimes it is better to forget about it until the next day. Your brain will organize your ideas at night so that you can invest better your time the next morning.

But maybe the most difficult part for me was the beginning, and therefore I have to thank Francesc Alted and Aleix Alcacer for giving me a push into the not always easy world of Python extensions.

C-Blosc2 Ready for General Review

On behalf of the Blosc team, we are happy to announce the first C-Blosc2 release (Release Candidate 1) that is meant to be reviewed by users. As of now we are declaring both the API and the format frozen, and we are seeking for feedback from the community so as to better check the library and declare it apt for its use in production.

Some history

The next generation Blosc (aka Blosc2) started back in 2015 as a way to overcome some limitations of the Blosc compressor, mainly the limitation of 2 GB for the size of data to be compressed. But it turned out that I wanted to make thinks a bit more complete, and provide a native serialization too. During that process Google awarded my contributions to Blosc with the Open Source Peer Bonus Program in 2017. This award represented a big emotional push for me in persisting in the efforts towards producing a stable release.

Back in 2018, Zeeman Wang from Huawei invited me to go to their central headquarters in Shenzhen to meet a series of developers that were trying to use compression in a series of scenarios. During two weeks we had a series of productive meetings, and I got aware of the many possibilities that compression is opening in industry: since making phones with limited hardware to work faster to accelerate computations on high-end computers. That was also a great opportunity for me to better know a millennial culture; I was genuinely interested to see how people live, eat and socialize in China.

In 2020, Huawei graciously offered a grant to the Blosc project to complete the project. Since then, we have got donations from several other sources (like NumFOCUS, Python Software Foundation, ESRF among them). Lately ironArray is sponsoring two of us (Aleix Alcacer and myself) to work partial time on Blosc related projects.

Thanks to all this support, the Blosc development team has been able to grow quite a lot (we are currently 5 people in the core team) and we have been able to work hard at producing a series of improvements in different projects under the Blosc umbrella, in particular C-Blosc2, Python-Blosc2, Caterva and cat4py.

As you see, there is a lot of development going on around C-Blosc2 other than C-Blosc2 itself. In this installment I am going to focus just on the main features that C-Blosc2 is bringing, but hopefully all the other projects in the ecosystem will also complement its existing functionality. When all these projects would be ready, we hope that users will be able to use them to store big amounts of data in a way that is both efficient, easy-to-use and most importantly, adapted to their needs.

New features of C-Blosc2

Here it is the list of the main features that we are releasing today:

  • 64-bit containers: the first-class container in C-Blosc2 is the super-chunk or, for brevity, schunk, that is made by smaller chunks which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed or not by another container which is called a frame (see later).

  • More filters: besides shuffle and bitshuffle already present in C-Blosc1, C-Blosc2 already implements:

    • delta: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The idea is that, in some situations, the diff will have more zeros than the original data, leading to better compression.

    • trunc_prec: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the shuffle or bitshuffle filter, this leads to more contiguous zeros, which are compressed better.

  • A filter pipeline: the different filters can be pipelined so that the output of one can the input for the other. A possible example is a delta followed by shuffle, or as described above, trunc_prec followed by bitshuffle.

  • Prefilters: allows to apply user-defined C callbacks prior the filter pipeline during compression. See test_prefilter.c for an example of use.

  • Postfilters: allows to apply user-defined C callbacks after the filter pipeline during decompression. The combination of prefilters and postfilters could be interesting for supporting e.g. encryption (via prefilters) and decryption (via postfilters). Also, a postfilter alone can used to produce on-the-flight computation based on existing data (or other metadata, like e.g. coordinates). See test_postfilter.c for an example of use.

  • SIMD support for ARM (NEON): this allows for faster operation on ARM architectures. Only shuffle is supported right now, but the idea is to implement bitshuffle for NEON too. Thanks to Lucian Marc.

  • SIMD support for PowerPC (ALTIVEC): this allows for faster operation on PowerPC architectures. Both shuffle and bitshuffle are supported; however, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer and ESRF for sponsoring the Blosc team in helping him in this task.

  • Dictionaries: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, it is only supported in the zstd codec, but would be nice to extend it to lz4 and blosclz at least.

  • Contiguous frames: allow to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported.

  • Sparse frames (on-disk): each chunk in a super-chunk is stored in a separate file, as well as the metadata. This is the counterpart of in-memory super-chunk, and allows for more efficient updates than in frames (i.e. avoiding 'holes' in monolithic files).

  • Partial chunk reads: there is support for reading just part of chunks, so avoiding to read the whole thing and then discard the unnecessary data.

  • Parallel chunk reads: when several blocks of a chunk are to be read, this is done in parallel by the decompressing machinery. That means that every thread is responsible to read, post-filter and decompress a block by itself, leading to an efficient overlap of I/O and CPU usage that optimizes reads to a maximum.

  • Meta-layers: optionally, the user can add meta-data for different uses and in different layers. For example, one may think on providing a meta-layer for NumPy so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter for adding more high-level info if desired (e.g. geo-spatial, meteorological...).

  • Variable length meta-layers: the user may want to add variable-length meta information that can be potentially very large (up to 2 GB). The regular meta-layer described above is very quick to read, but meant to store fixed-length and relatively small meta information. Variable length metalayers are stored in the trailer of a frame, whereas regular meta-layers are in the header.

  • Efficient support for special values: large sequences of repeated values can be represented with an efficient, simple and fast run-length representation, without the need to use regular codecs. With that, chunks or super-chunks with values that are the same (zeros, NaNs or any value in general) can be built in constant time, regardless of the size. This can be useful in situations where a lot of zeros (or NaNs) need to be stored (e.g. sparse matrices).

  • Nice markup for documentation: we are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C-API. See https://c-blosc2.readthedocs.io. Thanks to Alberto Sabater and Aleix Alcacer for contributing the support for this.

  • Plugin capabilities for filters and codecs: we have a plugin register capability inplace so that the info about the new filters and codecs can be persisted and transmitted to different machines. Thanks to the NumFOCUS foundation for providing a grant for doing this.

  • Pluggable tuning capabilities: this will allow users with different needs to define an interface so as to better tune different parameters like the codec, the compression level, the filters to use, the blocksize or the shuffle size. Thanks to ironArray for sponsoring us in doing this.

  • Support for I/O plugins: so that users can extend the I/O capabilities beyond the current filesystem support. Things like use databases or S3 interfaces should be possible by implementing these interfaces. Thanks to ironArray for sponsoring us in doing this.

  • Python wrapper: we have a preliminary wrapper in the works. You can have a look at our ongoing efforts in the python-blosc2 repo. Thanks to the Python Software Foundation for providing a grant for doing this.

  • Security: we are actively using using the OSS-Fuzz and ClusterFuzz for uncovering programming errors in C-Blosc2. Thanks to Google for sponsoring us in doing this.

As you see, the list is long and hopefully you will find compelling enough features for your own needs. Blosc2 is not only about speed, but also about providing

Tasks to be done

Even if the list of features above is long, we still have things to do in Blosc2; and the plan is to continue the development, although always respecting the existing API and format. Here are some of the things in our TODO list:

  • Centralized plugin repository: we have got a grant from NumFOCUS for implementing a centralized repository so that people can send their plugins (using the existing machinery) to the Blosc2 team. If the plugins fulfill a series of requirements, they will be officially accepted, and distributed withing the library.

  • Improve the safety of the library: although this is always a work in progress, we did a long way in improving our safety, mainly thanks to the efforts of Nathan Moinvaziri.

  • Support for lossy compression codecs: although we already support the trunc_prec filter, this is only valid for floating point data; we should come with lossy codecs that are meant for any data type.

  • Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted. Candidates for checksums can be the xxhash32 or xxhash64, depending on the goals (to be decided).

  • Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:

    • Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?.

    • Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is Markdown, so this should also be the first candidate to use here.

  • Lock support for super-chunks: when different processes are accessing concurrently to super-chunks, make them to sync properly by using locks, either on-disk (frame-backed super-chunks), or in-memory. Such a lock support would be configured in build time, so it could be disabled with a cmake flag.

It would be nice that, in case some of this feature (or a new one) sounds useful for you, you can help us in providing either code or sponsorship.

Summary

Since 2015, it has been a long time to get C-Blosc2 so much featured and tested. But hopefully the journey will continue because as Kavafis said:

As you set out for Ithaka
hope your road is a long one,
full of adventure, full of discovery.

Let me thank again all the people and sponsors that we have had during the life of the Blosc project; without them we would not be where we are now. We do hope that C-Blosc2 will have a long life and we as a team will put our soul in making that trip to last as long as possible.

Now is your turn. We expect you to start testing the library as much as possible and report back. With your help we can get C-Blosc2 in production stage hopefully very soon. Thanks in advance!

Blosc metalayers, where the user metainformation is stored

The C-Blosc2 library has two different spaces to store user-defined information. In this post, we are going to describe what these spaces are and where they are stored inside a Blosc2 frame (a persistent super-chunk).

As its name suggests, a metalayer is a space that allows users to store custom information. For example, Caterva, a project based on C-Blosc2 that handles compressed and chunked arrays, uses these metalayers to store the dimensions and the shape, chunkshape and blockshape of the arrays.

Fixed-length metalayers

The first kind of metalayers in Blosc2 are the fixed-length metalayers. These metalayers are stored in the header of the frame. This decision allows adding chunks to the frame without the need to rewrite the whole meta information and data coming after it.

But this implementation has some drawbacks. The most important one is that fixed-length metalayers cannot be resized. Furthermore, once the first chunk of data is added to the super-chunk, no more fixed-length metalayers can be added either.

Let's see with an example the reason for these restrictions. Supose that we have a frame that stores 10 GB of data with a metalayer containing a "cat". If we update the meta information with a "dog" we can do that because they have exactly the same size.

However, if we were to update the meta information with a "giraffe", the metalayer would need to be resized and therefore we would have to rewrite the 10GB of data plus the trailer. This would obviously be very inefficient and hence, not allowed:

/images/metalayers/metalayers.png

Data that would need to be rewritten are ploted in red.

Variable-length metalayers

To fix the above issue, we have introduced variable-length metalayers. Unlike fixed-length metalayers, these are stored in the trailer section of the frame.

As their name suggests, these metalayers can be resized. Blosc can do that because, whenever the metalayers content are modified, Blosc rewrites the trailer completely, using more space if necessary. Furthermore, and since these metalayers are stored in the trailer, they will also be rewritten each time a chunk is added.

Another feature of variable-length metalayers is that their content is compressed by default (in contrast to fixed-length metalayers). This will minimize the size of the trailer, a very important feature because since the trailer is rewritten every time new data is added, we want to keep it as small as possible so as to optimize data written.

Let's continue with the previous example, but storing the meta information in a variable-length metalayer now:

/images/metalayers/metalayers-vl.png

In this case the trailer is rewritten each time that we update the metalayer, but it is a much more efficient operation than rewriting all the data (as a fixed-length metalayer would require). So the variable-length metalayers complement the fixed-length metalayers by bringing different capabilities on the table. Depending on her needs, it is up to the user to choose one or another metalayer storage.

Fixed-length vs variable-length metalayers comparsion

To summarize, and to better see what kind of metalayer is more suitable for each situation, the following table contains a comparison between fixed-length metalayers and variable-length metalayers:

Fixed-length metalayers

Variable-length metalayers

Where are stored?

Header

Trailer

Can be resized?

No

Yes

Can be added after adding chunks?

No

Yes

Are they rewritten when adding chunks?

No

Yes

Metalayers API

Currently, C-Blosc2 has the following functions implemented:

  • blosc2_meta_add() / blosc2_vlmeta_add(): Add a new metalayer.

  • blosc2_meta_get() / blosc2_vlmeta_get(): Get the metalayer content.

  • blosc2_meta_exists() / blosc2_vlmeta_exists(): Check if a metalayer exists or not.

  • blosc2_meta_update() / blosc2_vlmeta_update(): Update the metalayer content.

Conclusions

As we have seen, Blosc2 supports two different spaces where users can store their meta information. The user can choose one or another depending on her needs.

On the one hand, the fixed-length metalayers are meant to store user meta information that does not change size over time. They are stored in the header and can be updated without having to rewrite any other part of the frame, but they can no longer be added once the first chunk of data is added.

On the other hand, for users storing meta information that is going to change in size over time, they can store their meta information into variable-length metalayers. These are stored in the trailer section of a frame and are more flexible than its fixed-length counterparts. However, each time that a metalayer content is updated, the whole trailer has to be rewritten.