Blosc Main Blog Page (Posts by Francesc Alted)

Exploring lossy compression with Blosc2

Francesc Alted — Tue, 13 Feb 2024 01:32:20 GMT

In the realm of data compression, efficiency is key. Whether you're dealing with massive datasets or simply aiming to optimize storage space and transmission speeds, the choice of compression algorithm can make a significant difference. In this blog post, we'll delve into the world of lossy compression using Blosc2, exploring its capabilities, advantages, and potential applications.

Understanding lossy compression

Unlike lossless compression, where the original data can be perfectly reconstructed from the compressed version, lossy compression involves discarding some information to achieve higher compression ratios. While this inevitably results in a loss of fidelity, the trade-off is often justified by the significant reduction in storage size.

Lossy compression techniques are commonly employed in scenarios where minor degradation in quality is acceptable, such as multimedia applications (e.g., images, audio, and video) and scientific data analysis. By intelligently discarding less crucial information, lossy compression algorithms can achieve substantial compression ratios while maintaining perceptual quality within acceptable bounds.

Lossy codecs in Blosc2

In the context of Blosc2, lossy compression can be achieved either through a combination of traditional compression algorithms and filters that can selectively discard less critical data, or by using codecs specially meant for doing so.

Filters for truncating precision

Since its inception, Blosc2 has featured the TRUNC_PREC filter, which is meant to discard the least significant bits from floating-point values (be they float32 or float64). This filter operates by zeroing out the designated bits slated for removal, resulting in enhanced compression. To see the impact on compression ratio and speed, an illustrative example here.

A particularly useful use case of the TRUNC_PREC filter is to truncate precision of float32/float64 types to either 8 or 16 bit; this is a quick and dirty way to ‘fake’ float8 or float16 types, which are very much used in AI nowadays, and contain storage needs.

In that vein, we recently implemented the INT_TRUNC filter, which does the same as TRUNC_PREC, but for integers (int8, int16, int32 and int64, and their unsigned counterparts). With both TRUNC_PREC and INT_TRUNC, you can specify an acceptable precision for most numerical data types.

Codecs for NDim datasets

Blosc2 has support for ZFP, another codec that is very useful for compressing multidimensional datasets. Although ZFP itself supports both lossless and lossy compression, Blosc2 makes use of its lossy capabilities only (the lossless ones are supposed to be already covered by other codecs in Blosc2). See this blog post for more info on the kind of lossy compression that can be achieved with ZFP.

Codecs for images

In addition, we recently included support for a couple of codecs that support the JPEG 2000 standard. One is OpenJ2HK, and the other is grok. Both have good, high quality JPEG 2000 implementations, but grok is a bit more advanced and has support for 16-bit gray images; we have blogged about it.

Experimental filters

Finally, you may want to experiment with some filters and codecs that were mainly designed to be a learning tool for people wanting to implement their own ones. Among them you can find:

NDCELL: A filter that groups data in multidimensional cells, reordering them so that the codec can find better repetition patterns on a cell-by-cell basis.
NDMEAN: A multidimensional filter for lossy compression in multidimensional cells, replacing all elements in a cell by the mean of the cell. This allows for better compressions by the actual compression codec (e.g. NDLZ).
NDLZ: A compressor based on the Lempel-Ziv algorithm for 2-dim datasets. Although this is a lossless compressor, it is actually meant to be used in combination with the NDCELL and NDMEAN above, providing lossy compression for the latter case.

Again, the codecs in this section are not specially efficient, but can be used for learning about the compression pipeline in Blosc2. For more info on how to implement (and register) your own filters, see this blog post.

Applications and use cases

The versatility of Blosc2's lossy compression capabilities opens up a myriad of applications across different domains. In scientific computing, for example, where large volumes of data are generated and analyzed, lossy compression can significantly reduce storage requirements without significantly impacting the accuracy of results.

Similarly, in multimedia applications, such as image and video processing, lossy compression can help minimize bandwidth usage and storage costs while maintaining perceptual quality within acceptable limits.

Compressing images with JPEG 2000 and with INT_TRUNC

As an illustration, a recent study involved the compression of substantial volumes of 16-bit grayscale images sourced from different synchrotron facilities in Europe. While achieving efficient compression ratios necessitates the use of lossy compression techniques, it is essential to exercise caution to preserve key features for clear visual examination and accurate numerical analysis. Below, we provide an overview of how Blosc2 can employ various codecs and quality settings within filters to accomplish this task.

The SSIM index, derived from the Structural Similarity Measure, gauges the perceived quality of an image, with values closer to 1 indicating higher fidelity. You can appreciate the varying levels of fidelity achievable through the utilization of different filters and codecs.

In terms of performance, each of these compression methods also showcases significantly varied speeds (tested on a MacBook Air with an M1 processor):

A pivotal benefit of Blosc2's strategy for lossy compression lies in its adaptability and configurability. This enables tailoring to unique needs and limitations, guaranteeing optimal performance across various scenarios.

Using Blosc2 within HDF5

HDF5 is a widely used data format, and both major Python wrappers, h5py (via hdf5plugin) and PyTables, offer basic support for Blosc2. However, accessing the full capabilities of the Blosc2 compression pipeline is somewhat restricted because the current hdf5-blosc2 filter, available in PyTables (and used by hdf5plugin), is not yet equipped to transmit all the necessary parameters to the HDF5 data pipeline.

Thankfully, HDF5 includes support for the direct chunking mechanism, which enables the direct transmission of pre-compressed chunks to HDF5, bypassing its standard data pipeline. Since h5py also offers this functionality, it's entirely feasible to leverage all the advanced features of Blosc2, including lossy compression. Below are a couple of examples illustrating how this process operates:

Conclusion

Lossy compression is a powerful tool for optimizing storage space, reducing bandwidth usage, and improving overall efficiency in data handling. With Blosc2, developers have access to a robust and flexible compression library for both lossless and lossy compression modes.

With its advanced compression methodologies and adept memory management, Blosc2 empowers users to strike a harmonious balance between compression ratio, speed, and fidelity. This attribute renders it especially suitable for scenarios where resource limitations or performance considerations hold significant weight.

Finally, there are ongoing efforts towards integrating fidelity into our BTune AI tool. This enhancement will empower the tool to autonomously identify the most suitable codecs and filters, balancing compression level, precision, and fidelity according to user-defined preferences. Keep an eye out for updates!

Whether you're working with scientific data, multimedia content, or large-scale datasets, Blosc2 offers a comprehensive solution for efficient data compression and handling.

Special thanks to sponsors and developers

Gratitude goes out to our sponsors over the years, with special recognition to the LEAPS collaboration and NumFOCUS, whose support has been instrumental in advancing the lossy compression capabilities within Blosc2.

The Blosc2 project is the outcome of the work of many developers.

Bytedelta: Enhance Your Compression Toolset

Francesc Alted — Fri, 24 Mar 2023 11:32:20 GMT

Bytedelta is a new filter that calculates the difference between bytes in a data stream. Combined with the shuffle filter, it can improve compression for some datasets. Bytedelta is based on initial work by Aras Pranckevičius.

TL;DR: We have a brief introduction to bytedelta in the 3rd section of this presentation.

The basic concept is simple: after applying the shuffle filter,

then compute the difference for each byte in the byte streams (also called splits in Blosc terminology):

The key insight enabling the bytedelta algorithm lies in its implementation, especially the use of SIMD on Intel/AMD and ARM NEON CPUs, making the filter overhead minimal.

Although Aras's original code implemented shuffle and bytedelta together, it was limited to a specific item size (4 bytes). Making it more general would require significant effort. Instead, for Blosc2 we built on the existing shuffle filter and created a new one that just does bytedelta. When we insert both in the Blosc2 filter pipeline (it supports up to 6 chained filters), it leads to a completely general filter that works for any type size supported by existing shuffle filter.

With that said, the implementation of the bytedelta filter has been a breeze thanks to the plugin support in C-Blosc2. You can also implement your own filters and codecs on your own, or if you are too busy, we will be happy to assist you.

Compressing ERA5 datasets

The best approach to evaluate a new filter is to apply it to real data. For this, we will use some of the ERA5 datasets, representing different measurements and labeled as "wind", "snow", "flux", "pressure" and "precip". They all contain floating point data (float32) and we will use a full month of each one, accounting for 2.8 GB for each dataset.

The diverse datasets exhibit rather dissimilar complexity, which proves advantageous for testing diverse compression scenarios. For instance, the wind dataset appears as follows:

The image shows the intricate network of winds across the globe on October 1, 1987. The South American continent is visible on the right side of the map.

Another example is the snow dataset:

This time the image is quite flat. Here one can spot Antarctica, Greenland, North America and of course, Siberia, which was pretty full of snow by 1987-10-01 23:00:00 already.

Let's see how the new bytedelta filter performs when compressing these datasets. All the plots below have been made using a box with an Intel i13900k processor, 32 GB of RAM and using Clear Linux.

In the box plot above, we summarized the compression ratios for all datasets using different codecs (BLOSCLZ, LZ4, LZ4HC and ZSTD). The main takeaway is that using bytedelta yields the best median compression ratio: bytedelta achieves a median of 5.86x, compared to 5.62x for bitshuffle, 5.1x for shuffle, and 3.86x for codecs without filters. Overall, bytedelta seems to improve compression ratios here, which is good news.

While the compression ratio is a useful metric for evaluating the new bytedelta filter, there is more to consider. For instance, does the filter work better on some data sets than others? How does it impact the performance of different codecs? If you're interested in learning more, read on.

Effects on various datasets

Let's see how different filters behave on various datasets:

Here we see that, for datasets that compress easily (precip, snow), the behavior is quite different from those that are less compressible. For precip, bytedelta actually worsens results, whereas for snow, it slightly improves them. For less compressible datasets, the trend is more apparent, as can be seen in this zoomed in image:

In these cases, bytedelta clearly provides a better compression ratio, most specifically with the pressure dataset, where compression ratio by using bytedelta has increased by 25% compared to the second best, bitshuffle (5.0x vs 4.0x, using ZSTD clevel 9). Overall, only one dataset (precip) shows an actual decrease. This is good news for bytedelta indeed.

Furthermore, Blosc2 supports another compression parameter for splitting the compressed streams into bytes with the same significance. Normally, this leads to better speed but less compression ratio, so this is automatically activated for faster codecs, whereas it is disabled for slower ones. However, it turns out that, when we activate splitting for all the codecs, we find a welcome surprise: bytedelta enables ZSTD to find significantly better compression paths, resulting in higher compression ratios.

As can be seen, in general ZSTD + bytedelta can compress these datasets better. For the pressure dataset in particular, it goes up to 5.7x, 37% more than the second best, bitshuffle (5.7x vs 4.1x, using ZSTD clevel 9). Note also that this new highest is 14% more than without splitting (the default).

This shows that when compressing, you cannot just trust your intuition for setting compression parameters - there is no substitute for experimentation.

Effects on different codecs

Now, let's see how bytedelta affects performance for different codecs and compression levels.

Interestingly, on average bytedelta proves most useful for ZSTD and higher compression levels of ZLIB (Blosc2 comes with ZLIB-NG). On the other hand, the fastest codecs (LZ4, BLOSCLZ) seem to benefit more from bitshuffle instead.

Regarding compression speed, in general we can see that bytedelta has little effect on performance:

As we can see, compression algorithms like BLOSCLZ, LZ4 and ZSTD can achieve extremely high speeds. LZ4 reaches and surpasses speeds of 30 GB/s, even when using bytedelta. BLOSCLZ and ZSTD can also exceed 20 GB/s, which is quite impressive.

Let’s see the compression speed grouped by compression levels:

Here one can see that, to achieve the highest compression rates when combined with shuffle and bytedelta, the codecs require significant CPU resources; this is especially noticeable in the zoomed-in view:

where capable compressors like ZSTD do require up to 2x more time to compress when using bytedelta, especially for high compression levels (6 and 9).

Now, let us examine decompression speeds:

In general, decompression is faster than compression. BLOSCLZ, LZ4 and LZ4HC can achieve over 100 GB/s. BLOSCLZ reaches nearly 180 GB/s using no filters on the snow dataset (lowest complexity).

Let’s see the decompression speed grouped by compression levels:

The bytedelta filter noticeably reduces speed for most codecs, up to 20% or more. ZSTD performance is less impacted.

Achieving a balance between compression ratio and speed

Often, you want to achieve a good balance of compression and speed, rather than extreme values of either. We will conclude by showing plots depicting a combination of both metrics and how bytedelta influences them.

Let's first represent the compression ratio versus compression speed:

As we can see, the shuffle filter is typically found on the Pareto frontier (in this case, the point furthest to the right and top). Bytedelta comes next. In contrast, not using a filter at all is on the opposite side. This is typically the case for most real-world numerical datasets.

Let's now group filters and datasets and calculate the mean values of combining (in this case, multiplying) the compression ratio and compression speed for all codecs.

As can be seen, bytedelta works best with the wind dataset (which is quite complex), while bitshuffle does a good job in general for the others. The shuffle filter wins on the snow dataset (low complexity).

If we group by compression level:

We see that bytedelta works well with LZ4 here, and also with ZSTD at the lowest compression level (1).

Let's revise the compression ratio versus decompression speed comparison:

Let's group together the datasets and calculate the mean for all codecs:

In this case, shuffle generally prevails, with bitshuffle also doing reasonably well, winning on precip and pressure datasets.

Also, let’s group the data by compression level:

We find that bytedelta compression does not outperform shuffle compression in any scenario. This is unsurprising since decompression is typically fast, and bytedelta's extra processing can decrease performance more easily. We also see that LZ4HC (clevel 6 and 9) + shuffle strikes the best balance in this scenario.

Finally, let's consider the balance between compression ratio, compression speed, and decompression speed:

Here the winners are shuffle and bitshuffle, depending on the data set, but bytedelta never wins.

If we group by compression levels:

Overall, we see LZ4 as the clear winner at any level, especially when combined with shuffle. On the other hand, bytedelta did not win in any scenario here.

Benchmarks for other computers

We have run the benchmarks presented here in an assortment of different boxes:

Also, find here a couple of runs using the i9-13900K box above, but with the always split and never split settings:

Reproducing the benchmarks is straightforward. First, download the data; the downloaded files will be in the new era5_pds/ directory. Then perform the series of benchmarks; this is takes time, so grab coffee and wait 30 min (fast workstations) to 6 hours (slow laptops). Finally, run the plotting Jupyter notebook to explore your results. If you wish to share your results with the Blosc development team, we will appreciate hearing from you!

Conclusion

Bytedelta can achieve higher compression ratios in most datasets, specially in combination with capable codecs like ZSTD, with a maximum gain of 37% (pressure) over other codecs; only in one case (precip) compression ratio decreases. By compressing data more efficiently, bytedelta can reduce file sizes even more, accelerating transfer and storage.

On the other hand, while bytedelta excels at achieving high compression ratios, this requires more computing power. We have found that for striking a good balance between high compression and fast compression/decompression, other filters, particularly shuffle, are superior overall.

We've learned that no single codec/filter combination is best for all datasets:

ZSTD (clevel 9) + bytedelta can get better absolute compression ratio for most of the datasets.
LZ4 + shuffle is well-balanced for all metrics (compression ratio, speed, decompression speed).
LZ4 (clevel 6) and ZSTD (clevel 1) + shuffle strike a good balance of compression ratio and speed.
LZ4HC (clevel 6 and 9) + shuffle balances well compression ratio and decompression speed.
BLOSCLZ without filters achieves best decompression speed (at least in one instance).

In summary, the optimal choice depends on your priorities.

As a final note, the Blosc development team is working on BTune, a new deep learning tuner for Blosc2. BTune can be trained to automatically recognize different kinds of datasets and choose the optimal codec and filters to achieve the best balance, based on the user's needs. This would create a much more intelligent compressor that can adapt itself to your data faster, without requiring time-consuming manual tuning. If interested, contact us; we are looking for beta testers!

100 Trillion Rows Baby

Francesc Alted — Fri, 10 Feb 2023 10:32:20 GMT

In recently released PyTables 3.8.0 we gave support for an optimized path for writing and reading Table instances with Blosc2 cooperating with the HDF5 machinery. On the blog describing its implementation we have shown how it collaborates with the HDF5 library so as to get top-class I/O performance.

Since then, we have been aware (thanks to Mark Kittisopikul) of the introduction of the H5Dchunk_iter function in HDF5 1.14 series. This predates the functionality of H5Dget_chunk_info, and makes retrieving the offsets of the chunks in the HDF5 file way more efficiently, specially on files with a large number of chunks - H5Dchunk_iter cost is O(n), whereas H5Dget_chunk_info is O(n^2).

As we decided to implement support for H5Dchunk_iter in PyTables, we were curious on the sort of boost this could provide reading tables created from real data. Keep reading for the experiments we've conducted about this.

Effect on (relatively small) datasets

We start by reading a table with real data coming from our usual ERA5 database. We fetched one year (2000 to be specific) of data with five different ERA5 datasets with the same shape and the same coordinates (latitude, longitude and time). This data has been stored on a table with 8 columns with 32 bytes per row and with 9 millions rows (for a grand total of 270 GB); the number of chunks is about 8K.

When using compression, the size is typically reduced between a factor of 6x (LZ4 + shuffle) and 9x (Zstd + bitshuffle); in any case, the resulting file size is larger than the RAM available in our box (32 GB), so we can safely exclude OS filesystem caching effects here. Let's have a look at the results on reading this dataset inside PyTables (using shuffle only; for bitshuffle results are just a bit slower):

We see how the improvement when using HDF5 1.14 (and hence H5Dchunk_iter) for reading data sequentially (via a PyTables query) is not that noticeable, but for random queries, the speedup is way more apparent. For comparison purposes, we added the figures for Blosc1+LZ4; one can notice the great job of Blosc2, specially in terms of random reads due to the double partitioning and HDF5 pipeline replacement.

A trillion rows table

But 8K chunks is not such a large figure, and we are interested in using datasets with a larger amount. As it is very time consuming to download large amounts of real data for our benchmarks purposes, we have decided to use synthetic data (basically, a bunch of zeros) just to explore how the new H5Dchunk_iter function scales when handling extremely large datasets in HDF5.

Now we will be creating a large table with 1 trillion rows, with the same 8 fields than in the previous section, but whose values are zeros (remember, we are trying to push HDF5 / Blosc2 to their limits, so data content is not important here). With that, we are getting a table with 845K chunks, which is about 100x more than in the previous section.

With this, lets' have a look at the plots for the read speed:

As expected, we are getting significantly better results when using HDF5 1.14 (with H5Dchunk_iter) in both sequential and random cases. For comparison purposes, we have added Blosc1-Zstd which does not make use of the new functionality. In particular, note how Blosc1 gets better results for random reads than Blosc2 with HDF5 1.12; as this is somehow unexpected, if you have an explanation, please chime in.

It is worth noting that even though the data are made of zeros, Blosc2 still needs to compress/decompress the full 32 TB thing. And the same goes for numexpr, which is used internally to perform the computations for the query in the sequential read case. This is testimonial of the optimization efforts in the data flow (i.e. avoiding as much memory copies as possible) inside PyTables.

100 trillion rows baby

As a final exercise, we took the previous experiment to the limit, and made a table with 100 trillion (that’s a 1 followed with 14 zeros!) rows and measured different interesting aspects. It is worth noting that the total size for this case is 2.8 PB (petabyte), and the number of chunks is around 85 millions (finally, large enough to fully demonstrate the scalability of the new H5Dchunk_iter functionality).

Here it is the speed of random and sequential reads:

As we can see, despite the large amount of chunks, the sequential read speed actually improved up to more than 75 GB/s. Regarding the random read latency, it increased to 60 µs; this is not too bad actually, as in real life the latencies during random reads in such a large files are determined by the storage media, which is no less than 100 µs for the fastest SSDs nowadays.

The script that creates the table and reads it can be found at bench/100-trillion-rows-baby.py. For the curious, it took about 24 hours to run on a Linux box wearing an Intel 13900K CPU with 32 GB of RAM. The memory consumption during writing was about 110 MB, whereas for reading was 1.7 GB steadily (pretty good for a multi-petabyte table). The final size for the file has been 17 GB, for a compression ratio of more than 175000x.

Conclusion

As we have seen, the H5Dchunk_iter function recently introduced in HDF5 1.14 is confirmed to be of a big help in performing reads more efficiently. We have also demonstrated that scalability is excellent, reaching phenomenal sequential speeds (exceeding 75 GB/s with synthetic data) that cannot be easily achieved by the most modern I/O subsystems, and hence avoiding unnecessary bottlenecks.

Indeed, the combo HDF5 / Blosc2 is able to handle monster sized tables (on the petabyte ballpark) without becoming a significant bottleneck in performance. Not that you need to handle such a sheer amount of data anytime soon, but it is always reassuring to use a tool that is not going to take a step back in daunting scenarios like this.

If you regularly store and process large datasets and need advice to partition your data, or choosing the best combination of codec, filters, chunk and block sizes, or many other aspects of compression, do not hesitate to contact the Blosc team at contact (at) blosc.org. We have more than 30 years of cumulated experience in storage systems like HDF5, Blosc and efficient I/O in general; but most importantly, we have the ability to integrate these innovative technologies quickly into your products, enabling a faster access to these innovations.

20 years of PyTables

Francesc Alted — Sat, 31 Dec 2022 12:32:20 GMT

Back in October 2002 the first version of PyTables was released. It was an attempt to store a large amount of tabular data while being able to provide a hierarchical structure around it. Here it is the first public announcement by me:

Hi!,

PyTables is a Python package which allows dealing with HDF5 tables.
Such a table is defined as a collection of records whose values are
stored in fixed-length fields.  PyTables is intended to be easy-to-use,
and tried to be a high-performance interface to HDF5.  To achieve this,
the newest improvements in Python 2.2 (like generators or slots and
metaclasses in brand-new classes) has been used.  Python creation
extension tool has been chosen to access the HDF5 library.

This package should be platform independent, but until now I’ve tested
it only with Linux.  It’s the first public release (v 0.1), and it is
in alpha state.

As noted, PyTables was an early adopter of generators and metaclasses that were introduced in the new (by that time) Python 2.2. It turned out that generators demonstrated to be an excellent tool in many libraries related with data science. Also, Pyrex adoption (which was released just a few months ago) greatly simplified the wrapping of native C libraries like HDF5.

By that time there were not that much Python libraries for persisting tabular data with a format that allowed on-the-flight compression, and that gave PyTables a chance to be considered as a good option. Some months later, PyCon 2003 accepted our first talk about PyTables. Since then, we (mainly me, with the support from Scott Prater on the documentation part) gave several presentations in different international conferences, like SciPy or EuroSciPy and its popularity skyrocketed somehow.

Cárabos Coop. V.

In 2005, and after receiving some good inputs on PyTables by some customers (including The HDF Group), we decided to try to make a life out of PyTables development and together with Vicent Mas and Ivan Vilata, we set out to create a cooperative called Cárabos Coop V. Unfortunately, and after 3 years of enthusiastic (and hard) work, we did not succeed in making the project profitable, and we had to close by 2008.

During this period we managed to make a professional version of PyTables that was using out-of core indexes (aka OPSI) as well as a GUI called ViTables. After closing Cárabos we open sourced both technologies, and we are happy to say that they are still in good use, most specially OPSI indexes, that are meant to perform fast queries in very large datasets; OPSI can still be used straight from pandas.

Crew renewal

After Cárabos closure, I (Francesc Alted) continued to maintain PyTables for a while, but in 2010 I expressed my desire to handover the project, and shortly after, a new gang of people, including Anthony Scopatz and Antonio Valentino, with Andrea Bedini joining shortly after, stepped ahead and took the challenge. This is where open source is strong: whenever a project faces difficulties, there are always people eager to jump up to the wagon and continue providing traction for it.

Attempt to merge with h5py

Meanwhile, the h5py package was receiving a great adoption, specially from the community that valued more the multidimensional arrays than the tabular side of the things. There was a feeling that we were duplicating efforts and by 2016, Andrea Bedini, with the help of Anthony Scopatz, organized a HackFest in Perth, Australia where developers of the h5py and PyTables gathered to attempt a merge of the two projects. After the initial work there, we continued this effort with a grant from NumFOCUS.

Unfortunately, the effort demonstrated to be rather complex, and we could not finish it properly (for the sake of curiosity, the attempt is still available). At any rate, we are actively encouraging people using both packages depending on the need; see for example, the tutorial on h5py/PyTables that Tom Kooij taught at SciPy 2017.

Satellite Projects: Blosc and numexpr

As many other open sources libraries, PyTables stands in the shoulders of giants, and makes use of amazing libraries like HDF5 or NumPy for doing its magic. In addition to that, and in order to allow PyTables push against the hardware I/O and computational limits, it leverages two high-performance packages: Blosc and numexpr. Blosc is in charge of compressing data efficiently and at very high speeds to overcome the limits imposed by the I/O subsystem, while numexpr allows to get maximum performance from computations in CPU when querying large tables. Both projects have been substantially improved by the PyTables crew, and actually, they are quite popular by themselves.

Specifically, the Blosc compressor, although born out of the needs of PyTables, it spun off as a standalone compressor (or meta-compressor, as it can use several codecs internally) meant to accelerate not just disk I/O, but also memory access in general. In an unexpected twist, Blosc2, has developed its own multi-level data partitioning system, which goes beyond the single-level partitions in HDF5, and is currently helping PyTables to reach new performance heights. By teaming with the HDF5 library (and hence PyTables), Blosc2 is allowing PyTables to query 100 trillion rows in human timeframes.

Thank you!

It has been a long way since PyTables started 20 years ago. We are happy to have helped in providing a useful framework for data storage and querying needs for many people during the journey.

Many thanks to all maintainers and contributors (either with code or donations) to the project; they are too numerous to mention them all here, but if you are reading this and are among them, you should be proud to have contributed to PyTables. In hindsight, the road may have been certainly bumpy, but it somehow worked and many difficulties have been surpassed; such is the magic and grace of Open Source!

C-Blosc2 Ready for General Review

Francesc Alted — Thu, 06 May 2021 10:32:20 GMT

On behalf of the Blosc team, we are happy to announce the first C-Blosc2 release (Release Candidate 1) that is meant to be reviewed by users. As of now we are declaring both the API and the format frozen, and we are seeking for feedback from the community so as to better check the library and declare it apt for its use in production.

Some history

The next generation Blosc (aka Blosc2) started back in 2015 as a way to overcome some limitations of the Blosc compressor, mainly the limitation of 2 GB for the size of data to be compressed. But it turned out that I wanted to make thinks a bit more complete, and provide a native serialization too. During that process Google awarded my contributions to Blosc with the Open Source Peer Bonus Program in 2017. This award represented a big emotional push for me in persisting in the efforts towards producing a stable release.

Back in 2018, Zeeman Wang from Huawei invited me to go to their central headquarters in Shenzhen to meet a series of developers that were trying to use compression in a series of scenarios. During two weeks we had a series of productive meetings, and I got aware of the many possibilities that compression is opening in industry: since making phones with limited hardware to work faster to accelerate computations on high-end computers. That was also a great opportunity for me to better know a millennial culture; I was genuinely interested to see how people live, eat and socialize in China.

In 2020, Huawei graciously offered a grant to the Blosc project to complete the project. Since then, we have got donations from several other sources (like NumFOCUS, Python Software Foundation, ESRF among them). Lately ironArray is sponsoring two of us (Aleix Alcacer and myself) to work partial time on Blosc related projects.

Thanks to all this support, the Blosc development team has been able to grow quite a lot (we are currently 5 people in the core team) and we have been able to work hard at producing a series of improvements in different projects under the Blosc umbrella, in particular C-Blosc2, Python-Blosc2, Caterva and cat4py.

As you see, there is a lot of development going on around C-Blosc2 other than C-Blosc2 itself. In this installment I am going to focus just on the main features that C-Blosc2 is bringing, but hopefully all the other projects in the ecosystem will also complement its existing functionality. When all these projects would be ready, we hope that users will be able to use them to store big amounts of data in a way that is both efficient, easy-to-use and most importantly, adapted to their needs.

New features of C-Blosc2

Here it is the list of the main features that we are releasing today:

64-bit containers: the first-class container in C-Blosc2 is the super-chunk or, for brevity, schunk, that is made by smaller chunks which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed or not by another container which is called a frame (see later).
More filters: besides shuffle and bitshuffle already present in C-Blosc1, C-Blosc2 already implements:
- delta: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The idea is that, in some situations, the diff will have more zeros than the original data, leading to better compression.
- trunc_prec: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the shuffle or bitshuffle filter, this leads to more contiguous zeros, which are compressed better.
A filter pipeline: the different filters can be pipelined so that the output of one can the input for the other. A possible example is a delta followed by shuffle, or as described above, trunc_prec followed by bitshuffle.
Prefilters: allows to apply user-defined C callbacks prior the filter pipeline during compression. See test_prefilter.c for an example of use.
Postfilters: allows to apply user-defined C callbacks after the filter pipeline during decompression. The combination of prefilters and postfilters could be interesting for supporting e.g. encryption (via prefilters) and decryption (via postfilters). Also, a postfilter alone can used to produce on-the-flight computation based on existing data (or other metadata, like e.g. coordinates). See test_postfilter.c for an example of use.
SIMD support for ARM (NEON): this allows for faster operation on ARM architectures. Only shuffle is supported right now, but the idea is to implement bitshuffle for NEON too. Thanks to Lucian Marc.
SIMD support for PowerPC (ALTIVEC): this allows for faster operation on PowerPC architectures. Both shuffle and bitshuffle are supported; however, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer and ESRF for sponsoring the Blosc team in helping him in this task.
Dictionaries: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, it is only supported in the zstd codec, but would be nice to extend it to lz4 and blosclz at least.
Contiguous frames: allow to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported.
Sparse frames (on-disk): each chunk in a super-chunk is stored in a separate file, as well as the metadata. This is the counterpart of in-memory super-chunk, and allows for more efficient updates than in frames (i.e. avoiding 'holes' in monolithic files).
Partial chunk reads: there is support for reading just part of chunks, so avoiding to read the whole thing and then discard the unnecessary data.
Parallel chunk reads: when several blocks of a chunk are to be read, this is done in parallel by the decompressing machinery. That means that every thread is responsible to read, post-filter and decompress a block by itself, leading to an efficient overlap of I/O and CPU usage that optimizes reads to a maximum.
Meta-layers: optionally, the user can add meta-data for different uses and in different layers. For example, one may think on providing a meta-layer for NumPy so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter for adding more high-level info if desired (e.g. geo-spatial, meteorological...).
Variable length meta-layers: the user may want to add variable-length meta information that can be potentially very large (up to 2 GB). The regular meta-layer described above is very quick to read, but meant to store fixed-length and relatively small meta information. Variable length metalayers are stored in the trailer of a frame, whereas regular meta-layers are in the header.
Efficient support for special values: large sequences of repeated values can be represented with an efficient, simple and fast run-length representation, without the need to use regular codecs. With that, chunks or super-chunks with values that are the same (zeros, NaNs or any value in general) can be built in constant time, regardless of the size. This can be useful in situations where a lot of zeros (or NaNs) need to be stored (e.g. sparse matrices).
Nice markup for documentation: we are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C-API. See https://c-blosc2.readthedocs.io. Thanks to Alberto Sabater and Aleix Alcacer for contributing the support for this.
Plugin capabilities for filters and codecs: we have a plugin register capability inplace so that the info about the new filters and codecs can be persisted and transmitted to different machines. Thanks to the NumFOCUS foundation for providing a grant for doing this.
Pluggable tuning capabilities: this will allow users with different needs to define an interface so as to better tune different parameters like the codec, the compression level, the filters to use, the blocksize or the shuffle size. Thanks to ironArray for sponsoring us in doing this.
Support for I/O plugins: so that users can extend the I/O capabilities beyond the current filesystem support. Things like use databases or S3 interfaces should be possible by implementing these interfaces. Thanks to ironArray for sponsoring us in doing this.
Python wrapper: we have a preliminary wrapper in the works. You can have a look at our ongoing efforts in the python-blosc2 repo. Thanks to the Python Software Foundation for providing a grant for doing this.
Security: we are actively using using the OSS-Fuzz and ClusterFuzz for uncovering programming errors in C-Blosc2. Thanks to Google for sponsoring us in doing this.

As you see, the list is long and hopefully you will find compelling enough features for your own needs. Blosc2 is not only about speed, but also about providing

Tasks to be done

Even if the list of features above is long, we still have things to do in Blosc2; and the plan is to continue the development, although always respecting the existing API and format. Here are some of the things in our TODO list:

Centralized plugin repository: we have got a grant from NumFOCUS for implementing a centralized repository so that people can send their plugins (using the existing machinery) to the Blosc2 team. If the plugins fulfill a series of requirements, they will be officially accepted, and distributed withing the library.
Improve the safety of the library: although this is always a work in progress, we did a long way in improving our safety, mainly thanks to the efforts of Nathan Moinvaziri.
Support for lossy compression codecs: although we already support the trunc_prec filter, this is only valid for floating point data; we should come with lossy codecs that are meant for any data type.
Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted. Candidates for checksums can be the xxhash32 or xxhash64, depending on the goals (to be decided).
Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:
- Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?.
- Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is Markdown, so this should also be the first candidate to use here.
Lock support for super-chunks: when different processes are accessing concurrently to super-chunks, make them to sync properly by using locks, either on-disk (frame-backed super-chunks), or in-memory. Such a lock support would be configured in build time, so it could be disabled with a cmake flag.

It would be nice that, in case some of this feature (or a new one) sounds useful for you, you can help us in providing either code or sponsorship.

Summary

Since 2015, it has been a long time to get C-Blosc2 so much featured and tested. But hopefully the journey will continue because as Kavafis said:

As you set out for Ithaka
hope your road is a long one,
full of adventure, full of discovery.

Let me thank again all the people and sponsors that we have had during the life of the Blosc project; without them we would not be where we are now. We do hope that C-Blosc2 will have a long life and we as a team will put our soul in making that trip to last as long as possible.

Now is your turn. We expect you to start testing the library as much as possible and report back. With your help we can get C-Blosc2 in production stage hopefully very soon. Thanks in advance!

Mid 2020 Progress Report

Francesc Alted — Thu, 27 Aug 2020 12:32:20 GMT

2020 has been a year where the Blosc projects have received important donations, totalling an amount of $55,000 USD so far. In the present report we list the most important tasks that have been carried out during the period that goes from January 2020 to August 2020. Most of these tasks are related to the most fast-paced projects under development: C-Blosc2 and Caterva (including its cat4py wrapper). Having said that, the Blosc development team has been active in other projects too (C-Blosc, python-blosc), although mainly for maintenance purposes.

Besides, we also list the roadmap for the C-Blosc2, Caterva and cat4py projects that we plan to tackle during the next few months.

C-Blosc2

C-Blosc2 adds new data containers, called superchunks, that are essentially a set of compressed chunks in memory that can be accessed randomly and enlarged during its lifetime. Also, a new frame serialization layer has been added, so that superchunks can be persisted on disk, while keeping the same properties of superchunks in memory. Finally, a metalayer capability allow for higher level containers to be created on top of superchunks/frames.

Highligths

Maskout functionality. This allows for selectively choose the blocks of a chunk that are going to be decompressed. This paves the road for faster multidimensional slicing in Caterva (see below in the Caterva section).
Prefilters introduced and declared stable. Prefilters allow for the user to pass C functions for performing arbitrary computations on a chunk prior to the filter/codec pipeline. In addition, the C function can even have access to more chunks than just the one that is being compressed. This opens the door to a way to operate with different super-chunks and produce a new one very efficiently. See https://github.com/Blosc/c-blosc2/blob/master/tests/test_prefilter.c for some examples of use.
Support for PowerPC/Altivec. We added support for PowerPC SIMD (Altivec/VSX) instructions for faster operation of shuffle and bitshuffle filters. For details, see https://github.com/Blosc/c-blosc2/pull/98.
Improvements in compression ratio for LZ4/BloscLZ. New processors are continually increasing the amount of memory in their caches. In recent C-Blosc and C-Blosc2 releases we increased the size of the internal blocks so that LZ4/BloscLZ codecs have better opportunities for finding duplicates and hence, increasing their compression ratios. But due to the increased cache sizes, performance has kept close to the original, fast speeds. For some benchmarks, see https://blosc.org/posts/beast-release/.
New entropy probing method for BloscLZ. BloscLZ is a native codec for Blosc whose mission is to be able to compress synthetic data efficiently. Synthetic data can appear in multiple situations and having a codec that is meant to compress/decompress that with high compression ratios in a fast manner is important. The new entropy probing method included in recent BloscLZ 2.3 (introduced in both C-Blosc and C-Blosc2) allows for even better compression ratios for highly compressible data, while giving up early when blocks are going to be difficult to compress at all. For details see: https://blosc.org/posts/beast-release/ too.

Roadmap for C-Blosc2

During the next few months, we plan to tackle the next tasks:

Postfilters. The same way that prefilters allows to do user-defined computations prior to the compression pipeline, the postfilter would allow to do the same after the decompression pipeline. This could be useful in e.g. creating superchunks out of functions taking simple data as input (for example, a [min, max] range of values).
Finalize the frame implementation. Although the frame specification is almost complete (bar small modifications/additions), we still miss some features that are included in the specification, but not implemented yet. An example of this is the fingerprint support at the end of the frames.
Chunk insertion. Right now only chunk appends are supported. It should be possible to support chunk insertion in any position, and not only at the end of a superchunk.
Security. Although we already started actions to improve the safety of the package using tools like OSS-Fuzz, this is an always work in progress task, and we plan indeed continuing improving it in the future.
Wheels. We would like to deliver wheels on every release soon.

Caterva/cat4py

Caterva is a multidimensional container on top of C-Blosc2 containers. It uses the metalayer capabilities present in superchunks/frames in order to store the multidimensionality information necessary to define arrays up to 8 dimensions and up to 2^63 elements. Besides being able to create such arrays, Caterva provides functionality to get (multidimensional) slices of the arrays easyly and efficiently. cat4py is the Python wrapper for Caterva.

Highligths

Multidimensional blocks. Chunks inside superchunk containers are endowed with a multidimensional structure so as to enable efficient slicing. However, in many cases there is a tension between defining large chunks so as to reduce the amount of indexing to find chunks or smaller ones in order to avoid reading data that falls outside of a slice. In order to reduce such a tension, we endowed the blocks inside chunks with a multidimensional structure too, so that the user has two parameters (chunkshape and blockshape) to play with in order to optimize I/O for their use case. For an example of the kind of performance enhancements you can expect, see https://htmlpreview.github.io/?https://github.com/Blosc/cat4py/blob/269270695d7f6e27e6796541709e98e2f67434fd/notebooks/slicing-performance.html.
API refactoring. Caterva is a relatively young project, and its API grew up organically and hence, in a quite disorganized manner. We recognized that and proceeded with a big API refactoring, trying to put more sense in the naming schema of the functions, as well as in providing a minimal set of C structs that allows for a simpler and better API.
Improved documentation. A nice API is useless if it is not well documented, so we decided to put a significant amount of effort in creating high-quality documentation and examples so that the user can quickly figure out how to create and access Caterva containers with their own data. Although this is still a work in progress, we are pretty happy with how docs are shaping up. See https://caterva.readthedocs.io/ and https://cat4py.readthedocs.io/.
Better Python integration (cat4py). Python, specially thanks to the NumPy project, is a major player in handling multidimensional datasets, so have greatly bettered the integration of cat4py, our Python wrapper for Caterva, with NumPy. In particular, we implemented support for the NumPy array protocol in cat4py containers, as well as an improved NumPy-esque API in cat4py package.

Roadmap for Caterva / cat4py

During the next months, we plan to tackle the next tasks:

Append chunks in any order. This will make it easier for the user to create arrays, since they will not be forced to use a row-wise order.
Update array elements. With this, users will be able to update their arrays without having to make a copy.
Resize array dimensions. This feature will allow Caterva to increase or decrease in size any dimension of the arrays.
Wheels. Once Caterva/cat4py would be in beta stage, we plan to deliver wheels on every release.

Final thoughts

We are very grateful to our sponsors in 2020; they allowed us to implement what we think would be nice features for the whole Blosc ecosystem. However, and although we did a lot of progress towards making C-Blosc2 and Caterva as featured and stable as possible, we still need to finalize our efforts so as to see both projects stable enough to allow them to be used in production. Our expectation is to release a 2.0.0 (final) release for C-Blosc2 by the end of the year, whereas Caterva (and cat4py) should be declared stable during 2021.

Also, we are happy to have enrolled new members on Blosc crew: Óscar Griñón, who proved to be instrumental in implementing the multidimensional blocks in Caterva and Nathan Moinvaziri, who is making great strides in making C-Blosc and C-Blosc2 more secure. Thanks guys!

Hopefully 2021 will also be a good year for seeing the Blosc ecosystem to evolve. If you are interested on what we are building and want to help, we are open to any kind of contribution, including donations. Thank you for your interest!

C-Blosc Beast Release

Francesc Alted — Sat, 25 Jul 2020 14:32:20 GMT

TL;DR; The improvements in new CPUs allow for more cores and (much) larger caches. Latest C-Blosc release leverages these facts so as to allow better compression ratios, while keeping the speed on par with previous releases.

During the past two months we have been working hard at increasing the efficiency of Blosc for the new processors that are coming with more cores than ever before (8 can be considered quite normal, even for laptops, and 16 is not that unusual for rigs). Furthermore, their caches are increasing beyond limits that we thought unthinkable just a few years ago (for example, AMD is putting 64 MB in L3 for their mid-range Ryzen2 39x0 processors). This is mainly a consequence of the recent introduction of the 7nm process for both ARM and AMD64 architectures. It turns out that compression ratios are quite dependent on the sizes of the streams to compress, so having access to more cores and significantly larger caches, it was clear that Blosc was in a pressing need to catch-up and fine-tune its performance for such a new 'beasts'.

So, the version released today (C-Blosc 1.20.0) has been carefully fine-tuned to take the most of recent CPUs, specially for fast codecs, where even if speed is more important than compression ratio, the latter is still a very important parameter. With that, we decided to increase the amount of every compressed stream in a block from 64 KB to 256 KB (most of CPUs nowadays have this amount of private L2 cache or even larger). Also, it is important to allow a minimum of shared L3 cache to every thread so that they do not have to compete for resources, so a new restriction has been added so that no thread has to deal with streams larger than 1 MB (both old and modern CPUs seem to guarantee that they provide at least this amount of L3 per thread).

Below you will find the net effects of this new fine-tuning of fast codecs like LZ4 and BloscLZ on our AMD 3900X box (12 physical cores, 64 MB L3). Here we will be comparing results from C-Blosc 1.18.1 and C-Blosc 1.20.0 (we will skip the comparison against 1.19.x because this can be considered an intermediate release in our pursuit). Spoiler: you will be seeing an important boost of compression ratios, while the high speed of LZ4 and BloscLZ codecs is largely kept.

On the plots below, on the left is the performance of 1.18.1 release, whereas on the right is the performance of the new 1.20.0 release.

Effects in LZ4

Let's start by looking at how the new fine tuning affected compression performance:

Look at how much compression ratio has improved. This is mainly a consequence of using compression streams of up to 256 KB, instead of the previous 64 KB --incidentally, this is just for this synthetic data, but it is clear that real data is going to be benefited as well; besides, synthetic data is something that frequently appears in data science (e.g. a uniformly spaced array of values). One can also see that compression speed has not dropped in general which is great considering that we allow for much better compression ratios now.

Regarding decompression we can see a similar pattern:

So the decompression speed is generally the same, even for data that can be compressed with high compression ratios.

Effects in BloscLZ

Now it is the turn for BloscLZ. Similarly to LZ4, this codec is also meant for speed, but another reason for its existence is that it usually provides better compression ratios than LZ4 when using synthetic data. In that sense, BloscLZ complements well LZ4 because the latter can be used for real data, whereas BloscLZ is usually a better bet for highly repetitive synthetic data. In new C-Blosc we have introduced BloscLZ 2.3.0 which brings a brand new entropy detector which will disable compression early when entropy is high, allowing to selectively put CPU cycles where there are more low-hanging data compression opportunities.

Here it is how performance changes for compression:

In this case, the compression ratio has improved a lot too, and even if compression speed suffers a bit for small compression levels, it is still on par to the original speed for higher compression levels (compressing at more than 30 GB/s while reaching large compression ratios is a big achievement indeed).

Regarding decompression we have this:

As usual for the new release, the decompression speed is generally the same, and performance can still exceed 80 GB/s for the whole range of compression levels. Also noticeable is that fact that single-thread speed is pretty competitive with a regular memcpy(). Again, Ryzen2 architecture is showing its muscle here.

Final Thoughts

Due to technological reasons, CPUs are evolving towards having more cores and larger caches. Hence, compressors and specially Blosc, has to adapt to the new status quo. With the new parametrization and new algorithms (early entropy detector) introduced today, we can achieve much better results. In new Blosc you can expect a good bump in compression ratios with fast codecs (LZ4, BloscLZ) while keeping speed as good as always.

Appendix: Hardware and Software Used

For reference, here it is the software that has been used for this blog entry:

Hardware: AMD Ryzen2 3900X, 12 physical cores, 64 MB L3, 32 GB RAM.
OS: Ubuntu 20.04
Compiler: Clang 10.0.0
C-Blosc: 1.18.1 (2020-03-29) and 1.20.0 (2020-07-25)

** Enjoy Data!**

Blosc Received a $50,000 USD donation

Francesc Alted — Thu, 20 Feb 2020 01:32:20 GMT

I am happy to announce that the Blosc project recently received a donation of $50,000 USD from Huawei via NumFOCUS. Now that we have such an important amount available, our plan is to use it in order to continue making Blosc and its ecosystem more useful for the community. In order to do so, it is important to stress out that our priorities are going to be on the fundamentals of the stack: getting C-Blosc2 out of beta and pushing for making Caterva (the multi-dimensional container on top of C-Blosc2) actually usable.

Critical Tasks: Pushing C-Blosc2 and Caterva

C-Blosc2 has been kind of a laboratory that we used for testing new ideas, like new 64-bit containers, new filters, a new serialization system, the concept of pre-filters and others, for the past 5 years. Although the fork from C-Blosc happened such a long time ago, we tried hard to keep the API backwards compatible so that C-Blosc2 can be used as a drop-in replacement of C-Blosc1 --but beware, the C-Blosc2 format will not be forward-compatible with C-Blosc1, but will be backward-compatible, that is, it will be able to read C-Blosc1 compressed chunks.

On its hand, Caterva is our attempt to build a multidimensional container that is tightly built on top of C-Blosc2, so leveraging its unique features. Caterva is a C99 library (the same than C-Blosc2) that will allow an easy adoption by many different libraries that are about matrix manipulation. The fact that it supports on-the-flight compression and persistency will open new possibilities in that the size of matrices will not be limited to the available memory anymore: data may span through available memory or disk in compressed state.

Provided how fundamental C-Blosc2 and Caterva packages are meant to be, we think that the usefulness of the Blosc project as a whole will be largely benefited from putting most of our efforts here for the next months/years. For this, we already established a series of priorities for working in these projects, as specified in the roadmaps below

Roadmap for C-Blosc2

C-Blosc2 is already in beta stage, and in the next few months we should see it in production stage. Here are some of the more important the things that we want to tackle in order to make this happen:

Plugin capabilities for allowing users to add more filters and codecs. There should also be a plugin register capability so that the info about the new filters and codecs can be persistent and propagated to different machines.
Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted.
Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:
- Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?
- Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is MarkDown, so this should also be the first candidate to use here.
Wrappers for other languages: Python and Java are the most obvious candidates, but others like R or Julia would be nice to have. Still not sure if these should be produced and maintained by the Blosc development team, or leave them for third-party players that would be interested.

For a more detailed discussion see: https://github.com/Blosc/c-blosc2/blob/master/ROADMAP.md

Roadmap for Caterva

Caterva is a much more young project and as such, one may say that it is still in alpha stage, although the basic functionality like creating multidimensional containers, getting items or multidimensional slices or accessing persistent data without a previous load is already there. However, we still miss important things like:

A complete refactorization of the Caterva C code to facilitate its usability.
Adapt the Python interface to the refactorization done in C code.
Add examples into the Python wrapper documentation and create some jupyter notebooks.
Build wheels to make the Python wrapper easier for the user.
Implements a new level of multidimensionality in Caterva. After that, we will support three layers of multidimensionality in a Caterva container: the shape, the chunk shape and the block shape.

For a more detailed discussion see: https://github.com/Blosc/Caterva/blob/master/ROADMAP.md

How we are spending resources

Money is important, but not everything: you need people to work on a project. We are slowly starting to put consistent human resources in the Blosc project. To start with, I (Francesc Alted) and Aleix Alcacer will be putting 25% of our time in the project for the next months, and hopefully others will join too. We will also be using funds to invest in our main tool, that is laptops and desktop computers, but also some furniture like proper seats and tables; the office space is important for creating a happy team. Finally, our plan is to use a part of the donation in facilitating meeting among the Blosc development team.

Your input is important for us

Although during the next year or so, we plan to organize some meetings of the board of directors and the Blosc development team, we think that our ideas cannot grow isolated from the community of users. So in case you want to convey ideas or better, contribute with implementation of ideas, we will be happy to hear and discuss. You can get in touch with us via the Blosc mailing list (https://groups.google.com/forum/#!forum/blosc), and the @Blosc2 twitter account. We are thinking that having other tools like Discourse may help in driving discussions more to the point, but so far we have little experience with it; if you have other suggestions please tell us.

All in all, the Blosc development team is very excited about this new development, and we are putting all our enthusiasm for delivering a new set of tools that we sincerely hope will of of help for the data community out there.

Finally, let me thank our main sponsor for their generous donation, NumFOCUS for accepting our project inside its umbrella, and to all the users and contributors that made Blosc and its ecosystem to help people through the past years (a bit more than 10 since the first C-Blosc 1.0 release).

Enjoy Data!

Blosc2-Meets-Rome

Francesc Alted — Mon, 25 Nov 2019 18:32:20 GMT

On August 7, 2019, AMD released a new generation of its series of EPYC processors, the EPYC 7002, also known as Rome, which are based on the new Zen 2 micro-architecture. Zen 2 is a significant departure from the physical design paradigm of AMD's previous Zen architectures, mainly in that the I/O components of the CPU are laid out on a separate die, different from computing dies; this is quite different from Naples (aka EPYC 7001), its antecessor in the EPYC series:

Such a separation of dies for I/O and computing has quite large consequences in terms of scalability when accessing memory, which is critical for Blosc operation, and here we want to check how Blosc and AMD Rome couple behaves. As there is no replacement for experimentation, we are going to use the same benchmark that was introduced in our previous Breaking Down Memory Walls. This essentially boils down to compute an aggregation with a simple loop like:

#pragma omp parallel for reduction (+:sum)
for (i = 0; i < N; i++) {
  sum += udata[i];
}

As described in the original blog post, the different udata arrays are just chunks of the original dataset that are decompressed just in time for performing the partial aggregation operation; the final result is indeed the sum of all the partial aggregations. Also we have seen that the time to execute the aggregation is going to depend quite a lot on the kind of data that is decompressed: carefully chosen synthetic data can be decompressed much more quickly than real data. But syntehtic data is nevertheless interesting as it allows for a roof analysis of where the performance can grow up to.

In this blog, we are going to see how the AMD EPYC 7402 (Rome), a 24-core processor performs on both synthetic and real data.

Aggregating the Synthetic Dataset on AMD EPYC 7402 24-Core

The synthetic data chosen for this benchmark allows to be compressed/decompressed very easily with applying the shuffle filter before the actual compression codec. Interestingly, and as good example of how filters can benefit the compression process, if we would not apply the shuffle filter first, synthetic data was going to take much more time to compress/decompress (test it by yourself if you don't believe this).

After some experiments, and as usual for synthetic datasets, the codec inside Blosc2 that has shown the best speed while keeping a decent compression ratio (54.6x), has been BloscLZ with compression level 3. Here are the results:

As we can see, the uncompressed dataset scales pretty well until 8 threads, where it hits the memory wall for this machine (around 74 GB/s). On its hand, even if data compressed with Blosc2 (in combination with BloscLZ codec) shows less performance initially, it scales quite smoothly up to 12 threads, where it reaches a higher performance than its uncompressed counterpart (and reaching the 90 GB/s mark).

After that, the compressed dataset can perform aggregations at speeds that are typically faster than uncompressed ones, reaching important peaks at some magical number of threads (up to 210 GB/s at 48 threads). Why these peaks exist at all is probably related with the architecture of the AMD Rome processor, but provided that we are using a 24-core CPU there is little wonder that numbers like 12, 24 (28 is an exception here) and 48 are reaching the highest figures.

Aggregating the Precipitation Dataset on AMD EPYC 7402 24-Core

Now it is time to check the performance of the aggregation with the 100 million values dataset coming from a precipitation dataset from Central Europe. Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area. This time, the best codec inside Blosc2 was determined to be LZ4 with compression level 9:

As expected, the uncompressed aggregation scales pretty much the same than for the synthetic dataset (in the end, the Arithmetic and Logical Unit in the CPU is completely agnostic on what kind of data it operates with). But on its hand, the compressed dataset scales more slowly, but more steadily towards hitting a maximum at 48 threads, where it reaches almost the same speed than the uncompressed dataset, which is quite a feat, provided the high memory bandwidth of this machine (~74 GB/s).

Also, as Blosc2 recently gained support for the accelerated LZ4 codec inside Intel IPP, figures for it have been added to the plot above. There one can see that Intel's accelerated LZ4 can get an up to 10% boost in speed compared with regular LZ4; this additional 10% actually allows Blosc2/LZ4 to be clearly faster than the uncompressed dataset at 48 threads.

Final Thoughts

AMD EPYC Rome represents a significant leap forward in adding a high number of cores to CPUs in a way that scales really well, allowing to put more computational resources to our problems at hand. Here we have shown how nicely a 24-core AMD Rome CPU performs when performing tasks with in-memory compressed datasets; first, by allowing competitive speed when using compression with real data and second, allowing speeds of more than 200 GB/s (with synthetic datasets).

Finally, the 24-core CPU that we have exercised here is just for whetting your appetite, as CPUs of 32 or even 64 cores are going to happen more and more often in the next future. Although I should have better said in present times, as AMD announced today the availability of 32-core CPUs for the workstation market, with 64-core ones coming next year. Definitely, compression is going to play an increasingly important role in getting the most out of these beasts.

Appendix: Software used

For reference, here it is the software that has been used for this blog entry:

OS: Ubuntu 19.10
Compiler: Clang 8.0.0
C-Blosc2: 2.0.0b5.dev (2019-09-13)

Acknowledgments

Thanks to packet.com for kindly providing the hardware for the purposes of this benchmark. Packet guys have been really collaborative through the time in allowing me testing new, bare-metal hardware, and I must say that I am quite impressed on how easy is to start using their services with almost no effort on user's side.

C-Blosc2 Enters Beta Stage

Francesc Alted — Tue, 13 Aug 2019 01:32:20 GMT

The first beta version of C-Blosc2 has been released today. C-Blosc2 is the new iteration of C-Blosc 1.x series, adding more features and better documentation and is the outcome of more than 4 years of slow, but steady development. This blog entry describes the main features that you may see in next generation of C-Blosc, as well as an overview of what is in our roadmap.

Note 1: C-Blosc2 is currently in beta stage, so not ready to be used in production yet. Having said this, being in beta means that the API has been declared frozen, so there is guarantee that your programs will continue to work with future versions of the library. If you want to collaborate in this development, you are welcome: have a look at our roadmap below and contribute PR's or just go to the open issues and help us with them.

Note 2: the term C-Blosc1 will be used instead of the official C-Blosc name for referring to the 1.x series of the library. This is to make the distinction between the C-Blosc 2.x series and C-Blosc 1.x series more explicit.

Main features in C-Blosc2

New 64-bit containers

The main container in C-Blosc2 is the super-chunk or, for brevity, schunk, that is made by smaller containers which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed (or not) by another container which is called a frame. If a schunk is not backed by a frame (the default), the different chunks will be stored sparsely in-memory.

The frame object allows to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported and that data can be accessed using the same API independently of where it is stored, memory or disk.

Finally, the user can add meta-data to frames for different uses and in different layers. For example, one may think on providing a meta-layer for NumPy so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter can add more high-level info (e.g. geo-spatial, meteorological...), if desired.

When taken together, these features represent a pretty powerful way to store and retrieve compressed data that goes well beyond of the previous contiguous compressed buffer, 32-bit limited, of C-Blosc1.

New filters and filters pipeline

Besides shuffle and bitshuffle already present in C-Blosc1, C-Blosc2 already implements:

delta: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The basic idea here is that, in some situations, the diff will have more zeros than the original data, leading to better compression.
trunc_prec: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the shuffle or bitshuffle filter, this leads to more contiguous zeros, which are compressed better and faster.

Also, a new filter pipeline has been implemented. With it, the different filters can be pipelined so that the output of one filter can be the input for the next; this happens at the block level, so minimizing the size of temporary buffers, and hence, accelerating the process. Possible examples of pipelines are a delta filter followed by shuffle, or a trunc_prec followed by bitshuffle. Up to 6 filters can be pipelined, so there is plenty of space for upcoming new filters to collaborate among them.

More SIMD support for ARM and PowerPC

New SIMD support for ARM (NEON), allowing for faster operation on ARM architectures. Only shuffle is supported right now, but the idea is to implement bitshuffle for NEON too.

Also, SIMD support for PowerPC (ALTIVEC) is here, and both shuffle and bitshuffle are supported. However, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer.

New codecs

There is a new Lizard codec, which is an efficient compressor with very fast decompression. It achieves compression ratio that is comparable to zip/zlib and zstd/brotli (at low and medium compression levels) that is able to attain decompression speeds of 1 GB/s or more.

New dictionary support for better compression ratio

Dictionaries allow for better discovery of data duplicates among different blocks: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, this is only supported in the zstd codec, but would be nice to extend it to lz4 and blosclz at least.

Much improved documentation mark-up

We are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C API for C-Blosc2. This is a huge step further compared with the documentation of C-Blosc1, where the developer needed to go the blosc.h header for reading the docstrings there. Thanks to Alberto Sabater for contributing the support for this.

Support for Intel IPP (Integrated Performance Primitives)

Intel is producing a series of optimizations in their IPP library and among them, and accelerated version of the LZ4 codec. Due to its excellent compression capabilities and speed, LZ4 is probably the most used codec in Blosc, so enabling even a bit more of optimization on LZ4 is always a good news. And judging by the plots below, the Intel guys seem to have done an excellent job:

In the plots above we see a couple of things: 1) the IPP/LZ4 functions can compress more than regular LZ4, and 2) they are quite a bit faster than regular LZ4. As always, take these plots with a grain of salt, as actual datasets will see more similar compression ratios and speed (but still, the difference can be significant). Of course, IPP/LZ4 should generate LZ4 chunks that are completely compatible with the original LZ4 library (but in case you detect any incompatibility, please shout!).

C-Blosc2 beta.1 comes with support for LZ4/IPP out-of-the-box, that is, if IPP is detected in the system, its optimized LZ4 functions are automatically linked and used with the Blosc2 library. If, for portability or other reasons, you don't want to create a Blosc2 library that is linked with Intel IPP, you can disable support for it passing the -DDEACTIVATE_IPP=ON to cmake. In the future, we surely may give support for other optimized codecs in IPP too (Zstd would be an excellent candidate).

Roadmap

Of course, C-Blosc2 is not done yet, and there are many interesting enhancements that we would like to tackle sooner or later. Here it is a more or less comprehensive list of our roadmap:

Lock support for super-chunks: when different processes are accessing concurrently to super-chunks, make them to sync properly by using locks, either on-disk (frame-backed super-chunks), or in-memory.
Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted. Candidates for checksums can be the xxhash32 or xxhash64, depending on the gaols (to be decided).
Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:
- Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?
- Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is MarkDown, so this should also be the first candidate to use here.
Wrappers for other languages: Python and Java are the most obvious candidates, but others like R or Julia would be nice to have. Still not sure if these should be produced and maintained by the Blosc development team, or leave them for third-party players that would be interested.
It would be nice to use LGTM, a CI-friendly analyzer for security.
Add support for buildkite as another CI would be handy because it allows to use on-premise machines, potentially speeding-up the time to do the builds, but also to setup pipelines with more complex dependencies and analyzers.

The implementation of these features will require the help of people, either by contributing code (see our developing guidelines) or, as it turns out that Blosc is a project sponsored by NumFOCUS, you may want to make a donation to the project. If you plan to contribute in any way, thanks so much in the name of the community!

Addendum: Special thanks to developers

C-Blosc2 is the outcome of the work of many developers that worked not only on C-Blosc2 itself, but also on C-Blosc1, from which C-Blosc2 inherits a lot of features. I am very grateful to Jack Pappas, who contributed important portability enhancements, specially runtime and cross-platform detection of SSE2/AVX2 (with the help of Julian Taylor) as well as high precision timers (HPET) which are essential for benchmarking purposes. Lucian Marc also contributed the support for ARM/NEON for the shuffle filter. Jerome Kieffer contributed support for PowerPC/ALTIVEC. Alberto Sabater, for his great efforts on producing really nice Blosc2 docs, among other aspects. And last but not least, to Valentin Haenel for general support, bug fixes and other enhancements through the years.

** Enjoy Data!**