Blosc Main Blog Page (Posts about blosclz)

Fine Tuning the BloscLZ codec

Francesc Alted — Fri, 14 Jul 2017 06:32:20 GMT

Yesterday I was reading about the exciting new CPU architectures that both AMD and Intel are introducing and I was wondering how the improved architecture of the new cores and most specially, its caches, could apply to Blosc. It turns out that I have access to a server with a relatively modern CPU (Xeon E3-1245 v5 @ 3.50GHz, with 4 physical cores) and I decided to have a go at fine-tune the included BloscLZ codec (the one that I know the best) inside C-Blosc2. Of course, I already spent some time tuning BloscLZ, but that was some years ago and provided the fast pace at which CPUs are evolving I thought that this was excellent timing for another round of fine-tuning, most specially in preparation for users adopting the forthcoming RYZEN, Threadripper, EPYC and Skylake-SP architectures.

Frankly speaking, I was expecting to get very little improvements in this front, but the results have been unexpectedly good. Keep reading.

Where we come from

Just for reference, here it is the performance of the BloscLZ codec in my server before the new tuning work:

That is the typical synthetic benchmark in Blosc, but for the plotting function in the C-Blosc2 project, the actual size of each compressed buffer is shown (and not the size of the whole dataset, as in C-Blosc1). In this case, the dataset (256 MB) is split in chunks of 4 MB, and provided that our CPU has a LLC (Last Level Cache) of 8 MB, this is sort of an optimal size for achieving maximum performance (the buffers meant for Blosc usually do not exceed 4 MB for most of its common usages).

As can be seen, performance is quite good, although compression ratios left something to be desired. Furthermore, for the maximum compression level (9), the compression ratio has a regression with respect to the previous level (8). This is not too bad, and sometimes happens in any codec, but the nice thing would be to avoid it if possible.

The new BloscLZ after fine tuning

So, after a couple of hours playing with different parameters in BloscLZ and C-Blosc2, I started to realize that the new Intel CPU performed exceedingly well when asked to compress more, to the point that high compression settings were not performing that slow in comparision with low compression ones; rather the contrary: high compression settings were operating at almost the same speed than lower ones (which was a welcome surprise indeed). Hence I tried to be set quite more aggressive parameters in BloscLZ, while trying to keep the size of internal blocks in Blosc2 below 256 KB (the typical size of L2 caches in modern CPUs). This is the result:

So the compression ratios have increased quite a bit, specially for the larger compression levels (going from less than 10x to more than 20x for this benchmark). This is courtesy of the new, more agressive compression parameters. Strikingly enough, performance has also increased in general, but specially for these large compression levels. I am not completely certain on why this is the case, but probably this new CPU architecture is much better at out-of-order execution and prefetching larger blocks of data, which benefits compressing both faster even in large buffers; similarly, I am pretty sure that improvements in compiler technology (I am using a recent GCC 6.3.0 here) is pretty important for getting faster binary code. We can also see that when using 4 threads (i.e. using all the physical cores available in our CPU at hand), BloscLZ can compress faster than a memcpy() call for most of the cases, and most specially at large compression levels, as mentioned before. Oh, and we can see that we also got rid of the regression in the compression ratio for compression level 9, which is cool.

Regarding decompression speed, we can see that the new tuning gave general speed-ups of between 10% and 20%, with no significant slowdowns in any case. All in all, quite good results indeed!

Room for more improvements? Enter PGO.

To temporary end (optimization is a never ending task) this quest for speed, I am curious about the speed that we can buy by using the PGO (Profile Guided Optimization) capability that is present in most of the modern compilers. Here I am going to use the PGO of GCC in combination with our benchmark at hand so as to provide the profile for the compiler optimizer. Here are the results when PGO is applied to the new parametrization:

So, while the speed improvement for compression is not significant (albeit a bit better), the big improvement comes in the decompression speed, where we see speeds almost reaching 50 GB/s and perhaps more interestingly, more than 35 GB/s for maximum compression level, and for first time in my life as Blosc developer, I can see the speed of decompressing with one single thread being faster than memcpy() for all the compression levels.

I wonder what the PGO technique can bring to other codecs in Blosc, but that is stuff for other blog post. At any rate, the reader is encouraged to try PGO on their own setups. I am pretty sure that she will be pleased to see nice speed improvements.

Appendix: Hardware and software used

For reference, here it is the configuration that I used for producing the plots in this blog entry.

CPU: Intel Xeon E3-1245 v5 @ 3.50GHz (4 physical cores with hyper-threading)
OS: Ubuntu 16.04
Compiler: GCC 6.3.0
C-Blosc2: 2.0.0a4.dev (2017-07-14)
BloscLZ: 1.0.6 (2017-07-14)

Seeking Sponsorship for Bcolz/Blosc

Valentin Haenel — Tue, 26 May 2015 08:41:20 GMT

Dear Everyone,

as you may or may not know, the Blosc compressor has become the basis for some novel, innovative technological experiments in the PyData space. Especially the Bcolz and Bloscpack projects which provide a way to perform out-of-core computations on column based datasets have become particularly interesting for the analysis of medium-sized time-series datasets.

In this post, we would like to convince you to give us some money to foster the project, development and accelerate growth of our community. Historically, it has always been a difficult endeavour to monetize open-source development and so, below is a non-exhaustive list of potential models that we are considering:

Direct sponsoring / Donations

This involves paying either a single lump-sum or monthly installments to foster continued development and innovation. This type of sponsoring isn't bound to any specific goal or feature and would allow us for example maintain and release the projects regularly.
Feature-driven sponsoring

Paying for specific features to be implemented, bugs to be fixed or paying to have a voice when it comes to prioritizing items in the issue-tracker(s).
Hiring us as freelancers for Blosc/Bcolz projects

This means that you hire one or both of us to implement a project that uses bcolz inside your company. Any bugs we find or improvements that need to be made would flow back into the open source code-base.
Hiring us as part-time freelancers for general projects

This means you hire one or both of us as part-time freelancers for two to three days a week to work on general projects. These can be related to Python and data or open-source work on other projects. This would allow us to spend the remaining days on Blosc/Bcolz.
PhD positions

There are still a few interesting theoretical aspects to be unlocked, for example certain mathematical properties of the shuffle filter and a compressed extension of the external-memory-model (EMM) to analyse the runtime of Blosc style out-of-core algorithms and Bcolz operations in general.

We welcome any feedback regarding the above options and please do tell us about any additional models that may be interesting to us or for you.

With best wishes and looking forward to your input,

Francesc Alted and Valentin Haenel

Compress Me, Stupid!

Francesc Alted — Thu, 28 Aug 2014 17:01:20 GMT

How it all started

I think I began to become truly interested in compression when, back in 1992, I was installing C-News, a news server package meant to handle Usenet News articles in our university. For younger audiences, Usenet News was a very popular way to discuss about all kind of topics, but at the same time it was pretty difficult to cope with the huge amount of articles, specially because spam practices started to appear by that time. As Gene Spafford put it in 1992:

"Usenet is like a herd of performing elephants with diarrhea. Massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it."

But one thing was clear: Usenet brought massive amounts of data that had to be transmitted through the typical low-bandwidth data lines of that time: 64 Kbps shared for everyone at our university.

My mission then was to bring the Usenet News feed by making use of as low of resources as possible. Of course, one of the first things that I did was to start news transmission during the night, when everyone was warm at bed and nobody was going to complain about others stealing the precious and scarce Internet bandwidth. Another measure was to subscribe to just a selection of groups so that the transmission would end before the new day would start. And of course, I started experimenting with compression for maximizing the number of groups that we could bring to our community.

Compressing Usenet News

The most used compressor by 1992 was compress, a Unix program based on the LZW compression algorithm. But LZW had patents issues, so by that time Jean-Loup Gailly and Mark Adler started the work with gzip. At the beginning of 1993 gzip 1.0 was ready for consumption and I find it exciting not only because it was not patent-encumbered, but also because it compressed way better than the previous compress program, allowed different compression levels, and it was pretty fast too (although compress still had an advantage here, IIRC).

So I talked with the university that was providing us with the News feed and we manage to start compressing it, first with compress and then with gzip. Shortly after that, while making measurements on the new gzip improvements, I discovered that the bottleneck was in our News workstation (an HP 9000-730 with a speedy PA-7000 RISC microprocessor @ 66 MHz) being unable to decompress all the gzipped stream of subscribed news on-time. The bottleneck suddenly changed from the communication line to the CPU!

I remember spending large hours playing with different combinations of data chunk sizes and gzip compression levels, plotting the results (with the fine gnuplot) before finally coming with a combination that stroked a fair balance between available bandwidth and CPU speed, maximizing the amount of news articles hitting our university. I think this was my first realization of how compression could help bringing data faster to the system, making some processes more effective. In fact, that actually blew-up my mind and made me passionate about compression technologies for the years to come.

LZO and the explosion of compression technology

During 1996, Markus F.X.J. Oberhumer started to announce the availability of his own set of LZO compressors. These consisted in many different compressors, all of them being variations of his own compression algorithm (LZO), but tweaked to achieve either better compression ratios or compression speed. The suite was claimed to being able to achieve speeds reaching 1/3 of the memory speed of the typical Pentium-class computers available at that time. An entire set of compressors being able to approach memory speed? boy, that was a very exciting news for me.

LZO was in the back of my mind when I started my work on PyTables in August 2002 and shortly after, in May 2003, PyTables gained support for LZO. My goal was indeed to accelerate data transmission from disk to the CPU (and back), and these plots are testimonial of how beneficial LZO was for achieving that goal. Again, compression was demonstrating that it could effectively increase disk bandwidth, and not only slow internet lines.

However, although LZO was free of patent issues and fast as hell, it had a big limitation for a project like PyTables: the licensing. LZO was using the GPL license, and that prevented the inclusion of its sources in distributions without re-licensing PyTables itself as GPL, a thing that I was not willing to do (PyTables has a BSD license, as it is usual in the NumPy ecosystem). Because of that, LZO was a nice compressor to be included in GPL projects like the Linux kernel itself, but not a good fit for PyTables (although support for LZO still exists, as long as it is downloaded and installed separately).

By that time (mid 2000's) it started to appear a plethora of fast compressors with the same spirit than LZO, but with more permissive licenses (typically BSD/MIT), many of them being a nice fit for PyTables.

A new compressor for PyTables/HDF5

By 2008 it was clear that PyTables needed a compressor whose sources could be included in the PyTables tarball, so minimizing the installation requirements. For this I started considering a series of libraries and immediately devised FastLZ as a nice candidate because of its simplicity and performance. Also, FastLZ had a permissive MIT license, which was what I was looking for.

But pure FastLZ was not completely satisfactory because it was not simple enough. It had 2 compression levels that complicated the implementation quite a bit, so I decided to keep just the highest level, and then optimize certain parts of it so that speed would be acceptable. These modifications gave birth to BloscLZ, which is still being default compressor in Blosc.

But I had more ideas on what other features the new Blosc compressor should have, namely, multi-threading and an integrated shuffle filter. Multi-threading made a lot of sense by 2008 because both Intel and AMD already had a wide range of multi-core processors by then, and it was clear that the race for throwing more and more cores into systems was going to intensify. A fast compressor had to be able to use all these cores dancing around, period.

Shuffle (see slide 71 of this presentation) was the other important component of the new compressor. This algorithm relies on neighboring elements of a dataset being highly correlated to improve data compression. A shuffle filter already came as part of the HDF5 library but it was implemented in pure C, and as it had an important overhead in terms of computation, I decided to do an SIMD version using the powerful SSE2 instructions present in all Intel and AMD processors since 2003. The result is that this new shuffle implementation adds almost zero overhead compared with the compression/decompression stages.

Once all of these features were implemented, I designed a pretty comprehensive suite of tests and asked the PyTables community to help me testing the new compressor in as much systems as possible. After some iterations, we were happy when the new compressor worked flawlessly compressing and decompressing hundreds of terabytes on many different Windows and Unix boxes, both in 32-bit and 64-bit. The new beast was ready to ship.

Blosc was born

I then grabbed BloscLZ, the multi-threading support and the SSE2-powered shuffle and put it all in the same package. That also became a standalone, pure C library, with no attachments to PyTables or HDF5, so any application could make use of it. I have got the first stable version (1.0) of Blosc released by July 2010. Before this, I already introduced Blosc publicly in my EuroSciPy 2009 keynote and also made a small reference to it in an article about Starving CPUs where I stated:

"As the gap between CPU and memory speed continues to widen, I expect Blosc to improve memory-to-CPU data transmission rates over an increasing range of datasets."

And that is the thing. As CPUs are getting faster, the chances for using compression for an advantage can be applied to more and more scenarios, to the point that improving the bandwidth of main memory (RAM) is becoming possible now. And surprisingly enough, the methodology for achieving that is the same than back in the C-news ages: strike a good balance between data block sizes and compression speed, and let compression make your applications handle data faster and not only making it more compact.

When seen in perspective, it has been a long quest over the last decades. During the 90's, compression was useful to improve the bandwidth of slow internet connections. In the 2000's, it made possible accelerating disk I/O operation. In the 2010's Blosc goal is making the memory subsystem faster and whether it is able to achieve this or not will be the subject of future blogs (hint: data arrangement is critical too). But one thing is clear, achieving this (by Blosc or any other compressor out there) is just a matter of time. Such is the fate of the ever increasing gap in CPU versus memory speeds.