Breaking Down Memory Walls

Nowadays CPUs struggle to get data at enough speed to feed their cores. The reason for this is that memory speed is growing at a slower pace than CPUs increase their speed at crunching numbers. This memory slowness compared with CPUs is generally known as the Memory Wall.

For example, let's suppose that we want to compute the aggregation of a some large array; here it is how to do that using OpenMP for leveraging all cores in a CPU:

#pragma omp parallel for reduction (+:sum)
for (i = 0; i < N; i++) {
  sum += udata[i];
}

With this, some server (an Intel Xeon E3-1245 v5 @ 3.50GHz, with 4 physical cores and hyperthreading) takes about 14 ms for doing the aggregation of an array with 100 million of float32 values when using 8 OpenMP threads (optimal number for this CPU). However, if instead of bringing the whole 100 million elements from memory to the CPU we generate the data inside the loop, we are avoiding the data transmission between memory and CPU, like in:

#pragma omp parallel for reduction (+:sum)
for (i = 0; i < N; i++) {
  sum += (float)i;
}

This loop takes just 3.5 ms, that is, 4x less than the original one. That means that our CPU could compute the aggregation at a speed that is 4x faster than the speed at which the memory subsystem can provide data elements to the CPU; or put in another words, the CPU is idle, doing nothing during the 75% of the time, waiting for data to arrive (for this example, but there could be other, more extreme cases). Here we have the memory wall in action indeed.

That the memory wall exists is an excellent reason to think about ways to workaround it. One of the most promising venues is to use compression: what if we could store data in compressed state in-memory and use the spare clock cycles of the CPU for decompressing it just when it is needed? In this blog entry we will see how to implement such a computational kernel on top of data structures that are cache- and compression-friendly and we will examine how they perform on a range of modern CPU architectures. Some surprises are in store.

For demonstration purposes, I will run a simple task: summing up the same array of values than above but using a compressed dataset instead. While computing sums of values seems trivial, it exposes a couple of properties that are important for our discussion:

  1. This is a memory-bounded task.
  2. It is representative of many aggregation/reduction algorithms that are routinely used out in the wild.

Operating with Compressed Datasets

Now let's see how to run our aggregation efficiently when using compressed data. For this, we need:

  1. A data container that supports on-the-flight compression.
  2. A blocking algorithm that leverages the caches in CPUs.

As for the data container, we are going to use the super-chunk object that comes with the Blosc2 library. A super-chunk is a data structure that is meant to host many data chunks in a compressed form, and that has some interesting features; more specifically:

  • Compactness: everything in a super-chunk is designed to take as little space as possible, not only by using compression, but also my minimizing the amount of associated metadata (like indexes).
  • Small fragmentation: by splitting the data in large enough chunks that are contiguous, the resulting structure ends stored in memory with a pretty small amount of 'holes' in it, allowing a more efficient memory management by both the hardware and the software.
  • Support for contexts: useful when we have different threads and we want to decompress data simultaneously. Assigning a context per each thread is enough to allow the simultaneous use of the different cores without badly interfering with each other.
  • Easy access to chunks: an integer is assigned to the different chunks so that requesting a specific chunk is just a matter of specifying its number and then it gets decompressed and returned in one shot. So pointer arithmetic is replaced by indexing operations, making the code less prone to get severe errors (e.g. if a chunk does not exist, an error code is returned instead of creating a segmentation fault).

If you are curious on how the super-chunk can be created and used, just check the sources for the benchmark used for this blog.

Regarding the computing algorithm, I will use one that follows the principles of the blocking computing technique: for every chunk, bring it to the CPU, decompress it (so that it stays in cache), run all the necessary operations on it, and then proceed to the next chunk:

/images/breaking-down-memory-walls/blocking-technique.png

For implementation details, have a look at the benchmark sources.

Also, and in order to allow maximum efficiency when performing multi-threaded operations, the size of each chunk in the super-chunk should fit in non-shared caches (namely, L1 and L2 in modern CPUs). This optimization avoids concurrent access to bus caches as much as possible, thereby allowing dedicated access to data caches in each core.

For our experiments below, we are going to choose a chunksize of 4,000 elements because Blosc2 needs 2 internal buffers for performing the decompression besides the source and destination buffer. Also, we are using 32-bit (4 bytes) float values for our exercise, so the final size used in caches will be 4,000 * (2 + 2) * 4 = 64,000 bytes, which should fit comfortably in L2 caches in most modern CPU architectures (which normally sports 256 KB or even higher). Please note that finding an optimal value for this size might require some fine-tuning, not only for different architectures, but also for different datasets.

The Precipitation Dataset

There are plenty of datasets out there exposing different data distributions so, depending on your scenario, your mileage may vary. The dataset chosen here is the result of a regional reanalysis covering the European continent, and in particular, the precipitation data in a certain region of Europe. Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area.

Caveat: For the sake of easy reproducibility, for building the 100 million dataset I have chosen a small geographical area with a size of 150x150 and reused it repeatedly so as to fill the final dataset completely. As the size of the chunks is lesser than this area, and the super-chunk (as configured here) does not use data redundancies from other chunks, the results obtained here can be safely extrapolated to the actual dataset made from real data (bar some small differences).

Choosing the Compression Codec

When determining the best codec to use inside Blosc2 (it has support for BloscLZ, LZ4, LZ4HC, Zstd, Zlib and Lizard), it turns out that they behave quite differently, both in terms of compression and speed, with the dataset they have to compress and with the CPU architecture in which they run. This is quite usual, and the reason why you should always try to find the best codec for your use case. Here we have how the different codecs behaves for our precipitation dataset in terms of decompression speed for our reference platform (Intel Xeon E3-1245):

i7server-codecs rainfall-cr

In this case LZ4HC is the codec that decompress faster for any number of threads and hence, the one selected for the benchmarks for the reference platform. A similar procedure has been followed to select the codec for the CPUs. The selected codec for every CPU will be conveniently specified in the discussion of the results below.

For completeness, I am also showing the compression ratios achieved by the different codecs for the precipitation dataset. Although there are significant differences for them, these usually come at the cost of compression/decompression time. At any rate, even though compression ratio is important, in this blog we are mainly interested in the best decompression speed, so we will use this latter as the only important parameter for codec selection.

Results on Different CPUs

Now it is time to see how our compressed sum algorithm performs compared with the original uncompressed one. However, as not all the CPUs are created equal, we are going to see how different CPUs perform doing exactly the same computation.

Reference CPU: Intel Xeon E3-1245 v5 4-Core processor @ 3.50GHz

This is a mainstream, somewhat 'small' processor for servers that has an excellent price/performance ratio. Its main virtue is that, due to its small core count, the CPU can be run at considerably high clock speeds which, combined with a high IPC (Instructions Per Clock) count, delivers considerable computational power. These results are a good baseline reference point for comparing other CPUs packing a larger number of cores (and hence, lower clock speeds). Here it is how it performs:

/images/breaking-down-memory-walls/i7server-rainfall-lz4hc-9.png

We see here that, even though the uncompressed dataset does not scale too well, the compressed dataset shows a nice scalability even when using using hyperthreading (> 4 threads); this is a remarkable fact for a feature (hyperthreading) that, despite marketing promises, does not always deliver 2x the performance of the physical cores. With that, the performance peak for the compressed precipitation dataset (22 GB/s, using LZ4HC) is really close to the uncompressed one (27 GB/s); quite an achievement for a CPU with just 4 physical cores.

AMD EPYC 7401P 24-Core Processor @ 2.0GHz

This CPU implements EPYC, one of the most powerful architectures ever created by AMD. It packs 24 physical cores, although internally they are split into 2 blocks with 12 cores each. Here is how it behaves:

/images/breaking-down-memory-walls/epyc-rainfall-lz4-9.png

Stalling at 4/8 threads, the EPYC scalability for the uncompressed dataset is definitely not good. On its hand, the compressed dataset behaves quite differently: it shows a nice scalability through the whole range of cores in the CPU (again, even when using hyperthreading), achieving the best performance (45 GB/s, using LZ4) at precisely 48 threads, well above the maximum performance reached by the uncompressed dataset (30 GB/s).

Intel Scalable Gold 5120 2x 14-Core Processor @ 2.2GHz

Here we have one of the latest and most powerful CPU architectures developed by Intel. We are testing it here within a machine with 2 CPUs, each containing 14 cores. Here’s it how it performed:

/images/breaking-down-memory-walls/scalable-rainfall-lz4-9.png

In this case, and stalling at 24/28 threads, the Intel Scalable shows a quite remarkable scalability for the uncompressed dataset (apparently, Intel has finally chosen a good name for an architecture; well done guys!). More importantly, it also reveals an even nicer scalability on the compressed dataset, all the way up to 56 threads (which is expected provided the 2x 14-core CPUs with hyperthreading); this is a remarkable feat for such a memory bandwidth beast. In absolute terms, the compressed dataset achieves a performance (68 GB/s, using LZ4) that is very close to the uncompressed one (72 GB/s).

Cavium ARMv8 2x 48-Core

We are used to seeing ARM architectures powering most of our phones and tablets, but seeing them performing computational duties is far more uncommon. This does not mean that there are not ARM implementations that cannot power big servers. Cavium, with its 48-core in a single CPU, is an example of a server-grade chip. In this case we are looking at a machine with two of these CPUs:

/images/breaking-down-memory-walls/cavium-rainfall-blosclz-9.png

Again, we see a nice scalability (while a bit bumpy) for the uncompressed dataset, reaching its maximum (35 GB/s) at 40 threads. Regarding the compressed dataset, it scales much more smoothly, and we see how the performance peaks at 64 threads (15 GB/s, using BloscLZ) and then drops significantly after that point (even if the CPU still has enough cores to continue the scaling; I am not sure why is that). Incidentally, the BloscLZ codec being the best performer here is not a coincidence as it recently received a lot of fine-tuning for ARM.

What We Learned

We have explored how to use compression in an nearly optimal way to perform a very simple task: compute an aggregation out of a large dataset. With a basic understanding of the cache and memory subsystem, and by using appropriate compressed data structures (the super-chunk), we have seen how we can easily produce code that enables modern CPUs to perform operations on compressed data at a speed that approaches the speed of the same operations on uncompressed data (and sometimes exceeding it). More in particular:

  1. Performance for the compressed dataset scales very well on the number of threads for all the CPUs (even hyperthreading seems very beneficial at that, which is a welcome surprise).
  2. The CPUs that benefit the most from compression are those with relatively low memory bandwidth and CPUs with many cores. In particular, the EPYC architecture is a good example and we have shown how the compressed dataset can operate 50% faster that the uncompressed one.
  3. Even when using CPUs with a low number of cores (e.g. our reference CPU, with only 4) we can achieve computational speeds on compressed data that can be on par with traditional, uncompressed computations, while saving precious amounts of memory and disk space.
  4. The appropriate codec (and other parameters) to use within Blosc2 for maximum performance can vary depending on the dataset and the CPU used. Having a way to automatically discover the optimal compression parameters would be a nice addition to the Blosc2 library.

Final Thoughts

To conclude, it is interesting to remember here what Linus Torvalds said back in 2006 (talking about the git system that he created the year before):

[...] git actually has a simple design, with stable and reasonably well-documented data structures. In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful. [...] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

Of course, we all know how drastic Linus can be in his statements, but I cannot agree more on how important is to adopt a data-driven view when designing our applications. But I'd go further and say that, when trying to squeeze the last drop of performance out of modern CPUs, data containers need to be structured in a way that leverages the characteristics of the underlying CPU, as well as to facilitate the application of the blocking technique (and thereby allowing compression to run efficiently). Hopefully, installments like this can help us explore new possibilities to break down the memory wall that bedevils modern computing.

Acknowledgements

Thanks to my friend Scott Prater for his great advices on improving my writing style, Dirk Schwanenberg for pointing out to the precipitation dataset and for providing the script for reading it, and Robert McLeod, J. David Ibáñez and Javier Sancho for suggesting general improvements (even though some of their suggestions required such a big amount of work that made me ponder about their actual friendship :).

Appendix: Software used

For reference, here it is the software that has been used for this blog entry:

  • OS: Ubuntu 18.04
  • Compiler: GCC 7.3.0
  • C-Blosc2: 2.0.0a6.dev (2018-05-18)

New Forward Compatibility Policy

The #215 issue

Recently, a C-Blosc user filed an issue describing how buffers created with a modern version of C-Blosc (starting from 1.11.0) were not able to be decompressed with an older version of the library (1.7.0), i.e. C-Blosc was effectively breaking the so-called forward-compatibility. After some investigation, it turned out that the culprit was an optimization that was introduced in 1.11.0 in order to allow better compression ratios and in some cases, better speed too.

Not all the codecs inside C-Blosc were equally affected; the ones that are experiencing the issue are LZ4, LZ4HC and Zlib (quite luckily, BloscLZ, the default codec, is not bitten by this and should be forward compatible probably til 1.0.0); that is, when a user is using a modern C-Blosc library (> 1.11.0) and is using any of the affected codecs, there are situations (namely when the shuffle or bitshuffle filter are active and the buffers to be compressed are larger than 32 KB) that the result cannot be decompressed with older versions (< 1.11.0).

Why this occurred?

Prior to 1.11.0, Blosc has traditionally split the internal blocks (the different pieces in which the buffer to be compressed is divided) into smaller pieces composed by the same significant bytes that the shuffle filter has put together. The rational for doing this is that these pieces are, in many cases, hosting values that are either zeros or very close byte values (this is why shuffle works well in general for binary data), and asking the codec to compress these split-blocks separately was quite less effort than compressing a complete block hence providing more speed).

However, I realized that the so-called High Compression Ration codecs (the HCR codecs inside BLosc are LZ4HC, Zlib and Zstd) generally benefited from this split not happening (the reason is clear: more data means more opportunities for finding more duplicates) and in some cases, the speed was better too. So, in C-Blosc 1.11.0, I decided that the split was not going to happen by default for HCR codecs (in the last minute I decided to include LZ4 too, for which the experiments showed a noticeable performance bump too, see below). Fortunately, the Zstd codec was introduced (in an out-of-beta way) at the very same 1.11.0 release than this split-block change, so in practice data compressed with the Zstd codec is not affected by this.

New forward compatibility enforcement policy

Although this change was deliberate and every measure was put in making it backward compatible (i.e. new library versions could read buffers compressed with older versions), I was not really aware of the inconveniences that the change was putting for people creating data files using newer versions of the library and expecting these to be read with older versions.

So in order to prevent something like this to happen again, I decided that forward compatibility is going to be enforced for future releases of C-Blosc (just for 1.x series; C-Blosc 2.x should be just backward compatible with 1.x). By the way, this new forward compatibility policy will require a significantly more costly release procedure, as different libraries for a specific set of versions have to be manually re-created; if you know a more automatic way to test forward compatibility with old versions of a library, I'd love to hear your comments.

Measures taken and new split mode

In order to alleviate this forward incompatibility issue, I decided to revert the split change introduced in 1.11.0 in forthcoming 1.14.0 release. That means that, by default, compressed buffers created with C-Blosc 1.14.0 and on will be forward compatible with all the previous C-Blosc libraries (till 1.3.0, which was when support for different codecs was introduced). That is, the only buffers that will pose problems to be decompressed with old versions are those created with a C-Blosc library with versions between 1.11.0 and 1.14.0 and using the shuffle/bitshuffle filter in combination with the LZ4, LZ4HC or Zlib codecs.

For fine-tuning how the block-split would happen or not, I have introduced a new API function, void blosc_set_splitmode(int mode), that allows to select the split mode that is going to be used during the compression. The split modes that can take the new function are:

  • BLOSC_FORWARD_COMPAT_SPLIT
  • BLOSC_AUTO_SPLIT
  • BLOSC_NEVER_SPLIT
  • BLOSC_ALWAYS_SPLIT

BLOSC_FORWARD_COMPAT_SPLIT offers reasonably forward compatibility (i.e. Zstd still will not split, but this is safe because it was introduced at the same time than the split change in 1.11.0), BLOSC_AUTO_SPLIT is for nearly optimal results (based on heuristics; this is the same approach than the one introduced in 1.11.0), BLOSC_NEVER_SPLIT and BLOSC_ALWAYS_SPLIT are for the user interested in experimenting for getting best compression ratios and/or speed. If blosc_set_splitmode() is not called, the default mode will be BLOSC_FORWARD_COMPAT_SPLIT.

Also, the user will be able to specify the split mode by using the BLOSC_SPLITMODE variable environment. If that variable exists in the environment, and has any value among:

  • 'BLOSC_FORWARD_COMPAT_SPLIT'
  • 'BLOSC_AUTO_SPLIT'
  • 'BLOSC_NEVER_SPLIT'
  • 'BLOSC_ALWAYS_SPLIT'

this will select the corresponding split mode.

How this change affects performance

So as to allow to visualize at a glance the differences in performance that the new release is introducing, let's have a look at the impact on two of the most used codecs inside C-Blosc: LZ4 and LZ4HC. In the plots below the left side is the pre-1.14.0 version (non-split blocks) and on the right, the forthcoming 1.14.0 (split blocks). Note that I am using here the typical synthetic benchmarks for C-Blosc, so expect a different outcome for your own data.

Let's start by LZ4HC, a High Compression Ratio codec (and the one that triggered the initial report of the forward compatibility issue). When compressing, we have this change in behavior:

lz4hc-c lz4hc-compat-c

For LZ4HC decompression:

lz4hc-d lz4hc-compat-d

For LZ4HC one can see that, when using non-split blocks, it can achieve better compression ratios (this is expected, as the block sizes are larger). Speed-wise the performance is quite similar, with maybe some advantage for split blocks (expected as well). As the raison d'être for HCR codecs is maximize the compression ration, that was the reason why I did the split change for LZ4HC in 1.11.0.

And now for LZ4, a codec meant for speed (although it normally gives pretty decent results in compression ratio). When compressing, here it is the change:

lz4-c lz4-compat-c

For LZ4 decompression:

lz4-d lz4-compat-d

So, here one can see that when using Blosc pre-1.14 (i.e. non-split blocks) we are getting a bit less compression ratio than for the forthcoming 1.14, which even if counter-intutive, it matches my experience with non-HCR codecs. Speed-wise the difference is not that much during compression; however, decompression is significantly faster with non-split blocks. As LZ4 is meant for speed, this was possibly the reason that pushed me towards making non-split blocks by default for LZ4 in addition to HCR codecs in 1.11.0.

Feedback

If you have suggestions on this forward compatibility issue or the solution that has been implemented, please shout!

Appendix: Hardware and software used

For reference, here it is the configuration that I used for producing the plots in this blog entry.

  • CPU: Intel Xeon E3-1245 v5 @ 3.50GHz (4 physical cores with hyper-threading)
  • OS: Ubuntu 16.04
  • Compiler: GCC 6.3.0
  • C-Blosc: 1.13.7 and 1.14.0 (release candidate)
  • LZ4: 1.8.1

Blosc Has Won Google's Open Source Peer Bonus Program

Past month Google announced the winners for the 2017’s second round of their Open Source Peer Bonus program and I was among them for my commitment to the Blosc project. It took a bit, but I wanted to express my thoughts on this nice event. Needless to say, I am proud and honored for this recognition, most specially when this is the first completely uninterested donation that someone made to me after 15 years of doing open source (in many occasions doing that as a full-time work), so thank you very much Google! The assumption is that people does open source because 1) they believe in the concept and 2) they can earn a public consideration that allows them to get contracts (so allowing many of us to have a life!). However, this time the unexpected happened, and that an important corporation like Google decided to publicly recognize this work makes me very happy (would that pave the way for others to follow? :-).

Having said this, and as it happens with any open source project that has seen some success, the contributions for other people have been instrumental in making Blosc such a featured and stable library. People like Jack Pappas, Valentin Hänel, Christohper Speller, Antonio Valentino and up to 30 contributors made important contributions to the project. This award goes indeed to them too.

This push comes very timely because it is giving me more stamina towards the release of Blosc2. Blosc2 is the next generation of Blosc, and will add features like:

  • Full 64-bit support for chunks (i.e. not anymore limited to 2 GB).
  • New filters, like delta and truncation of floating point precision.
  • A new filter pipeline that will allow to run more than one filter before the compression step.
  • Support for variable length objects (i.e. not limited to fixed-length datasets).
  • Support for dictionaries between different blocks in the same chunk. That will be important for allowing smaller chunks (and hence improving decompression latency) while keeping compression ratio and performance mostly untouched.
  • Support for more codecs (lizard support is already in).
  • New serialisation format which is meant to allow self-discovery via magic numbers and introspection.
  • New super-chunk object that will allow to work seamlessly with arbitrarily large sets of chunks, both in-memory and on-disk.
  • Support for SIMD in ARM processors.

All in all, and after more than 2 years working in different aspects of these features, I am quite satisfied on the progress so far. My expectation was to do a beta release during this fall, and although the work is quite advanced, there are still some loose ends that require quite a bit of work. If you like where I am headed and are interested in seeing this work to complete faster, a contribution to the project in the form of a pull request or, better yet, a donation suggesting which feature you would like the most will be greatly appreciated.

Finally, I'd like to take the opportunity to annonunce that Blosc has a logo (finally!). You can admire it at the header of this page. This is the work of Domènec Morera who also made for us the logo of PyTables. I really think he is a great artist and that he did an excellent job again; I hope the new logo will be beneficial for the Blosc project as a whole!

The Lizard Codec

The past weekend I was putting some time in integrating one of the codecs that I was lately more curious about (specially since the release of its 1.0 version some months ago). I am talking about Lizard, a direct derivative of the LZ4 codec and whose author is Przemyslaw Skibinski. One should remark that Przemyslaw is not new in the compression arena as he has helped Yann Collet quite a lot during the Zstandard development, and also he is the author of lzbench, a nice and comprehensive in-memory benchmark of a series of open-source LZ77/LZSS/LZMA compressors.

The reason why I was thinking that Lizard was an interesting codec for Blosc is because it mixes some interesting optimizations of the LZ4 codec and, optionally, it can use them in combination with the Huffman coding by selecting different compression levels (currently from 10 to 49).

After the initial support for Lizard in Blosc, it took me some time to determine a decent map between the compression levels in Blosc (1-9) to the ones in Lizard (10-49), mainly for allowing fast compression and decompression (what Blosc is all about). During the way, I discovered that the most interesting compression levels in Lizard have been 10, 20 and 41. This was indeed determined using the synthetic benchamrk that comes with Blosc, but that is the usual path that gaves me quite good estimates for a first calibration (we are working on a more complete tuner that can adapt to actual data in real time, but I'll blog about it in another occasion).

A new star has born

After the calibration was done the results of the new codec are really surprising:

lizard-c lizard-d

The interesting part of Lizard can be seen when large compression levels for Blosc are used, specially 8 and 9. Those are mapped to compression level 41 in Lizard, which means that the LIZv1 + Huffman compression method is used. Following the documentation, this matches the compression levels of Zlib and Zstd/Brotli, and it shows.

Just for reference, here it is the performance of the LZ4 codec, from which Lizard inherits a good part of its code:

lz4-c lz4-d

And here the performance of Zstd, which also uses Huffman coding:

zstd-c zstd-d

So, while Lizard (or at least, the current mapping that I did for it inside Blosc) in low compression levels cannot beat the speed of LZ4 or the compression ratios of Zstd, for high compression levels it clearly beats LZ4 and Zstd speed both for compression and decompression. Most specially, it works extremely well for achieving pretty reasonable compression ratios (typically better than Zlib, albeit not as good as Zstd) at very good decompression speed and exceptional compression speed (compressing at more than the memcpy() speed, at very good ratios, oh really?).

Finally, for those wondering why I have not used the LIZv1 + Huffman compression method also for the lower compression levels in Blosc, the answer is that I obviously tried that, but for some reason, this method only performs well for large buffers, whereas for small buffers (like the ones created by low compression levels in Blosc) its performance is rather poor. I was kind of getting a similar behaviour with Zstd, where performance shines for decompressing large buffers (the difference is that Lizard can compress at tremendous speed when compared with Zstd in this scenario), so I suppose this is typical when Huffman methods are used.

Finding its place among Blosc codecs

In my previous blog, I was saying that Zstd has virtually no competitor in Write Once Read Multiple scenarios. However, I think there is still a niche for codecs that, without providing the extreme compression ratios of Zstd, they still show big enough compression muscle without loosing too much compression speed. IMO, this is a good description of how Lizard performs. However, in Blosc1 we only have slots for a couple of codecs more (but that will not be a problem for Blosc2, where much more codecs will be supported), and although I am pretty enthusiastic on adding Lizard it would be nice to gather users feedback before than that. So in case you are a Blosc user, please use the lizard branch of the C-Blosc repo (UPDATE: Lizard has been merged into the C-Blosc2 repo recently) and report back your results.

Appendix: Hardware and software used

For reference, here it is the configuration that I used for producing the plots in this blog entry.

  • CPU: Intel Xeon E3-1245 v5 @ 3.50GHz (4 physical cores with hyper-threading)
  • OS: Ubuntu 16.04
  • Compiler: GCC 6.3.0
  • C-Blosc2: 2.0.0a4.dev (2017-07-29, lizard branch)
  • Lizard: 1.0.0

Testing PGO with LZ4 and Zstd codecs

In past week's post I was showing how the PGO (Profile Guided Optimization) capability in modern compilers allowed for a good increase in the performance of the BloscLZ codec. Today I'd like to test how the PGO optimization affected the speed of the same synthetic benchmark that comes with C-Blosc2 for the two other of the most used codecs in Blosc: LZ4 and Zstd.

LZ4

First, for GCC without PGO:

lz4-old-c lz4-old-d

Now with PGO enabled:

lz4-pgo-c lz4-pgo-d

We can see here that, similarly to BloscLZ, although the compression speed has not improved significantly, the decompression is now reaching up to 30 GB/s, and for high compression levels, up to 20 GB/s, which is pretty good.

Zstd

First, for GCC without PGO:

zstd-old-c zstd-old-d

Now with PGO enabled:

zstd-pgo-c zstd-pgo-d

Wow, in this case we really can see important speedups in both compressing and decompressing. Specially interesting is the decompression case where, for the higher compression levels, Zstd can reach speeds exceeding 20 GB/s (whereas without PGO it was not able to exceed 12 GB/s) which seems a bit crazy provided the wonderful compression ratios that Zstd is able to achieve. Beyond any doubt, for Write Once Read Multiple scenarios there is no competitor for Zstd, most specially when PGO is used.

This confirms that, once again, when performance is critical for your applications, PGO should be part of your daily weaponery.

Appendix: Hardware and software used

For reference, here it is the configuration that I used for producing the plots in this blog entry.

  • CPU: Intel Xeon E3-1245 v5 @ 3.50GHz (4 physical cores with hyper-threading)
  • OS: Ubuntu 16.04
  • Compiler: GCC 6.3.0 (using PGO)
  • C-Blosc2: 2.0.0a4.dev (2017-07-11)
  • LZ4 codec: 1.7.5
  • Zstd codec: 1.3.0

Fine Tuning the BloscLZ codec

Yesterday I was reading about the exciting new CPU architectures that both AMD and Intel are introducing and I was wondering how the improved architecture of the new cores and most specially, its caches, could apply to Blosc. It turns out that I have access to a server with a relatively modern CPU (Xeon E3-1245 v5 @ 3.50GHz, with 4 physical cores) and I decided to have a go at fine-tune the included BloscLZ codec (the one that I know the best) inside C-Blosc2. Of course, I already spent some time tuning BloscLZ, but that was some years ago and provided the fast pace at which CPUs are evolving I thought that this was excellent timing for another round of fine-tuning, most specially in preparation for users adopting the forthcoming RYZEN, Threadripper, EPYC and Skylake-SP architectures.

Frankly speaking, I was expecting to get very little improvements in this front, but the results have been unexpectedly good. Keep reading.

Where we come from

Just for reference, here it is the performance of the BloscLZ codec in my server before the new tuning work:

blosclz-old-c blosclz-old-d

That is the typical synthetic benchmark in Blosc, but for the plotting function in the C-Blosc2 project, the actual size of each compressed buffer is shown (and not the size of the whole dataset, as in C-Blosc1). In this case, the dataset (256 MB) is split in chunks of 4 MB, and provided that our CPU has a LLC (Last Level Cache) of 8 MB, this is sort of an optimal size for achieving maximum performance (the buffers meant for Blosc usually do not exceed 4 MB for most of its common usages).

As can be seen, performance is quite good, although compression ratios left something to be desired. Furthermore, for the maximum compression level (9), the compression ratio has a regression with respect to the previous level (8). This is not too bad, and sometimes happens in any codec, but the nice thing would be to avoid it if possible.

The new BloscLZ after fine tuning

So, after a couple of hours playing with different parameters in BloscLZ and C-Blosc2, I started to realize that the new Intel CPU performed exceedingly well when asked to compress more, to the point that high compression settings were not performing that slow in comparision with low compression ones; rather the contrary: high compression settings were operating at almost the same speed than lower ones (which was a welcome surprise indeed). Hence I tried to be set quite more aggressive parameters in BloscLZ, while trying to keep the size of internal blocks in Blosc2 below 256 KB (the typical size of L2 caches in modern CPUs). This is the result:

blosclz-new-c blosclz-new-d

So the compression ratios have increased quite a bit, specially for the larger compression levels (going from less than 10x to more than 20x for this benchmark). This is courtesy of the new, more agressive compression parameters. Strikingly enough, performance has also increased in general, but specially for these large compression levels. I am not completely certain on why this is the case, but probably this new CPU architecture is much better at out-of-order execution and prefetching larger blocks of data, which benefits compressing both faster even in large buffers; similarly, I am pretty sure that improvements in compiler technology (I am using a recent GCC 6.3.0 here) is pretty important for getting faster binary code. We can also see that when using 4 threads (i.e. using all the physical cores available in our CPU at hand), BloscLZ can compress faster than a memcpy() call for most of the cases, and most specially at large compression levels, as mentioned before. Oh, and we can see that we also got rid of the regression in the compression ratio for compression level 9, which is cool.

Regarding decompression speed, we can see that the new tuning gave general speed-ups of between 10% and 20%, with no significant slowdowns in any case. All in all, quite good results indeed!

Room for more improvements? Enter PGO.

To temporary end (optimization is a never ending task) this quest for speed, I am curious about the speed that we can buy by using the PGO (Profile Guided Optimization) capability that is present in most of the modern compilers. Here I am going to use the PGO of GCC in combination with our benchmark at hand so as to provide the profile for the compiler optimizer. Here are the results when PGO is applied to the new parametrization:

blosclz-pgo-c blosclz-pgo-d

So, while the speed improvement for compression is not significant (albeit a bit better), the big improvement comes in the decompression speed, where we see speeds almost reaching 50 GB/s and perhaps more interestingly, more than 35 GB/s for maximum compression level, and for first time in my life as Blosc developer, I can see the speed of decompressing with one single thread being faster than memcpy() for all the compression levels.

I wonder what the PGO technique can bring to other codecs in Blosc, but that is stuff for other blog post. At any rate, the reader is encouraged to try PGO on their own setups. I am pretty sure that she will be pleased to see nice speed improvements.

Appendix: Hardware and software used

For reference, here it is the configuration that I used for producing the plots in this blog entry.

  • CPU: Intel Xeon E3-1245 v5 @ 3.50GHz (4 physical cores with hyper-threading)
  • OS: Ubuntu 16.04
  • Compiler: GCC 6.3.0
  • C-Blosc2: 2.0.0a4.dev (2017-07-14)
  • BloscLZ: 1.0.6 (2017-07-14)

Zstd has just landed in Blosc

Zstd, aka Zstandard, is a new breed of compression library that promises to achieve better compression ratios than Zlib, and at better speeds too. The fact that Zstd is geared towards fast compression / decompression since the beginning was an indication for me that it could be a good fit for Blosc. After some months of experimentation with Zstd in Blosc2, I am really happy to say that I am quite impressed on how the pair performs.

And now that the Zstd format has been declared stable and that its API is maturing rapidly, it is a good time for inclusion in the Blosc1 project too. In Blosc1 there was still a couple of slots available for additional codecs, and after my positive experiences with Zstd I decided that it would be an excellent candidate to take one of the free seats (will see which one will take the last one, if any).

Beware: the Zstd support in Blosc should still be considered in beta and so it is not recommended to use this new codec in production yet. It is indeed recommended to start experimenting with it so as to see the kind of improvements that it can bring to your scenario, and specially report possible flaws back.

A compression beast for Blosc operation

As said, Zstd is meant to achieve better compression ratios than Zlib, and this is indeed the case for many situations already. But it turns out that Zstd shines specially when faced to the kind of data that is left after the shuffle (or bitshuffle) filter passes.

As for one, here it is the typical benchmark plot for compressing with Zstd on my machine (Intel Xeon E3-1245-v5 @ 3.5GHz), side-by-side with Zlib which was the codec having the best compression ratios among all the supported inside Blosc:

lap-zstd-c lap-zlib-c

As can be seen, Zstd achieves a maximum compression ratio of more than 300x for this specific dataset, which is quite a lot more than the 70x achieved by Zlib. But the coolest thing is that we are not paying a performance price for this increased compression ratio, rather the contrary, because Zstd is clearly superior (up to a 25%) in compression speed to Zlib.

But one of the most distinctive features for Blosc is its ability to decompress data very fast (sometimes faster than memcpy() as I like to remind). And look at what Zstd is able to achieve in this case:

lap-zstd-d lap-zlib-d

With peak speeds larger than 10 GB/s, Zstd can decompress data more than 2x faster than Zlib peaks (~ 4 GB/s). And more importantly, when it comes to decompress data at the highest compression level, Zstd can do that about 6x faster than Zlib (~6 GB/s vs ~1 GB/s), which is a welcome surprise.

Not the fastest, but a nicely balanced one

Of course, Zstd is still far from the fastest codecs in Blosc. See for example how the internal BloscLZ codec can perform in this machine:

lap-blosclz-c lap-blosclz-d

But nevertheless, due to its impressive balance between compression ratio and speed, Zstd is called to be one of the most attractive codecs in Blosc for the near future.

As always, all these benchmarks here were made for the specific, synthetic dataset that I am using for Blosc since the beginning (mainly for reproducibility purposes). But I am pretty sure that most of the capabilities shown here will be experienced in a large variety of datasets that Blosc is meant to tackle (in fact, it would be nice if you can share your experience by adding a comment below).

Finally, my special thanks to Yann Collet, the author of Zstd (as well as LZ4, also included in Blosc) for putting his genius at the service of the community by opening not only his code, but also his mind in his amazing series of blogs about compression: http://fastcompression.blogspot.com

Appendix: What can be expected in Blosc2

Blosc2 has support for Zstd contexts and a new way to split chunks into blocks that makes codecs go faster in general. Below you have a couple of plots on how the Blosc2/Zstd couple behaves:

blosc2-zstd-c blosc2-zstd-d

As can be seen, in Blosc2 Zstd can get peaks of more than 15 GB/s, almost reaching memcpy() speed in this machine (~17 GB/s). Also, decompression speed at the highest compression ratio can scale when throwing more threads at it (a thing that Blosc1 is not able to achieve), and easily surpasses 10 GB/s. Notice that reaching such a high speed while decompressing a buffer with a really high compression ratio (~300x) is really impressing. On his part, compression speed is a bit less (25%) than in Blosc1 but still quite competitive (and on par with Zlib).

This is really exciting news to be added on top of the new planned features for Blosc2.

ARM is becoming a first-class citizen for Blosc

We are happy to announce that Blosc is receiving official support for ARM processors. Blosc has always been meant to support all platforms where a C89 compliant C compiler can be found, but until now the only hardware platforms that we were testing on a regular basis has been Intel (on top of Unix/Linux, Mac OSX and Windows).

We want this to change and the ARM architecture has been our first candidate to become a fully supported platform besides Intel/AMD. You may be wondering that we could have chosen any other architecture like MIPS or PowerPC, so why ARM?

ARM is eating the world

ARM is an increasingly popular architecture and we can find implementation exemplars of it not only in the phones, tablets or ChromeBooks, but also acting as embedded processors, as well as in providing computing power to immensely popular Raspberry Pi's and Arduinos and even environments so apparently alien to it like High Performance Computing.

Contrarily to what has been traditional for other computer platforms, one of the most important design features for ARM is to keep energy consumption under very strict limits. Nowadays, the ARM architecture can run decently powerful CPUs where each core consumes just 600 to 750 mWatt or less.

In my opinion, it is precisely this energy efficiency what makes of ARM one of the platforms with more projection to gain ground as a general computer platform in the short future. By now, we all know that ARM allows packing more cores into a single die (e.g. your phone having more cores than your laptop, anyone?). And more cores also means more combined computing throughput (albeit a bit more difficult to program), but more importantly, more cores being able to bring data from memory at the same time. Contrarily to what one might think, having different threads transmitting data from RAM to the CPU caches provides a better utilization of memory buses, and hence, a much better global memory bandwidth. This can be seen, for example, in typical Blosc benchmarks by looking at how the bandwidth grows with the number of threads in all the dots, but specially where compression ratio equals 1 (i.e. no compression is active, so Blosc is only doing a memory copy in this case).

Blosc is getting ready for ARM

So ARM is cool indeed, but what we are doing for making it a first-class citizen? For starters, we have created a new C-Blosc2 repository that is going to act as a playground for some time and where we are going to experiment with a new range of features (those will be discussed in a later post). And this is exactly the place where we have already started implementing a NEON version of the shuffle filter.

NEON is an SIMD extension in the same spirit than SSE2 or AVX2 present in Intel/AMD offerings. NEON extension was introduced in ARMv7 architecture, and is present in most of the current high-end devices (including most of the phones and tablets floating around, including the new Raspberry Pi 2). As many of you know, leveraging SIMD in modern CPUs is key for allowing Blosc to be one of the fastest compressors around, and if we wanted to be serious about ARM, NEON support had to be here, period.

The new NEON implementation of shuffle for Blosc has been entirely made by Lucian Marc, a summer student that joined the project at the beginning of July 2015. Lucian did a terrific work on implementing the shuffle filter NEON, and during the 2-months stage he did not only that, but he also had time to do a preliminary version of the bitshuffle filter as well (not completely functional yet, but as time allows, he plans to finish that).

Some hints on the measured increase in performance

So you might be asking, how fast can perform Blosc on an ARM with NEON? Well, let's start first by showing how fast it works on a Raspberry Pi 2 (Broadcom BCM2836 ARMv7 Quad Core Processor) having NEON and running Raspbian (gcc 4.7.2). To not bore people, we are going to show just decompression speeds:

/images/blosclz-shuffle-neon-rpi2.png

It turns out that, when using the 4 cores and low compression levels, Blosc with NEON support already shows evidence that it can equal the performance of memcpy() on ARM. This is an important fact because I did not think that ARM performance was enough to allow Blosc doing that already. I was wrong.

Okay, so Blosc using NEON can be fast, but exactly how much when compared to a shuffle implementation in pure C? Here you have the figures for the generic C shuffle:

/images/blosclz-shuffle-generic-rpi2.png

That means that NEON can accelerate the whole decompression process between 2x and 3x, which is pretty significant, and also speaks highly about the quality of Lucian's NEON implementation.

Does that mean that we can extrapolate these figures for all ARM processors out there? Not quite. In fact, the performance of a Raspberry Pi 2 is quite mild compared with other boards. So, let's see what is the performance on a ODROID-XU3 (although it has been replaced by ODROID-XU4, the XU3 has the same processor, so we are testing a pretty powerful CPU model here). This board comes with a Samsung Exynos5422 Cortex-A15 2.0 GHz quad core and Cortex™-A7 quad core CPUs, so it is a representative of the ARM Heterogeneous Multi-Processing solution (aka big.LITTLE). Here are its figures:

/images/blosclz-shuffle-neon-odroid.png

So, the first thing to note is the memcpy() speed that at 1.6 GB/s, is considerably faster than the RPi2 (< 0.9 GB/s). Yeah, this is a much more capable board from a computational point of view. The second thing is that decompression speed almost doubles the memcpy() speed. Again, I was very impressed because I did not expect this range of speeds at all. ARM definitely is getting in a situation where compression can be used for an advantage, computationally speaking.

The third thing to note is a bit disappointing though: why only 3 threads appear in the plot? Well, it turns out that the benchmark suite fails miserably when using 4 threads or more. As the Raspberry setup does not suffer from this problem at all, I presume that this is more related with the board or the libraries that come with the operating system (Ubuntu 14.04). This is rather unfortunate because I was really curious to see such an ARMv7 8-core beast running at full steam using the 8 threads. At any rate, time will tell if the problem is in the board or in Blosc itself.

Just to make the benchmarks a bit more complete, let me finish this benchmark section showing the performance using the generic C code for the shuffling algorithm:

/images/blosclz-shuffle-generic-odroid.png

If we compare with NEON figures for the ODROID board, we can see again an increase in speed of between 2x and 4x, which is crazy amazing (sorry if I seem a bit over-enthusiastic, but again, I was not really prepared for seeing this). Again, only figures for 2 threads are in this plot because the benchmark crashes for 3 threads (this is another hint that points to the fault being outside Blosc itself and not in its NEON implementation of the shuffle filter).

At decompression speeds of 3 GB/s and ~ 2 Watt of energy consumption, the ARM platform has one of the best bandwidth/Watt ratios that you can find in the market, and this can have (and will have) profound implications on how computations will be made in the short future (as the Mont Blanc initiative is trying to demonstrate).

What to expect from ARM/Blosc in the forthcoming months

This work on supporting ARM platforms is just the beginning. As ARM processors get more spread, and most specially, faster, we will need to refine the support for ARM in Blosc.

NEON support is only a part of the game, and things like efficient handling of ARM heterogeneous architectures (big.LITTLE) or making specific tweaks for ARM cache sizes will be critical so as to make of ARM a truly first-citizen for the Blosc ecosystem.

If you have ideas on what can be improved, and most specially how, we want to learn from you :) If you want to contribute code to the project, your pull requests are very welcome too! If you like what we are doing and want to see more of this, you can also sponsor us.

Hairy situation of Microsoft Windows compilers

Recently -- and to the requirement of a customer who recently sponsorized us -- I struggled a lot trying to get the maximum performance out of Visual Studio compilers. Here there are some quick benchmarks to show you an overview of the kind of performance that C-Blosc can reach on Windows.

First, let's use Visual Studio 2008 32-bit (extremely common platform because Python 2 still requires this compiler) and see how C-Blosc performs for decompressing on my laptop with Windows 7 Pro (64-bit) with an Intel i5-3380M @ 2.90GHz:

/images/vs2008-32bit-decompress.png

Now, let us see how the same benchmark performs with Visual Studio 2013:

/images/vs2013-64bit-decompress.png

Well, there is an important boost in speed, not only because a native 64-bit compiler has been used, but also because natural improvements in compiler technology.

At this point I wondered whether Visual Studio 2013 is doing just a decent job or if there is still some performance that can still be squeezed. So what kind of performance other compilers for Windows are reaching? For checking this, I tested the excellent MinGW-w64 compiler (thanks to Jack Pappas for suggesting this!). Here it is the result:

/images/mingw-w64-64bit-decompress.png

So, one can be seen that GCC 4.9 (included in latest Mingw-w64) can reach a performance that is still far beyond of what you can reach with modern Microsoft compilers (specially for lower compression levels, which is an important scenario when maximum speed is required), and very close to what I get on Linux.

Possibly the newest Visual Studio 2015 would allow more performance, but IMO, there is still some time until this is more spread, whereas GCC 4.9 (with GCC 5.1 starting to show up) is already shipping in many distributions, Windows and Mac OSX, which gives GCC a lot of advantage with respect to Visual Studio.

With regards the reason on why GCC shows that much performance for C-Blosc is probably a consequence of how it has been developed. It turns out that C-Blosc main development platform was (and still is) Linux/GCC, and after many profile/optimize cycles, this tends to favor that combination respect to others.

Provided this, and regarding the original request to reach optimal performance on Windows / Visual Studio 2013 64-bit environments, I ended implementing an example where existing Visual Studio applications can dynamically link a C-Blosc DLL that is in the PATH. You can see how this technique works at: https://github.com/Blosc/c-blosc/blob/master/examples/win-dynamic-linking.c

This is quite interesting because at compilation time you don't need to make reference to the C-Blosc DLL at all. I.e. the next is enough for compiling the example above:

cl /Ox /Fewin-dynamic-linking.exe /I..\blosc win-dynamic-linking.c

And that's all. After that, you only need to place the C-Blosc DLL anywhere in your PATH and it will be dynamically detected. I have tested that with different combinations of compilers (e.g. Visual Studio for the app, and MinGW-w64 for the DLL library) and it works beautifully. I think this is quite powerful and certainly I don't know an equivalent technique for Unix (although it probably exists also), allowing to use top-performance DLLs in your apps using different compilers in a quite easy way.

In case you have more hints on how to get better performance on Windows, please tell us.

New 'bitshuffle' filter

Although Blosc was meant for hosting more than one filter since day 0, it has traditionally came with just a single filter, known as 'shuffle', meant for shuffling bytes in binary blocks. Today this has changed, and c-blosc has officially received a new filter called 'bitshuffle' (a backport of the one included in the bithsuffle project). And you guess it, it works in a very similar way than 'shuffle', just that the shuffling happens at the bit level and not at the byte one.

Just for whetting your appetite here there are some small synthetic benchmarks on what you can expect from the newcomer. I'll start using my own laptop (Intel i5-3380M @ 2.90GHz, GCC 4.9.1, Ubuntu 14.10) and showing how the benchmark that comes with c-blosc performs with the LZ4 compressor and the regular 'shuffle' filter:

$ bench/bench lz4 shuffle single 2
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
Using compressor: lz4
Using shuffle type: shuffle
Running suite: single
--> 2, 2097152, 8, 19, lz4, shuffle
********************** Run info ******************************
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 2
********************** Running benchmarks *********************
memcpy(write):            529.1 us, 3780.3 MB/s
memcpy(read):             245.6 us, 8143.4 MB/s
Compression level: 0
comp(write):      267.3 us, 7483.3 MB/s   Final bytes: 2097168  Ratio: 1.00
decomp(read):     200.0 us, 9997.5 MB/s   OK
Compression level: 1
comp(write):      462.8 us, 4321.1 MB/s   Final bytes: 554512  Ratio: 3.78
decomp(read):     246.6 us, 8111.5 MB/s   OK
Compression level: 2
comp(write):      506.6 us, 3947.7 MB/s   Final bytes: 498960  Ratio: 4.20
decomp(read):     331.9 us, 6025.1 MB/s   OK
Compression level: 3
comp(write):      486.8 us, 4108.8 MB/s   Final bytes: 520824  Ratio: 4.03
decomp(read):     233.5 us, 8565.2 MB/s   OK
Compression level: 4
comp(write):      497.9 us, 4017.0 MB/s   Final bytes: 332112  Ratio: 6.31
decomp(read):     258.3 us, 7743.8 MB/s   OK
Compression level: 5
comp(write):      474.6 us, 4214.5 MB/s   Final bytes: 327112  Ratio: 6.41
decomp(read):     287.8 us, 6949.0 MB/s   OK
Compression level: 6
comp(write):      558.0 us, 3584.4 MB/s   Final bytes: 226308  Ratio: 9.27
decomp(read):     284.8 us, 7022.7 MB/s   OK
Compression level: 7
comp(write):      689.9 us, 2899.1 MB/s   Final bytes: 211880  Ratio: 9.90
decomp(read):     363.0 us, 5509.1 MB/s   OK
Compression level: 8
comp(write):      691.9 us, 2890.6 MB/s   Final bytes: 220464  Ratio: 9.51
decomp(read):     385.5 us, 5188.5 MB/s   OK
Compression level: 9
comp(write):      567.0 us, 3527.6 MB/s   Final bytes: 132154  Ratio: 15.87
decomp(read):     627.3 us, 3188.2 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:       3.6 s, 4755.3 MB/s

Now, look at what bitshuffle can do with the same datasets and compressor:

$ bench/bench lz4 bitshuffle single 2
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
Using compressor: lz4
Using shuffle type: bitshuffle
Running suite: single
--> 2, 2097152, 8, 19, lz4, bitshuffle
********************** Run info ******************************
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 2
********************** Running benchmarks *********************
memcpy(write):            518.5 us, 3857.3 MB/s
memcpy(read):             248.6 us, 8045.7 MB/s
Compression level: 0
comp(write):      259.5 us, 7706.1 MB/s   Final bytes: 2097168  Ratio: 1.00
decomp(read):     217.5 us, 9196.3 MB/s   OK
Compression level: 1
comp(write):     1099.0 us, 1819.9 MB/s   Final bytes: 72624  Ratio: 28.88
decomp(read):     824.2 us, 2426.5 MB/s   OK
Compression level: 2
comp(write):     1093.2 us, 1829.5 MB/s   Final bytes: 71376  Ratio: 29.38
decomp(read):    1293.2 us, 1546.5 MB/s   OK
Compression level: 3
comp(write):     1084.5 us, 1844.2 MB/s   Final bytes: 69200  Ratio: 30.31
decomp(read):    1331.2 us, 1502.4 MB/s   OK
Compression level: 4
comp(write):     1193.2 us, 1676.2 MB/s   Final bytes: 42480  Ratio: 49.37
decomp(read):     833.8 us, 2398.7 MB/s   OK
Compression level: 5
comp(write):     1190.9 us, 1679.4 MB/s   Final bytes: 42928  Ratio: 48.85
decomp(read):     880.2 us, 2272.2 MB/s   OK
Compression level: 6
comp(write):      969.7 us, 2062.5 MB/s   Final bytes: 32000  Ratio: 65.54
decomp(read):     854.8 us, 2339.8 MB/s   OK
Compression level: 7
comp(write):     1056.2 us, 1893.6 MB/s   Final bytes: 40474  Ratio: 51.81
decomp(read):     960.8 us, 2081.7 MB/s   OK
Compression level: 8
comp(write):     1018.5 us, 1963.8 MB/s   Final bytes: 28050  Ratio: 74.76
decomp(read):     966.8 us, 2068.7 MB/s   OK
Compression level: 9
comp(write):     1161.7 us, 1721.6 MB/s   Final bytes: 25188  Ratio: 83.26
decomp(read):    1245.5 us, 1605.8 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:       7.8 s, 2161.7 MB/s

Amazing! the compression ratios are much higher (up to 83x vs 16x) which is very exciting. The drawback is that with 'bitshuffle' the compression/decompression speed is between 2x and 4x slower than with the regular 'shuffle'. In fact, this slowdown is unusually light because the additional work should be much more (1 byte has 8 bits), so that's not too bad.

But we have some good news: besides SSE2, 'bitshuffle' also supports AVX2 SIMD instructions (as 'shuffle' itself) but unfortunately my laptop does not have them (pre-Haswell). So let's run the benchmark above in a AVX2 server (Intel Xeon E3-1240 v3 @ 3.40GHz, GCC 4.9.3, Gentoo 2.2):

$ bench/bench lz4 bitshuffle single 8
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
Using compressor: lz4
Using shuffle type: bitshuffle
Running suite: single
--> 8, 2097152, 8, 19, lz4, bitshuffle
********************** Run info ******************************
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):            264.9 us, 7551.1 MB/s
memcpy(read):             174.1 us, 11488.6 MB/s
Compression level: 0
comp(write):      173.1 us, 11551.7 MB/s          Final bytes: 2097168  Ratio: 1.00
decomp(read):     119.3 us, 16765.2 MB/s          OK
Compression level: 1
comp(write):      271.8 us, 7358.1 MB/s   Final bytes: 72624  Ratio: 28.88
decomp(read):     225.7 us, 8862.7 MB/s   OK
Compression level: 2
comp(write):      275.7 us, 7253.7 MB/s   Final bytes: 71376  Ratio: 29.38
decomp(read):     229.2 us, 8724.8 MB/s   OK
Compression level: 3
comp(write):      274.5 us, 7285.9 MB/s   Final bytes: 69200  Ratio: 30.31
decomp(read):     238.8 us, 8374.6 MB/s   OK
Compression level: 4
comp(write):      249.5 us, 8015.5 MB/s   Final bytes: 42480  Ratio: 49.37
decomp(read):     229.8 us, 8701.6 MB/s   OK
Compression level: 5
comp(write):      249.1 us, 8028.1 MB/s   Final bytes: 42928  Ratio: 48.85
decomp(read):     243.9 us, 8198.8 MB/s   OK
Compression level: 6
comp(write):      332.4 us, 6017.5 MB/s   Final bytes: 32000  Ratio: 65.54
decomp(read):     322.2 us, 6206.4 MB/s   OK
Compression level: 7
comp(write):      431.9 us, 4630.2 MB/s   Final bytes: 40474  Ratio: 51.81
decomp(read):     437.6 us, 4570.7 MB/s   OK
Compression level: 8
comp(write):      421.5 us, 4745.0 MB/s   Final bytes: 28050  Ratio: 74.76
decomp(read):     437.2 us, 4574.5 MB/s   OK
Compression level: 9
comp(write):      941.1 us, 2125.2 MB/s   Final bytes: 25188  Ratio: 83.26
decomp(read):     674.7 us, 2964.2 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:       2.8 s, 6047.8 MB/s

Wow, in this case we are having compression speed peaks even higher than a memcpy (8 GB/s vs 7.5 GB/s), and decompression speed is pretty good too (8.8 GB/s vs 11.5 GB/s memcpy). With AVX2 support, 'bitshuffle' does have a pretty good performance. But yeah, this server has 8 physical cores, so we are not actually comparing pears with pears. So let's re-run the benchmark with just 2 threads:

$ bench/bench lz4 bitshuffle single 2
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
Using compressor: lz4
Using shuffle type: bitshuffle
Running suite: single
--> 2, 2097152, 8, 19, lz4, bitshuffle
********************** Run info ******************************
Blosc version: 1.7.0.dev ($Date:: 2015-05-27 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 2
********************** Running benchmarks *********************
memcpy(write):            253.9 us, 7877.5 MB/s
memcpy(read):             174.1 us, 11488.8 MB/s
Compression level: 0
comp(write):      133.4 us, 14995.6 MB/s          Final bytes: 2097168  Ratio: 1.00
decomp(read):     117.5 us, 17026.6 MB/s          OK
Compression level: 1
comp(write):      604.1 us, 3310.7 MB/s   Final bytes: 72624  Ratio: 28.88
decomp(read):     431.2 us, 4638.3 MB/s   OK
Compression level: 2
comp(write):      624.3 us, 3203.5 MB/s   Final bytes: 71376  Ratio: 29.38
decomp(read):     452.3 us, 4421.5 MB/s   OK
Compression level: 3
comp(write):      623.7 us, 3206.8 MB/s   Final bytes: 69200  Ratio: 30.31
decomp(read):     442.3 us, 4521.9 MB/s   OK
Compression level: 4
comp(write):      585.2 us, 3417.6 MB/s   Final bytes: 42480  Ratio: 49.37
decomp(read):     395.3 us, 5058.9 MB/s   OK
Compression level: 5
comp(write):      530.0 us, 3773.4 MB/s   Final bytes: 42928  Ratio: 48.85
decomp(read):     400.5 us, 4994.0 MB/s   OK
Compression level: 6
comp(write):      542.6 us, 3686.0 MB/s   Final bytes: 32000  Ratio: 65.54
decomp(read):     426.7 us, 4687.2 MB/s   OK
Compression level: 7
comp(write):      605.6 us, 3302.4 MB/s   Final bytes: 40474  Ratio: 51.81
decomp(read):     494.5 us, 4044.5 MB/s   OK
Compression level: 8
comp(write):      588.1 us, 3400.7 MB/s   Final bytes: 28050  Ratio: 74.76
decomp(read):     487.3 us, 4104.6 MB/s   OK
Compression level: 9
comp(write):      692.5 us, 2888.2 MB/s   Final bytes: 25188  Ratio: 83.26
decomp(read):     591.4 us, 3381.8 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:       3.9 s, 4294.1 MB/s

Now, for 2 threads we are getting times that are about 2x slower than for 8 threads. But the interesting thing here is that the compression speed is still ~2x faster than my laptop (peaks at 3.7 GB/s vs 1.8 GB/s) and the same goes for decompression (peaks at 5 GB/s vs 2.4 GB/s). Agreed, the server can run at 3.4GHz vs 2.9 GHz of my laptop, but this alone cannot explain the difference in speed, so the big responsible for the speedup is the AVX2 support in 'bitshuffle'.

In summary, the new 'bitshuffle' filter is very good news for the users of the Blosc ecosystem because it adds yet another powerful resource that will help in the fight for storing datasets with less space, but still keeping good performance. Of course, this is just a quick experiment with synthetic data, but I am pretty sure that the new 'bitshuffle' filter will find a good niche in real world datasets. Anyone interested in contributing some real data benchmark?

I'd like to thank Kiyo Masui for his help in this 'bitshuffle' backport.