<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blosc Home Page  (Posts about NEON)</title><link>https://blosc.org/</link><description></description><atom:link href="https://blosc.org/categories/neon.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:blosc@blosc.org"&gt;The Blosc Developers&lt;/a&gt; </copyright><lastBuildDate>Thu, 04 Jun 2026 15:46:19 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>ARM is becoming a first-class citizen for Blosc</title><link>https://blosc.org/posts/arm-is-becoming-a-first-class-citizen-for-blosc/</link><dc:creator>Francesc Alted</dc:creator><description>&lt;p&gt;We are happy to announce that Blosc is receiving official support for
ARM processors.  Blosc has always been meant to support all platforms
where a C89 compliant C compiler can be found, but until now the only
hardware platforms that we were testing on a regular basis has been
Intel (on top of Unix/Linux, Mac OSX and Windows).&lt;/p&gt;
&lt;p&gt;We want this to change and the ARM architecture has been our first
candidate to become a fully supported platform besides Intel/AMD.  You
may be wondering that we could have chosen any other architecture like
MIPS or PowerPC, so why ARM?&lt;/p&gt;
&lt;section id="arm-is-eating-the-world"&gt;
&lt;h2&gt;ARM is eating the world&lt;/h2&gt;
&lt;p&gt;ARM is an increasingly popular architecture and we can find
implementation exemplars of it not only in the phones, tablets or
ChromeBooks, but also acting as embedded processors, as well as in
providing computing power to immensely popular Raspberry Pi's and
Arduinos and even environments so &lt;em&gt;apparently&lt;/em&gt; alien to it like &lt;a class="reference external" href="http://www.theplatform.net/2015/06/16/mont-blanc-sets-the-stage-for-arm-hpc/"&gt;High
Performance Computing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Contrarily to what has been traditional for other computer platforms,
one of the most important design features for ARM is to keep energy
consumption under very strict limits.  Nowadays, the ARM architecture
can run decently powerful CPUs where each core &lt;a class="reference external" href="http://www.androidauthority.com/arms-secret-recipe-for-power-efficient-processing-409850"&gt;consumes just 600 to
750 mWatt or less&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In my opinion, it is precisely this energy efficiency what makes of
ARM one of the platforms with more projection to gain ground as a
general computer platform in the short future.  By now, we all know
that ARM allows packing more cores into a single die (e.g. your phone
having more cores than your laptop, anyone?).  And more cores also
means more combined computing throughput (albeit a bit more difficult
to program), but more importantly, &lt;strong&gt;more cores being able to bring
data from memory at the same time&lt;/strong&gt;.  Contrarily to what one might
think, having different threads transmitting data from RAM to the CPU
caches provides a better utilization of memory buses, and hence, a
much better global memory bandwidth.  This can be seen, for example,
in &lt;a class="reference external" href="http://blosc.org/benchmarks-blosclz.html"&gt;typical Blosc benchmarks&lt;/a&gt; by looking at how the
bandwidth grows with the number of threads in all the dots, but
specially where compression ratio equals 1 (i.e. no compression is
active, so Blosc is only doing a &lt;em&gt;memory copy&lt;/em&gt; in this case).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="blosc-is-getting-ready-for-arm"&gt;
&lt;h2&gt;Blosc is getting ready for ARM&lt;/h2&gt;
&lt;p&gt;So ARM is cool indeed, but what we are doing for making it a
first-class citizen?  For starters, we have created a new &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2"&gt;C-Blosc2&lt;/a&gt; repository that is going to act
as a playground for some time and where we are going to experiment
with a new range of features (those will be discussed in a later
post).  And this is exactly the place where we have already started
implementing a NEON version of the shuffle filter.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABIIFHA.html"&gt;NEON&lt;/a&gt;
is an SIMD extension in the same spirit than SSE2 or AVX2 present in
Intel/AMD offerings.  NEON extension was introduced in ARMv7
architecture, and is present in most of the current high-end devices
(including most of the phones and tablets floating around, including
the new Raspberry Pi 2).  As many of you know, leveraging SIMD in
modern CPUs is key for allowing Blosc to be one of the fastest
compressors around, and if we wanted to be serious about ARM, NEON
support had to be here, &lt;strong&gt;period&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The new NEON implementation of shuffle for Blosc has been entirely
made by &lt;a class="reference external" href="https://github.com/LucianMarc"&gt;Lucian Marc&lt;/a&gt;, a summer
student that joined the project at the beginning of July 2015.  Lucian
did a terrific work on implementing the &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2/blob/master/blosc/shuffle-neon.c"&gt;shuffle filter NEON&lt;/a&gt;,
and during the 2-months stage he did not only that, but he also had
time to do a preliminary version of the bitshuffle filter as well (not
completely functional yet, but as time allows, he plans to finish that).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="some-hints-on-the-measured-increase-in-performance"&gt;
&lt;h2&gt;Some hints on the measured increase in performance&lt;/h2&gt;
&lt;p&gt;So you might be asking, how fast can perform Blosc on an ARM with
NEON?  Well, let's start first by showing how fast it works on a
Raspberry Pi 2 (Broadcom BCM2836 ARMv7 Quad Core Processor) having
NEON and running Raspbian (gcc 4.7.2).  To not bore people, we are
going to show just decompression speeds:&lt;/p&gt;
&lt;img alt="/images/blosclz-shuffle-neon-rpi2.png" src="https://blosc.org/images/blosclz-shuffle-neon-rpi2.png"&gt;
&lt;p&gt;It turns out that, when using the 4 cores and low compression levels,
Blosc with NEON support already shows evidence that it can equal the
performance of memcpy() on ARM.  This is an important fact because I
did not think that ARM performance was enough to allow Blosc doing
that already.  I was wrong.&lt;/p&gt;
&lt;p&gt;Okay, so Blosc using NEON can be fast, but exactly how much when
compared to a &lt;a class="reference external" href="https://github.com/Blosc/c-blosc/blob/master/blosc/shuffle-generic.h"&gt;shuffle implementation in pure C&lt;/a&gt;?
Here you have the figures for the generic C shuffle:&lt;/p&gt;
&lt;img alt="/images/blosclz-shuffle-generic-rpi2.png" src="https://blosc.org/images/blosclz-shuffle-generic-rpi2.png"&gt;
&lt;p&gt;That means that NEON can accelerate the whole decompression process
between 2x and 3x, which is pretty significant, and also speaks highly
about the quality of Lucian's NEON implementation.&lt;/p&gt;
&lt;p&gt;Does that mean that we can extrapolate these figures for all ARM
processors out there?  Not quite.  In fact, the performance of a
Raspberry Pi 2 is quite mild compared with other boards.  So, let's
see what is the performance on a &lt;a class="reference external" href="http://www.hardkernel.com/main/products/prdt_info.php?g_code=G140448267127"&gt;ODROID-XU3&lt;/a&gt;
(although it has been replaced by &lt;a class="reference external" href="http://www.hardkernel.com/main/products/prdt_info.php"&gt;ODROID-XU4&lt;/a&gt;, the XU3 has
the same processor, so we are testing a pretty powerful CPU model
here).  This board comes with a Samsung Exynos5422 Cortex-A15 2.0 GHz
quad core and Cortex™-A7 quad core CPUs, so it is a representative of
the ARM Heterogeneous Multi-Processing solution (aka big.LITTLE).
Here are its figures:&lt;/p&gt;
&lt;img alt="/images/blosclz-shuffle-neon-odroid.png" src="https://blosc.org/images/blosclz-shuffle-neon-odroid.png"&gt;
&lt;p&gt;So, the first thing to note is the memcpy() speed that at 1.6 GB/s,
is considerably faster than the RPi2 (&amp;lt; 0.9 GB/s).  Yeah, this is a
much more capable board from a computational point of view.  The
second thing is that decompression speed &lt;em&gt;almost doubles the memcpy()
speed&lt;/em&gt;.  Again, I was very impressed because I did not expect this
range of speeds &lt;em&gt;at all&lt;/em&gt;.  ARM definitely is getting in a situation
where compression can be used for an advantage, computationally
speaking.&lt;/p&gt;
&lt;p&gt;The third thing to note is a bit disappointing though: why only 3
threads appear in the plot?  Well, it turns out that the benchmark
suite fails miserably when using 4 threads or more.  As the Raspberry
setup does not suffer from this problem at all, I presume that this is
more related with the board or the libraries that come with the
operating system (Ubuntu 14.04).  This is rather unfortunate because I
was really curious to see such an ARMv7 8-core beast running at full
steam using the 8 threads.  At any rate, time will tell if the problem
is in the board or in Blosc itself.&lt;/p&gt;
&lt;p&gt;Just to make the benchmarks a bit more complete, let me finish this
benchmark section showing the performance using the generic C code for
the shuffling algorithm:&lt;/p&gt;
&lt;img alt="/images/blosclz-shuffle-generic-odroid.png" src="https://blosc.org/images/blosclz-shuffle-generic-odroid.png"&gt;
&lt;p&gt;If we compare with NEON figures for the ODROID board, we can see again
an increase in speed of between 2x and 4x, which is crazy amazing
(sorry if I seem a bit over-enthusiastic, but again, I was not really
prepared for seeing this).  Again, only figures for 2 threads are
in this plot because the benchmark crashes for 3 threads (this is
another hint that points to the fault being outside Blosc itself
and not in its NEON implementation of the shuffle filter).&lt;/p&gt;
&lt;p&gt;At decompression speeds of 3 GB/s and ~ 2 Watt of energy consumption,
the ARM platform has one of the best bandwidth/Watt ratios that you can find
in the market, and this can have (and will have) profound implications
on how computations will be made in the short future (as the &lt;a class="reference external" href="http://www.montblanc-project.eu/publications/energy-efficiency-high-performance-computing-mont-blanc-project"&gt;Mont
Blanc initiative is trying to demonstrate&lt;/a&gt;).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="what-to-expect-from-arm-blosc-in-the-forthcoming-months"&gt;
&lt;h2&gt;What to expect from ARM/Blosc in the forthcoming months&lt;/h2&gt;
&lt;p&gt;This work on supporting ARM platforms is just the beginning.  As ARM
processors get more spread, and most specially, &lt;a class="reference external" href="http://www.arm.com/products/processors/cortex-a/cortex-a72-processor.php"&gt;faster&lt;/a&gt;,
we will need to refine the support for ARM in Blosc.&lt;/p&gt;
&lt;p&gt;NEON support is only a part of the game, and things like efficient
handling of ARM heterogeneous architectures (&lt;a class="reference external" href="https://en.wikipedia.org/wiki/ARM_big.LITTLE"&gt;big.LITTLE&lt;/a&gt;) or making specific
tweaks for ARM cache sizes will be critical so as to make of ARM a
truly first-citizen for the Blosc ecosystem.&lt;/p&gt;
&lt;p&gt;If you have ideas on what can be improved, and most specially &lt;strong&gt;how&lt;/strong&gt;,
we want to learn from you :) If you want to contribute code to the
project, your pull requests are very welcome too!  If you like what we
are doing and want to see more of this, you can also &lt;a class="reference external" href="http://blosc.org/blog/seeking-sponsoship.html"&gt;sponsor us&lt;/a&gt;.&lt;/p&gt;
&lt;/section&gt;</description><category>ARM</category><category>Blosc2</category><category>NEON</category><guid>https://blosc.org/posts/arm-is-becoming-a-first-class-citizen-for-blosc/</guid><pubDate>Wed, 09 Sep 2015 11:32:20 GMT</pubDate></item></channel></rss>