<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blosc Home Page  (Posts about memory wall)</title><link>https://blosc.org/</link><description></description><atom:link href="https://blosc.org/categories/memory-wall.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:blosc@blosc.org"&gt;The Blosc Developers&lt;/a&gt; </copyright><lastBuildDate>Wed, 10 Jun 2026 17:44:33 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>The Surprising Speed of Compressed Data: A Roofline Story</title><link>https://blosc.org/posts/roofline-analysis-blosc2/</link><dc:creator>Francesc Alted</dc:creator><description>&lt;p&gt;Can a library designed for computing with compressed data ever hope to outperform highly optimized numerical engines like NumPy and Numexpr? The answer is complex, and it hinges on the "memory wall" — a phenomenon which occurs when system memory limitations start to drag on CPU. This post uses Roofline analysis to explore this very question, dissecting the performance of Blosc2 and revealing the surprising scenarios where it can gain a competitive edge.&lt;/p&gt;
&lt;aside class="admonition note"&gt;
&lt;p class="admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update on 2026-02-06:&lt;/strong&gt; We have published a follow-up post, &lt;a class="reference external" href="https://ironarray.io/blog/miniexpr-powered-blosc2"&gt;Python-Blosc2 4.0: Unleashing Compute Speed with miniexpr&lt;/a&gt;, which revisits this topic. This new post explains how the integration of miniexpr into Blosc2's compute engine has significantly improved performance—especially for in-memory operations—updating the conclusions drawn in this original analysis. We highly recommend reading the new post for the latest insights.&lt;/p&gt;
&lt;/aside&gt;
&lt;section id="tl-dr"&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;Before we dive in, here's what we discovered:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;For in-memory tasks, Blosc2's overhead can make it slower than Numexpr, especially on x86 CPUs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This changes on Apple Silicon, where Blosc2's performance is much more competitive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For on-disk tasks, Blosc2 consistently outperforms NumPy/Numexpr on both platforms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The "memory wall" is real, and disk I/O is an even bigger one, which is where compression shines.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="a-trip-down-memory-lane"&gt;
&lt;h2&gt;A Trip Down Memory Lane&lt;/h2&gt;
&lt;p&gt;Let's rewind to 2008. NumPy 1.0 was just a toddler, and the computing world was buzzing with the arrival of multi-core CPUs and their shiny new SIMD instructions. On the &lt;a class="reference external" href="https://mail.python.org/archives/list/numpy-discussion@python.org/thread/YPX5PGM5WZXQAMQ5AZLLEU67D5RZBOVH/#YFX3G2RYHTIYMFDPCHKHED5F7CT4OTVK"&gt;NumPy mailing list&lt;/a&gt;, a group of us were brainstorming how to harness this new power to make Python's number-crunching faster.&lt;/p&gt;
&lt;p&gt;The idea seemed simple: trust newer compilers to use SIMD (and, possibly, data alignment) to perform operations on multiple data points at once. To test this, a &lt;a class="reference external" href="https://mail.python.org/archives/list/numpy-discussion@python.org/message/S2IEJV7U7TXHQLEMORGME6KIGRZTG33L/"&gt;simple benchmark&lt;/a&gt; was shared: multiply two large vectors element-wise. Developers from around the community ran the code and shared their results. What came back was a revelation.&lt;/p&gt;
&lt;p&gt;For small arrays that fit snugly into the CPU's high-speed cache, SIMD was quite good at accelerating computations. But as soon as the arrays grew larger, the performance boost vanished. Some of us were already suspicious about the new "memory wall" that had been growing lately, seemingly due to the widening gap between CPU speeds and memory bandwidth.  However, a conclusive answer (and solution) was still lacking.&lt;/p&gt;
&lt;p&gt;But amidst the confusion, a curious anomaly emerged. One machine, belonging to NumPy legend Charles Harris, was consistently outperforming the rest—even those with faster processors. It made no sense. We checked our code, our compilers, everything. Yet, his machine remained inexplicably faster. The answer, when it finally came, wasn't in the software at all. Charles, a hardware wizard, had &lt;a class="reference external" href="https://mail.python.org/archives/list/numpy-discussion@python.org/message/YFX3G2RYHTIYMFDPCHKHED5F7CT4OTVK/"&gt;tinkered with his BIOS to overclock his RAM&lt;/a&gt; from 667 MHz to a whopping 800 MHz.&lt;/p&gt;
&lt;p&gt;That was my lightbulb moment: for data-intensive tasks, raw CPU clock speed was not the limiting factor; memory bandwidth was what truly mattered.&lt;/p&gt;
&lt;p&gt;This led me to a wild idea: what if we could make memory &lt;em&gt;effectively&lt;/em&gt; faster? What if we could compress data in memory and decompress it on-the-fly, just in time for the CPU? This would &lt;a class="reference external" href="https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf"&gt;slash the amount of data being moved&lt;/a&gt;, boosting our effective memory bandwidth. That idea became the seed for &lt;a class="reference external" href="https://www.blosc.org"&gt;Blosc&lt;/a&gt;, a project I started in 2010 that has been &lt;a class="reference external" href="https://github.com/Blosc/python-blosc2"&gt;my passion ever since&lt;/a&gt;. Now, 15 years later, it is time to revisit that idea and see how well it holds up in today's computing landscape.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="roofline-model-understanding-the-memory-wall"&gt;
&lt;h2&gt;Roofline Model: Understanding the Memory Wall&lt;/h2&gt;
&lt;p&gt;Not all computations are equally affected by the memory wall - in general performance can be either CPU-bound or memory-bound. To diagnose which resource is the limiting factor, the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Roofline_model"&gt;Roofline model&lt;/a&gt; provides an insightful analytical framework. This model &lt;a class="reference external" href="https://docs.nersc.gov/tools/performance/roofline/"&gt;plots computational performance against arithmetic intensity&lt;/a&gt; (i.e. floating-point operations per second versus memory accesses per second) to visually determine whether a task is constrained by CPU speed or memory bandwidth.&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-intro.avif" src="https://blosc.org/images/roofline-surprising-story/roofline-intro.avif"&gt;
&lt;p&gt;We will use Roofline plots to analyze Blosc2's performance, compared to that of NumPy and Numexpr. NumPy, with its highly optimized linear algebra backends, and Numexpr, with its efficient evaluation of element-wise expressions, together form a strong performance baseline for the full range of arithmetic intensities tested.&lt;/p&gt;
&lt;p&gt;To highlight the role of memory bandwidth, we will conduct our benchmarks on an AMD Ryzen 7800X3D CPU at two different memory speeds: the standard 4800 MTS and an overclocked 6000 MTS. This allows us to directly observe how memory frequency impacts computational performance.&lt;/p&gt;
&lt;p&gt;To cover a range of computational scenarios, our benchmarks include five operations with varying arithmetic intensities:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Very Low&lt;/strong&gt;: A simple element-wise addition (a + b + c).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low&lt;/strong&gt;: A moderately complex element-wise expression (sqrt(a + 2 * b + (c / 2)) ^ 1.2).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Medium&lt;/strong&gt;: A highly complex element-wise calculation involving trigonometric and exponential functions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High&lt;/strong&gt;: Matrix multiplication on small matrices (labeled matmul0).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Very High&lt;/strong&gt;: Matrix multiplication on large matrices (labeled matmul1 and matmul2).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-mem-speed-AMD-7800X3D.png" src="https://blosc.org/images/roofline-surprising-story/roofline-mem-speed-AMD-7800X3D.png"&gt;
&lt;p&gt;The Roofline plot confirms that increasing memory speed only benefits memory-bound operations (low arithmetic intensity), while CPU-bound tasks (high arithmetic intensity) are unaffected, as expected. Although this might suggest the "memory wall" is not a major obstacle, low-intensity operations like element-wise calculations, reductions, and selections are extremely common and often create performance bottlenecks. Therefore, optimizing for memory performance remains crucial.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="the-in-memory-surprise-why-wasn-t-compression-faster"&gt;
&lt;h2&gt;The In-Memory Surprise: Why Wasn't Compression Faster?&lt;/h2&gt;
&lt;p&gt;We benchmarked Blosc2 (both compressed and uncompressed) against NumPy and Numexpr. For this test, Blosc2 was configured with the LZ4 codec and shuffle filter, a setup known for its balance of speed and compression ratio.  The benchmarks were executed on an AMD Ryzen 7800X3D CPU with memory speed set to 6000 MTS, ensuring optimal memory bandwidth for the tests.&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-7800X3D-mem-def.png" src="https://blosc.org/images/roofline-surprising-story/roofline-7800X3D-mem-def.png"&gt;
&lt;p&gt;The analysis reveals a surprising outcome: for memory-bound operations, Blosc2 is up to five times slower than Numexpr. Although operating on compressed data provides a marginal improvement over uncompressed Blosc2, it is not enough to overcome this performance gap. This result is unexpected because Blosc2 leverages Numexpr internally, and the reduced memory bandwidth from compression should theoretically lead to better performance in these scenarios.&lt;/p&gt;
&lt;p&gt;To understand this counter-intuitive result, we must examine Blosc2's core architecture. The key lies in its double partitioning scheme, which, while powerful, introduces an overhead that can negate the benefits of compression in memory-bound contexts.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="unpacking-the-overhead-a-look-inside-blosc2-s-architecture"&gt;
&lt;h2&gt;Unpacking the Overhead: A Look Inside Blosc2's Architecture&lt;/h2&gt;
&lt;p&gt;The performance characteristics of Blosc2 are rooted in its double partitioning architecture, which organizes data into chunks and blocks.&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/double-partition-b2nd.avif" src="https://blosc.org/images/roofline-surprising-story/double-partition-b2nd.avif"&gt;
&lt;p&gt;This design is crucial for both aligning with the CPU's memory hierarchy and enabling efficient multidimensional array representation (important for things like e.g. n-dimensional slicing). However, this structure introduces an inherent overhead from additional indexing logic. In memory-bound scenarios, this latency counteracts the performance gains from reduced memory traffic, explaining why Blosc2 does not surpass Numexpr.&lt;/p&gt;
&lt;p&gt;Conversely, as arithmetic intensity increases, the computational demands begin to dominate the total execution time. In these CPU-bound regimes, the partitioning overhead is effectively amortized, allowing Blosc2 to close the performance gap and eventually match NumPy's performance in tasks like large matrix multiplications.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="modern-arm-architectures"&gt;
&lt;h2&gt;Modern ARM Architectures&lt;/h2&gt;
&lt;p&gt;CPU architecture is a rapidly evolving field. To investigate how these changes impact performance, we extended our analysis to the Apple Silicon M4 Pro, a modern ARM-based processor.&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-m4pro-mem-def.png" src="https://blosc.org/images/roofline-surprising-story/roofline-m4pro-mem-def.png"&gt;
&lt;p&gt;The results show that Blosc2 performs significantly better on this platform, narrowing the performance gap with NumPy/NumExpr, especially for operations on compressed data. While compute engines optimized for uncompressed data still hold an edge, these findings suggest that compression will play an increasingly important role in improving computational performance in the future.&lt;/p&gt;
&lt;p&gt;However, while the in-memory results are revealing, they don't tell the whole story. Blosc2 was designed not just to fight the memory wall, but to conquer an even greater bottleneck: disk I/O. Although compression has the benefit of fitting more data into RAM when used in-memory (which is per se extremely interesting in these times, where &lt;a class="reference external" href="https://arstechnica.com/gadgets/2025/11/spiking-memory-prices-mean-that-it-is-once-again-a-horrible-time-to-build-a-pc/"&gt;RAM prices skyrocketed&lt;/a&gt;), its true power is unleashed when computations move off-motherboard. Now, let's shift the battlefield to the disk and see how Blosc2 performs in its native territory.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="a-different-battlefield-blosc2-shines-with-on-disk-data"&gt;
&lt;h2&gt;A Different Battlefield: Blosc2 Shines with On-Disk Data&lt;/h2&gt;
&lt;p&gt;Blosc2's architecture extends its computational engine to operate seamlessly on data stored on disk, a significant advantage for large-scale analysis.  This is particularly relevant in scenarios where datasets exceed available memory, necessitating out-of-core processing, as commonly encountered in data science, machine learning workflows or &lt;a class="reference external" href="https://ironarray.io/cat2cloud"&gt;cloud computing environments&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Our on-disk benchmarks were designed to use datasets larger than the system's available memory to prevent filesystem caching from influencing the results. To establish a baseline, we implemented an out-of-core solution for NumPy/NumExpr, leveraging memory-mapped files. Here Blosc2 has a performance edge, particularly for memory-bound operations on compressed data, being able to send and receive data faster to disk than the memory-mapped NumPy arrays.&lt;/p&gt;
&lt;p&gt;In this case, we've used high-performance NVMe SSDs (NVMe 4.0) to minimize the impact of disk speed on the results.  We also switched to the ZSTD codec for Blosc2, as its superior compression ratio over LZ4 further minimizes data transfer to and from the disk.&lt;/p&gt;
&lt;p&gt;First, let's see the results for the AMD Ryzen 7800X3D system:&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-7800X3D-disk-def.png" src="https://blosc.org/images/roofline-surprising-story/roofline-7800X3D-disk-def.png"&gt;
&lt;p&gt;The plots above show that Blosc2 outperforms both NumPy and Numexpr for all low-to-medium intensity operations. This is because the high latency of disk I/O amortizes the overhead of Blosc2's double partitioning scheme. Furthermore, the reduced bandwidth required for compressed data gives Blosc2 an additional performance advantage in this scenario.&lt;/p&gt;
&lt;p&gt;Now, let's see the results for the Apple Silicon M4 Pro system:&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-m4pro-disk-def.png" src="https://blosc.org/images/roofline-surprising-story/roofline-m4pro-disk-def.png"&gt;
&lt;p&gt;On the Apple Silicon M4 Pro system, Blosc2 again outperforms both NumPy and Numexpr for all on-disk operations, mirroring the results from the AMD system. However, the performance advantage is even more significant here, especially for memory-bound tasks. This is mainly because memory-mapped arrays are less efficient on Apple Silicon than on x86_64 systems, increasing the overhead for the NumPy/Numexpr baseline.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="roofline-plot-in-memory-vs-on-disk"&gt;
&lt;h2&gt;Roofline Plot: In-Memory vs On-Disk&lt;/h2&gt;
&lt;p&gt;To better understand the trade-offs between in-memory and on-disk processing with Blosc2, the following plot contrasts their performance characteristics for compressed data:&lt;/p&gt;
&lt;img alt="/images/roofline-surprising-story/roofline-mem-disk-def.png" src="https://blosc.org/images/roofline-surprising-story/roofline-mem-disk-def.png"&gt;
&lt;p&gt;A notable finding for the AMD system is that Blosc2's on-disk operations are noticeably faster than its in-memory operations, especially for memory-bound tasks (low arithmetic intensity). This is likely due to two factors: first, the larger datasets used for on-disk tests allow Blosc2 to use more efficient internal partitions (chunks and blocks), and second, parallel data reads from disk further reduce bandwidth requirements.&lt;/p&gt;
&lt;p&gt;In contrast, for CPU-bound tasks (high arithmetic intensity), on-disk performance is comparable to, albeit slightly slower than, in-memory performance. The analysis also reveals a specific weakness: small matrix multiplications (matmul0) are significantly slower on-disk, identifying a clear target for future optimization.&lt;/p&gt;
&lt;p&gt;In contrast to the AMD system, the Apple Silicon M4 Pro shows that Blosc2's on-disk operations are slower than in-memory, a difference that is most significant for memory-bound tasks. This performance disparity suggests that current on-disk optimizations may favor x86_64 architectures over ARM.&lt;/p&gt;
&lt;p&gt;As with the AMD platform, CPU-bound operations exhibit similar performance for both on-disk and in-memory contexts. The notable exception remains the small matrix multiplication (matmul0), which performs significantly worse on-disk. This recurring pattern pinpoints a clear opportunity for future optimization efforts.&lt;/p&gt;
&lt;p&gt;Finally, and in addition to its on-disk performance, Blosc2 offers a significant cost advantage. With the &lt;a class="reference external" href="https://arstechnica.com/gadgets/2025/11/spiking-memory-prices-mean-that-it-is-once-again-a-horrible-time-to-build-a-pc/"&gt;recent rise in SSD prices&lt;/a&gt;, compressing data on disk becomes an economically attractive strategy, allowing you to store more data in less space and thereby reduce hardware expenses.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="reproducibility"&gt;
&lt;h2&gt;Reproducibility&lt;/h2&gt;
&lt;p&gt;All the &lt;a class="reference external" href="https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/roofline-analysis.py"&gt;benchmarks&lt;/a&gt; and &lt;a class="reference external" href="https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/roofline-plot.py"&gt;plots&lt;/a&gt; presented in this blog post can be reproduced. You are invited to run the scripts on your own hardware to explore the performance characteristics of Blosc2 in different environments. In case you get interesting results, please consider sharing them with the community!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusions"&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;In this blog post, we explored the Roofline model to analyze the performance of Blosc2, NumPy, and Numexpr. We've confirmed that memory-bound operations are significantly affected by the "memory wall", making data compression of interest when maximizing performance. However, for in-memory operations, the overhead of Blosc2's double partitioning scheme can be a limiting factor, especially on x86_64 architectures. Encouragingly, this performance gap narrows considerably on modern ARM platforms like Apple Silicon, suggesting a promising future.&lt;/p&gt;
&lt;p&gt;The situation changes dramatically for on-disk operations. Here, Blosc2 consistently outperforms NumPy and Numexpr, as the high latency of disk I/O (even if we used SSDs here) amortizes its internal overhead. This makes Blosc2 a compelling choice for out-of-core computations, one of its primary use cases.&lt;/p&gt;
&lt;p&gt;Overall, this analysis has provided valuable insights, highlighting the importance of the memory hierarchy. It has also exposed specific areas for improvement, such as the performance of small matrix multiplications. As Blosc2 continues to evolve, I am confident we can address these points and further enhance its performance, making it an even more powerful tool for numerical computations in Python.&lt;/p&gt;
&lt;hr class="docutils"&gt;
&lt;p&gt;Read more about &lt;a class="reference external" href="https://ironarray.io"&gt;ironArray SLU&lt;/a&gt; — the company behind Blosc2, Caterva2, Numexpr and other high-performance data processing libraries.&lt;/p&gt;
&lt;p&gt;Compress Better, Compute Bigger!&lt;/p&gt;
&lt;/section&gt;</description><category>Blosc2</category><category>memory wall</category><category>numexpr</category><category>numpy</category><category>performance</category><category>roofline</category><guid>https://blosc.org/posts/roofline-analysis-blosc2/</guid><pubDate>Thu, 27 Nov 2025 08:05:21 GMT</pubDate></item><item><title>Blosc2-Meets-Rome</title><link>https://blosc.org/posts/blosc2-meets-rome/</link><dc:creator>Francesc Alted</dc:creator><description>&lt;p&gt;On August 7, 2019, AMD released a new generation of its series of EPYC processors, the EPYC 7002, also known as Rome, which are based on the new &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Zen_2"&gt;Zen 2&lt;/a&gt; micro-architecture.  Zen 2 is a significant departure from the physical design paradigm of AMD's previous Zen architectures, mainly in that the I/O components of the CPU are laid out on a separate die, different from computing dies; this is quite different from Naples (aka EPYC 7001), its antecessor in the EPYC series:&lt;/p&gt;
&lt;img alt="/images/blosc2-meets-rome/amd-rome-arch-multi-die.png" class="align-center" src="https://blosc.org/images/blosc2-meets-rome/amd-rome-arch-multi-die.png" style="width: 33%;"&gt;
&lt;p&gt;Such a separation of dies for I/O and computing has quite &lt;a class="reference external" href="https://www.anandtech.com/show/15044/the-amd-ryzen-threadripper-3960x-and-3970x-review-24-and-32-cores-on-7nm/3"&gt;large consequences in terms of scalability when accessing memory&lt;/a&gt;, which is critical for Blosc operation, and here we want to check how Blosc and AMD Rome couple behaves.  As there is no replacement for experimentation, we are going to use the same benchmark that was introduced in our previous &lt;a class="reference external" href="https://blosc.org/posts/breaking-memory-walls/"&gt;Breaking Down Memory Walls&lt;/a&gt;.  This essentially boils down to compute an aggregation with a simple loop like:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c"&gt;&lt;a id="rest_code_249467fae76140e28f9feb1f09341c85-1" name="rest_code_249467fae76140e28f9feb1f09341c85-1" href="https://blosc.org/posts/blosc2-meets-rome/#rest_code_249467fae76140e28f9feb1f09341c85-1"&gt;&lt;/a&gt;&lt;span class="cp"&gt;#pragma omp parallel for reduction (+:sum)&lt;/span&gt;
&lt;a id="rest_code_249467fae76140e28f9feb1f09341c85-2" name="rest_code_249467fae76140e28f9feb1f09341c85-2" href="https://blosc.org/posts/blosc2-meets-rome/#rest_code_249467fae76140e28f9feb1f09341c85-2"&gt;&lt;/a&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_249467fae76140e28f9feb1f09341c85-3" name="rest_code_249467fae76140e28f9feb1f09341c85-3" href="https://blosc.org/posts/blosc2-meets-rome/#rest_code_249467fae76140e28f9feb1f09341c85-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;udata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;a id="rest_code_249467fae76140e28f9feb1f09341c85-4" name="rest_code_249467fae76140e28f9feb1f09341c85-4" href="https://blosc.org/posts/blosc2-meets-rome/#rest_code_249467fae76140e28f9feb1f09341c85-4"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As described in the original blog post, the different &lt;cite&gt;udata&lt;/cite&gt; arrays are just chunks of the original dataset that are decompressed just in time for performing the partial aggregation operation; the final result is indeed the sum of all the partial aggregations.  Also we have seen that the time to execute the aggregation is going to depend quite a lot on the kind of data that is decompressed: carefully chosen synthetic data can be decompressed much more quickly than real data.  But syntehtic data is nevertheless interesting as it allows for a roof analysis of where the performance can grow up to.&lt;/p&gt;
&lt;p&gt;In this blog, we are going to see how the AMD EPYC 7402 (Rome), a 24-core processor performs on both synthetic and real data.&lt;/p&gt;
&lt;section id="aggregating-the-synthetic-dataset-on-amd-epyc-7402-24-core"&gt;
&lt;h2&gt;Aggregating the Synthetic Dataset on AMD EPYC 7402 24-Core&lt;/h2&gt;
&lt;p&gt;The synthetic data chosen for this benchmark allows to be compressed/decompressed very easily with applying the shuffle filter before the actual compression codec.  Interestingly, and as good example of how filters can benefit the compression process, if we would not apply the shuffle filter first, synthetic data was going to take much more time to compress/decompress (test it by yourself if you don't believe this).&lt;/p&gt;
&lt;p&gt;After some experiments, and as usual for synthetic datasets, the codec inside Blosc2 that has shown the best speed while keeping a decent compression ratio (54.6x), has been BloscLZ with compression level 3.  Here are the results:&lt;/p&gt;
&lt;img alt="/images/blosc2-meets-rome/sum_openmp_synthetic-blosclz-3.png" class="align-center" src="https://blosc.org/images/blosc2-meets-rome/sum_openmp_synthetic-blosclz-3.png" style="width: 50%;"&gt;
&lt;p&gt;As we can see, the uncompressed dataset scales pretty well until 8 threads, where it hits the memory wall for this machine (around 74 GB/s).  On its hand, even if data compressed with Blosc2 (in combination with BloscLZ codec) shows less performance initially, it scales quite smoothly up to 12 threads, where it reaches a higher performance than its uncompressed counterpart (and reaching the 90 GB/s mark).&lt;/p&gt;
&lt;p&gt;After that, the compressed dataset can perform aggregations at speeds that are typically faster than uncompressed ones, reaching important peaks at some magical number of threads (up to 210 GB/s at 48 threads).  Why these peaks exist at all is probably related with the architecture of the AMD Rome processor, but provided that we are using a 24-core CPU there is little wonder that numbers like 12, 24 (28 is an exception here) and 48 are reaching the highest figures.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="aggregating-the-precipitation-dataset-on-amd-epyc-7402-24-core"&gt;
&lt;h2&gt;Aggregating the Precipitation Dataset on AMD EPYC 7402 24-Core&lt;/h2&gt;
&lt;p&gt;Now it is time to check the performance of the aggregation with the 100 million values dataset coming from a &lt;a class="reference external" href="http://reanalysis.meteo.uni-bonn.de/"&gt;precipitation dataset from Central Europe&lt;/a&gt;.  Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area.  This time, the best codec inside Blosc2 was determined to be LZ4 with compression level 9:&lt;/p&gt;
&lt;img alt="/images/blosc2-meets-rome/sum_openmp_rainfall-lz4-9-lz4-9-ipp.png" class="align-center" src="https://blosc.org/images/blosc2-meets-rome/sum_openmp_rainfall-lz4-9-lz4-9-ipp.png" style="width: 50%;"&gt;
&lt;p&gt;As expected, the uncompressed aggregation scales pretty much the same than for the synthetic dataset (in the end, the Arithmetic and Logical Unit in the CPU is completely agnostic on what kind of data it operates with).  But on its hand, the compressed dataset scales more slowly, but more steadily towards hitting a maximum at 48 threads, where it reaches almost the same speed than the uncompressed dataset, which is quite a feat, provided the high memory bandwidth of this machine (~74 GB/s).&lt;/p&gt;
&lt;p&gt;Also, as Blosc2 recently gained support for the  &lt;a class="reference external" href="https://blosc.org/posts/blosc2-first-beta/"&gt;accelerated LZ4 codec inside Intel IPP&lt;/a&gt;, figures for it have been added to the plot above.  There one can see that Intel's accelerated LZ4 can get an up to 10% boost in speed compared with regular LZ4; this additional 10% actually allows Blosc2/LZ4 to be clearly faster than the uncompressed dataset at 48 threads.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;AMD EPYC Rome represents a significant leap forward in adding a high number of cores to CPUs in a way that scales really well, allowing to put more computational resources to our problems at hand.  Here we have shown how nicely a 24-core AMD Rome CPU performs when performing tasks with in-memory compressed datasets; first, by allowing competitive speed when using compression with real data and second, allowing speeds of more than 200 GB/s (with synthetic datasets).&lt;/p&gt;
&lt;p&gt;Finally, the 24-core CPU that we have exercised here is just for whetting your appetite, as CPUs of 32 or even 64 cores are going to happen more and more often in the next future.  Although I should have better said in &lt;em&gt;present times&lt;/em&gt;, as &lt;a class="reference external" href="https://www.anandtech.com/show/15044/the-amd-ryzen-threadripper-3960x-and-3970x-review-24-and-32-cores-on-7nm"&gt;AMD announced today the availability of 32-core CPUs for the workstation market&lt;/a&gt;, with 64-core ones coming next year.  Definitely, compression is going to play an increasingly important role in getting the most out of these beasts.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="appendix-software-used"&gt;
&lt;h2&gt;Appendix: Software used&lt;/h2&gt;
&lt;p&gt;For reference, here it is the software that has been used for this blog entry:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OS&lt;/strong&gt;: Ubuntu 19.10&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compiler&lt;/strong&gt;: Clang 8.0.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C-Blosc2&lt;/strong&gt;: 2.0.0b5.dev (2019-09-13)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="acknowledgments"&gt;
&lt;h2&gt;Acknowledgments&lt;/h2&gt;
&lt;p&gt;Thanks to &lt;a class="reference external" href="https://www.packet.com"&gt;packet.com&lt;/a&gt; for kindly providing the hardware for the purposes of this benchmark.  Packet guys have been really collaborative through the time in allowing me testing new, bare-metal hardware, and I must say that I am quite impressed on how easy is to start using their services with almost no effort on user's side.&lt;/p&gt;
&lt;/section&gt;</description><category>amd</category><category>memory wall</category><category>rome</category><guid>https://blosc.org/posts/blosc2-meets-rome/</guid><pubDate>Mon, 25 Nov 2019 18:32:20 GMT</pubDate></item><item><title>Is ARM Hungry Enough to Eat Intel's Favorite Pie?</title><link>https://blosc.org/posts/arm-memory-walls-followup/</link><dc:creator>Francesc Alted</dc:creator><description>&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This entry is a follow-up of the &lt;a class="reference external" href="http://blosc.org/posts/breaking-memory-walls/"&gt;Breaking Down Memory Walls&lt;/a&gt; blog.  Please make sure that you have read it if you want to fully understand all the benchmarks performed here.&lt;/p&gt;
&lt;p&gt;At the beginning of the 1990s the computing world was mainly using RISC (Reduced Instruction Set Computer) architectures, namely SPARC, Alpha, Power and MIPS CPUs for performing serious calculations and Intel CPUs were seen as something that was appropriate just to run essentially personal applications on PCs, but almost nobody was thinking about them as a serious contender for server environments.  But Intel had an argument that almost nobody was ready to recognize how important it could become; with its dominance of the PC market it quickly ranked to be the largest CPU maker in the world and, with such an enormous revenue, Intel played its cards well and, by the beginning of 2000s, they were able to make of its CISC architecture (Complex Instruction Set Computer) the one with the best compute/price ratio, clearly beating the RISC offerings at that time.  That amazing achievement shut the mouths of CISC critics (to the point that nowadays almost everybody recognizes that performance has very little to do with using RISC or CISC) and cleared the path for Intel to dominate not only the PC world, but also the world of server computing for the next 20 years.&lt;/p&gt;
&lt;p&gt;Fast forward to the beginning of 2010s, with Intel clearly dominating the market of CPUs for servers.  However, at the same time something potentially disruptive happened: the market for mobile and embedded systems exploded making &lt;a class="reference external" href="https://cacm.acm.org/magazines/2011/5/107684-an-interview-with-steve-furber/fulltext"&gt;the ARM architecture the most widely used architecture in this area&lt;/a&gt;.  By 2017, with over 100 billion ARM processors produced, ARM was already the most widely used architecture in the world.  Now, the smart reader will have noted here a clear parallelism between the situation of Intel at the end of 1990s and ARM at the end of 2010s: both companies were responsible of the design of the most used CPUs in the world.  There was an important difference though: while Intel was able to implement its own designs, ARM was leaving the implementation job to third party vendors.  Of course, this fact will have consequences on the way ARM will be competing with Intel (see below).&lt;/p&gt;
&lt;section id="arm-plans-for-improving-cpu-performance"&gt;
&lt;h2&gt;ARM Plans for Improving CPU Performance&lt;/h2&gt;
&lt;p&gt;So with ARM CPUs dominating the world of mobile and embedded, the question is whether ARM would be interested in having a stab at the client market (laptops and PC desktops) and, by extension, to the server computing market during the 2020s decade or they would renounce to that because they comfortable enough with the current situation?  In 2018 ARM provided an important hint to answer this question: they really want to push hard for the client market with the &lt;a class="reference external" href="https://www.anandtech.com/show/13226/arm-unveils-client-cpu-performance-roadmap"&gt;introduction of the Cortex A76 CPU&lt;/a&gt; which aspires to redefine the capability of ARM to compete with Intel at its own game:&lt;/p&gt;
&lt;img alt="/images/arm-memory-walls-followup/arm-compute-plans.png" class="align-center" src="https://blosc.org/images/arm-memory-walls-followup/arm-compute-plans.png" style="width: 75%;"&gt;
&lt;p&gt;On the other hand, the fact that ARM is not just providing licenses to use its IP cores, but also the possibility to buy an architectural licence for vendors to design their own CPU cores using the ARM instruction sets makes possible that other players like Apple, AppliedMicro, Broadcom, Cavium (now Marvell), Nvidia, Qualcomm, and Samsung Electronics can produce ARM CPUs that can be adapted to be used in different scenarios.  One example that is interesting for this discussion is Marvell who, with its ThunderX2 CPU, is already entering into the computing servers market --actually, a new super-computer with more than 100,000 ThunderX2 cores has recently entered into the &lt;a class="reference external" href="https://t.co/LM2wXQrXm8"&gt;TOP500 ranking&lt;/a&gt;; this is the first time that an ARM-based computer enters that list, overwhelmingly dominated by Intel architectures for almost two decades now.&lt;/p&gt;
&lt;p&gt;In the next sections we are trying to bring more hints (experimentally tested) on whether ARM (and its licensees) are fulfilling their promise, or their claims were just bare marketing.  For checking this, I was able to use two recent (2018) implementations of the ARMv8-A architecture, one meant for the client market and the other for servers, replicated the benchmarks of my previous &lt;a class="reference external" href="http://blosc.org/posts/breaking-memory-walls/"&gt;Breaking Down Memory Walls&lt;/a&gt; blog entry and extracted some interesting results.  Keep reading.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="the-kirin-980-cpu"&gt;
&lt;h2&gt;The Kirin 980 CPU&lt;/h2&gt;
&lt;p&gt;Here we are going to analyze &lt;a class="reference external" href="https://www.anandtech.com/show/13503/the-mate-20-mate-20-pro-review"&gt;Huawei's Kirin 980 CPU&lt;/a&gt; , a SoC (System On a Chip) that uses the ARM A76 core internally.  This is a fine example of an internal IP core design of ARM that is licensed to be used in a CPU chipset (or SoC) by another vendor (Huawei in this case).  The Kirin 980 wears 4 A76 cores plus 4 A55 cores, but the more powerful ones are the A76 (the A55 are more headed to do light tasks with very little energy consumption, which is critical for phones).  The A76 core is designed to be implemented using a 7nm technology (as it is the case for the Kirin 980, the second SoC in the world to use a 7 nm node, after Apple A12), and supports ARM's DynamIQ technology which allows scalability to target the specific requirements of a SoC.  In our case the Kirin 980 is running in a phone (Humawei's Mate 20), and in this scenario the power dissipation (TDP) cannot exceed the 4 W figure, so DynamIQ should try to be very conservative here and avoid putting too many cores active at the same time.&lt;/p&gt;
&lt;p&gt;ARM is saying that they designed the &lt;a class="reference external" href="https://arstechnica.com/gadgets/2018/06/arm-promises-laptop-level-performance-in-2019/"&gt;A76 to be a competitor of the Intel Skylake Core i5&lt;/a&gt;, so this is what we are going to check here.  For this, we are going to compare a Kirin 980 in a Huawei Mate 20 phone against a Core i5 included in a MacBook Pro (late 2016).  Here it is the side-by-side performance for the precipitation dataset that I used in the &lt;a class="reference external" href="http://blosc.org/posts/breaking-memory-walls/"&gt;previous blog&lt;/a&gt;:&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;&lt;img alt="rainfall-kirin980" src="https://blosc.org/images/arm-memory-walls-followup/kirin980-rainfall-lz4-9.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;img alt="rainfall-i5laptop" src="https://blosc.org/images/arm-memory-walls-followup/i5laptop-lz4-9.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here we can already see a couple of things.  First, the speed of the calculation when there is no compression is similar for both CPUs.  This is interesting because, although the bottleneck for this benchmark is in the memory access, the fact that the Kirin 980 performance is almost the same than the Core i5 is a testimony of how well ARM performed in the design of a memory prefetcher, clearly allowing for a good memory-level parallelism.&lt;/p&gt;
&lt;p&gt;Second, for the compressed case, the Core i5 is still a 50% faster than the Kirin 980, but the performance scales similarly (up to 4 threads) for both CPUs.  The big news here is that the Core i5 has a TDP of 28 W, whereas for the Kirin 980 is just 4 W (and probably less than that), so that means that ARM's DynamIQ works beautifully so as to allow 4 (powerful) cores to run simultaneously in such a restrictive scenario (remember that we are running this benchmark &lt;em&gt;inside a phone&lt;/em&gt;).  It is also true that we are comparing an Intel CPU from 2016 against an ARM CPU from 2018 and that nowadays probably we can find Intel exemplars showing a similar performance than this i5 for probably no more than 10 W (e.g. an &lt;a class="reference external" href="https://ark.intel.com/products/149088/Intel-Core-i5-8265U-Processor-6M-Cache-up-to-3-90-GHz-"&gt;i5-8265U with configurable TDP-down&lt;/a&gt;), although I am not really sure how an Intel CPU will perform with such a strict power constraint.  At any rate, the Kirin 980 still consumes less than half of the power than its Intel counterpart --and its price would probably be a fraction of it too.&lt;/p&gt;
&lt;p&gt;I believe that these facts are really a good testimony of how serious ARM was on their claim that they were going to catch Intel in the performance side of the things for client devices, and probably with an important advantage in consuming less energy too.  The fact that ARM CPUs are more energy efficient should not be surprising given the experience of ARM in that area for decades.  But another reason for that is the important reduction in the manufacturing technology that ARM has achieved on their new designs (7nm node for ARM vs 14nm node for Intel); undoubtedly, ARM advantage in power consumption is going to be important for their world-domination plans.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="the-thunderx2-cpu"&gt;
&lt;h2&gt;The ThunderX2 CPU&lt;/h2&gt;
&lt;p&gt;The second way in which ARM sells licenses is the so-called &lt;em&gt;architectural license&lt;/em&gt; allowing companies to design their own CPU cores using the ARM instruction sets.  Cavium (now bought by Marvell) was one of these companies, and they produced different CPU designs that culminated with Vulcan, the micro-architecture that powers the ThunderX2 CPU, which was made available in May 2018.  &lt;a class="reference external" href="https://en.wikichip.org/wiki/cavium/microarchitectures/vulcan"&gt;Vulcan is a 16 nm high-performance 64-bit ARM micro-architecture&lt;/a&gt; that is specifically meant to compete in compute/data server facilities (think of it as a  a Xeon-class ARM-based server microprocessor).  ThunderX2 can pack up to 32 Vulcan cores, and as every Vulcan core supports up to 4 threads, the whole CPU can run up to 128 threads.  With its capability to handle so many threads simultaneously, I expected that its raw compute power should be nothing to sneeze at.&lt;/p&gt;
&lt;p&gt;So as to check how powerful a ThunderX2 can be, we are going to compare &lt;a class="reference external" href="https://en.wikichip.org/wiki/cavium/thunderx2/cn9975"&gt;ThunderX2 CN9975&lt;/a&gt; (actually a box with 2 instances of it, each containing 28 cores) against one of its natural competitor, the Intel Scalable Gold 5120 (actually a box with 2 instances of it, each containing 14 cores):&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;&lt;img alt="rainfall-thunderx2" src="https://blosc.org/images/arm-memory-walls-followup/thunderx2-rainfall-lz4-9.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;img alt="rainfall-scalable" src="https://blosc.org/images/arm-memory-walls-followup/scalable-rainfall-lz4-9.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here we see that, when no compression is used, the Intel instance scales much better and more predictably; however the ThunderX2 is able to reach a similar performance (almost 70 GB/s) than the Intel when enough threads are thrown at the computing task.  This is a really interesting fact, because it is showing that, for first time ever, an ARM CPU can match the memory bandwidth of a latest generation Intel CPU (which BTW, was pretty good at that already).&lt;/p&gt;
&lt;p&gt;Regarding the compressed scenario, Intel Scalable still performs more than 2x faster than the ThunderX2 and it continues to show a really nice scalability.  On the other hand, although the ThunderX2 represents a good step in improving the performance of the ARM architecture, it is still quite far from being able to reach Intel in terms of both raw computing performance and the capacity to scale smoothly.&lt;/p&gt;
&lt;p&gt;When we look at power consumption, although I was not able to find the exact figure for the ThunderX2 CN9975 model that has been used in the benchmarks above, it is probably in the range of 150 W per CPU, which is quite larger than its Intel Scalable 5120 counterpart which is around 100 W per CPU.  That means that Intel is using quite far less power in their CPU, giving them a clear advantage in server computing at this time.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;From these results, it is quite evident that ARM is making large strides in catching Intel performance, specially in the client side of the things (laptops, and PC desktops), with an important reduction in power consumption, which is specially important for laptops.  Keep these facts in mind when you are going to buy your next laptop or desktop PC and do not blindly assume that Intel is the only reasonable option anymore ;-)&lt;/p&gt;
&lt;p&gt;On the server side, Intel still holds an important advantage though, and it will not be easy to take the performance crown away from them.  However, the fact that ARM is allowing different vendors to produce their own implementations means that the competition can be more specific and each vendor is free to tackle different aspects of server computing.  So it is not difficult to realize that in the next few years we are going to see new ARM exemplars that would be meant not only for crunching numbers, but that will also specialize in different tasks, like storing and serving big data, routing data or performing artificial intelligence, to just mention a few cases (for example, &lt;a class="reference external" href="https://www.marvell.com/documents/8ru3g25b5f77f5pbjwl9/"&gt;Marvell is trying to position the ThunderX2 more specifically for the data server scenario&lt;/a&gt;) that are going to put Intel architectures in difficulties to maintain its current dominance in the data centers.&lt;/p&gt;
&lt;p&gt;Finally, we should not forget the fact that software developers (including myself) have been building high performance libraries using exclusively Intel boxes for &lt;em&gt;decades&lt;/em&gt;, so making them extremely efficient for Intel architectures.  If, as we have seen here, ARM architectures are going to be an alternative in the performance client and server scenarios, then software developers will have to increasingly adopt ARM boxes as part of their tooling so as to continue being competitive in a world that is quite likely it won't necessarily be ruled by Intel anymore.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h2&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;I would like to thank &lt;a class="reference external" href="https://www.packet.com/"&gt;Packet&lt;/a&gt;, a provider of bare metal servers in the cloud (among other things) for allowing me not only to use their machines for free, but also helping me in different questions about the configuration of the machines.  In particular, Ed Vielmetti has been instrumental in providing me early access to a ThunderX2 server, and making sure that everything was stable enough for the benchmark needs.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="appendix-software-used"&gt;
&lt;h2&gt;Appendix: Software used&lt;/h2&gt;
&lt;p&gt;For reference, here it is the software that has been used for this blog entry.&lt;/p&gt;
&lt;p&gt;For the Kirin 980:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OS&lt;/strong&gt;: Android 9 - Linux Kernel 4.9.97&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compiler&lt;/strong&gt;: clang 7.0.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C-Blosc2&lt;/strong&gt;: 2.0.0a6.dev (2018-05-18)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the ThunderX2:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OS&lt;/strong&gt;: Ubuntu 18.04&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compiler&lt;/strong&gt;: GCC 7.3.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C-Blosc2&lt;/strong&gt;: 2.0.0a6.dev (2018-05-18)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;</description><category>ARM</category><category>memory wall</category><category>tuning</category><guid>https://blosc.org/posts/arm-memory-walls-followup/</guid><pubDate>Mon, 07 Jan 2019 10:12:20 GMT</pubDate></item><item><title>Breaking Down Memory Walls</title><link>https://blosc.org/posts/breaking-memory-walls/</link><dc:creator>Francesc Alted</dc:creator><description>&lt;p&gt;&lt;strong&gt;Update (2018-08-09)&lt;/strong&gt;: An extended version of this blog post can be found in this &lt;a class="reference external" href="http://www.blosc.org/docs/Breaking-Down-Memory-Walls.pdf"&gt;article&lt;/a&gt;.  On it, you will find a complementary study with synthetic data (mainly for finding ultimate performance limits), a more comprehensive set of CPUs has been used, as well as more discussion about the results.&lt;/p&gt;
&lt;p&gt;Nowadays CPUs struggle to get data at enough speed to feed their cores.  The reason for this is that memory speed is &lt;a class="reference external" href="http://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf"&gt;growing at a slower pace than CPUs increase their speed at crunching numbers&lt;/a&gt;.   This memory slowness compared with CPUs is generally known as the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Random-access_memory#Memory_wall"&gt;Memory Wall&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For example, let's suppose that we want to compute the aggregation of a some large array; here it is how to do that using OpenMP for leveraging all cores in a CPU:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c"&gt;&lt;a id="rest_code_ba063f1120384bb59ff21e2076505ff1-1" name="rest_code_ba063f1120384bb59ff21e2076505ff1-1" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ba063f1120384bb59ff21e2076505ff1-1"&gt;&lt;/a&gt;&lt;span class="cp"&gt;#pragma omp parallel for reduction (+:sum)&lt;/span&gt;
&lt;a id="rest_code_ba063f1120384bb59ff21e2076505ff1-2" name="rest_code_ba063f1120384bb59ff21e2076505ff1-2" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ba063f1120384bb59ff21e2076505ff1-2"&gt;&lt;/a&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_ba063f1120384bb59ff21e2076505ff1-3" name="rest_code_ba063f1120384bb59ff21e2076505ff1-3" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ba063f1120384bb59ff21e2076505ff1-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;udata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;a id="rest_code_ba063f1120384bb59ff21e2076505ff1-4" name="rest_code_ba063f1120384bb59ff21e2076505ff1-4" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ba063f1120384bb59ff21e2076505ff1-4"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With this, some server (an Intel Xeon E3-1245 v5 @ 3.50GHz, with 4 physical cores and hyperthreading) takes about 14 ms for doing the aggregation of an array with 100 million of float32 values when using 8 OpenMP threads (optimal number for this CPU).  However, if instead of bringing the whole 100 million elements from memory to the CPU we generate the data inside the loop, we are avoiding the data transmission between memory and CPU, like in:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code c"&gt;&lt;a id="rest_code_ce1ca609941140708a916cfeeaf3a3a8-1" name="rest_code_ce1ca609941140708a916cfeeaf3a3a8-1" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ce1ca609941140708a916cfeeaf3a3a8-1"&gt;&lt;/a&gt;&lt;span class="cp"&gt;#pragma omp parallel for reduction (+:sum)&lt;/span&gt;
&lt;a id="rest_code_ce1ca609941140708a916cfeeaf3a3a8-2" name="rest_code_ce1ca609941140708a916cfeeaf3a3a8-2" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ce1ca609941140708a916cfeeaf3a3a8-2"&gt;&lt;/a&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;a id="rest_code_ce1ca609941140708a916cfeeaf3a3a8-3" name="rest_code_ce1ca609941140708a916cfeeaf3a3a8-3" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ce1ca609941140708a916cfeeaf3a3a8-3"&gt;&lt;/a&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;a id="rest_code_ce1ca609941140708a916cfeeaf3a3a8-4" name="rest_code_ce1ca609941140708a916cfeeaf3a3a8-4" href="https://blosc.org/posts/breaking-memory-walls/#rest_code_ce1ca609941140708a916cfeeaf3a3a8-4"&gt;&lt;/a&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This loop takes just 3.5 ms, that is, 4x less than the original one.  That means that our CPU could compute the aggregation at a speed that is 4x faster than the speed at which the memory subsystem can provide data elements to the CPU; or put in another words, the CPU is idle, doing nothing during the 75% of the time, waiting for data to arrive (for this example, but there could be other, more extreme cases).  Here we have the memory wall in action indeed.&lt;/p&gt;
&lt;p&gt;That the memory wall exists is an excellent reason to think about ways to workaround it.  One of the most promising venues is to use compression: what if we could store data in compressed state in-memory and use the spare clock cycles of the CPU for decompressing it just when it is needed?  In this blog entry we will see how to implement such a computational kernel on top of data structures that are cache- and compression-friendly and we will examine how they perform on a range of modern CPU architectures.  Some surprises are in store.&lt;/p&gt;
&lt;p&gt;For demonstration purposes, I will run a simple task: summing up the same array of values than above but using a &lt;em&gt;compressed&lt;/em&gt; dataset instead.  While computing sums of values seems trivial, it exposes a couple of properties that are important for our discussion:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;This is a memory-bounded task.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It is representative of many aggregation/reduction algorithms that are routinely used out in the wild.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;section id="operating-with-compressed-datasets"&gt;
&lt;h2&gt;Operating with Compressed Datasets&lt;/h2&gt;
&lt;p&gt;Now let's see how to run our aggregation efficiently when using compressed data.  For this, we need:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;A data container that supports on-the-flight compression.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A blocking algorithm that leverages the caches in CPUs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As for the data container, we are going to use the &lt;em&gt;super-chunk&lt;/em&gt; object that comes with the Blosc2 library.  A super-chunk is a data structure that is meant to host many data chunks in a compressed form, and that has some interesting features; more specifically:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compactness&lt;/strong&gt;: everything in a super-chunk is designed to take as little space as possible, not only by using compression, but also my minimizing the amount of associated metadata (like indexes).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Small fragmentation&lt;/strong&gt;: by splitting the data in large enough chunks that are contiguous, the resulting structure ends stored in memory with a pretty small amount of 'holes' in it, allowing a more efficient memory management by both the hardware and the software.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Support for contexts&lt;/strong&gt;: useful when we have different threads and we want to decompress data simultaneously.  Assigning a context per each thread is enough to allow the simultaneous use of the different cores without badly interfering with each other.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Easy access to chunks&lt;/strong&gt;: an integer is assigned to the different chunks so that requesting a specific chunk is just a matter of specifying its number and then it gets decompressed and returned in one shot.  So pointer arithmetic is replaced by indexing operations, making the code less prone to get severe errors (e.g. if a chunk does not exist, an error code is returned instead of creating a segmentation fault).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are curious on how the super-chunk can be created and used, just check the &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2/blob/master/bench/sum_openmp.c#L144-L157"&gt;sources for the benchmark&lt;/a&gt; used for this blog.&lt;/p&gt;
&lt;p&gt;Regarding the computing algorithm, I will use one that follows the principles of the blocking computing technique:  for every chunk, bring it to the CPU, decompress it (so that it stays in cache), run all the necessary operations on it, and then proceed to the next chunk:&lt;/p&gt;
&lt;img alt="/images/breaking-down-memory-walls/blocking-technique.png" class="align-center" src="https://blosc.org/images/breaking-down-memory-walls/blocking-technique.png" style="width: 25%;"&gt;
&lt;p&gt;For implementation details, have a look at the &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2/blob/master/bench/sum_openmp.c#L191-L209"&gt;benchmark sources&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also, and in order to allow maximum efficiency when performing multi-threaded operations, the size of each chunk in the super-chunk should fit in non-shared caches (namely, L1 and L2 in modern CPUs).  This optimization avoids concurrent access to bus caches as much as possible, thereby allowing dedicated access to data caches in each core.&lt;/p&gt;
&lt;p&gt;For our experiments below, we are going to choose a chunksize of 4,000 elements because Blosc2 needs 2 internal buffers for performing the decompression besides the source and destination buffer.  Also, we are using 32-bit (4 bytes) float values for our exercise, so the final size used in caches will be 4,000 * (2 + 2) * 4 = 64,000 bytes, which should fit comfortably in L2 caches in most modern CPU architectures (which normally sports 256 KB or even higher).  Please note that finding an optimal value for this size might require some fine-tuning, not only for different architectures, but also for different datasets.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="the-precipitation-dataset"&gt;
&lt;h2&gt;The Precipitation Dataset&lt;/h2&gt;
&lt;p&gt;There are plenty of datasets out there exposing different data distributions so, depending on your scenario, your mileage may vary.  The dataset chosen here is the result of a &lt;a class="reference external" href="http://reanalysis.meteo.uni-bonn.de"&gt;regional reanalysis covering the European continent&lt;/a&gt;, and in particular, the precipitation data in a certain region of Europe.  Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Caveat&lt;/em&gt;: For the sake of easy reproducibility, for building the 100 million dataset I have chosen a small &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2/blob/master/bench/read-grid-150x150.py"&gt;geographical area with a size of 150x150&lt;/a&gt; and reused it repeatedly so as to fill the final dataset completely.  As the size of the chunks is lesser than this area, and the super-chunk (as configured here) does not use data redundancies from other chunks, the results obtained here can be safely extrapolated to the actual dataset made from real data (bar some small differences).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="choosing-the-compression-codec"&gt;
&lt;h2&gt;Choosing the Compression Codec&lt;/h2&gt;
&lt;p&gt;When determining the best codec to use inside Blosc2 (it has support for BloscLZ, LZ4, LZ4HC, Zstd, Zlib and Lizard), it turns out that they behave quite differently, both in terms of compression and speed, with the dataset they have to compress &lt;em&gt;and&lt;/em&gt; with the CPU architecture in which they run.  This is quite usual, and the reason why you should always try to find the best codec for your use case.  Here we have how the different codecs behaves for our precipitation dataset in terms of decompression speed for our reference platform (Intel Xeon E3-1245):&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;p&gt;&lt;img alt="i7server-codecs" src="https://blosc.org/images/breaking-down-memory-walls/i7server-rainfall-codecs.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;td&gt;&lt;p&gt;&lt;img alt="rainfall-cr" src="https://blosc.org/images/breaking-down-memory-walls/rainfall-cr.png" style="width: 70%;"&gt;&lt;/p&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In this case LZ4HC is the codec that decompress faster for any number of threads and hence, the one selected for the benchmarks for the reference platform.  A similar procedure has been followed to select the codec for the CPUs.  The selected codec for every CPU will be conveniently specified in the discussion of the results below.&lt;/p&gt;
&lt;p&gt;For completeness, I am also showing the compression ratios achieved by the different codecs for the precipitation dataset.  Although there are significant differences for them, these usually come at the cost of compression/decompression time.  At any rate, even though compression ratio is important, in this blog we are mainly interested in the best decompression speed, so we will use this latter as the only important parameter for codec selection.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results-on-different-cpus"&gt;
&lt;h2&gt;Results on Different CPUs&lt;/h2&gt;
&lt;p&gt;Now it is time to see how our compressed sum algorithm performs compared with the original uncompressed one.  However, as not all the CPUs are created equal, we are going to see how different CPUs perform doing exactly the same computation.&lt;/p&gt;
&lt;section id="reference-cpu-intel-xeon-e3-1245-v5-4-core-processor-3-50ghz"&gt;
&lt;h3&gt;Reference CPU: Intel Xeon E3-1245 v5 4-Core processor @ 3.50GHz&lt;/h3&gt;
&lt;p&gt;This is a mainstream, somewhat 'small' processor for servers that has an excellent price/performance ratio.  Its main virtue is that, due to its small core count, the CPU can be run at considerably high clock speeds which, combined with a high IPC (Instructions Per Clock) count, delivers considerable computational power.  These results are a good baseline reference point for comparing other CPUs packing a larger number of cores (and hence, lower clock speeds).  Here it is how it performs:&lt;/p&gt;
&lt;img alt="/images/breaking-down-memory-walls/i7server-rainfall-lz4hc-9.png" class="align-center" src="https://blosc.org/images/breaking-down-memory-walls/i7server-rainfall-lz4hc-9.png" style="width: 75%;"&gt;
&lt;p&gt;We see here that, even though the uncompressed dataset does not scale too well, the compressed dataset shows a nice scalability even when using using hyperthreading (&amp;gt; 4 threads); this is a remarkable fact for a feature (hyperthreading) that, despite marketing promises, does not always deliver 2x the performance of the physical cores.  With that, the performance peak for the compressed precipitation dataset (22 GB/s, using LZ4HC) is really close to the uncompressed one (27 GB/s); quite an achievement for a CPU with just 4 physical cores.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="amd-epyc-7401p-24-core-processor-2-0ghz"&gt;
&lt;h3&gt;AMD EPYC 7401P 24-Core Processor @ 2.0GHz&lt;/h3&gt;
&lt;p&gt;This CPU implements EPYC, one of the most powerful architectures ever created by AMD.  It packs 24 physical cores, although internally they are split into 2 blocks with 12 cores each.  Here is how it behaves:&lt;/p&gt;
&lt;img alt="/images/breaking-down-memory-walls/epyc-rainfall-lz4-9.png" class="align-center" src="https://blosc.org/images/breaking-down-memory-walls/epyc-rainfall-lz4-9.png" style="width: 75%;"&gt;
&lt;p&gt;Stalling at 4/8 threads, the EPYC scalability for the uncompressed dataset is definitely not good.  On its hand, the compressed dataset behaves quite differently: it shows a nice scalability through the whole range of cores in the CPU (again, even when using hyperthreading), achieving the best performance (45 GB/s, using LZ4) at precisely 48 threads, well above the maximum performance reached by the uncompressed dataset (30 GB/s).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="intel-scalable-gold-5120-2x-14-core-processor-2-2ghz"&gt;
&lt;h3&gt;Intel Scalable Gold 5120 2x 14-Core Processor @ 2.2GHz&lt;/h3&gt;
&lt;p&gt;Here we have one of the latest and most powerful CPU architectures developed by Intel.  We are testing it here within a machine with 2 CPUs, each containing 14 cores.  Here’s it how it performed:&lt;/p&gt;
&lt;img alt="/images/breaking-down-memory-walls/scalable-rainfall-lz4-9.png" class="align-center" src="https://blosc.org/images/breaking-down-memory-walls/scalable-rainfall-lz4-9.png" style="width: 75%;"&gt;
&lt;p&gt;In this case, and stalling at 24/28 threads, the Intel Scalable shows a quite remarkable scalability for the uncompressed dataset (apparently, Intel has finally chosen a good name for an architecture; well done guys!).  More importantly, it also reveals an even nicer scalability on the compressed dataset, all the way up to 56 threads (which is expected provided the 2x 14-core CPUs with hyperthreading); this is a remarkable feat for such a memory bandwidth beast.  In absolute terms, the compressed dataset achieves a performance (68 GB/s, using LZ4) that is very close to the uncompressed one (72 GB/s).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="cavium-armv8-2x-48-core"&gt;
&lt;h3&gt;Cavium ARMv8 2x 48-Core&lt;/h3&gt;
&lt;p&gt;We are used to seeing ARM architectures powering most of our phones and tablets, but seeing them performing computational duties is far more uncommon.  This does not mean that there are not ARM implementations that cannot power big servers.  Cavium, with its 48-core in a single CPU, is an example of a server-grade chip.  In this case we are looking at a machine with two of these CPUs:&lt;/p&gt;
&lt;img alt="/images/breaking-down-memory-walls/cavium-rainfall-blosclz-9.png" class="align-center" src="https://blosc.org/images/breaking-down-memory-walls/cavium-rainfall-blosclz-9.png" style="width: 75%;"&gt;
&lt;p&gt;Again, we see a nice scalability (while a bit bumpy) for the uncompressed dataset, reaching its maximum (35 GB/s) at 40 threads.  Regarding the compressed dataset, it scales much more smoothly, and we see how the performance peaks at 64 threads (15 GB/s, using BloscLZ) and then drops significantly after that point (even if the CPU still has enough cores to continue the scaling; I am not sure why is that).  Incidentally, the BloscLZ codec being the best performer here is not a coincidence as it recently received a lot of fine-tuning for ARM.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="what-we-learned"&gt;
&lt;h2&gt;What We Learned&lt;/h2&gt;
&lt;p&gt;We have explored how to use compression in an nearly optimal way to perform a very simple task: compute an aggregation out of a large dataset.  With a basic understanding of the cache and memory subsystem, and by using appropriate compressed data structures (the super-chunk), we have seen how we can easily produce code that enables modern CPUs to perform operations on compressed data at a speed that approaches the speed of the same operations on uncompressed data (and sometimes exceeding it).  More in particular:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;p&gt;Performance for the compressed dataset scales very well on the number of threads for all the CPUs (even hyper-threading seems very beneficial at that, which is a welcome surprise).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The CPUs that benefit the most from compression are those with relatively low memory bandwidth and CPUs with many cores.  In particular, the EPYC architecture is a good example and we have shown how the compressed dataset can operate 50% faster that the uncompressed one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Even when using CPUs with a low number of cores (e.g. our reference CPU, with only 4) we can achieve computational speeds on compressed data that can be on par with traditional, uncompressed computations, while saving precious amounts of memory and disk space.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The appropriate codec (and other parameters) to use within Blosc2 for maximum performance can vary depending on the dataset and the CPU used.  Having a way to automatically discover the optimal compression parameters would be a nice addition to the Blosc2 library.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
&lt;section id="final-thoughts"&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;To conclude, it is interesting to remember here what Linus Torvalds said back in 2006 (talking about the git system that he created the year before):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] git actually has a simple  design, with stable and reasonably well-documented data structures.  In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful.
[...] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Of course, we all know how drastic Linus can be in his statements, but I cannot agree more on how important is to adopt a data-driven view when designing our applications.  But I'd go further and say that, when trying to squeeze the last drop of performance out of modern CPUs, data containers need to be structured in a way that leverages the characteristics of the underlying CPU, as well as to facilitate the application of the blocking technique (and thereby allowing compression to run efficiently).  Hopefully, installments like this can help us explore new possibilities to break down the memory wall that bedevils modern computing.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="acknowledgements"&gt;
&lt;h2&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;Thanks to my friend Scott Prater for his great advices on improving my writing style, Dirk Schwanenberg for pointing out to the precipitation dataset and for providing the script for reading it, and Robert McLeod, J. David Ibáñez and Javier Sancho for suggesting general improvements (even though some of their suggestions required such a big amount of work that made me ponder about their actual friendship :).&lt;/p&gt;
&lt;/section&gt;
&lt;section id="appendix-software-used"&gt;
&lt;h2&gt;Appendix: Software used&lt;/h2&gt;
&lt;p&gt;For reference, here it is the software that has been used for this blog entry:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OS&lt;/strong&gt;: Ubuntu 18.04&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compiler&lt;/strong&gt;: GCC 7.3.0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C-Blosc2&lt;/strong&gt;: 2.0.0a6.dev (2018-05-18)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;</description><category>caches</category><category>memory wall</category><category>tuning</category><guid>https://blosc.org/posts/breaking-memory-walls/</guid><pubDate>Mon, 25 Jun 2018 18:32:20 GMT</pubDate></item></channel></rss>