<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blosc Home Page  (Posts by Ricardo Sales Piquer)</title><link>https://blosc.org/</link><description></description><atom:link href="https://blosc.org/authors/ricardo-sales-piquer.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:blosc@blosc.org"&gt;The Blosc Developers&lt;/a&gt; </copyright><lastBuildDate>Wed, 04 Mar 2026 11:43:34 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Make NDArray Transposition Fast (and Compressed!) within Blosc 2 </title><link>https://blosc.org/posts/optimizing-chunks-transpose/</link><dc:creator>Ricardo Sales Piquer</dc:creator><description>&lt;p&gt;&lt;strong&gt;Update (2025-04-30):&lt;/strong&gt; The &lt;code class="docutils literal"&gt;transpose&lt;/code&gt; function is now officially deprecated and
replaced by the new &lt;code class="docutils literal"&gt;permute_dims&lt;/code&gt;. This transition follows the Python array
API standard v2022.12, aiming to make Blosc2 even more compatible with modern
Python libraries and workflows.&lt;/p&gt;
&lt;p&gt;In contrast with the previous &lt;code class="docutils literal"&gt;transpose&lt;/code&gt;, the new &lt;code class="docutils literal"&gt;permute_dims&lt;/code&gt; offers:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Support for arrays of any number of dimensions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full handling of arbitrary axis permutations, including support for
negative indices.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Moreover, I have found a new way to transpose matrices more efficiently for
Blosc2. This blog contains updated plots and discussions.&lt;/p&gt;
&lt;p&gt;---&lt;/p&gt;
&lt;p&gt;Matrix transposition is more than a textbook exercise, it plays a key role in
memory-bound operations where layout and access patterns can make or break
performance.&lt;/p&gt;
&lt;p&gt;When working with large datasets, efficient data transformation can significantly
improve both performance and compression ratios. In Blosc2, we recently implemented
a matrix transposition function, a fundamental operation that rearranges data by
swapping rows and columns. In this post, I'll share the design insights,
implementation details, performance considerations that went into this feature,
and an unexpected NumPy behaviour.&lt;/p&gt;
&lt;section id="what-was-the-old-behavior"&gt;
&lt;h2&gt;What was the old behavior?&lt;/h2&gt;
&lt;p&gt;Previously, calling &lt;code class="docutils literal"&gt;blosc2.transpose(A)&lt;/code&gt; would &lt;strong&gt;transpose the data within
each chunk&lt;/strong&gt;, and a new chunk shape would be chosen for the output array.
However, this new chunk shape was not necessarily aligned with the new memory
access patterns induced by the transpose. As a result, even though the output
looked correct, accessing data along the new axes still incurred a
significant overhead due to increased number of I/O operations. This
lead to performance bottlenecks, particularly in workloads that rely on
efficient memory access patterns.&lt;/p&gt;
&lt;img alt="Transposition explanation for old operation" class="align-center" src="https://blosc.org/images/blosc2-transpose/transpose2.png"&gt;
&lt;/section&gt;
&lt;section id="what-s-new"&gt;
&lt;h2&gt;What's new?&lt;/h2&gt;
&lt;p&gt;The &lt;code class="docutils literal"&gt;permute_dims&lt;/code&gt; function in Blosc2 has been redesigned to greatly improve
performance when working with compressed, multidimensional arrays. The main
improvement lies in &lt;strong&gt;transposing the chunk layout alongside the array data&lt;/strong&gt;,
which eliminates the overhead of cross-chunk access patterns.&lt;/p&gt;
&lt;p&gt;The new implementation transposes the chunk layout along with the data.
For example, an array with &lt;code class="docutils literal"&gt;&lt;span class="pre"&gt;chunks=(2,&lt;/span&gt; 5)&lt;/code&gt; that is transposed with
&lt;code class="docutils literal"&gt;&lt;span class="pre"&gt;axes=(1,&lt;/span&gt; 0)&lt;/code&gt; will result in an array with &lt;code class="docutils literal"&gt;&lt;span class="pre"&gt;chunks=(5,&lt;/span&gt; 2)&lt;/code&gt;. This ensures
that the output layout matches the new data order, making block access
contiguous and efficient.&lt;/p&gt;
&lt;p&gt;This logic generalizes to N-dimensional arrays and applies regardless of their
shape or chunk configuration.&lt;/p&gt;
&lt;img alt="Transposition explanation for new operation" class="align-center" src="https://blosc.org/images/blosc2-transpose/transpose3.png"&gt;
&lt;/section&gt;
&lt;section id="performance-benchmark-transposing-matrices-with-blosc2-vs-numpy"&gt;
&lt;h2&gt;Performance benchmark: Transposing matrices with Blosc2 vs NumPy&lt;/h2&gt;
&lt;p&gt;To evaluate the performance of the new matrix transposition implementation in
&lt;em&gt;Blosc2&lt;/em&gt;, I conducted a series of benchmarks comparing it to &lt;em&gt;NumPy&lt;/em&gt;, which
serves as the baseline due to its widespread use and high optimization level.
The goal was to observe how both approaches perform when handling matrices of
increasing size and to understand the impact of different chunk configurations
in Blosc2.&lt;/p&gt;
&lt;section id="benchmark-setup"&gt;
&lt;h3&gt;Benchmark setup&lt;/h3&gt;
&lt;p&gt;All tests were conducted using matrices filled with &lt;code class="docutils literal"&gt;float64&lt;/code&gt; values,
covering a wide range of sizes, starting from small &lt;code class="docutils literal"&gt;100×100&lt;/code&gt; matrices and
scaling up to very large matrices of size &lt;code class="docutils literal"&gt;17000×17000&lt;/code&gt;, covering data sizes
from just a few megabytes to over 2 GB. Each matrix was transposed using the
Blosc2 API under different chunking strategies:&lt;/p&gt;
&lt;p&gt;In the case of NumPy, I used the &lt;code class="docutils literal"&gt;.transpose()&lt;/code&gt; function followed by a
&lt;code class="docutils literal"&gt;.copy()&lt;/code&gt; to ensure that the operation was comparable to that of Blosc2. This
is because, by default, NumPy's transposition is a view operation that only
modifies the array's metadata, without actually rearranging the data in memory.
Adding &lt;code class="docutils literal"&gt;.copy()&lt;/code&gt; forces NumPy to perform a real memory reordering, making the
comparison with Blosc2 fair and accurate.&lt;/p&gt;
&lt;p&gt;For Blosc2, I tested the transposition function across several chunk
configurations. Specifically, I included:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Automatic chunking, where Blosc2 decides the optimal chunk size
internally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fixed chunk sizes: &lt;code class="docutils literal"&gt;(150, 300)&lt;/code&gt;, &lt;code class="docutils literal"&gt;(1000, 1000)&lt;/code&gt; and
&lt;code class="docutils literal"&gt;(5000, 5000)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These chunk sizes were chosen to represent a mix of square and rectangular
blocks, allowing me to study how chunk geometry impacts performance, especially
for very large matrices.&lt;/p&gt;
&lt;p&gt;Each combination of library and configuration was tested across all matrix sizes,
and the time taken to perform the transposition was recorded in seconds. This
comprehensive setup makes it possible to compare not just raw performance, but
also how well each method scales with data size and structure.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="results-and-discussion"&gt;
&lt;h3&gt;Results and discussion&lt;/h3&gt;
&lt;p&gt;The chart below summarizes the benchmark results for matrix transposition using
NumPy and Blosc2, across various chunk shapes and matrix sizes.&lt;/p&gt;
&lt;img alt="Transposition performance for new method" class="align-center" src="https://blosc.org/images/blosc2-transpose/performance-new.png"&gt;
&lt;p&gt;While NumPy sets a strong performance baseline, the behaviour of Blosc2 becomes
particularly interesting when we dive into how different chunk configurations
affect transposition speed. The following observations highlight how crucial the
choice of chunk shape is to achieving optimal performance.&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;Large square chunks (e.g., &lt;code class="docutils literal"&gt;(4000, 4000)&lt;/code&gt;) showed the worst performance,
especially with large matrices. Despite having fewer chunks, their size
seems to hinder cache performance and introduces memory pressure that
degrades throughput. Execution times were consistently higher than other
configurations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Small rectangular chunks such as &lt;code class="docutils literal"&gt;(150, 300)&lt;/code&gt; also underperformed.
As matrix size grew, execution times increased significantly,
reaching nearly 3 seconds at around 2200 MB, likely due to poor cache
utilization and the overhead of managing many tiny chunks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mid-sized square chunks like (1000, 1000) delivered consistently solid
results across all tested sizes. Their timings stay below ~1.2 s with
minimal variance, making them a reliable manual choice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automatically selected chunks consistently achieved the best performance.
By adapting chunk layout to the data shape and size, the internal
heuristics outpaced all fixed configurations, even rivaling plain NumPy
transpose times.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img alt="Blosc2 vs NumPy comparison" class="align-center" src="https://blosc.org/images/blosc2-transpose/Numpy-vs-Blosc2-new.png"&gt;
&lt;p&gt;The second plot provides a direct comparison between the standard NumPy
&lt;code class="docutils literal"&gt;transpose&lt;/code&gt; and the newly optimized Blosc2
version. It shows that Blosc2’s optimized implementation closely matches
NumPy's performance, even for larger matrices. The results confirm that with
good chunking strategies and proper memory handling, Blosc2 can achieve
performance on par with NumPy for transposition operations.&lt;/p&gt;
&lt;aside class="admonition note"&gt;
&lt;p class="admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;Across all chunk configurations, there is an anomalous latency spike around
the 1500–1600 MB range. This unexpected behavior suggests some low-level
effect (e.g., memory management thresholds, buffer alignment issues, or shifts
in cache access patterns) that is not directly tied to chunk size but rather to
the overall matrix magnitude in that specific region.&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The benchmarks highlight one key insight: Blosc2 is highly sensitive to chunk
shape, and its performance can range from excellent to poor depending on how it
is configured. With the right chunk size, Blosc2 can offer both high-speed
transpositions and advanced features like compression and out-of-core
processing. However, misconfigured chunks, especially those that are too big
or too small, can drastically reduce its effectiveness. This makes chunk tuning
an essential step for anyone seeking to get the most out of Blosc2 for
large-scale matrix operations.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="appendix-a-unexpected-numpy-behaviour"&gt;
&lt;h2&gt;Appendix A: Unexpected NumPy behaviour&lt;/h2&gt;
&lt;p&gt;While running the benchmarks, two unusual spikes were consistently observed in
the performance of NumPy around matrices of approximately &lt;strong&gt;500 MB&lt;/strong&gt;, &lt;strong&gt;1100 MB&lt;/strong&gt;
and &lt;strong&gt;2000 MB&lt;/strong&gt; in size. This can be clearly seen in the plot below:&lt;/p&gt;
&lt;img alt="NumPy transposition performance anomaly" class="align-center" src="https://blosc.org/images/blosc2-transpose/only-numpy.png"&gt;
&lt;p&gt;This sudden increase in transposition time is consistently reproducible and
does not seem to correlate with the gradual increase expected from larger
memory sizes.  We have also observed this behaviour in other machines,
although at different sizes.&lt;/p&gt;
&lt;p&gt;This observation reinforces the importance of testing under realistic and
varied conditions, as performance is not always linear or intuitive.&lt;/p&gt;
&lt;aside class="admonition note"&gt;
&lt;p class="admonition-title"&gt;Note&lt;/p&gt;
&lt;p&gt;See NumPy's issue &lt;a class="reference external" href="https://github.com/numpy/numpy/issues/28711"&gt;#28711&lt;/a&gt; for
more details.&lt;/p&gt;
&lt;/aside&gt;
&lt;/section&gt;</description><category>blosc2 optimization matrix transposition compression numpy</category><guid>https://blosc.org/posts/optimizing-chunks-transpose/</guid><pubDate>Tue, 08 Apr 2025 09:00:00 GMT</pubDate></item><item><title>Optimizing chunks for matrix multiplication in Blosc2</title><link>https://blosc.org/posts/optimizing-chunks-blosc2/</link><dc:creator>Ricardo Sales Piquer</dc:creator><description>&lt;p&gt;As data volumes continue to grow in fields like machine learning and scientific computing,
optimizing fundamental operations like matrix multiplication becomes increasingly critical.
Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.&lt;/p&gt;
&lt;section id="matrix-multiplication"&gt;
&lt;h2&gt;Matrix Multiplication&lt;/h2&gt;
&lt;p&gt;Matrix multiplication is a fundamental operation in many scientific and
engineering applications. With the introduction of matrix multiplication into
Blosc2, users can now perform this operation on compressed arrays efficiently.
The key advantages of having matrix multiplication in Blosc2 include:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compressed matrices in memory:&lt;/strong&gt;
Blosc2 enables matrices to be stored in a compressed format without sacrificing
the ability to perform operations directly on them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficiency with chunks&lt;/strong&gt;:
In computation-intensive applications, matrix multiplication can be executed
without fully decompressing the data, operating on small blocks of data independently,
saving both time and memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Out-of-core computation:&lt;/strong&gt;
When matrices are too large to fit in main memory, Blosc2 facilitates out-of-core
processing. Data stored on disk is read and processed in optimized chunks,
allowing matrix multiplication operations without loading the entire dataset into
memory.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features are especially valuable in big data environments and in scientific
or engineering applications where matrix sizes can be overwhelming, enabling
complex calculations efficiently.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="implementation"&gt;
&lt;h2&gt;Implementation&lt;/h2&gt;
&lt;p&gt;The matrix multiplication functionality is implemented in the &lt;code class="docutils literal"&gt;matmul&lt;/code&gt;
function. It supports Blosc2 &lt;code class="docutils literal"&gt;NDArray&lt;/code&gt; objects and leverages chunked
operations to perform the multiplication efficiently.&lt;/p&gt;
&lt;img alt="How blocked matrix multiplication works" class="align-center" src="https://blosc.org/images/blosc2-matmul/blocked-gemm.png"&gt;
&lt;p&gt;The image illustrates a &lt;strong&gt;blocked matrix multiplication&lt;/strong&gt; approach. The key idea
is to divide matrices into smaller blocks (or chunks) to optimize memory
access and computational efficiency.&lt;/p&gt;
&lt;p&gt;In the image, matrix &lt;cite&gt;A (M x K)&lt;/cite&gt; and matrix &lt;cite&gt;B (K x N)&lt;/cite&gt;
are partitioned into chunks, and these are partitioned into blocks. The resulting
matrix &lt;cite&gt;C (M x N)&lt;/cite&gt; is computed as a sum of block-wise multiplication.&lt;/p&gt;
&lt;p&gt;This method significantly improves cache utilization by ensuring that only the
necessary parts of the matrices are loaded into memory at any given time. In
Blosc2, storing matrix blocks as compressed chunks reduces memory footprint and
enhances performance by enabling on-the-fly decompression.&lt;/p&gt;
&lt;p&gt;Also, Blosc2 supports a wide range of data types. In addition to standard Python
types such as &lt;cite&gt;int&lt;/cite&gt;, &lt;cite&gt;float&lt;/cite&gt;, and &lt;cite&gt;complex&lt;/cite&gt;, it also fully supports various NumPy
types. The currently supported types include:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.int8&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.int16&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.int32&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.int64&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.float32&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.float64&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.complex64&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;cite&gt;np.complex128&lt;/cite&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This versatility allows compression and subsequent processing to be
applied across diverse scenarios, tailored to the specific needs of each
application.&lt;/p&gt;
&lt;p&gt;Together, these features make Blosc2 a flexible and adaptable tool for various
scenarios, but especially suited for the handling of large datasets.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="benchmarks"&gt;
&lt;h2&gt;Benchmarks&lt;/h2&gt;
&lt;p&gt;The benchmarks have been designed to evaluate the performance of the &lt;code class="docutils literal"&gt;matmul&lt;/code&gt;
function under various conditions. Here are the key aspects of our
experimental setup and findings:&lt;/p&gt;
&lt;p&gt;Different matrix sizes were tested using both &lt;code class="docutils literal"&gt;float32&lt;/code&gt; and &lt;code class="docutils literal"&gt;float64&lt;/code&gt;
data types. All the matrices used for multiplication are square.
The variation in matrix sizes helps observe how the function scales and
how the overhead of chunk management impacts performance.&lt;/p&gt;
&lt;p&gt;The x-axis represents the size of the resulting matrix in megabytes (MB).
We used GFLOPS (Giga Floating-Point Operations per Second) to gauge the
computational throughput, allowing us to compare the efficiency of the
&lt;code class="docutils literal"&gt;matmul&lt;/code&gt; function relative to highly optimized libraries like NumPy.&lt;/p&gt;
&lt;p&gt;Blosc2 also incorporates a functionality to automatically select chunks, and
it is represented in the benchmark by "Auto".&lt;/p&gt;
&lt;img alt="Benchmark float32" class="align-center" src="https://blosc.org/images/blosc2-matmul/float32.png"&gt;
&lt;img alt="Benchmark float64" class="align-center" src="https://blosc.org/images/blosc2-matmul/float64.png"&gt;
&lt;p&gt;For smaller matrices, the overhead of managing chunks in Blosc2 can result in
lower GFLOPS compared to NumPy. As the matrix size increases, Blosc2 scales
well, approaching its performance to NumPy.&lt;/p&gt;
&lt;p&gt;Each chunk shape exhibits a peak performance when the matrix size matches the
chunk size, or is a multiple of the chunk shape.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The new matrix multiplication feature in Blosc2 introduces efficient, chunked
computation for compressed arrays. This allows users to handle large datasets
both in memory and on disk without sacrificing performance. The implementation
supports a wide range of data types, making it versatile for various numerical
applications.&lt;/p&gt;
&lt;p&gt;Real-world applications, such as neural network training, demonstrate the
potential benefits in scenarios where memory constraints and large data sizes
are common. While there are some limitations —such as support only for 2D arrays
and the overhead of blocking— the applicability looks promising, like
potential integration with deep learning frameworks.&lt;/p&gt;
&lt;p&gt;Overall, Blosc2 offers a compelling alternative for applications where the
advantages of compression and out-of-core computation are critical, paving
the way for more efficient processing of massive datasets.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="getting-my-feet-wet-with-blosc2"&gt;
&lt;h2&gt;Getting my feet wet with Blosc2&lt;/h2&gt;
&lt;p&gt;In the initial phase of the project, my biggest challenge was understanding how
Blosc2 manages data internally. For matrix multiplication, it was critical to
grasp how to choose the right chunks, since the operation requires that the
ranges of both matrices coincide. After some considerations and a few insightful
conversations with Francesc, I finally understood the underlying mechanics.
This breakthrough allowed me to begin implementing the first versions of my
solution, adjusting the data fragmentation so that each block was properly
aligned for precise computation.&lt;/p&gt;
&lt;p&gt;Another important aspect was adapting to the professional workflow of using Git
for version control. Embracing Git —with its branch creation, regular commits,
and conflict resolution— represented a significant shift in my development
approach. This experience not only improved the organization of my code and
facilitated collaboration but also instilled a structured and disciplined
mindset in managing my projects. This tool has shown to be both valuable and
extremely helpful.&lt;/p&gt;
&lt;p&gt;Finally, the moment when the function finally returned the correct result was
really exciting. After multiple iterations, the rigorous debugging process paid
off as everything fell into place. This breakthrough validated the robustness
of the implementation and boosted my confidence to further optimize and tackle
new challenges in data processing.&lt;/p&gt;
&lt;/section&gt;</description><category>blosc2 optimization matrix multiplication matmul compression</category><guid>https://blosc.org/posts/optimizing-chunks-blosc2/</guid><pubDate>Wed, 12 Mar 2025 09:00:00 GMT</pubDate></item></channel></rss>