Persistent reductions and broadcast in Lazy Expressions#
In this tutorial, we’ll explore Blosc2’s capabilities for lazy computation in Python. We’ll create arrays of various dimensions, operate them using operations like reduction, addition and multiplication, and demonstrate how lazy expressions defer computations to optimize performance.
The lazy expression technique is efficient because it postpones the computation of the expression until it is actually needed, removing the need for large temporaries and hence, optimizing memory usage and processing.
However, reductions are kind of an exception in computing lazy expressions, as they are always computed eagerly when using regular Python expressions with Blosc2 operands. Fortunately, we can avoid eager computations by using a string version of the expression in combination with the blosc2.lazyexpr
function. We will show how to create and save a lazy expression, and then compute it to obtain the desired results.
We’ll also see how resizing operand arrays is reflected in the results, highlighting the flexibility of lazy expressions.
Without further ado, let’s dive into lazy computation, reductions and broadcasting with Blosc2!
Operands as arrays of different shape#
We will now create the operands, using different shape for each of them, just for flexing the broadcasting capabilities of lazy expressions.
[6]:
import time
import blosc2
# Define dimensions of arrays
dim_a = (200, 300, 400) # 3D array
dim_b = (200, 400) # 2D array
dim_c = 400 # 1D array
# Create arrays with specific dimensions and values
a = blosc2.full(dim_a, 1, urlpath="a.b2nd", mode="w")
b = blosc2.full(dim_b, 2, urlpath="b.b2nd", mode="w")
c = blosc2.full(dim_c, 3, urlpath="c.b2nd", mode="w")
Array a slice: [[[1 1 1 1]
[1 1 1 1]]
[[1 1 1 1]
[1 1 1 1]]]
Array b slice: [[2 2 2 2]
[2 2 2 2]]
Array c slice: [3 3 3 3]
Creating, saving and loading a lazy expression#
First, let’s build a string expression that sums the contents of array a
and operates with the values of b
by c
. In this context, creating a string version of the expression is critical; otherwise, reductions should be computed eagerly.
Let’s see how this works.
[9]:
# Expression that sums all elements of 'a' and multiplies 'b' by 'c'
expression = "a.sum() + b * c"
# Define the operands for the expression
operands = {"a": a, "b": b, "c": c}
# Create a lazy expression
lazy_expression = blosc2.lazyexpr(expression, operands)
# Save the lazy expression to the specified path
url_path = "my_expr.b2nd"
lazy_expression.save(urlpath=url_path, mode="w")
In the code above, an expression combining the arrays a
, b
, and c
is expressed in string form: a.sum() + b ∗ c
. Then, one builds a lazy expression and save it for later. The expression chosen illustrates how operations automatically adapt to the dimensions of the operands through the concept of broadcasting.
Broadcasting allows arrays of different shapes (dimensions) to align for mathematical operations, such as addition or multiplication, without the need to enlarge operands by replicating data. The main idea is that smaller dimensions are “stretched” to larger dimensions in such a way that allows the operation to be performed consistently.
See NumPy docs on broadcasting for more information.
Now that we have saved the expression, we can open and compute it to obtain the result. Let’s see how this is done.
[11]:
lazy_expression = blosc2.open(urlpath=url_path)
# Print the lazy expression and its shape
print(lazy_expression)
t1 = time.time()
print(lazy_expression.shape)
t2 = time.time()
print(f"Time to get shape:{t2-t1:.5f}")
t1 = time.time()
result1 = lazy_expression.compute()
t2 = time.time()
print(f"Time to compute:{t2-t1:.5f}")
print("Result of the operation (slice):")
print(result1[:2, :4]) # Print a small slice of the result for demonstration
a.sum() + b * c
(200, 400)
Time to get shape:0.00004
Time to compute:0.05476
Result of the operation (slice):
[[24000006 24000006 24000006 24000006]
[24000006 24000006 24000006 24000006]]
As we can observe when printing the lazy expression and its shape, the time required to get the shape
is significantly shorter. This is because lazy_expression.shape
does not need to compute all the elements of the expression; instead, it only accesses the metadata of the operands, from which it is inferred the basic information about the dimensions and type of the result.
Thanks to this metadata, if we know the dimensions of the arrays involved in the operation (such as in the case of a.sum() + b * c
), Blosc2 can quickly infer the resulting shape without performing intensive calculations. This allows for fast access to structural information (like the shape
and dtype
) without operating on the actual data.
In contrast, when we call lazy_expression.compute()
, all the necessary operations to calculate the final result are executed. Here is where the real computation takes place, and as we can see from the time, this process is significantly longer.
Resizing operands of persisted lazy expressions#
In this section, we will see how persisted lazy expressions automatically adapt to changes in the dimensions and values of the original operands, such as arrays a
and b
.
[12]:
# Resizing arrays and updating values to see changes in the expression result
a.resize((300, 300, 400))
a[200:300] = 3
b.resize((300, 400))
b[200:300] = 5
# Open the saved file
lazy_expression = blosc2.open(urlpath=url_path)
t1 = time.time()
print(lazy_expression.shape)
t2 = time.time()
print(f"Time to get shape:{t2-t1:.5f}")
t1 = time.time()
result2 = lazy_expression.compute()
t2 = time.time()
print(f"Time to compute:{t2-t1:.5f}")
print("Result of the operation (slice):")
print(result2[:2, :4])
(300, 400)
Time to get shape:0.00010
Time to compute:0.06103
Result of the operation (slice):
[[60000006 60000006 60000006 60000006]
[60000006 60000006 60000006 60000006]]
After increasing the dimensions of the original arrays by modifying the values of a
and b
, the lazy expression is reopened. This step is crucial as it allows us to observe how the computation of the expression adapts to the new dimensions. Upon re-opening the expression, we can check that the results now accurately reflect these changes in the dimensions of the array operands. Moreover, see how obtaining the structural information (the shape
) of the expression is a quick process,
requiring only a fraction of the time it takes for the complete computation.
This behavior highlights the ability of lazy expressions to adjust to operands using metadata, eliminating the need to re-compute each operation from the beginning. Thanks to this approach, notable flexibility and efficiency are achieved in handling arrays of various shapes and sizes.
Conclusion#
The dynamic adaptation of lazy expressions to changes in the dimensions of array operands illustrates the power of deferred computations in Blosc2. By deferring the computation of expressions until necessary, Blosc2 can quickly access structural information like the shape
and dtype
, even when operands change on disk, without performing intensive calculations.
Also, broadcasting support facilitates working with arrays of different sizes, making the process more powerful and intuitive.
Understanding how operations are managed in this context enables developers and data scientists to make the most of reduction and broadcasting capabilities, thereby enhancing the efficiency and effectiveness of their analyses and calculations. The beauty of lazy expressions lies in its ability to simplify the complex and empower our creativity!