Expressions containing NDArray objects#
Python-Blosc2 implements a powerful way to operate with NDArray (and other flavors) objects. In this section, we will see how to do computations with NDArray arrays in a simple way.
[1]:
import numpy as np
import blosc2
A simple example#
First, let’s create a couple of NDArrays.
[2]:
shape = (500, 1000)
a = blosc2.linspace(0, 1, np.prod(shape), dtype=np.float32, shape=shape, urlpath="a.b2nd", mode="w")
b = blosc2.linspace(1, 2, np.prod(shape), dtype=np.float64, shape=shape, urlpath="b.b2nd", mode="w")
Now, let’s create an expression that involves a
and b
.
[3]:
c = a**2 + b**2 + 2 * a * b + 1
print(c.info) # at this stage, the expression has not been computed yet
type : LazyExpr
expression : ((((o0 ** 2) + (o1 ** 2)) + ((2 * o0) * o1)) + 1)
operands : {'o0': 'a.b2nd', 'o1': 'b.b2nd'}
shape : (500, 1000)
dtype : float64
We see that the outcome of the expression is a LazyExpr
object. This object is a placeholder for the actual computation that will be done when we compute it. This is a very powerful feature because it allows us to build complex expressions without actually computing them until we really need the result.
Now, let’s compute it. LazyExpr
objects follow the LazyArray interface, and this provides several ways for performing the computation, depending on the object we want as the desired output.
First, let’s use the compute
method. The result will be another NDArray array:
[4]:
d = c.compute() # compute the expression
print(f"Class: {type(d)}")
print(f"Compression ratio: {d.schunk.cratio:.2f}x")
Class: <class 'blosc2.ndarray.NDArray'>
Compression ratio: 1.88x
We can specify different compression parameters for the result. For example, we can change the codec to ZLIB
, use the bitshuffle filter, and the compression level set to 9:
[5]:
cparams = blosc2.CParams(codec=blosc2.Codec.ZLIB, filters=[blosc2.Filter.BITSHUFFLE], clevel=9)
d = c.compute(cparams=cparams)
print(f"Compression ratio: {d.schunk.cratio:.2f}x")
Compression ratio: 1.97x
Or, we can store the result in a file:
[6]:
d = c.compute(urlpath="result.b2nd", mode="w")
!ls -lh result.b2nd
Detected ARM ...
-rw-r--r--@ 1 francesc staff 2.0M Nov 30 12:59 result.b2nd
Note that all the output is stored in the file as computation proceeds; this is an efficient way to store large results on disk. Incidentally, both operands and results are stored on disk here, so you can operate with very large arrays in a very small memory footprint.
Now, let’s compute the expression and store the result in a NumPy array. For this, we will use the __getitem__
method:
[7]:
npd = d[:]
print(f"Class: {type(npd)}")
Class: <class 'numpy.ndarray'>
As you can see, the result is a NumPy array now.
Depending on your needs, you can choose to get the result as a NDArray array or as a NumPy array. The former is more storage efficient, but the latter is more flexible when interacting with other libraries that do not support NDArray arrays.
Saving expressions to disk#
You can save literal expressions to disk (and only computed results). For this, use the save
method of LazyArray
objects. For example, let’s save the expression c
to disk:
[8]:
c = a**2 + b**2 + 2 * a * b + 1
c.save(urlpath="expr.b2nd")
And you can load it back with the open
function:
[9]:
c2 = blosc2.open("expr.b2nd")
print(c2.info)
type : LazyExpr
expression : o0 ** 2 + o1 ** 2 + 2 * o0 * o1 + 1
operands : {'o0': 'a.b2nd', 'o1': 'b.b2nd'}
shape : (500, 1000)
dtype : float64
Now, you can compute it as before:
[10]:
d2 = c2.compute()
print(f"Compression ratio: {d2.schunk.cratio:.2f}x")
Compression ratio: 1.88x
Reductions#
We can also perform reductions as part of expressions. Let’s see an example:
[11]:
c = (a + b).sum()
c
[11]:
np.float64(999999.9999999473)
As we can see, the result is a scalar. That means that reductions in expressions always perform the computation immediately.
We can also specify the axis for the reduction:
[12]:
c = (a + b).sum(axis=1)
print(f"Shape of c: {c.shape}")
# Show the first 4 elements of the result
c[:4]
Shape of c: (500,)
[12]:
array([1001.998004 , 1005.998012 , 1009.99802 , 1013.99802799])
Reductions can also be part of more complex expressions:
[13]:
c = (a + b).sum(axis=0) + 2 * a + 1
print(f"Shape of c: {c.shape}")
# Show the first 4 elements of the result
c[0, 0:4]
Shape of c: (500, 1000)
[13]:
array([1000.0010009 , 1000.00300336, 1000.00500598, 1000.00700854])
In particular, note that the result of the reduction above has a different shape than a
, but the expression is still computed correctly. This is because the shape of the reduction is compatible with the shape of the operands. See below for more information on broadcasting.
Broadcasting#
NumPy arrays support broadcasting, and so do NDArray arrays. Let’s see an example:
[14]:
b2 = b[0] # take the first row of b
print(f"Shape of a: {a.shape}, shape of b2: {b2.shape}")
Shape of a: (500, 1000), shape of b2: (1000,)
We see that the shapes of a
and b2
are different. However, as the shapes are compatible, we can still operate with them and the broadcasting will be done automatically (à la NumPy):
[15]:
c2 = a + b2
d2 = c2.compute()
print(f"Compression ratio: {d2.schunk.cratio:.2f}x, shape: {d2.shape}")
Compression ratio: 17.89x, shape: (500, 1000)
The broadcasting is done in efficiently, and only the necessary chunks are computed. This is a powerful feature that allows you to operate with arrays of different shapes in a very simple way.
Querying NDArray arrays#
A powerful feature of Blosc2 compute engine is its ability to do queries on NDArray arrays with structured types. Let’s see an example.
[16]:
N = 1000_000
rng = np.random.default_rng(seed=1)
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(
it, dtype=[("A", "i4"), ("B", "f4"), ("C", "f8")], shape=(N,), urlpath="sa-1M.b2nd", mode="w"
)
print("First 3 rows:\n", sa[:3])
First 3 rows:
[( 1, -2., 0.34558419) ( 0, -1., 0.82161814) (-1, 0., 0.33043708)]
Now, we can select rows depending on the value of different fields:
[17]:
A = sa["A"]
B = sa["B"]
C = sa["C"]
expr = sa[A > B]
expr[:]
[17]:
array([(1, -2., 0.34558419), (0, -1., 0.82161814)],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])
We can do the same on a more compact way by using an expression in string form inside the brackets:
[18]:
expr = sa["A > B"]
expr[:]
[18]:
array([(1, -2., 0.34558419), (0, -1., 0.82161814)],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])
The expression can also be a complex one:
[19]:
expr = sa["(A > B) & (sin(C) > .5)"]
expr[:]
[19]:
array([(0, -1., 0.82161814)],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])
We can also query and extract a single field:
[20]:
C["A > B"][:]
[20]:
array([0.34558419, 0.82161814])
And perform reductions on queries on a single field:
[21]:
C[((C > 0) & (B < 0))].sum()
[21]:
np.float64(1.1672023355659444)
Finally, more complex queries can be done using the where()
function. For example, let’s sum all the rows with the maximum of field A
or field B
:
[22]:
blosc2.where(A > B, A, B).sum()
[22]:
np.float64(499997527552.0)
Combining all this weaponry allows to query your data on a simple and efficient way. As the computation is lazy, all the operations are grouped and executed together for maximum performance. The only exception is that, when a reduction is found, it is computed eagerly, but it can still be part of more general expressions, as well as being able to be saved and loaded from disk.
Summary#
In this section, we have seen how to perform computations with NDArray arrays, and more in particular, how to create expressions, compute them, and save them to disk. Also, we have looked at performing reductions, broadcasting, selections and combinations of both. Lazy expressions allow you to build and compute complex computations from operands that can be in-memory, on-disk or remote (C2Array
) in a simple and effective way.