NDArray: A NDim, Compressed Data Container#

NDArray objects let users perform different operations with NDArray arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate a NDArray array in a simple way.

[1]:
import numpy as np

import blosc2

Creating an array#

Let’s start creating an 2-D array with 100M elements filled with arange.

[2]:
array = blosc2.arange(10_000 * 10_000, shape=(10_000, 10_000))
print(array.info)
type    : NDArray
shape   : (10000, 10000)
chunks  : (40, 10000)
blocks  : (1, 10000)
dtype   : int64
cratio  : 173.20
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
        : nthreads=7, blocksize=80000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=7)

Note that all the compression and decompression parameters, as well as the chunks and blocks shapes are set to the default.

Reading and modifying data#

We can read and modify NDArray arrays using NumPy array as data source.

[3]:
array[0]
[3]:
array([   0,    1,    2, ..., 9997, 9998, 9999])
[4]:
array[0, :] = blosc2.zeros(10000, dtype=array.dtype)
array[:, 0] = blosc2.ones(10000, dtype=array.dtype)
[5]:
array[0, 0]
[5]:
array(1)
[6]:
array[0, :]
[6]:
array([1, 0, 0, ..., 0, 0, 0])
[7]:
array[:, 0]
[7]:
array([1, 1, 1, ..., 1, 1, 1])

Enlarging the array#

Existing arrays can be enlarged.

[8]:
array.resize((10_001, 10_000))
print(array.shape)
array[10_000, :] = 1
array[10_000, :]
(10001, 10000)
[8]:
array([1, 1, 1, ..., 1, 1, 1])

Enlarging is a fast operation because data is chunked, and we just have to add more chunks into the array, so no need to copy all the data to a new location (as in the case of a NumPy array, which requires a full copy of the data).

You can also shrink the array.

[9]:
array.resize((9_000, 10_000))
print(array.shape)
print(array[8_999])  # This works
# array[9_000]  # This will raise an exception
(9000, 10000)
[       1 89990001 89990002 ... 89999997 89999998 89999999]

Persistent data#

We can use the save() method to store the array on disk. This is very useful when you have a large array that you want to keep around but do not need to access all the time.

[10]:
array.save("array_tutorial.b2nd", mode="w")  # , contiguous=True)
!ls -lh array_tutorial.b2nd
Detected ARM ...
-rw-r--r--@ 1 francesc  staff   4.2M Nov 30 09:27 array_tutorial.b2nd

For arrays, it is usual to use the .b2nd extension.

Also, when we create a NDArray array, we can specify where it will be stored, and no memory will be used at all. Indeed, we can specify all the compression/decompression and other storage parameters.

[11]:
array1 = blosc2.full(
    (1000, 1000),
    fill_value=b"pepe",
    chunks=(100, 100),
    blocks=(50, 50),
    urlpath="array1_tutorial.b2nd",
    mode="w",
)
!ls -lh array1_tutorial.b2nd
Detected ARM ...
-rw-r--r--@ 1 francesc  staff   3.9K Nov 30 09:27 array1_tutorial.b2nd

This time we have set the chunks and blocks shapes.

Now, let’s reopen our original array.

[12]:
array2 = blosc2.open("array_tutorial.b2nd")
print(array2.info)
type    : NDArray
shape   : (9000, 10000)
chunks  : (40, 10000)
blocks  : (1, 10000)
dtype   : int64
cratio  : 162.65
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
        : nthreads=1, blocksize=80000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=1)

And make sure that they are the same.

[13]:
np.all(array2[:] == array[:])
[13]:
np.True_

Compression params#

Let’s see how to make a copy of a NDArray array, while changing its compression parameters in an easy way.

[14]:
cparams = blosc2.CParams(
    codec=blosc2.Codec.LZ4,
    clevel=9,
    filters=[blosc2.Filter.BITSHUFFLE],
    filters_meta=[0],
)

array2 = array.copy(chunks=(500, 10_000), blocks=(50, 10_000), cparams=cparams)
print(array2.info)
type    : NDArray
shape   : (9000, 10000)
chunks  : (500, 10000)
blocks  : (50, 10000)
dtype   : int64
cratio  : 760.55
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
        : nthreads=7, blocksize=4000000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=7)

Metalayers and variable length metalayers#

We have seen that you can pass to the NDArray constructor any compression or decompression parameters that you want, and now, we will add metalayers to these. Metalayers are small metadata for informing about the properties of data that is stored on a container. As explained in the SChunk tutorial, there are two kinds of metalayers. The first one (meta), is the system one, and must be added at construction time; it cannot be deleted and can only be updated with values that have the same bytes size as the old value. They are easy to access and edit by users:

[15]:
meta = {"dtype": "i8", "coords": [5.14, 23.0]}
array = blosc2.zeros((1000, 1000), dtype=np.int16, chunks=(100, 100), blocks=(50, 50), meta=meta)

You can work with them like if you were working with a dictionary. To access this dictionary you will use the schunk attribute that an NDArray has.

[16]:
array.meta
[16]:
<blosc2.schunk.Meta at 0x103a64b10>
[17]:
array.meta.keys()
[17]:
['b2nd', 'dtype', 'coords']

As you can see, Blosc2 internally uses such metalayers to store shapes, ndim, dtype, etc, and retrieve this data when needed. For example, the b2nd metalayer has this info.

[18]:
array.meta["b2nd"]
[18]:
[0, 2, [1000, 1000], [100, 100], [50, 50], 0, '<i2']

And we can look at the our own user meta:

[19]:
array.meta["coords"]
[19]:
[5.14, 23.0]

To add a metalayer after the creation or a variable length metalayer, you can use the vlmeta accessor from the SChunk. Similarly to meta, it works as a dictionary.

[20]:
print(array.vlmeta[:])
array.vlmeta["info1"] = "This is an example"
array.vlmeta["info2"] = "of user meta handling"
array.vlmeta[:]  # this return all the metadata as a dictionary
{}
[20]:
{b'info1': 'This is an example', b'info2': 'of user meta handling'}

You can update them with a value larger than the original one:

[21]:
array.vlmeta["info1"] = "This is a larger example"
array.vlmeta
[21]:
{b'info1': 'This is a larger example', b'info2': 'of user meta handling'}

You can store any kind of data in the vlmeta metalayer, as long as it is serializable with msgpack. This is a very flexible way to store metadata in a Blosc2 container.

[22]:
array.vlmeta["info3"] = {"a": 1, "b": 2}
array.vlmeta
[22]:
{b'info1': 'This is a larger example', b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

Variable length metadata can be deleted:

[23]:
del array.vlmeta["info1"]
array.vlmeta
[23]:
{b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

This is very useful to store metadata that is not known at the time of creation of the container, or that can be updated or deleted at any time.

Creating a NDArray from a NumPy array#

Let’s create a NDArray from a NumPy array using the asarray constructor:

[24]:
shape = (100, 100, 100)
dtype = np.float64
nparray = np.linspace(0, 100, np.prod(shape), dtype=dtype).reshape(shape)
b2array = blosc2.asarray(nparray)
print(b2array.info)
b2array[0, 0, :4]
type    : NDArray
shape   : (100, 100, 100)
chunks  : (50, 100, 100)
blocks  : (1, 100, 100)
dtype   : float64
cratio  : 13.73
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
        : nthreads=7, blocksize=80000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=7)

[24]:
array([0.    , 0.0001, 0.0002, 0.0003])

Building a NDArray from an iterator#

Finally, let’s see how you can create a NDArray filled with data from an iterator, store it into a file, and reopen it. Let’s create a structured array with 3 fields and 1 million of elements.

[25]:
N = 1000_000
rng = np.random.default_rng()
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
%time sa = blosc2.fromiter(it, dtype='i4,f4,f8', shape=(N,), urlpath="sa-1M.b2nd", mode="w")
!ls -lh sa-1M.b2nd
sa2 = blosc2.open("sa-1M.b2nd")
sa2.info
CPU times: user 499 ms, sys: 30.4 ms, total: 530 ms
Wall time: 504 ms
Detected ARM ...
-rw-r--r--@ 1 francesc  staff   7.0M Nov 30 09:27 sa-1M.b2nd
[25]:
typeNDArray
shape(1000000,)
chunks(125000,)
blocks(4000,)
dtype[('f0', '
cratio2.24
cparamsCParams(codec=, codec_meta=0, clevel=1, use_dict=False, typesize=16, nthreads=1, blocksize=64000, splitmode=, filters=[, , , , , ], filters_meta=[0, 0, 0, 0, 0, 0], tuner=)
dparamsDParams(nthreads=1)

That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!