NDArray: mutidimensional SChunk#
NDArray functions let users perform different operations with NDArray arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate a NDArray array in a simple way.
[26]:
import numpy as np
import blosc2
Creating an array#
First, we create an array, with zeros being used as the default value for uninitialized portions of the array.
[27]:
array = blosc2.zeros((10000, 10000), dtype=np.int32)
print(array.info)
type : NDArray
shape : (10000, 10000)
chunks : (25, 10000)
blocks : (2, 10000)
dtype : int32
cratio : 32500.00
cparams : {'blocksize': 80000,
'clevel': 1,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 4,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 4,
'use_dict': 0}
dparams : {'nthreads': 4}
Note that all the compression and decompression parameters, as well as the chunks and blocks shapes are set to the default.
Reading and writing data#
We can access and edit NDArray arrays using NumPy.
[28]:
array[0, :] = np.arange(10000, dtype=array.dtype)
array[:, 0] = np.arange(10000, dtype=array.dtype)
[29]:
array[0, 0]
[29]:
array(0, dtype=int32)
[30]:
array[0, :]
[30]:
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
[31]:
array[:, 0]
[31]:
array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32)
Persistent data#
As in the SChunk, when we create a NDArray array, we can specify where it will be stored. Indeed, we can specify all the compression/decompression parameters that we can specify in a SChunk. So as in the SChunk, to store an array on-disk we only have to specify a urlpath
where to store the new array.
[32]:
array = blosc2.full(
(1000, 1000),
fill_value=b"pepe",
chunks=(100, 100),
blocks=(50, 50),
urlpath="ndarray_tutorial.b2nd",
mode="w",
)
print(array.info)
type : NDArray
shape : (1000, 1000)
chunks : (100, 100)
blocks : (50, 50)
dtype : |S4
cratio : 1111.11
cparams : {'blocksize': 10000,
'clevel': 1,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 4,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 4,
'use_dict': 0}
dparams : {'nthreads': 4}
This time we even set the chunks and blocks shapes. You can now open it with modes w
, a
or r
.
[33]:
array2 = blosc2.open("ndarray_tutorial.b2nd")
print(array2.info)
type : NDArray
shape : (1000, 1000)
chunks : (100, 100)
blocks : (50, 50)
dtype : |S4
cratio : 1111.11
cparams : {'blocksize': 10000,
'clevel': 1,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 1,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 4,
'use_dict': 0}
dparams : {'nthreads': 1}
Compression params#
Here we can see how when we make a copy of a NDArray array we can change its compression parameters in an easy way.
[38]:
b = np.arange(1000000).tobytes()
array1 = blosc2.frombuffer(b, shape=(1000, 1000), dtype=np.int64, chunks=(500, 10), blocks=(50, 10))
print(array1.info)
type : NDArray
shape : (1000, 1000)
chunks : (500, 10)
blocks : (50, 10)
dtype : int64
cratio : 7.45
cparams : {'blocksize': 4000,
'clevel': 1,
'codec': <Codec.ZSTD: 5>,
'codec_meta': 0,
'filters': [<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.NOFILTER: 0>,
<Filter.SHUFFLE: 1>],
'filters_meta': [0, 0, 0, 0, 0, 0],
'nthreads': 4,
'splitmode': <SplitMode.ALWAYS_SPLIT: 1>,
'typesize': 8,
'use_dict': 0}
dparams : {'nthreads': 4}
[39]:
cparams = blosc2.CParams(
codec=blosc2.Codec.ZSTD,
clevel=9,
filters=[blosc2.Filter.BITSHUFFLE],
filters_meta=[0],
)
array2 = array1.copy(chunks=(500, 10), blocks=(50, 10), cparams=cparams)
print(array2.info)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[39], line 8
1 cparams = blosc2.CParams(
2 codec=blosc2.Codec.ZSTD,
3 clevel=9,
4 filters=[blosc2.Filter.BITSHUFFLE],
5 filters_meta=[0],
6 )
----> 8 array2 = array1.copy(chunks=(500, 10), blocks=(50, 10), cparams=cparams)
9 print(array2.info)
File ~/blosc/python-blosc2/src/blosc2/ndarray.py:1356, in NDArray.copy(self, dtype, **kwargs)
1351 if dtype is None:
1352 dtype = self.dtype
1353 kwargs["cparams"] = (
1354 kwargs.get("cparams").copy()
1355 if isinstance(kwargs.get("cparams"), dict)
-> 1356 else asdict(self.schunk.cparams)
1357 )
1358 kwargs["dparams"] = (
1359 kwargs.get("dparams").copy()
1360 if isinstance(kwargs.get("dparams"), dict)
1361 else asdict(self.schunk.dparams)
1362 )
1363 if "meta" not in kwargs:
1364 # Copy metalayers as well
File ~/opt/miniconda3/lib/python3.12/dataclasses.py:1319, in asdict(obj, dict_factory)
1300 """Return the fields of a dataclass instance as a new dictionary mapping
1301 field names to field values.
1302
(...)
1316 tuples, lists, and dicts. Other objects are copied with 'copy.deepcopy()'.
1317 """
1318 if not _is_dataclass_instance(obj):
-> 1319 raise TypeError("asdict() should be called on dataclass instances")
1320 return _asdict_inner(obj, dict_factory)
TypeError: asdict() should be called on dataclass instances
Metalayers and variable length metalayers#
We have seen that you can pass to the NDArray constructor any compression or decompression parameters that you may pass to a SChunk. Indeed, you can also pass the metalayer dict. Metalayers are small metadata for informing about the properties of data that is stored on a container. As explained in the SChunk basics, there are two kinds. The first one (meta
), cannot be deleted, must be added at construction time and can only be updated with values that have the
same bytes size as the old value. They are easy to access and edit by users:
[ ]:
meta = {"dtype": "i8", "coords": [5.14, 23.0]}
array = blosc2.zeros((1000, 1000), dtype=np.int16, chunks=(100, 100), blocks=(50, 50), meta=meta)
You can work with them like if you were working with a dictionary. To access this dictionary you will use the SChunk attribute that an NDArray has.
[ ]:
array.schunk.meta
[23]:
array.schunk.meta.keys()
[23]:
['b2nd']
As you can see, Blosc2 internally uses these metalayers to store shapes, ndim, dtype, etc, and retrieve this data when needed in the b2nd
metalayer.
[24]:
array.schunk.meta["b2nd"]
[24]:
[0, 2, [1000, 1000], [100, 100], [50, 50], 0, '|S4']
[25]:
array.schunk.meta["coords"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[25], line 1
----> 1 array.schunk.meta["coords"]
File ~/blosc/python-blosc2/src/blosc2/schunk.py:122, in Meta.__getitem__(self, item)
117 return unpackb(
118 blosc2_ext.meta__getitem__(self.schunk, item),
119 list_hook=blosc2_ext.decode_tuple,
120 )
121 else:
--> 122 raise KeyError(f"{item} not found")
KeyError: 'coords not found'
To add a metalayer after the creation or a variable length metalayer, you can use the vlmeta
accessor from the SChunk. As well as the meta
, it works similarly to a dictionary.
[ ]:
print(array.schunk.vlmeta.getall())
array.schunk.vlmeta["info1"] = "This is an example"
array.schunk.vlmeta["info2"] = "of user meta handling"
array.schunk.vlmeta.getall()
You can update them with a value larger than the original one:
[ ]:
array.schunk.vlmeta["info1"] = "This is a larger example"
array.schunk.vlmeta.getall()
Creating a NDArray from a NumPy array#
Let’s create a NDArray from a NumPy array using the asarray
constructor:
[ ]:
shape = (100, 100, 100)
dtype = np.float64
nparray = np.linspace(0, 100, np.prod(shape), dtype=dtype).reshape(shape)
b2ndarray = blosc2.asarray(nparray)
print(b2ndarray.info)
Building a NDArray from a buffer#
Furthermore, you can create a NDArray filled with data from a buffer:
[ ]:
rng = np.random.default_rng()
buffer = bytes(rng.normal(size=np.prod(shape)) * 8)
b2ndarray = blosc2.frombuffer(buffer, shape, dtype=dtype)
print("Compression ratio:", b2ndarray.schunk.cratio)
b2ndarray[:5, :5, :5]
That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!