Slicing, extending and serializing#
The newest and coolest way to store data in python-blosc2 is through a SChunk
(super-chunk) object. Here the data is split into chunks of the same size. In the past, the only way of working with it was chunk by chunk (see the SChunk basics tutorial), but now, python-blosc2 can retrieve, update or append data at item level (i.e. avoiding doing it chunk by chunk). To see how this works, let’s first create our SChunk.
[11]:
import blosc2
import numpy as np
nchunks = 10
data = np.arange(200 * 1000 * nchunks, dtype=np.int32)
cparams = {"typesize": 4}
schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)
It is important to set the typesize
correctly as these methods will work with items and not with bytes.
Getting data from a SChunk#
Let’s begin by retrieving the data from the whole SChunk. We could use the decompress_chunk
method:
[12]:
out = np.empty(200 * 1000 * nchunks, dtype=np.int32)
for i in range(nchunks):
schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])
But instead of the code above, we can simply use the __getitem__
or the get_slice
methods. Let’s begin with __getitem__
:
[13]:
out_slice = schunk[:]
type(out_slice)
[13]:
bytes
As you can see, the data is returned as a bytes object. If we want to get a more meaningful container instead, we can use get_slice
, where you can pass any Python object (supporting the Buffer Protocol) as the out
param to fill it with the data. In this case we will use a NumPy array container.
[14]:
out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)
schunk.get_slice(out=out_slice)
np.array_equal(out, out_slice)
print(out_slice[:4])
[0 1 2 3]
That’s the expected data indeed!
Setting data in a SChunk#
We can also set the data of a SChunk
area from any Python object supporting the Buffer Protocol. Let’s see a quick example:
[15]:
start = 34
stop = 1000 * 200 * 4
new_value = np.ones(stop - start, dtype=np.int32)
schunk[start:stop] = new_value
We have seen how to get or set data. But what if we would like to add data? Well, you can still do that with __setitem__
.
[16]:
schunk_nelems = 1000 * 200 * nchunks
new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)
start = schunk_nelems - 123
new_nitems = start + new_value.size
schunk[start:new_nitems] = new_value
Here, start
is less than the number of elements in SChunk
and new_items
is larger than this; that means that __setitem__
can update and append data at the same time, and you don’t have to worry about whether you are exceeding the limits of the SChunk
.
Building a SChunk from/as a contiguous buffer#
Furthermore, you can convert a SChunk to a contiguous, serialized buffer and vice-versa. Let’s get that buffer (aka cframe
) first:
[17]:
buf = schunk.to_cframe()
And now the other way around:
[18]:
schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)
In this case we set the copy
param to True
. If you do not want to copy the buffer, be mindful that you will have to keep a reference to it until you do not want the SChunk anymore.
Serializing NumPy arrays#
If what you want is to create a serialized, compressed version of a NumPy array, you can use the newer (and faster) functions to store it either in-memory or on-disk. The specification of such a contiguous compressed representation, aka cframe can be seen here.
In-memory#
For obtaining an in-memory representation, you can use pack_tensor
. In comparison with its former version (pack_array
), it is way faster and does not have the 2 GB size limitation:
[19]:
np_array = np.arange(2**30, dtype=np.int32) # 4 GB array
packed_arr2 = blosc2.pack_tensor(np_array)
unpacked_arr2 = blosc2.unpack_tensor(packed_arr2)
On-disk#
To store the serialized buffer on-disk you want to use save_tensor
and load_tensor
:
[20]:
blosc2.save_tensor(np_array, urlpath="ondisk_array.b2frame", mode="w")
np_array2 = blosc2.load_tensor("ondisk_array.b2frame")
np.array_equal(np_array, np_array2)
[20]:
True
Conclusions#
Now python-blosc2 offers an easy, yet fast way of creating, getting, setting and expanding data via the SChunk
class. Moreover, you can get a contiguous compressed representation (aka cframe) of it and re-create it again later with no sweat.