Basics: compressing data with the SChunk class#
Python-Blosc2 is a thin wrapper for the C-Blosc2 format and compression library. It allows to easily and quickly create, append, insert, update and delete data and metadata in a super-chunk container (SChunk class).
[1]:
import numpy as np
import blosc2
Create a new SChunk instance#
Let’s configure the parameters that are different from defaults:
[18]:
cparams = blosc2.CParams(
codec=blosc2.Codec.BLOSCLZ,
typesize=4,
nthreads=8,
)
dparams = blosc2.DParams(
nthreads=16,
)
storage = blosc2.Storage(
contiguous=True,
urlpath="myfile.b2frame",
mode="w", # create a new file
)
Now, we can already create a SChunk instance:
[19]:
schunk = blosc2.SChunk(chunksize=10_000_000, cparams=cparams, dparams=dparams, storage=storage)
schunk
[19]:
<blosc2.schunk.SChunk at 0x110ace9e0>
Great! So you have created your first super-chunk with your desired compression codec and typesize, that is going to be persistent on-disk.
Append and read data#
We are going to add some data. First, let’s create the dataset (4 MB):
[20]:
buffer = [i * np.arange(2_500_000, dtype="int32") for i in range(100)]
[21]:
%%time
for i in range(100):
nchunks = schunk.append_data(buffer[i])
assert nchunks == (i + 1)
CPU times: user 774 ms, sys: 289 ms, total: 1.06 s
Wall time: 639 ms
[22]:
!ls -lh myfile.b2frame
-rw-r--r-- 1 oma staff 54M Oct 8 09:51 myfile.b2frame
So, while we have added 100 chunks of 10 MB each, the data size of the frame on-disk is a little above 10 MB. This is how compression is helping you to use less resources.
Now, let’s read the chunks from disk:
[23]:
dest = np.empty(2_500_000, dtype="int32")
[24]:
%%time
for i in range(100):
chunk = schunk.decompress_chunk(i, dest)
CPU times: user 379 ms, sys: 333 ms, total: 711 ms
Wall time: 282 ms
[25]:
check = 99 * np.arange(2_500_000, dtype="int32")
np.testing.assert_equal(dest, check)
Updating and inserting#
First, let’s update the first chunk:
[26]:
data_up = np.arange(2_500_000, dtype="int32")
chunk = blosc2.compress2(data_up)
[27]:
%%time
schunk.update_chunk(nchunk=0, chunk=chunk)
CPU times: user 305 µs, sys: 1.13 ms, total: 1.43 ms
Wall time: 1.62 ms
[27]:
100
And then, insert another one at position 4:
[28]:
%%time
schunk.insert_chunk(nchunk=4, chunk=chunk)
CPU times: user 269 µs, sys: 1.05 ms, total: 1.32 ms
Wall time: 2.48 ms
[28]:
101
In this case the return value is the new number of chunks in the super-chunk.
Add user meta info#
In Blosc2 there are to kind of meta information that you can add to a SChunk. One must be added during the creation of it, cannot be deleted and must always have the same bytes size. This one is known as meta
, and works like a dictionary.
[29]:
schunk = blosc2.SChunk(meta={"meta1": 234})
schunk.meta.keys()
[29]:
['meta1']
[30]:
schunk.meta["meta1"]
[30]:
234
[31]:
schunk.meta["meta1"] = 235
schunk.meta["meta1"]
[31]:
235
The other one is known as vlmeta
, which stands for “variable length metadata”, and, as the name suggests, it is meant to store general, variable length data (incidentally, this is more flexible than what you can store as regular data, which is always the same typesize
). You can add an entry after the creation of the SChunk, update it with a different bytes size value or delete it.
vlmeta
follows the dictionary interface, so adding info is as easy as:
[32]:
schunk.vlmeta["info1"] = "This is an example"
schunk.vlmeta["info2"] = "of user meta handling"
schunk.vlmeta.getall()
[32]:
{b'info1': 'This is an example', b'info2': 'of user meta handling'}
You can also delete an entry as you would do with a dictionary:
[33]:
del schunk.vlmeta["info1"]
schunk.vlmeta.getall()
[33]:
{b'info2': 'of user meta handling'}
That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!
[ ]: