Compressing data with the SChunk class#

Python-Blosc2 is a thin wrapper for the C-Blosc2 format and compression library. It allows to easily and quickly create, append, insert, update and delete data and metadata in a super-chunk container, the SChunk class. All other containers, like NDArray, C2Array, ProxySource, etc., are built on top of SChunk, so it is worth to know how to use it.

[1]:

import numpy as np

import blosc2

Create a new SChunk instance#

Let’s configure the parameters that are different from defaults:

[2]:

cparams = blosc2.CParams(
    codec=blosc2.Codec.BLOSCLZ,
    typesize=4,
    nthreads=8,
)

dparams = blosc2.DParams(
    nthreads=16,
)

storage = blosc2.Storage(
    contiguous=True,
    urlpath="myfile.b2frame",
    mode="w",  # create a new file
)

Now, we can already create a SChunk instance:

[3]:

schunk = blosc2.SChunk(chunksize=10_000_000, cparams=cparams, dparams=dparams, storage=storage)
schunk

[3]:

<blosc2.schunk.SChunk at 0x10f95f4d0>

Great! So you have created your first super-chunk with your desired compression codec and typesize, that is going to be persistent on-disk.

Append and read data#

We are going to add some data. First, let’s create the dataset (4 MB):

[4]:

buffer = [i * np.arange(2_500_000, dtype="int32") for i in range(100)]

[5]:

%%time
for i in range(100):
    nchunks = schunk.append_data(buffer[i])
    assert nchunks == (i + 1)

CPU times: user 847 ms, sys: 167 ms, total: 1.01 s
Wall time: 349 ms

[6]:

!ls -lh myfile.b2frame

-rw-r--r--  1 faltet  staff    54M Nov 27 13:55 myfile.b2frame

So, while we have added 100 chunks of 10 MB each, the data size of the frame on-disk is a little above 10 MB. This is how compression is helping you to use less resources.

Now, let’s read the chunks from disk:

[7]:

dest = np.empty(2_500_000, dtype="int32")

[8]:

%%time
for i in range(100):
    chunk = schunk.decompress_chunk(i, dest)

CPU times: user 790 ms, sys: 471 ms, total: 1.26 s
Wall time: 235 ms

[9]:

check = 99 * np.arange(2_500_000, dtype="int32")
np.testing.assert_equal(dest, check)

Updating and inserting#

First, let’s update the first chunk:

[10]:

data_up = np.arange(2_500_000, dtype="int32")
chunk = blosc2.compress2(data_up)

[11]:

%%time
schunk.update_chunk(nchunk=0, chunk=chunk)

CPU times: user 183 µs, sys: 566 µs, total: 749 µs
Wall time: 673 µs

[11]:

And then, insert another one at position 4:

[12]:

%%time
schunk.insert_chunk(nchunk=4, chunk=chunk)

CPU times: user 208 µs, sys: 997 µs, total: 1.21 ms
Wall time: 1.14 ms

[12]:

In this case the return value is the new number of chunks in the super-chunk.

Add user meta info#

In Blosc2 there are to kind of meta information that you can add to a SChunk. One must be added during the creation of it, cannot be deleted and must always have the same bytes size. This one is known as meta, and works like a dictionary.

[13]:

schunk = blosc2.SChunk(meta={"meta1": 234})
schunk.meta.keys()

[13]:

['meta1']

[14]:

schunk.meta["meta1"]

[14]:

[15]:

schunk.meta["meta1"] = 235
schunk.meta["meta1"]

[15]:

The other one is known as vlmeta, which stands for “variable length metadata”, and, as the name suggests, it is meant to store general, variable length data (incidentally, this is more flexible than what you can store as regular data, which is always the same typesize). You can add an entry after the creation of the SChunk, update it with a different bytes size value or delete it.

vlmeta follows the dictionary interface, so adding info is as easy as:

[16]:

schunk.vlmeta["info1"] = "This is an example"
schunk.vlmeta["info2"] = "of user meta handling"
schunk.vlmeta.getall()

[16]:

{b'info1': 'This is an example', b'info2': 'of user meta handling'}

You can also delete an entry as you would do with a dictionary:

[17]:

del schunk.vlmeta["info1"]
schunk.vlmeta.getall()

[17]:

{b'info2': 'of user meta handling'}

That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!