Blosc/Blosc2 Chunk Format#

A regular chunk is composed of a header and a blocks section:

+---------+--------+
|  header | blocks |
+---------+--------+

Also, there are the so-called lazy chunks that do not have the actual compressed data, but only meta-information about how to read it. Lazy chunks typically appear when reading data from persistent media. A lazy chunk has header and bstarts sections in place and in addition, an additional trailer for allowing to read the data blocks:

+---------+---------+---------+
|  header | bstarts | trailer |
+---------+---------+---------+

All these sections are described below. Note that the bstarts section is described as part of the blocks section.

Note: All integer types in this document are stored in little endian.

Blocks#

The blocks section is composed of a list of offsets to the start of each block, an optional dictionary to aid in compression, and finally a list of compressed data streams:

+=========+======+=========+
| bstarts | dict | streams |
+=========+======+=========+

Each block is equal-sized as specified by the blocksize header field. The size of the last block can be shorter or equal to the rest.

Block starts

The block starts section contains a list of offsets int32 bstarts that indicate where each block starts in the chunk. These offsets are relative to the start of the chunk and point to the start of one or more compressed data streams containing the contents of the block:

+=========+=========+========+=========+
| bstart0 | bstart1 |   ...  | bstartN |
+=========+=========+========+=========+

Dictionary (optional)

Only for C-Blosc2

Dictionaries are small datasets that are known to be repeated a lot and can help to compress data in blocks better. The dictionary section contains the size of the dictionary int32_t dsize followed by the dictionary data:

+=======+=================+
| dsize | dictionary data |
+=======+=================+

Compressed Data Streams

Compressed data streams are the compressed set of bytes that are passed to codecs for decompression. Each compressed data stream (uint8_t* cdata) is stored with the size of the stream (int32_t csize) preceding it:

+=======+=======+
| csize | cdata |
+=======+=======+

There are a couple of special cases for int32_t csize. If zero, that means that the stream is fully made of zeros, and there is not a cdata section. The actual size of the stream is inferred from blocksize and whether or not the block is split. If negative, the stream is stored like this:

+=======+=======+=======+
| csize | token | cdata |
+=======+=======+=======+

where token is a byte for providing different meanings to int32_t csize:

bit 0:

Repeated byte (stream is a run-length of bytes). This byte, representing the repeated value in the stream, is encoded in the LSB of the int32_t csize. In this case there is not a cdata section. Note that repeated zeros cannot happen here (already handled by the csize == 0 case above).

bits 1 and 2:

Reserved for two-codecs in a row. TODO: complete description

bits 3, 4 and 5:

Reserved for secondary codec. TODO: complete description

bits 6 and 7:

Reserved for future use.

If bit 4 of the flags header field is set, each block is stored in a single data stream:

+=========+
| stream0 |
+=========+
| block0  |
+=========+

If bit 4 of the flags header is not set, each block can be stored using multiple data streams:

+=========+=========+=========+=========+
| stream0 | stream1 |    ...  | streamN |
+=========+=========+=========+=========+
| block0                                |
+=========+=========+=========+=========+

The uncompressed size for each block is equivalent to the blocksize field in the header, with the exception of the last block which may be equal to or less than the blocksize.

Trailer#

This is an optional section, mainly for lazy chunks use. A lazy chunk is similar to a regular one, except that only the meta-information has been loaded. The actual data from blocks is ‘lazily’ only loaded on demand. This allows for improved selectivity, and hence less input bandwidth demands, during partial chunk reads (e.g. blosc1_getitem) from data that is on disk.

It is arranged like this:

+=========+=========+========+========+=========+
| nchunk  | offset  | bsize0 |   ...  | bsizeN |
+=========+=========+========+========+=========+
nchunk:

(int32_t) The number of the chunk in the super-chunk.

offset:

(int64_t) The offset of the chunk in the frame (contiguous super-chunk).

bsize0 .. bsizeN:

(int32_t) The sizes in bytes for every block.