Parquet to Blosc2 Walkthrough¶
The parquet-to-blosc2 CLI converts Parquet files to Blosc2 columnar
table stores (.b2z compact or .b2d sparse) and can export them
back to Parquet.
Prerequisites¶
pyarrow is required for all Parquet operations. Install it alongside
the optional parquet extras:
pip install "blosc2[parquet]"
Step 1 — Create a sample Parquet file¶
Run the snippet below once to produce sample.parquet with three columns
(id, name, score):
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(
{
"id": pa.array([1, 2, 3, 4], type=pa.int64()),
"name": pa.array(["Alice", "Bob", "Charlie", "David"], type=pa.string()),
"score": pa.array([85.5, 90.0, 78.2, 95.4], type=pa.float64()),
}
)
pq.write_table(table, "sample.parquet")
Step 2 — Import to a compact .b2z store¶
The default output format is .b2z — a single-file zip-backed store:
parquet-to-blosc2 sample.parquet sample.b2z --overwrite
Step 3 — Import to a sparse .b2d store¶
Use the .b2d extension to produce a directory-backed (sparse) store:
parquet-to-blosc2 sample.parquet sample.b2d --overwrite
Step 4 — Fixed-width string import¶
By default, string columns are stored as variable-length strings
(vlstring). Pass --fixed-str-maxlen to pre-scan strings and store
columns whose maximum character length fits within the given limit as
fixed-width, indexable strings:
parquet-to-blosc2 sample.parquet sample_fixed.b2z --fixed-str-maxlen 16 --overwrite
Step 5 — Custom chunk and block layout¶
Override the automatic chunk and block sizes (in rows) chosen by
blosc2.compute_chunks_blocks(). Smaller blocks improve cache locality;
larger chunks reduce per-chunk overhead:
parquet-to-blosc2 sample.parquet sample_layout.b2z --chunks 1000 --blocks 100 --overwrite
Step 6 — Disable the summary index¶
By default the tool builds a SUMMARY index for eligible scalar columns on
close. The index costs less than 0.1 % of column size and accelerates
WHERE queries. Disable it with --no-summary-index when you do not
need indexed queries:
parquet-to-blosc2 sample.parquet sample_no_index.b2z --no-summary-index --overwrite
Step 7 — Export back to Parquet¶
Use --export to convert a Blosc2 store back to a Parquet file:
parquet-to-blosc2 --export sample.b2z exported.parquet --overwrite
Step 8 — Spot-check the exported file¶
Verify the round-trip with a quick Python comparison:
import pyarrow.parquet as pq
original = pq.read_table("sample.parquet")
exported = pq.read_table("exported.parquet")
# Compare row counts and column names
assert original.num_rows == exported.num_rows, "row count mismatch"
assert original.column_names == exported.column_names, "column name mismatch"
# Compare values column by column
for col in original.column_names:
assert original[col].equals(exported[col]), f"value mismatch in column '{col}'"
print("Round-trip check passed — all columns match.")