CTable and .b2z: Querying Tabular Data, the Blosc Way

Francesc Alted — Thu, 11 Jun 2026 10:00:00 GMT

Here is a question we have been chasing, in one form or another, for more than fifteen years: how much work can you avoid doing if your data is stored the right way?

In this post we put that question to a concrete test: one selective query against 24.3 million Chicago taxi trips, stored on disk in two formats — Parquet and the new Blosc2 .b2z — and answered by five different tools: DuckDB, PyArrow, pandas, polars, and Blosc2's own CTable. But the numbers will make more sense if we first tell you how we got here, because CTable did not appear out of thin air: it is the fourth floor of a building whose foundations were laid in 2009.

From a turbo-charged compressor...

Blosc was born inside PyTables with a single, then-heretical idea: that compression could make data access faster, not slower. CPUs were (and are) starving — they can crunch numbers far faster than memory can feed them — so if you split data into blocks that fit in CPU caches, shuffle the bytes so that similar ones sit together, and decompress with all your cores, the time spent decompressing can be smaller than the time saved moving fewer bytes. "Compress faster than memcpy" was the provocative benchmark slogan of the time.

That first Blosc was deliberately humble: a blocked, multithreaded meta-compressor for binary buffers. No containers, no files, no types. Just speed.

...to containers, arrays, and a compute engine

The next decade taught us that a fast compressor alone is not enough; data needs a home. C-Blosc2 (2.0 released in 2021) gave it one: 64-bit super-chunks, persistent frames, a richer filter pipeline, modern codecs like Zstd, and a plugin system. On the Python side, this matured into python-blosc2.

Then came NDArray (2023): a compressed, n-dimensional array for Python, with a two-level partitioning scheme — chunks, divided into blocks — where the block is the unit of decompression, sized to fit comfortably in CPU caches. Slicing an NDArray decompresses only the blocks that the slice touches. Keep that sentence in mind; it is the seed of everything below.

On top of that, python-blosc2 3.0 (early 2025) added a compute engine: lazy expressions like a + b * 2 that evaluate block by block, straight over compressed (possibly larger-than-RAM) operands, and return NumPy arrays. The engine never materializes whole arrays; it streams cache-sized blocks through the CPU. At this point we had fast compressed storage and fast compute over it — what we were missing was a way to talk about tables.

CTable: a columnar table on Blosc2 foundations

CTable (introduced in May 2026) is exactly that: a columnar table where each column is an NDArray (or a ListArray for variable-length data), with typed schemas, nullable columns, and a where() method that accepts plain Python expressions and is executed by the compute engine.

Because columns are NDArrays, every column inherits the block structure — and this is where the design clicks together. CTable can build a small SUMMARY index per column: min/max statistics kept at block granularity. When a query like t.payment.tips > 100 arrives, blocks whose maximum tip is below 100 are never read and never decompressed. The index granularity is exactly aligned with the unit of work it avoids.

A CTable persists inside a .b2z file: the single-file, zip-based flavor of TreeStore that holds all columns, indexes and metadata in one compact, openable-anywhere container. Like Parquet, the data stays compressed on disk; unlike Parquet, you can open it and immediately get NumPy-addressable columns, no engine in between.

So: does the fourth floor hold the weight? Time to measure.

The contest: one selective query, five tools

The dataset is the classic Chicago Taxi trips table: 24.3 million rows, 14 columns (floats, timestamps, dictionary-encoded strings, and even a variable-length GPS path per trip). The query is a needle-in-a-haystack filter with projection and sort:

SELECT payment.tips, payment.total, trip.sec, trip.km, company
WHERE  payment.tips > 100 AND trip.km > 0 AND trip.begin.lon < 0
ORDER BY trip.sec

Only 67 of 24.3 million rows match — a highly selective query, which is precisely the regime where storage-level pruning can shine (more on this honest caveat later).

The contenders: DuckDB, PyArrow, pandas and polars querying the Parquet file, and Blosc2's CTable querying the .b2z. Every tool reads from disk on demand; nothing is preloaded. Each engine runs in a fresh subprocess under /usr/bin/time, and we report the query time each script measures internally (open + compute + print), which excludes interpreter and import overhead. Cold-cache runs happen right after flushing the OS file cache (sudo purge); warm runs are best-of-7. The machine is a Mac mini (Apple M4 Pro). The full, reproducible notebook is in the python-blosc2 repository.

First, the storage footprint — because a fair query race starts with files of comparable size:

The .b2z lands at 670 MB versus Parquet's 654 MB — a 2% premium. Those extra bytes are mostly the block-level indexes; remember them, they are about to earn their keep.

Cold cache: reading less wins

The cold run is the scenario we care most about: you have a large file on disk, it is not in the OS cache, and you want one answer, now.

CTable answers in 0.056 s — about 1.9x faster than DuckDB (0.107 s), 2.4x faster than PyArrow (0.137 s), 5x faster than polars (0.298 s) and 9.5x faster than pandas (0.534 s).

Let us be clear about why, because it is not magic and it is not a faster CPU loop. On a cold cache, the dominant cost is bytes coming off the disk. The SUMMARY indexes let CTable prune roughly 89% of the blocks for this query: those blocks are neither read nor decompressed. Pruning pays twice — less I/O and less CPU — and on a first-touch query the I/O half is the whole ballgame.

Warm cache: a dead heat with a real database

Once the file is fully cached in RAM, I/O is nearly free and raw engine throughput takes over. This is DuckDB's home turf — a vectorized, multithreaded analytical SQL engine with filter pushdown and late materialization.

CTable finishes in 0.031 s, DuckDB in 0.034 s — a dead heat (the two trade places within run-to-run noise), with both about 2.6x ahead of PyArrow, 7x ahead of polars, and 16x ahead of pandas. We find this result remarkable not because CTable "beats" anything here (it does not), but because of what is absent: there is no SQL engine in the Blosc2 process. A storage container holding the tie with a purpose-built database, purely on the strength of skipping work, tells us the layout is doing the heavy lifting.

Memory tells a similar story:

DuckDB (~60 MB) and CTable (~85 MB) are the two leanest by a wide margin — an order of magnitude below pandas (~1.6 GB), which materializes full columns before filtering. CTable never holds more than the blocks it could not prune, plus the 67 matching rows.

Why pruning wins: granularity

Parquet also carries min/max statistics — at row-group granularity, here ~970,000 rows per group. CTable keeps them at block granularity, ~27,000 rows per block: roughly 36x finer. For this query the difference is binary: every one of Parquet's 25 row groups contains some trip with tips > 100, so row-group statistics prune nothing, and every Parquet reader must stream most of the file. The block-level SUMMARY index prunes ~809 of 906 blocks.

The deeper point is architectural. A Blosc2 block is the unit of decompression — the same cache-sized block the compute engine streams. An index at that granularity skips exactly the work the query would otherwise do. An index at a coarser granularity than the I/O unit can only skip work in big, lucky lumps.

And the honest caveat: this advantage rides on selectivity, not on any general superiority. tips > 100 is rare enough that most 27 K-row blocks contain no match. A predicate that matches everywhere prunes nothing at any granularity, and on data sorted or clustered by the filter column, even Parquet's coarse row groups would start pruning effectively. Benchmarks are stories with a point of view; this one is about selective, first-touch queries.

Seasoned conclusions

What do we think these numbers support — and not support?

For selective cold queries on large tabular files, CTable/.b2z is genuinely fast — the fastest of the five tools here, on a query and dataset it was not specially tuned for. If your workload looks like "open a big file, fetch a small subset, move on", the block-level indexing earns its 2% of disk many times over.
Warm, it ties — it does not dethrone. DuckDB remains an excellent engine, and on cached data it matches CTable while speaking full SQL with joins and aggregations that CTable does not attempt. If your problems are relational, use a relational engine.
The result is arrays, not a result set. t.where(...) hands back NumPy-addressable columns with their original dtypes — no .to_numpy() hop, no DataFrame conversion tax. For NumPy-centric pipelines, that removes a whole impedance layer. And since columns are NDArrays, a CTable column can even be n-dimensional, or hold variable-length data (this dataset stores a GPS trace per row).
Parquet is not going anywhere. It is slightly smaller here, and it remains the lingua franca of the data ecosystem, readable by everything. .b2z is young and its natural habitat is the Python/NumPy world. What this experiment shows is that the trade is real and the price is modest: a couple percent of disk for first-touch queries that run in a fraction of the time.

Sixteen years after asking whether compression could be faster than memcpy, the question has scaled up but kept its shape: the fastest byte is still the one you never touch. Blocks sized for caches made decompression cheap; the compute engine made math over blocks cheap; and CTable's block-level indexes now make not touching most of a table cheap, too. The fourth floor stands on the first.

Reproduce it yourself

Everything in this post lives in bench/chicago-taxi in the python-blosc2 repository: the notebook, the driver, the five per-engine query scripts, and a README with the details. The notebook downloads the dataset on first run and builds the .b2z from it, so the whole thing is two commands away:

pip install "blosc2>=4.4.3" pyarrow duckdb polars pandas matplotlib jupyter
jupyter lab compare-query-methods.ipynb   # then: Run All

One practical tip if you chase the cold-cache numbers: flushing the OS file cache is necessary but not sufficient. After a flush and a few idle seconds, the first disk read also pays the drive's idle-state exit latency (tens of ms on power-managed NVMe), and it lands on whichever process touches the disk first — we learned this the hard way while preparing this post. The driver's --purge flag handles both the flush and the disk wake-up for you; the README explains the manual route.

More info

Introducing CTable — the design and feature tour
Getting started with CTable and Indexing CTables
The benchmark directory — notebook, driver, per-engine scripts and README
CTable API reference

Enjoy data!

Blosc Home Page (Posts about ctable b2z parquet queries tabular indexing compression)