<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blosc Home Page  (Posts about ctable columnar table compression)</title><link>https://blosc.org/</link><description></description><atom:link href="https://blosc.org/categories/ctable-columnar-table-compression.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:blosc@blosc.org"&gt;The Blosc Developers&lt;/a&gt; </copyright><lastBuildDate>Wed, 06 May 2026 12:07:55 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Introducing CTable: a Blosc2-based columnar table</title><link>https://blosc.org/posts/ctable-blosc2-columnar-table/</link><dc:creator>Jorge Albiol, Francesc Alted</dc:creator><description>&lt;p&gt;Working with large structured datasets in Python often means choosing between speed and simplicity. The new &lt;a class="reference external" href="https://blosc.org/python-blosc2/reference/ctable.html#"&gt;CTable&lt;/a&gt; object has born out of the need for a columnar store that compresses data on the fly, stays close to NumPy, and does not require an external database engine. It is the logical extension of current compressed data storage and computation in Python-Blosc2, brought to tabular datasets.&lt;/p&gt;
&lt;p&gt;As compression is paramount in Blosc2 ecosystem, we have chosen a columnar approach because, by placing together data that is similar (values in the same column), it allows for best compression ratios.  Column storage also allows for better data management for some cases, like adding, deleting, accessing or replacing entire columns; admittedly, it it also has its own drawbacks, like more costly access along the row axis. Nevertheless, columnar storage is quite common in modern libraries.&lt;/p&gt;
&lt;p&gt;Another important piece for CTable is to leverage the extremely efficient compute engine that can operate on compressed data without dropping too much performance (and in some cases, even improving it).  That puts the basis for allowing great analytics machinery on top of CTable object, without the need to decompress entire columns (just small excerpts of them, fitting in CPU caches, is enough).&lt;/p&gt;
&lt;p&gt;Last but not least, the CTable object inherits the independence of media storage of underlying structures (NDArray, ObjectArray, ListArray...) so that data can be stored and used straight from memory, disk or the network (coming soon).  That means that you can open a data file containing a big CTable and immediately start doing analytics with it without the need to load/parse everything in-memory.  Of course, for maximum speed, you may also load everything in-memory too; but as the format is the same, loading/saving is just a matter of copying data from one media to another, without the need for parsing or conversion.&lt;/p&gt;
&lt;p&gt;Keep reading to learn more about CTable, its features and how to use it in your projects.&lt;/p&gt;
&lt;section id="how-it-works"&gt;
&lt;h2&gt;How it works&lt;/h2&gt;
&lt;p&gt;CTable stores each column as an independent &lt;code class="docutils literal"&gt;blosc2.NDArray&lt;/code&gt;, compressed in chunks. Column types are defined through a &lt;a class="reference external" href="https://blosc.org/python-blosc2/reference/ctable.html#schema-specs"&gt;schema&lt;/a&gt; — a plain Python dataclass where each field is annotated with a Blosc2 type spec such as &lt;code class="docutils literal"&gt;b2.int64()&lt;/code&gt;, &lt;code class="docutils literal"&gt;b2.float32(ge=0)&lt;/code&gt;, or &lt;code class="docutils literal"&gt;b2.string()&lt;/code&gt;. Specs can carry constraints (e.g. &lt;code class="docutils literal"&gt;ge=0&lt;/code&gt; for non-negative values) and are compiled into a schema that validates every row on insert, either one at a time via Pydantic or in bulk via vectorized NumPy checks.&lt;/p&gt;
&lt;p&gt;Rows are tracked with a boolean tombstone mask: deleting a row simply flips its entry in the mask to &lt;code class="docutils literal"&gt;False&lt;/code&gt;, with no data movement at all. The actual space is reclaimed lazily when you call &lt;code class="docutils literal"&gt;compact()&lt;/code&gt;. Appending is also efficient because the underlying arrays are pre-allocated up front — they only grow when the pre-allocated capacity is exhausted, so there is no resize on every single insert.&lt;/p&gt;
&lt;p&gt;Because the data lives in Blosc2 chunks, many queries can skip full chunks entirely. When a chunk's stored metadata (min/max) rules out any match, it is never decompressed. This is where a lot of the query speed comes from, and it is also why explicit indexes (described in Features below) can push performance even further.&lt;/p&gt;
&lt;p&gt;Since &lt;code class="docutils literal"&gt;blosc2.NDArray&lt;/code&gt; stores fixed-width binary data, &lt;code class="docutils literal"&gt;null&lt;/code&gt; has no natural representation for integers, floats, or booleans. CTable solves this by letting you declare a column as nullable, and a sentinel value as the null marker is chosen automatically, although you can always use your own. For example, if you are storing ages and sometimes the value is unknown, you can set &lt;code class="docutils literal"&gt;&lt;span class="pre"&gt;-1&lt;/span&gt;&lt;/code&gt; as the null value since ages are never negative. Aggregates such as &lt;code class="docutils literal"&gt;.mean()&lt;/code&gt; or &lt;code class="docutils literal"&gt;.std()&lt;/code&gt; skip those rows automatically, and helper methods like &lt;code class="docutils literal"&gt;.is_null()&lt;/code&gt; and &lt;code class="docutils literal"&gt;.null_count()&lt;/code&gt; make it easy to work with them.&lt;/p&gt;
&lt;p&gt;Not all data fits neatly into fixed-width columns though. Think of a column storing the tags of an article, the purchase history of a user, or the list of measurements taken at a sensor in a given day — each row may have a different number of items. For these cases CTable supports list columns, declared as &lt;code class="docutils literal"&gt;blosc2.list(item_spec)&lt;/code&gt; (e.g. &lt;code class="docutils literal"&gt;&lt;span class="pre"&gt;blosc2.list(blosc2.float32())&lt;/span&gt;&lt;/code&gt;), structured objects via &lt;code class="docutils literal"&gt;blosc2.struct(item_spec)&lt;/code&gt; or completely general objects via &lt;code class="docutils literal"&gt;b2.object()&lt;/code&gt;. These columns are backed by a different storage class internally, one that keeps a compressed stream of items alongside an offsets array to know where each row starts and ends. From the user's perspective they behave like any other column, but each cell holds a Python list instead of a scalar, and individual lists can also be &lt;code class="docutils literal"&gt;None&lt;/code&gt;.  Internally, the underlying C-Blosc2 has been improved (and &lt;a class="reference external" href="https://github.com/Blosc/c-blosc2/releases"&gt;released as 3.0.0&lt;/a&gt;) to allow variable-length data in super-chunks in a very efficient (and backward-compatible) way.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="main-features"&gt;
&lt;h2&gt;Main features&lt;/h2&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Creation&lt;/strong&gt;: A CTable can be created in several ways. The most direct is declaring a typed schema as a dataclass and passing it to the constructor. You can also build a CTable from existing data — &lt;code class="docutils literal"&gt;from_arrow()&lt;/code&gt; and &lt;code class="docutils literal"&gt;from_csv()&lt;/code&gt; import Arrow tables and CSV files respectively, inferring or  mapping types automatically. Finally, &lt;code class="docutils literal"&gt;copy()&lt;/code&gt; produces a new independent CTable from an existing one, already compacted.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Modification&lt;/strong&gt;: Appending a single row uses &lt;code class="docutils literal"&gt;append()&lt;/code&gt;, while bulk insertion uses &lt;code class="docutils literal"&gt;extend()&lt;/code&gt;. Deleting rows sets their mask entry to &lt;code class="docutils literal"&gt;False&lt;/code&gt; and is essentially free. Columns can be added with a default value or dropped and renamed at any time. Beyond stored columns, CTable also supports two kinds
of virtual columns: &lt;em&gt;computed columns&lt;/em&gt; are evaluated on-the-fly from an expression over other columns and never touch storage; &lt;em&gt;materialized columns&lt;/em&gt; look like stored columns but are auto-filled automatically during every &lt;code class="docutils literal"&gt;extend()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Querying&lt;/strong&gt;: &lt;code class="docutils literal"&gt;where(expr)&lt;/code&gt; filters rows and returns a &lt;em&gt;view&lt;/em&gt; — a new CTable object that shares the same column arrays as the parent but carries its own mask. No data is copied; only the mask is computed. Views block structural changes (adding/dropping columns, deleting rows) but do allow writing values to existing cells. &lt;code class="docutils literal"&gt;select(cols)&lt;/code&gt; gives a column-projection view in the same  spirit. Both can be made into a fully independent mutable table with &lt;code class="docutils literal"&gt;copy()&lt;/code&gt;. Aggregates (&lt;code class="docutils literal"&gt;sum()&lt;/code&gt;, &lt;code class="docutils literal"&gt;mean()&lt;/code&gt;, &lt;code class="docutils literal"&gt;std()&lt;/code&gt;, &lt;code class="docutils literal"&gt;min()&lt;/code&gt;, &lt;code class="docutils literal"&gt;max()&lt;/code&gt;, ...) and &lt;code class="docutils literal"&gt;sort_by()&lt;/code&gt; also work on views and respect the mask.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Indexing&lt;/strong&gt;: For workloads that repeatedly query the same column, CTable supports three index flavors: &lt;code class="docutils literal"&gt;FULL&lt;/code&gt; (sorted positions array, best for range and comparison queries), &lt;code class="docutils literal"&gt;BUCKET&lt;/code&gt; (hash-based, best for equality lookups), and &lt;code class="docutils literal"&gt;PARTIAL&lt;/code&gt; (a lighter-weight sorted structure). Once an index is created, &lt;code class="docutils literal"&gt;where()&lt;/code&gt; uses it automatically when the query can benefit from it. Indexes are persisted alongside the table and survive &lt;code class="docutils literal"&gt;.b2z&lt;/code&gt; round-trips.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Persistence&lt;/strong&gt;: Tables can live fully in memory or be backed by files on disk. &lt;code class="docutils literal"&gt;save()&lt;/code&gt; writes an in-memory table to a directory or &lt;code class="docutils literal"&gt;.b2z&lt;/code&gt; archive. &lt;code class="docutils literal"&gt;CTable.open()&lt;/code&gt; attaches directly to an on-disk table for reading or writing without loading everything into RAM. &lt;code class="docutils literal"&gt;CTable.load()&lt;/code&gt; copies the on-disk table fully into memory for faster subsequent access. Both &lt;code class="docutils literal"&gt;.b2d&lt;/code&gt; directories and &lt;code class="docutils literal"&gt;.b2z&lt;/code&gt; zip archives are supported transparently.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="mini-benchmarks"&gt;
&lt;h2&gt;Mini benchmarks&lt;/h2&gt;
&lt;p&gt;All numbers below are from a single machine, 1 million rows, using the benchmark scripts in the repository. They are meant to give a feel for the performance characteristics, not as absolute guarantees.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bulk loading speed&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;How you feed data into a CTable matters a lot. Loading 1M rows from a Python list of dicts takes around 0.66 s. Switching to a NumPy structured array brings that down to 0.03 s — a &lt;strong&gt;22x speedup&lt;/strong&gt;. Loading from an existing CTable is even faster at &lt;strong&gt;28x&lt;/strong&gt;. The takeaway is simple: if you have NumPy data, hand it directly to &lt;code class="docutils literal"&gt;extend()&lt;/code&gt; and it will be ingested at close to raw array speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Filtering vs pandas&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Filtering 1M rows with a range query (&lt;code class="docutils literal"&gt;id&lt;/code&gt; between 250k and 750k, so 50% of the table) takes around 13 ms in CTable vs 31 ms in pandas — &lt;strong&gt;2.4x faster&lt;/strong&gt;. On top of that, the CTable occupies 20 MB compressed versus 31 MB for the equivalent pandas DataFrame, a &lt;strong&gt;1.6x reduction in memory&lt;/strong&gt; essentially for free thanks to Blosc2's compression pipeline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;``where()`` is nearly free regardless of selectivity&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;One property of the lazy mask approach is that &lt;code class="docutils literal"&gt;where()&lt;/code&gt; costs roughly the same whether the result contains 10 rows or 999,990 rows out of 1M. In practice the time stays between 12 ms and 18 ms across all selectivity levels. You are not paying to materialise the matching rows — you are only computing a mask. The data is only read when you actually access it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;``extend()`` vs ``append()`` — always batch if you can&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;CTable has two ways to insert data: &lt;code class="docutils literal"&gt;append()&lt;/code&gt; adds one row at a time and goes through a full Pydantic validation cycle per row; &lt;code class="docutils literal"&gt;extend()&lt;/code&gt; takes a batch and validates it in one vectorized NumPy pass. At 100k rows the difference is &lt;strong&gt;2000x in favour of ``extend()``&lt;/strong&gt;. Even at 10k rows it is already 700x. The message is simple: if you have more than a handful of rows to insert, always batch them into a single &lt;code class="docutils literal"&gt;extend()&lt;/code&gt; call.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Combining filters is 4x faster than chaining them&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It is tempting to filter a CTable step by step — first narrow by one condition, then filter the result by another. But each &lt;code class="docutils literal"&gt;where()&lt;/code&gt; call creates a new view with its own mask computation. A single &lt;code class="docutils literal"&gt;where()&lt;/code&gt; with all conditions joined by &lt;code class="docutils literal"&gt;&amp;amp;&lt;/code&gt; does the same work in one pass and is &lt;strong&gt;4.4x faster&lt;/strong&gt; than five chained calls returning the same final result.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema validation has near-zero cost at scale&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Every CTable has a typed schema with optional constraints (ranges, string lengths, etc.). When inserting data with &lt;code class="docutils literal"&gt;extend()&lt;/code&gt;, these constraints are checked via a vectorized NumPy path rather than row by row. At 1M rows with a NumPy structured array the validation overhead is essentially &lt;strong&gt;1.00x —indistinguishable from skipping validation entirely&lt;/strong&gt;. Even with Python list input it only adds 1.31x. You get correctness guarantees without paying for them at scale.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="ai-role"&gt;
&lt;h2&gt;AI role&lt;/h2&gt;
&lt;p&gt;During the development of CTable, we have been using AI tools to help us in the design and implementation of the API, as well as in the documentation.  Tools like Perplexity, and agents like Pi, Codex or Claude have been instrumental in helping us throughout the process, which allowed us to be much more ambitious in the features we wanted to implement. &lt;em&gt;Note from Francesc&lt;/em&gt;: I specially liked the combination of the Pi agent and GPT 5.5 model (essentially GPT &amp;gt;= 5.3); that worked really well!&lt;/p&gt;
&lt;p&gt;Of course, we already borrowed some ideas from other libraries, like Apache Arrow, Pandas, Polars, DuckDB or PyTables, but we also wanted to bring some unique features to CTable, like the ability to operate on compressed data without decompressing it, or the rich schema specs for expressing complex data types; AI has been instrumental in allowing us doing this.&lt;/p&gt;
&lt;p&gt;Being a powerful tool, AI always need supervision and guidance to be used effectively, and we have spent lots of time bringing our cumulated decades-long experience for review code, designing tests and benchmarks, and fine-tuning the internal knobs for allowing best performance and user experience.   We must say that we are very happy with the results: combining our experience with the power of AI has allowed us to create a powerful and flexible tabular data container that is well tested and documented.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="more-info"&gt;
&lt;h2&gt;More info&lt;/h2&gt;
&lt;p&gt;We have setup a couple of tutorials and a complete API reference to get you started with CTable:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://blosc.org/python-blosc2/getting_started/tutorials/13.ctable-basics.html"&gt;Getting started with CTable&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://blosc.org/python-blosc2/getting_started/tutorials/15.indexing-ctables.html"&gt;Indexing CTables&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://github.com/Blosc/python-blosc2/tree/main/examples/ctable"&gt;More CTable examples&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a class="reference external" href="https://blosc.org/python-blosc2/reference/ctable.html#blosc2.CTable"&gt;CTable API reference&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/section&gt;
&lt;section id="conclusions"&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;CTable brings together compression, schema validation, and query acceleration in a self-contained Python package. It is still young, but the architecture is solid and the feature set already covers most common analytical workflows, and we hope it will be useful for many users in the Python ecosystem. We are looking forward to seeing how the community will use and contribute to Python-Blosc2 in general, an CTable in particular, and to continue improving it based on feedback and contributions from users.&lt;/p&gt;
&lt;p&gt;Enjoy data!&lt;/p&gt;
&lt;/section&gt;</description><category>ctable columnar table compression</category><guid>https://blosc.org/posts/ctable-blosc2-columnar-table/</guid><pubDate>Wed, 06 May 2026 09:00:00 GMT</pubDate></item></channel></rss>