CTable Tutorial

CTable is a columnar compressed table built on top of blosc2.NDArray. It stores each column independently as a compressed array, giving you:

  • Compression — data lives compressed in RAM and on disk.

  • Schema — every column has a declared type and optional constraints.

  • Speed — bulk operations stay in NumPy; no row-by-row Python overhead.

  • Persistence — tables can be saved to and loaded from disk transparently.

This notebook walks through the full API, starting from the very basics and finishing with a real-world analysis of climate data across ten world cities.

[1]:
from dataclasses import dataclass

import matplotlib.pyplot as plt
import numpy as np

import blosc2
from blosc2 import CTable

Part 1 — The Basics

1.1 Defining a schema

Every CTable is typed. You define the schema with a plain Python @dataclass. Each field gets a spec — a blosc2 type that carries the NumPy dtype and optional constraints.

[2]:
@dataclass
class Sensor:
    id: int = blosc2.field(blosc2.int32(ge=0))
    location: str = blosc2.field(blosc2.string(max_length=16), default="")
    temperature: float = blosc2.field(blosc2.float64(ge=-80, le=60), default=20.0)
    active: bool = blosc2.field(blosc2.bool(), default=True)


# Create an empty in-memory table
t = CTable(Sensor, expected_size=50)
print(f"Empty table: {len(t)} rows, columns: {t.col_names}")
Empty table: 0 rows, columns: ['id', 'location', 'temperature', 'active']

1.2 Appending rows

append() adds one row at a time. The row is validated against the schema before writing.

[3]:
t.append(Sensor(id=0, location="roof", temperature=22.5, active=True))
t.append(Sensor(id=1, location="basement", temperature=18.1, active=True))
t.append(Sensor(id=2, location="outdoor", temperature=-3.2, active=False))
print(t)
 id            location                       temperature        active
 int32         U16 (Unicode, max 16 chars)    float64            bool
────────────  ─────────────────────────────  ─────────────────  ────────
 0             roof                           22.5               True
 1             basement                       18.1               True
 2             outdoor                        -3.2               False
────────────  ─────────────────────────────  ─────────────────  ────────
3 rows × 4 columns

Constraints are enforced — trying to insert a temperature above 60 °C raises an error:

[4]:
try:
    t.append(Sensor(id=99, location="sun", temperature=9999.0, active=True))
except Exception as e:
    print(f"Validation error: {e}")
Validation error: 1 validation error for _Validator_Sensor
temperature
  Input should be less than or equal to 60 [type=less_than_equal, input_value=9999.0, input_type=float]
    For further information visit https://errors.pydantic.dev/2.13/v/less_than_equal

1.3 Bulk loading with extend()

extend() accepts a list of tuples or a structured NumPy array. It is much faster than calling append() in a loop.

[5]:
bulk = [
    (3, "lab-A", 20.0, True),
    (4, "lab-B", 21.5, True),
    (5, "server", 35.8, True),
    (6, "garden", -1.0, False),
]
t.extend(bulk)
print(f"After extend: {len(t)} rows")
print(t)
After extend: 7 rows
 id            location                       temperature        active
 int32         U16 (Unicode, max 16 chars)    float64            bool
────────────  ─────────────────────────────  ─────────────────  ────────
 0             roof                           22.5               True
 1             basement                       18.1               True
 2             outdoor                        -3.2               False
 3             lab-A                          20.0               True
 4             lab-B                          21.5               True
 5             server                         35.8               True
 6             garden                         -1.0               False
────────────  ─────────────────────────────  ─────────────────  ────────
7 rows × 4 columns

1.4 Navigating the table

head(), tail(), and slicing give you quick views without materialising everything.

[6]:
print("--- head(3) ---")
print(t.head(3))

print("--- tail(2) ---")
print(t.tail(2))

print(f"\nCompression: {t.cbytes:,} B compressed / {t.nbytes:,} B uncompressed")
--- head(3) ---
 id            location                       temperature        active
 int32         U16 (Unicode, max 16 chars)    float64            bool
────────────  ─────────────────────────────  ─────────────────  ────────
 0             roof                           22.5               True
 1             basement                       18.1               True
 2             outdoor                        -3.2               False
────────────  ─────────────────────────────  ─────────────────  ────────
3 rows × 4 columns
--- tail(2) ---
 id            location                       temperature        active
 int32         U16 (Unicode, max 16 chars)    float64            bool
────────────  ─────────────────────────────  ─────────────────  ────────
 5             server                         35.8               True
 6             garden                         -1.0               False
────────────  ─────────────────────────────  ─────────────────  ────────
2 rows × 4 columns

Compression: 747 B compressed / 3,900 B uncompressed

1.5 Columns as first-class objects

Access a column with table["name"] or table.name. Columns are lazy — they only decompress data when you ask for it.

[7]:
temps = t["temperature"]
print(f"dtype  : {temps.dtype}")
print(f"min    : {temps.min():.1f} °C")
print(f"max    : {temps.max():.1f} °C")
print(f"mean   : {temps.mean():.1f} °C")
print(f"as numpy: {temps[:]}")
dtype  : float64
min    : -3.2 °C
max    : 35.8 °C
mean   : 16.2 °C
as numpy: [22.5 18.1 -3.2 20.  21.5 35.8 -1. ]

1.6 Computed columns

A CTable can also expose computed columns: read-only columns backed by a lazy expression over stored columns. They use no extra storage, update automatically after appends/deletes, and participate in display, filtering, sorting, and aggregates.

[8]:
t.add_computed_column("temperature_f", "temperature * 9 / 5 + 32")

print(t.select(["location", "temperature", "temperature_f"]))
print(f"\nMean temperature in °F: {t['temperature_f'].mean():.1f} °F")

# Use the computed column in a query
warm_f = t.where(t.temperature_f > 70)
print("\nRows above 70 °F:")
print(warm_f.select(["location", "temperature", "temperature_f"]))
 location                       temperature        temperature_f
 U16 (Unicode, max 16 chars)    float64            float64
─────────────────────────────  ─────────────────  ─────────────────
 roof                           22.5               72.5
 basement                       18.1               64.58
 outdoor                        -3.2               26.24000000000…
 lab-A                          20.0               68.0
 lab-B                          21.5               70.7
 server                         35.8               96.44
 garden                         -1.0               30.2
─────────────────────────────  ─────────────────  ─────────────────
7 rows × 3 columns

Mean temperature in °F: 61.2 °F

Rows above 70 °F:
 location                       temperature        temperature_f
 U16 (Unicode, max 16 chars)    float64            float64
─────────────────────────────  ─────────────────  ─────────────────
 roof                           22.5               72.5
 lab-B                          21.5               70.7
 server                         35.8               96.44
─────────────────────────────  ─────────────────  ─────────────────
3 rows × 3 columns

Part 2 — The Climate Dataset

Enough warm-up. Let’s do something real.

We will simulate one full year of daily weather readings for 10 world cities. Each row is one day at one city: temperature, humidity, wind speed, atmospheric pressure.

City

Climate

Twist

Madrid

Mediterranean

Scorching summers, mild winters

London

Temperate oceanic

Famously grey and damp

Beijing

Continental

Brutal winters, hot summers

New York

Humid continental

Four very distinct seasons

Tokyo

Humid subtropical

Warm and very humid summers

Sydney

Oceanic (S. hemisphere)

Seasons are flipped!

Cairo

Hot desert

Basically always hot

Moscow

Subarctic

Coldest city in the dataset

Mumbai

Tropical

Hot and humid all year

São Paulo

Tropical highland

Warm, rainy, south hemisphere

[9]:
@dataclass
class WeatherReading:
    city: str = blosc2.field(blosc2.string(max_length=16))
    day: int = blosc2.field(blosc2.int16(ge=1, le=365), default=1)
    temperature: float = blosc2.field(blosc2.float32(ge=-80.0, le=60.0), default=20.0)
    humidity: float = blosc2.field(blosc2.float32(ge=0.0, le=100.0), default=50.0)
    wind_speed: float = blosc2.field(blosc2.float32(ge=0.0, le=200.0), default=0.0)
    pressure: float = blosc2.field(blosc2.float32(ge=800.0, le=1100.0), default=1013.0)
[10]:
# Climate profile for each city:
#   mean_temp  : annual mean temperature (°C)
#   amplitude  : half the annual temperature swing (°C)
#   peak_day   : day of year with the highest temperature
#                (196 ≈ July 15 for N. hemisphere, 15 ≈ Jan 15 for S. hemisphere)
#   humidity   : annual mean relative humidity (%)
#   wind       : mean wind speed (km/h)
#   pressure   : mean atmospheric pressure (hPa)

CITY_PROFILES = {
    "Madrid": {
        "mean_temp": 15.0,
        "amplitude": 13.0,
        "peak_day": 196,
        "humidity": 45,
        "wind": 12,
        "pressure": 1010,
    },
    "London": {
        "mean_temp": 11.0,
        "amplitude": 7.0,
        "peak_day": 196,
        "humidity": 75,
        "wind": 15,
        "pressure": 1013,
    },
    "Beijing": {
        "mean_temp": 12.0,
        "amplitude": 16.0,
        "peak_day": 196,
        "humidity": 55,
        "wind": 10,
        "pressure": 1012,
    },
    "New York": {
        "mean_temp": 13.0,
        "amplitude": 14.0,
        "peak_day": 196,
        "humidity": 65,
        "wind": 14,
        "pressure": 1013,
    },
    "Tokyo": {
        "mean_temp": 15.0,
        "amplitude": 12.0,
        "peak_day": 196,
        "humidity": 72,
        "wind": 11,
        "pressure": 1014,
    },
    "Sydney": {
        "mean_temp": 18.0,
        "amplitude": 8.0,
        "peak_day": 15,
        "humidity": 65,
        "wind": 16,
        "pressure": 1012,
    },
    "Cairo": {
        "mean_temp": 22.0,
        "amplitude": 14.0,
        "peak_day": 196,
        "humidity": 35,
        "wind": 8,
        "pressure": 1014,
    },
    "Moscow": {
        "mean_temp": 5.0,
        "amplitude": 18.0,
        "peak_day": 196,
        "humidity": 70,
        "wind": 10,
        "pressure": 1015,
    },
    "Mumbai": {
        "mean_temp": 28.0,
        "amplitude": 4.0,
        "peak_day": 196,
        "humidity": 80,
        "wind": 12,
        "pressure": 1011,
    },
    "Sao Paulo": {
        "mean_temp": 22.0,
        "amplitude": 5.0,
        "peak_day": 15,
        "humidity": 75,
        "wind": 8,
        "pressure": 1016,
    },
}

rng = np.random.default_rng(42)
days = np.arange(1, 366, dtype=np.int16)

all_rows = []
for city, p in CITY_PROFILES.items():
    seasonal = p["amplitude"] * np.cos(2 * np.pi * (days - p["peak_day"]) / 365)
    temps = (p["mean_temp"] + seasonal + rng.normal(0, 2.0, 365)).clip(-80, 60).astype(np.float32)
    humidity = (p["humidity"] + rng.normal(0, 8.0, 365)).clip(0, 100).astype(np.float32)
    wind = (p["wind"] + rng.exponential(4.0, 365)).clip(0, 200).astype(np.float32)
    pressure = (p["pressure"] + rng.normal(0, 5.0, 365)).clip(800, 1100).astype(np.float32)
    for i, d in enumerate(days):
        all_rows.append(
            (city, int(d), float(temps[i]), float(humidity[i]), float(wind[i]), float(pressure[i]))
        )

climate = CTable(WeatherReading, new_data=all_rows, validate=False, expected_size=len(all_rows))
print(f"Climate table: {len(climate):,} rows × {climate.ncols} columns")
print(f"Compressed: {climate.cbytes / 1024:.1f} KB  (uncompressed: {climate.nbytes / 1024:.1f} KB)")
print(climate)
Climate table: 3,650 rows × 6 columns
Compressed: 42.1 KB  (uncompressed: 295.8 KB)
 city                           day         temperature     humidity        wind_speed      pressure
 U16 (Unicode, max 16 chars)    int16       float32         float32         float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
 Madrid                         1           2.909233808…    56.85164260…    14.76450634…    1014.201110…
 Madrid                         2           0.173932701…    39.05129241…    12.57167434…    1017.437011…
 Madrid                         3           3.712681531…    38.42200088…    12.35102844…    1008.641601…
 Madrid                         4           4.054576396…    46.61845016…    14.67175292…    1004.238769…
 Madrid                         5           -1.76315498…    51.75508117…    18.27811622…    1008.797790…
 Madrid                         6           -0.49616417…    45.09140777…    12.12416076…    1010.510131…
 Madrid                         7           2.336975336…    55.63168334…    12.29123497…    1010.394958…
 Madrid                         8           1.425737023…    51.85435104…    12.44658184…    1015.668395…
 Madrid                         9           2.005382537…    51.73456192…    12.67978572…    1008.193664…
 Madrid                         10          0.317500203…    49.43293380…    15.99008655…    1011.760192…
 ...                            ...         ...             ...             ...             ...
 Sao Paulo                      356         24.10465049…    79.09764862…    11.10457992…    1014.604797…
 Sao Paulo                      357         28.06665992…    75.45437622…    8.651307106…    1021.078308…
 Sao Paulo                      358         25.48140907…    70.06164550…    9.325889587…    1010.540771…
 Sao Paulo                      359         25.73676872…    85.70543670…    8.712495803…    1011.278442…
 Sao Paulo                      360         25.67271423…    76.63825988…    10.50052738…    1019.007934…
 Sao Paulo                      361         26.86038208…    80.04640960…    11.57989692…    1013.804260…
 Sao Paulo                      362         24.26036834…    86.33271026…    11.86164474…    1008.734680…
 Sao Paulo                      363         22.30093574…    86.20143127…    13.39677143…    1022.754089…
 Sao Paulo                      364         27.72695350…    79.63828277…    17.80615234…    1010.502014…
 Sao Paulo                      365         22.60712432…    72.57873535…    9.853503227…    1015.002868…
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
3,650 rows × 6 columns  (3,630 rows hidden)

Part 3 — Querying

3.1 Filtering rows with where()

where() takes a boolean expression built from column comparisons and returns a view — a lightweight object that shares the underlying data without copying it.

[11]:
# All days where temperature exceeded 35 °C (any city)
very_hot = climate.where(climate.temperature > 35)
print(f"Days above 35 °C: {len(very_hot)} ({len(very_hot) / len(climate) * 100:.1f}% of all readings)")
print(very_hot.head(8))
Days above 35 °C: 49 (1.3% of all readings)
 city                           day         temperature     humidity        wind_speed      pressure
 U16 (Unicode, max 16 chars)    int16       float32         float32         float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
 Cairo                          154         35.72507095…    39.59734344…    10.80750942…    1010.431884…
 Cairo                          157         35.66441726…    38.08246231…    9.141173362…    1016.997558…
 Cairo                          158         35.80884170…    34.47190475…    12.61370849…    1016.754882…
 Cairo                          162         35.91463088…    33.77049636…    20.62659454…    1008.747253…
 Cairo                          163         36.98370361…    31.69925498…    15.52884197…    1010.481750…
 Cairo                          165         37.55741119…    35.59812164…    9.190578460…    1014.468322…
 Cairo                          169         36.81953430…    40.87221145…    15.42489051…    1024.706420…
 Cairo                          170         37.21762847…    36.48490905…    12.23543548…    1012.218139…
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
8 rows × 6 columns
[12]:
# Moscow in winter (below freezing)
moscow_frozen = climate.where((climate.city == "Moscow") & (climate.temperature < 0))
print(f"Moscow below freezing: {len(moscow_frozen)} days out of 365")
print(moscow_frozen.head())
Moscow below freezing: 148 days out of 365
 city                           day         temperature     humidity        wind_speed      pressure
 U16 (Unicode, max 16 chars)    int16       float32         float32         float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
 Moscow                         1           -13.5099849…    76.91526794…    10.18380069…    1006.785095…
 Moscow                         2           -13.0531520…    77.78500366…    20.35687637…    1010.101074…
 Moscow                         3           -12.9442214…    81.18754577…    16.77510261…    1020.993286…
 Moscow                         4           -12.8625192…    73.40472412…    13.44744586…    1013.957031…
 Moscow                         5           -10.4717388…    69.11986541…    10.80644416…    1016.391967…
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
5 rows × 6 columns

3.2 Column projection with select()

select() returns a view with only the columns you need — no data is copied.

[13]:
# Just city, day, and temperature — useful before exporting or computing stats
slim = climate.select(["city", "day", "temperature"])
print(slim.head(6))
 city                           day         temperature
 U16 (Unicode, max 16 chars)    int16       float32
─────────────────────────────  ──────────  ──────────────
 Madrid                         1           2.909233808…
 Madrid                         2           0.173932701…
 Madrid                         3           3.712681531…
 Madrid                         4           4.054576396…
 Madrid                         5           -1.76315498…
 Madrid                         6           -0.49616417…
─────────────────────────────  ──────────  ──────────────
6 rows × 3 columns

3.3 Sorting

sort_by() returns a sorted copy by default (or sorts in-place with inplace=True). Multi-column sorting is supported — primary key first.

[14]:
# Which were the 10 hottest days across all cities?
hottest = climate.sort_by("temperature", ascending=False)
print("Top 10 hottest days (any city):")
print(hottest.head(10))
Top 10 hottest days (any city):
 city                           day         temperature     humidity        wind_speed      pressure
 U16 (Unicode, max 16 chars)    int16       float32         float32         float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
 Cairo                          225         39.74710083…    40.93272018…    11.85465049…    1011.177551…
 Cairo                          184         39.47347640…    36.25406646…    8.386101722…    1009.952026…
 Cairo                          195         39.28937149…    26.06153678…    9.814800262…    1025.788452…
 Cairo                          205         38.39984130…    47.71873855…    21.52345466…    1014.675598…
 Cairo                          213         38.25638198…    38.09883117…    17.99407577…    1013.649597…
 Cairo                          218         37.88246536…    29.69907951…    11.12999248…    1015.036682…
 Cairo                          185         37.75819396…    30.60622787…    9.116196632…    1015.961669…
 Cairo                          165         37.55741119…    35.59812164…    9.190578460…    1014.468322…
 Cairo                          177         37.33380889…    23.67673683…    8.815520286…    1013.045227…
 Cairo                          170         37.21762847…    36.48490905…    12.23543548…    1012.218139…
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────
10 rows × 6 columns
[15]:
# Multi-column sort: primary key = city (A→Z), secondary = temperature (hottest first)
# This lets you see each city's hottest day at a glance
by_city_temp = climate.sort_by(["city", "temperature"], ascending=[True, False])
print("Sorted by city (asc) then temperature (desc):")
print(by_city_temp.select(["city", "day", "temperature", "humidity"]).head(20))
Sorted by city (asc) then temperature (desc):
 city                           day         temperature     humidity
 U16 (Unicode, max 16 chars)    int16       float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────
 Beijing                        207         33.77753448…    53.55015182…
 Beijing                        222         31.76755332…    55.32269287…
 Beijing                        205         31.75275421…    56.29219818…
 Beijing                        214         30.79148101…    63.60578536…
 Beijing                        178         30.47000694…    53.21742248…
 Beijing                        202         30.44628334…    65.57708740…
 Beijing                        203         30.01535034…    60.08674621…
 Beijing                        164         29.65973281…    43.95068740…
 Beijing                        177         29.58276939…    37.89999389…
 Beijing                        196         29.33683586…    54.66562271…
 Beijing                        188         29.22382354…    57.12422943…
 Beijing                        212         29.06817436…    43.84931945…
 Beijing                        191         29.05400466…    45.30105209…
 Beijing                        189         28.97620391…    55.71607208…
 Beijing                        181         28.96719360…    51.00849914…
 Beijing                        204         28.92338752…    45.47613143…
 Beijing                        179         28.87882232…    64.10365295…
 Beijing                        193         28.80445098…    52.74388504…
 Beijing                        195         28.73704147…    66.52650451…
 Beijing                        160         28.67715644…    61.17769241…
─────────────────────────────  ──────────  ──────────────  ──────────────
20 rows × 4 columns

Part 4 — Aggregates and Statistics

4.1 Per-city mean temperature

[16]:
print(f"{'City':<12}  {'Mean temp':>10}  {'Min':>7}  {'Max':>7}  {'Std':>7}")
print("-" * 50)
for city in CITY_PROFILES:
    v = climate.where(climate.city == city)
    col = v["temperature"]
    print(f"{city:<12}  {col.mean():>9.1f}°  {col.min():>6.1f}°  {col.max():>6.1f}°  {col.std():>6.1f}°")
City           Mean temp      Min      Max      Std
--------------------------------------------------
Madrid             15.0°    -1.8°    31.4°     9.3°
London             10.8°    -0.3°    22.7°     5.3°
Beijing            12.1°    -9.1°    33.8°    11.5°
New York           13.0°    -4.4°    30.9°    10.2°
Tokyo              15.1°    -0.2°    31.0°     8.5°
Sydney             17.8°     4.7°    30.9°     5.9°
Cairo              21.9°     2.8°    39.7°    10.1°
Moscow              5.0°   -17.5°    26.3°    12.9°
Mumbai             27.9°    18.4°    36.6°     3.5°
Sao Paulo          21.9°    12.7°    30.9°     4.1°

4.2 describe() — full summary in one call

[17]:
# describe() on a select() view — only numeric columns
climate.select(["temperature", "humidity", "wind_speed", "pressure"]).describe()
CTable  3,650 rows × 4 cols

  temperature  [float32]
    count : 3,650
    mean  : 16.04
    std   : 10.72
    min   : -17.54
    max   : 39.75

  humidity  [float32]
    count : 3,650
    mean  : 63.48
    std   : 16.02
    min   : 8.894
    max   : 99.81

  wind_speed  [float32]
    count : 3,650
    mean  : 15.63
    std   : 4.874
    min   : 8.005
    max   : 47.48

  pressure  [float32]
    count : 3,650
    mean  : 1013
    std   : 5.328
    min   : 991.1
    max   : 1036

4.3 Covariance matrix

cov() requires all columns to be numeric (int, float, or bool). It returns a standard numpy.ndarray.

[18]:
numeric = climate.select(["temperature", "humidity", "wind_speed", "pressure"])
cov = numeric.cov()

labels = ["temp", "humidity", "wind", "pressure"]
col_w = 12
print("Covariance matrix (all cities, full year):")
print(" " * 10 + "".join(f"{lbl:>{col_w}}" for lbl in labels))
for i, lbl in enumerate(labels):
    print(f"{lbl:<10}" + "".join(f"{cov[i, j]:>{col_w}.3f}" for j in range(4)))

# And the correlation matrix for easier interpretation
corr = np.corrcoef(np.stack([numeric[c][:] for c in ["temperature", "humidity", "wind_speed", "pressure"]]))
print("\nCorrelation matrix:")
print(" " * 10 + "".join(f"{lbl:>{col_w}}" for lbl in labels))
for i, lbl in enumerate(labels):
    print(f"{lbl:<10}" + "".join(f"{corr[i, j]:>{col_w}.3f}" for j in range(4)))
Covariance matrix (all cities, full year):
                  temp    humidity        wind    pressure
temp           114.963       0.018      -3.523      -0.207
humidity         0.018     256.861      10.773       6.652
wind            -3.523      10.773      23.760      -2.650
pressure        -0.207       6.652      -2.650      28.394

Correlation matrix:
                  temp    humidity        wind    pressure
temp             1.000       0.000      -0.067      -0.004
humidity         0.000       1.000       0.138       0.078
wind            -0.067       0.138       1.000      -0.102
pressure        -0.004       0.078      -0.102       1.000

Part 5 — Analysis: Summer in Madrid

Summer in the northern hemisphere runs roughly from the summer solstice (day 172, June 21) to the autumnal equinox (day 264, September 22).

Let’s zoom in on Madrid during those months and compare it with a few other cities.

[19]:
SUMMER_START = 172  # June 21
SUMMER_END = 264  # September 22

madrid = climate.where(climate.city == "Madrid")
madrid_summer = madrid.where((madrid.day >= SUMMER_START) & (madrid.day <= SUMMER_END))

print(f"Madrid summer readings : {len(madrid_summer)} days")
print(f"  mean temperature : {madrid_summer.temperature.mean():.1f} °C")
print(f"  max  temperature : {madrid_summer.temperature.max():.1f} °C")
print(f"  mean humidity    : {madrid_summer['humidity'].mean():.1f} %")
print(f"  mean wind speed  : {madrid_summer['wind_speed'].mean():.1f} km/h")
Madrid summer readings : 93 days
  mean temperature : 25.8 °C
  max  temperature : 31.4 °C
  mean humidity    : 43.8 %
  mean wind speed  : 15.8 km/h
[20]:
# Compare summer stats across several cities
compare_cities = ["Madrid", "London", "Cairo", "Moscow", "Tokyo", "Sydney"]

print(f"{'City':<12}  {'Summer mean':>12}  {'Summer max':>11}  {'Summer humidity':>16}")
print("-" * 58)
for city in compare_cities:
    v = climate.where(climate.city == city)
    # For Sydney (S. hemisphere) 'summer' is Jan-Mar, i.e. days 1-80 or 355-365
    if city == "Sydney":
        s = v.where((v.day <= 80) | (v.day >= 355))
        label = "(S. summer)"
    else:
        s = v.where((v.day >= SUMMER_START) & (v.day <= SUMMER_END))
        label = ""
    mean_t = s["temperature"].mean()
    max_t = s["temperature"].max()
    mean_h = s["humidity"].mean()
    print(f"{city:<12}  {mean_t:>10.1f}°C  {max_t:>9.1f}°C  {mean_h:>14.1f}%  {label}")
City           Summer mean   Summer max   Summer humidity
----------------------------------------------------------
Madrid              25.8°C       31.4°C            43.8%
London              16.5°C       22.7°C            74.6%
Cairo               33.5°C       39.7°C            34.4%
Moscow              20.1°C       26.3°C            69.3%
Tokyo               25.1°C       31.0°C            73.0%
Sydney              24.6°C       30.9°C            63.8%  (S. summer)
[21]:
# Top 10 hottest days in Madrid across the whole year
# Sort the full table, then filter — views cannot be sorted directly
hottest_all = climate.sort_by("temperature", ascending=False)
madrid_sorted = hottest_all.where(hottest_all.city == "Madrid")
print("10 hottest days in Madrid:")
print(madrid_sorted.select(["city", "day", "temperature", "humidity"]).head(10))
10 hottest days in Madrid:
 city                           day         temperature     humidity
 U16 (Unicode, max 16 chars)    int16       float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────
 Madrid                         191         31.39920806…    42.54333496…
 Madrid                         190         31.23257637…    44.30324554…
 Madrid                         227         31.22744178…    46.99229049…
 Madrid                         194         30.91518402…    35.04422760…
 Madrid                         186         30.87937355…    48.08030319…
 Madrid                         202         30.74568367…    43.72281265…
 Madrid                         177         30.46902275…    38.39016342…
 Madrid                         163         30.21517944…    46.05188751…
 Madrid                         181         30.18102455…    43.72652053…
 Madrid                         184         29.93619918…    50.65479660…
─────────────────────────────  ──────────  ──────────────  ──────────────
10 rows × 4 columns

5.1 Plotting: temperature over the year

Let’s visualise the full annual temperature cycle for a few contrasting cities.

[22]:
plot_cities = {
    "Madrid": "#e63946",
    "London": "#457b9d",
    "Moscow": "#2d6a4f",
    "Cairo": "#f4a261",
    "Sydney": "#a8dadc",
}

fig, ax = plt.subplots(figsize=(12, 5))

for city, color in plot_cities.items():
    v = climate.where(climate.city == city)
    d = v.day[:].astype(int)
    t = v["temperature"][:]
    order = np.argsort(d)
    ax.plot(d[order], t[order], label=city, color=color, linewidth=1.5, alpha=0.85)

ax.axvspan(SUMMER_START, SUMMER_END, alpha=0.10, color="gold", label="N. summer")
ax.set_xlabel("Day of year")
ax.set_ylabel("Temperature (°C)")
ax.set_title("Daily temperature — selected cities")
ax.legend(loc="upper left")
ax.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
../../_images/getting_started_tutorials_13.ctable-basics_38_0.png

5.2 Summer temperature distribution — Madrid vs London

A simple histogram comparison of how often each city exceeds different temperature thresholds.

[23]:
madrid_s = climate.where(
    (climate.city == "Madrid") & (climate.day >= SUMMER_START) & (climate.day <= SUMMER_END)
)["temperature"][:]

london_s = climate.where(
    (climate.city == "London") & (climate.day >= SUMMER_START) & (climate.day <= SUMMER_END)
)["temperature"][:]

fig, ax = plt.subplots(figsize=(9, 4))
bins = np.linspace(0, 45, 30)
ax.hist(madrid_s, bins=bins, alpha=0.7, color="#e63946", label="Madrid")
ax.hist(london_s, bins=bins, alpha=0.7, color="#457b9d", label="London")
ax.axvline(
    madrid_s.mean(),
    color="#e63946",
    linestyle="--",
    linewidth=1.5,
    label=f"Madrid mean {madrid_s.mean():.1f}°C",
)
ax.axvline(
    london_s.mean(),
    color="#457b9d",
    linestyle="--",
    linewidth=1.5,
    label=f"London mean {london_s.mean():.1f}°C",
)
ax.set_xlabel("Temperature (°C)")
ax.set_ylabel("Days")
ax.set_title("Summer temperature distribution — Madrid vs London")
ax.legend()
ax.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
../../_images/getting_started_tutorials_13.ctable-basics_40_0.png

5.3 Mean summer temperature — all cities ranked

[24]:
city_summer_means = {}
for city in CITY_PROFILES:
    v = climate.where(climate.city == city)
    if city == "Sydney" or city == "Sao Paulo":
        s = v.where((v.day <= 80) | (v.day >= 355))
    else:
        s = v.where((v.day >= SUMMER_START) & (v.day <= SUMMER_END))
    city_summer_means[city] = s["temperature"].mean()

sorted_cities = sorted(city_summer_means.items(), key=lambda x: x[1], reverse=True)
names = [c for c, _ in sorted_cities]
means = [m for _, m in sorted_cities]
colors = ["#e63946" if m > 30 else "#f4a261" if m > 20 else "#457b9d" for m in means]

fig, ax = plt.subplots(figsize=(10, 4))
bars = ax.barh(names, means, color=colors, edgecolor="white")
ax.bar_label(bars, fmt="%.1f °C", padding=4)
ax.set_xlabel("Mean summer temperature (°C)")
ax.set_title("Cities ranked by summer heat")
ax.set_xlim(0, max(means) * 1.15)
ax.grid(True, axis="x", linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
../../_images/getting_started_tutorials_13.ctable-basics_42_0.png

Part 6 — Mutations

CTable supports structural and value mutations: adding/dropping columns, deleting rows, sorting in place.

[25]:
# Add a 'feels_like' column: temperature adjusted for wind chill (simplified)
climate.add_column("feels_like", blosc2.float32(), default=0.0)

temp = climate.temperature[:]
wind = climate["wind_speed"][:]
# Simple wind-chill approximation (only meaningful below 10°C)
feels = np.where(temp < 10, temp - wind * 0.15, temp).astype(np.float32)
climate["feels_like"].assign(feels)

print("Table with feels_like column:")
print(climate.head(5))
print(f"\nColdest 'feels like' day: {climate['feels_like'].min():.1f} °C")
Table with feels_like column:
 city                           day         temperature     humidity        wind_speed      pressure        feels_like
 U16 (Unicode, max 16 chars)    int16       float32         float32         float32         float32         float32
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────  ──────────────
 Madrid                         1           2.909233808…    56.85164260…    14.76450634…    1014.201110…    0.694557666…
 Madrid                         2           0.173932701…    39.05129241…    12.57167434…    1017.437011…    -1.71181857…
 Madrid                         3           3.712681531…    38.42200088…    12.35102844…    1008.641601…    1.860027194…
 Madrid                         4           4.054576396…    46.61845016…    14.67175292…    1004.238769…    1.853813409…
 Madrid                         5           -1.76315498…    51.75508117…    18.27811622…    1008.797790…    -4.50487232…
─────────────────────────────  ──────────  ──────────────  ──────────────  ──────────────  ──────────────  ──────────────
5 rows × 7 columns

Coldest 'feels like' day: -19.2 °C

Part 7 — Persistence

CTable can live on disk in two Blosc2 container formats:

  • .b2d: a directory-backed store. This is the best default for local read/write workflows.

  • .b2z: a single zip-backed store. This is compact and convenient for sharing or archiving, and is typically opened read-only.

Both formats keep the table columns compressed. Use open() for persistent/on-disk access and load() when you want an in-memory copy.

[26]:
import os
import shutil
import tempfile

tmpdir = tempfile.mkdtemp(prefix="blosc2_climate_")
disk_path = f"{tmpdir}/climate.b2d"
zip_path = f"{tmpdir}/climate.b2z"
unpacked_path = f"{tmpdir}/climate-unpacked.b2d"
compact_zip_path = f"{tmpdir}/climate-compact.b2z"

# Save the in-memory table to a directory-backed .b2d store
climate.save(disk_path)
print(f"Saved to '{disk_path}'")

total_kb = sum(os.path.getsize(os.path.join(r, f)) for r, _, fs in os.walk(disk_path) for f in fs) / 1024
print(f"On-disk size: {total_kb:.1f} KB")
print(f"In-memory compressed: {climate.cbytes / 1024:.1f} KB")
Saved to '/var/folders/tb/7hwq2y354bb_68xwxjwjwwlr0000gn/T/blosc2_climate_pnnzna2z/climate'
On-disk size: 55.9 KB
In-memory compressed: 53.4 KB

Fast conversion between .b2d and .b2z

to_b2z() and to_b2d() use fast physical pack/unpack paths when possible: already-compressed leaves are copied as-is, without recompressing columns. This preserves the physical layout, including deleted rows and spare capacity.

Use compact=True when you want a logical compacted copy containing only visible/live rows. That path may rewrite columns and is slower.

[ ]:
# Fast-pack .b2d -> .b2z
ro = CTable.open(disk_path, mode="r")
ro.to_b2z(zip_path, overwrite=True)
ro.close()
print(f"Packed into '{zip_path}'")

# Fast-unpack .b2z -> .b2d
zipped = CTable.open(zip_path, mode="r")
zipped.to_b2d(unpacked_path, overwrite=True)
zipped.close()
print(f"Unpacked into '{unpacked_path}'")

# Logical compacted copy
ro = CTable.open(disk_path, mode="r")
ro.to_b2z(compact_zip_path, overwrite=True, compact=True)
ro.close()
print(f"Compacted copy: '{compact_zip_path}'")
[27]:
# Open read-only — fast, no data is copied until you access a column
ro = CTable.open(disk_path, mode="r")
print(f"Opened read-only: {len(ro):,} rows")
print(f"Cairo annual mean: {ro.where(ro.city == 'Cairo').temperature.mean():.1f} °C")
ro.close()

# Load fully into RAM (useful when you need repeated random access)
ram = CTable.load(disk_path)
print(f"Loaded into RAM : {len(ram):,} rows")

shutil.rmtree(tmpdir)
print("Temporary files removed.")
Opened read-only: 3,650 rows
Cairo annual mean: 21.9 °C
Loaded into RAM : 3,650 rows
Temporary files removed.

Part 8 — Arrow & CSV interop

CTable speaks Arrow and CSV, so it fits naturally into data pipelines.

[28]:
# CTable → Arrow
arrow_table = climate.select(["city", "day", "temperature"]).to_arrow()
print("Arrow table schema:", arrow_table.schema)
print("First 3 rows:", arrow_table.slice(0, 3).to_pydict())
Arrow table schema: city: string
day: int16
temperature: float
First 3 rows: {'city': ['Madrid', 'Madrid', 'Madrid'], 'day': [1, 2, 3], 'temperature': [2.909233808517456, 0.17393270134925842, 3.712681531906128]}
[29]:
import os
import tempfile

# CTable → CSV → CTable round-trip
tmp_csv = tempfile.mktemp(suffix=".csv")
climate.select(["city", "day", "temperature", "humidity"]).to_csv(tmp_csv)

print(f"CSV size: {os.path.getsize(tmp_csv) / 1024:.1f} KB")


@dataclass
class SlimReading:
    city: str = blosc2.field(blosc2.string(max_length=16))
    day: int = blosc2.field(blosc2.int16(ge=1, le=365), default=1)
    temperature: float = blosc2.field(blosc2.float32(), default=0.0)
    humidity: float = blosc2.field(blosc2.float32(), default=0.0)


t_from_csv = CTable.from_csv(tmp_csv, SlimReading)
print(f"Loaded from CSV: {len(t_from_csv):,} rows")
os.remove(tmp_csv)
CSV size: 111.5 KB
Loaded from CSV: 3,650 rows

Part 9 — Nullable columns

Real-world data is often incomplete. CTable handles missing values through a null sentinel approach: you declare a specific value (e.g. -1, "", or float("nan")) that represents “no data” for a column. The library treats it transparently in aggregates, sorting, unique(), value_counts(), and Arrow export.

This pattern is most useful for integer and string columns, which have no natural missing-value representation (unlike floats, which can use NaN).

[30]:
from dataclasses import dataclass


# Sensor log where some readings may be missing
@dataclass
class SensorLog:
    sensor_id: int = blosc2.field(blosc2.int32(ge=0))
    # -999 means "offline" — bypasses the ge/le constraint when stored
    temperature: float = blosc2.field(blosc2.float64(ge=-50.0, le=60.0, null_value=-999.0), default=-999.0)
    # "" means "location unknown"
    location: str = blosc2.field(blosc2.string(max_length=16, null_value=""), default="")


log_data = [
    (0, 22.3, "roof"),
    (1, -999.0, "cellar"),  # temperature: offline
    (2, 18.7, ""),  # location: unknown
    (3, 31.5, "garage"),
    (4, -999.0, ""),  # both missing
    (5, 15.1, "roof"),
]

log = blosc2.CTable(SensorLog, new_data=log_data)
print(log)
 sensor_id     temperature        location
 int32         float64            U16 (Unicode, max 16 chars)
────────────  ─────────────────  ─────────────────────────────
 0             22.3               roof
 1             -999.0             cellar
 2             18.7
 3             31.5               garage
 4             -999.0
 5             15.1               roof
────────────  ─────────────────  ─────────────────────────────
6 rows × 3 columns

9.1 Detecting nulls

[31]:
# is_null() → boolean array aligned to live rows
print("temperature is_null:", log["temperature"].is_null().tolist())
print("location is_null   :", log["location"].is_null().tolist())
print()
print(f"Offline sensors (null temperature): {log.temperature.null_count()}")
print(f"Unknown locations                 : {log['location'].null_count()}")

# Use notnull() as a mask to select only valid readings
valid_temps = log["temperature"][:][log["temperature"].notnull()]
print(f"Valid temperature readings: {valid_temps}")
temperature is_null: [False, True, False, False, True, False]
location is_null   : [False, False, True, False, True, False]

Offline sensors (null temperature): 2
Unknown locations                 : 2
Valid temperature readings: [22.3 18.7 31.5 15.1]

9.2 Null-aware aggregates

[32]:
# All aggregates automatically skip the null sentinel
temp = log["temperature"]
print(f"mean  = {temp.mean():.2f}  (3 valid readings only)")
print(f"min   = {temp.min():.2f}")
print(f"max   = {temp.max():.2f}")
print()
# unique() and value_counts() also exclude the sentinel
print("location unique:", log["location"].unique().tolist())
mean  = 21.90  (3 valid readings only)
min   = 15.10
max   = 31.50

location unique: ['cellar', 'garage', 'roof']

9.3 Validation bypass

[33]:
# The sentinel bypasses ge/le constraints — you can store it freely
# even though -999.0 is below ge=-50.0
log.append((6, -999.0, "attic"))  # succeeds: -999 is the null sentinel
print(f"Rows after append: {len(log)}")

# A genuine constraint violation still raises:
try:
    log.append((7, -100.0, "lab"))  # -100 is NOT the sentinel → rejected
except ValueError:
    print("Rejected: temperature -100 violates ge=-50")
Rows after append: 7
Rejected: temperature -100 violates ge=-50

9.4 Sort: nulls always go last

[34]:
# Regardless of ascending / descending, null rows are placed at the end
s = log.sort_by("temperature")
print("Ascending (nulls last):")
print([round(v, 1) for v in s["temperature"][:].tolist()])

s_desc = log.sort_by("temperature", ascending=False)
print("Descending (nulls still last):")
print([round(v, 1) for v in s_desc["temperature"][:].tolist()])
Ascending (nulls last):
[15.1, 18.7, 22.3, 31.5, -999.0, -999.0, -999.0]
Descending (nulls still last):
[31.5, 22.3, 18.7, 15.1, -999.0, -999.0, -999.0]

9.5 Arrow export: sentinels become Arrow nulls

[35]:
try:
    arrow = log.to_arrow()
    tc = arrow.column("temperature")
    lc = arrow.column("location")
    print(f"Arrow temperature null_count: {tc.null_count}")
    print(f"Arrow location null_count   : {lc.null_count}")
    print("Arrow temperature values:", tc.to_pylist())
except Exception as e:
    print(f"(pyarrow not available: {e})")
Arrow temperature null_count: 3
Arrow location null_count   : 2
Arrow temperature values: [22.3, None, 18.7, 31.5, None, 15.1, None]

9.6 CSV: empty cells become the null sentinel

[36]:
import os
import tempfile

csv_content = """sensor_id,temperature,location
10,25.1,lab
11,,office
12,18.3,
"""

with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f:
    f.write(csv_content)
    csv_path = f.name

log2 = blosc2.CTable.from_csv(csv_path, SensorLog)
print(log2)
print(f"temperature null_count: {log2.temperature.null_count()}")
print(f"location null_count   : {log2['location'].null_count()}")
os.unlink(csv_path)
 sensor_id     temperature        location
 int32         float64            U16 (Unicode, max 16 chars)
────────────  ─────────────────  ─────────────────────────────
 10            25.1               lab
 11            -999.0             office
 12            18.3
────────────  ─────────────────  ─────────────────────────────
3 rows × 3 columns
temperature null_count: 1
location null_count   : 1

Summary

Here’s everything we covered:

Feature

API

Create

CTable(Schema), CTable(Schema, new_data=...)

Insert

append(row), extend(list_or_array)

View

head(), tail(), print(t), t.info()

Filter

where(expr) → view

Project

select([cols]) → view

Sort

sort_by(cols), sort_by(cols, inplace=True)

Aggregates

col.sum(), .mean(), .std(), .min(), .max()

Stats

describe(), cov()

Mutate

delete(), compact(), add_column(), drop_column(), assign()

Persist

save(path), to_b2z(), to_b2d(), CTable.open(path), CTable.load(path)

Interop

to_arrow(), from_arrow(), to_csv(), from_csv()

Nullable

null_value= on spec, is_null(), notnull(), null_count()

CTable is designed for compressed analytical workloads — large tables that need to stay small in RAM while still being fast to query and easy to persist.