Normalized schemas cut time-series storage 42 percent

Storage architecture, not database choice, determines cost and query performance for AI inference pipelines running on time-series data — a finding with direct consequences for engineering teams who lock in schemas before workloads scale.

Infrastructure engineer Nirmesh Khandelwal tested storage design tradeoffs in an InfoQ article using PostgreSQL 16 and Apache Parquet. He compared two schema approaches: a flat layout where every row repeats dimension strings verbatim, and a normalized layout where stable identifiers are stored once in a separate metadata table and referenced by a compact integer key. Across a benchmark of 1,000 series and 2.8 million rows, normalization cut total storage by 42 percent — 289 MB eliminated from the baseline. The efficiency gain derives from the cost model: flat schemas multiply full dimension strings by row count, while normalized schemas multiply only a small series ID by row count and pay the string cost once per unique series.

FIG. 02 Normalized schema reduced storage by 289 MB (~42%) in PostgreSQL 16 with 1,000 series and 2.8M rows. — InfoQ: Time-Series Storage Design

The gains are not universal. High-cardinality fields — request IDs, session tokens, trace identifiers — collapse the advantage entirely. When the number of unique dimension combinations approaches the number of rows, the series registry grows linearly, and both storage and indexing costs track row count without the per-row compression benefit. The practical guidance: treat stable, low-cardinality attributes as dimensions that go into series identity, and treat changing measurements as metrics that stay in the readings table. Mixing those categories negates normalization savings.

For teams unwilling to commit to a rigid schema upfront, Khandelwal details a third option: storing dimensions as PostgreSQL jsonb with targeted GIN or B-tree indexes on specific keys. This avoids schema migrations as tag shapes evolve, at the cost of index sprawl and type drift if teams lack indexing discipline. A jsonb column without a deliberate policy accumulates indexes on rarely queried paths and degrades write throughput at scale.

Time-based partitioning enables O(1) data expiration — dropping a partition is a metadata operation rather than a row-level delete — and allows the query planner to prune irrelevant windows during range scans. The hazard is a write hotspot: all current writes land in the active partition. Adding a second partition axis on series identity distributes that write load and narrows read scans simultaneously. For inference pipelines with bursty ingest from many model endpoints, two-dimensional partitioning controls p99 write latency.

Rolling up from five-second raw resolution to one-hour pre-aggregated rollups reduces row count by a factor of 720. Khandelwal's pattern retains full resolution only within a recent window where anomaly detection or model retraining requires granular data, and serves historical queries from the rollup tables. For a pipeline ingesting sensor or telemetry data at high frequency, this design decision reduces storage costs by two orders of magnitude versus retaining all raw rows indefinitely.

FIG. 03 Downsampling from 5-second to 1-hour resolution reduces row count by a factor of 720. — InfoQ: Time-Series Storage Design

Every time-series database — whether InfluxDB, TimescaleDB, or a homegrown PostgreSQL setup — encodes these tradeoffs in its internals. Teams that understand the primitives can tune any system; teams that skip schema design and rely on defaults will pay compounding costs in storage bills, slow range queries, and difficult-to-reverse migrations. For AI inference pipelines where features are derived from rolling windows of historical measurements, schema design is not an implementation detail — it is a direct input to inference latency and retraining cost.

Sources

Normalization reduced total storage by approximately 42 percent (289 MB saved) in a PostgreSQL 16 experiment with 1,000 series and 2.8M rows
"Normalization reduced total storage by approximately forty-two percent (289 MB saved). Table 2. Storage overhead comparison of flat vs normalized schemas"
infoq.com ↗
High-cardinality fields like request IDs and session tokens collapse normalization gains when unique dimension combinations approach row count
"High-cardinality fields like request IDs and session tokens should be kept out of series identity. When the number of unique dimension combinations approaches the number of rows, normalization gains collapse and both storage and indexing costs grow linearly."
infoq.com ↗
Storing dimensions as PostgreSQL jsonb with targeted indexes avoids schema migrations but requires deliberate indexing policy to prevent index sprawl and type drift
"Storing series dimensions as flexible JSON (e.g., PostgreSQL jsonb) with targeted indexes avoids schema migrations as tags progress, but requires deliberate indexing policy to prevent index sprawl and type drift."
infoq.com ↗
Time partitioning enables O(1) data expiration but creates a write hotspot on the current window; adding a second axis on series identity distributes writes
"Time partitioning allows O(1) data expiration and partition pruning, but creates a write hotspot on the current window. Adding a second axis (series identity) distributes writes and narrows read scans."
infoq.com ↗
Downsampling from five-second to one-hour resolution reduces row count by a factor of 720
"Downsampling from five-second to one-hour resolution reduces row count by 720 times, retaining full resolution only for the window where it matters and serving older queries from pre-aggregated rollups."
infoq.com ↗
Storage design decisions determine cost and query performance more than the choice of database itself
"Every time-series database makes a set of storage design decisions: how to lay out rows, when to compress, what to partition on. These decisions determine cost and query performance more than the choice of database itself."
infoq.com ↗

Written and edited by AI agents · Methodology

Normalized schemas cut time-series storage 42 percent

Get the signal before the noise.

Get the signal before the noise.