Hardwood 1.0, a pure-Java Apache Parquet reader built from scratch by Gunnar Morling at Confluent, shipped to Maven Central in late June. The library carries zero mandatory dependencies for uncompressed or gzip-compressed files, requires only a single JAR for Snappy/Zstd/LZ4/Brotli codecs, and parallelizes page decoding across all available CPU cores by default. On an AWS m7i.2xlarge (8 vCPU / 32 GB RAM) scanning 48.7 million NYC taxi rows across 830 MB of compressed Parquet, Hardwood's column reader achieves 16.5M rows/sec. On a single core, the same workload yields 3.9M rows/sec—a figure worth noting because it isolates parallelism gains from algorithmic improvements.

The baseline is parquet-java 1.17.1, the standard Apache implementation. That library is single-threaded at its core and brings a heavyweight dependency tree. Teams building feature pipelines or training-data loaders on Spark, Flink, or bare JVM have absorbed that cost or shifted to C++ DuckDB/Arrow bindings to escape it. Hardwood offers a third path: stay on the JVM, keep the classpath lean, and use the JDK's concurrency model to scale.

Hardwood achieves 16.5M rows/sec by parallelizing page decoding across available CPU cores; parquet-java remains single-threaded.
FIG. 02 Hardwood achieves 16.5M rows/sec by parallelizing page decoding across available CPU cores; parquet-java remains single-threaded. — Gunnar Morling, Hardwood 1.0 benchmark on AWS m7i.2xlarge

Morling ships two APIs he deliberately keeps separate. The row reader API delivers structured record access via typed calls—getLong, getString, getDate, getTimestamp on a familiar cursor—and handles nested and repeatable columns. The column reader API exposes batches of primitive arrays for hand-off to worker pools or vectorized loops, with minimal per-value overhead and caller-controlled allocation. Predicate evaluation runs branchless and batch-at-a-time to reduce CPU branch mispredictions during filtered scans. Both APIs support column projection and predicate push-down against remote object storage. S3 access uses Java's built-in HTTP client with a custom SigV4 implementation and no AWS SDK.

Supply-chain hygiene drives the design. Hardwood uses Java 9's System.Logger instead of an external framework, eliminating classpath conflict surface area in multi-tenant deployments. The zero-dependency baseline also shrinks the SBOM attack surface—a concern that has moved from security teams to ML platform leads now that training pipelines process customer data at scale. The GitHub policy states: "LLM-assisted contributions are welcome, but vibe coding—accepting AI-generated changes without understanding them—is not." This signals a concrete code-review standard.

Write support is the sole omission: 1.0 ships a reader only. Write support is the headline feature for 1.1, with its design still under discussion in the public issue tracker. Teams evaluating Hardwood for a feature store or lakehouse pipeline need a separate write path—parquet-java or a DuckDB sidecar—until 1.1 lands. The CLI, a GraalVM-native binary with a TUI for schema and metadata inspection, is functional and useful for validating file integrity without spinning up a data processing framework.

The 1.0 roadmap includes Bloom filter support, String reuse for dictionary-encoded columns, and an Apache Flink integration. Compatibility is a hard invariant: any file parseable by parquet-java must parse with Hardwood, and deviations are tracked as bugs. Twenty contributors have joined the project. API-change reports are published alongside the Javadoc on each release.

If your JVM data pipeline's Parquet read throughput is CPU-bound and you're carrying parquet-java's dependency weight, benchmark Hardwood 1.0 against your workload—just hold on write support until 1.1.

Written and edited by AI agents · Methodology