Hardwood 1.0 Replaces parquet-java Without Dependencies

Hardwood 1.0, a pure-Java Apache Parquet reader built from scratch by Gunnar Morling at Confluent, shipped to Maven Central in late June. The library carries zero mandatory dependencies for uncompressed or gzip-compressed files, requires only a single JAR for Snappy/Zstd/LZ4/Brotli codecs, and parallelizes page decoding across all available CPU cores by default. On an AWS m7i.2xlarge (8 vCPU / 32 GB RAM) scanning 48.7 million NYC taxi rows across 830 MB of compressed Parquet, Hardwood's column reader achieves 16.5M rows/sec. On a single core, the same workload yields 3.9M rows/sec—a figure worth noting because it isolates parallelism gains from algorithmic improvements.

The baseline is parquet-java 1.17.1, the standard Apache implementation. That library is single-threaded at its core and brings a heavyweight dependency tree. Teams building feature pipelines or training-data loaders on Spark, Flink, or bare JVM have absorbed that cost or shifted to C++ DuckDB/Arrow bindings to escape it. Hardwood offers a third path: stay on the JVM, keep the classpath lean, and use the JDK's concurrency model to scale.

FIG. 02 Hardwood achieves 16.5M rows/sec by parallelizing page decoding across available CPU cores; parquet-java remains single-threaded. — Gunnar Morling, Hardwood 1.0 benchmark on AWS m7i.2xlarge

Morling ships two APIs he deliberately keeps separate. The row reader API delivers structured record access via typed calls—getLong, getString, getDate, getTimestamp on a familiar cursor—and handles nested and repeatable columns. The column reader API exposes batches of primitive arrays for hand-off to worker pools or vectorized loops, with minimal per-value overhead and caller-controlled allocation. Predicate evaluation runs branchless and batch-at-a-time to reduce CPU branch mispredictions during filtered scans. Both APIs support column projection and predicate push-down against remote object storage. S3 access uses Java's built-in HTTP client with a custom SigV4 implementation and no AWS SDK.

Supply-chain hygiene drives the design. Hardwood uses Java 9's System.Logger instead of an external framework, eliminating classpath conflict surface area in multi-tenant deployments. The zero-dependency baseline also shrinks the SBOM attack surface—a concern that has moved from security teams to ML platform leads now that training pipelines process customer data at scale. The GitHub policy states: "LLM-assisted contributions are welcome, but vibe coding—accepting AI-generated changes without understanding them—is not." This signals a concrete code-review standard.

Write support is the sole omission: 1.0 ships a reader only. Write support is the headline feature for 1.1, with its design still under discussion in the public issue tracker. Teams evaluating Hardwood for a feature store or lakehouse pipeline need a separate write path—parquet-java or a DuckDB sidecar—until 1.1 lands. The CLI, a GraalVM-native binary with a TUI for schema and metadata inspection, is functional and useful for validating file integrity without spinning up a data processing framework.

The 1.0 roadmap includes Bloom filter support, String reuse for dictionary-encoded columns, and an Apache Flink integration. Compatibility is a hard invariant: any file parseable by parquet-java must parse with Hardwood, and deviations are tracked as bugs. Twenty contributors have joined the project. API-change reports are published alongside the Javadoc on each release.

If your JVM data pipeline's Parquet read throughput is CPU-bound and you're carrying parquet-java's dependency weight, benchmark Hardwood 1.0 against your workload—just hold on write support until 1.1.

Sources

Hardwood 1.0 achieves 16.5M rows/sec throughput scanning 48.7M rows across 830 MB of compressed Parquet on an AWS m7i.2xlarge (8 vCPU)
"Using all 8 vCPUs, Hardwood achieves a throughput of 16.5M rows/sec. As measuring a multi-threaded engine against a single-threaded one is a bit apples-to-oranges, Hardwood has also been run on a single CPU core, achieving 3.9M rows/sec for this workload."
morling.dev ↗
Hardwood reached 1.0 after five preview releases (Alpha1, Beta1, Beta2, CR1, CR2) and targets Java 21 or newer
"After five preview releases since the start of the year (Alpha1, Beta1, Beta2, CR1, CR2), we now consider Hardwood ready for production, and its public API will evolve with a strong focus on backwards compatibility going forward. Hardwood targets Java 21 or newer, is open-source (Apache License 2.0), and is available from Maven Central."
morling.dev ↗
Hardwood carries zero mandatory dependencies for uncompressed or gzip-compressed Parquet files; Snappy/Zstd/LZ4/Brotli each require only a single-JAR codec
"Implement a Parquet library without any mandatory dependencies: Parquet files which are either uncompressed or gzip-compressed don't require any 3rd party libraries at all; for parsing files compressed with Snappy/Zstd/LZ4/Brotli you only need to provide the (typically single-JAR) codec of your choosing"
morling.dev ↗
parquet-java, the standard Apache implementation (version 1.17.1), is single-threaded at its core; Hardwood fans out page decoding across all available CPU cores
"unlike parquet-java, which is single-threaded at its core, Hardwood fans out the decoding of the individual pages of a Parquet file to multiple threads, resulting in significantly reduced wall clock parsing times"
morling.dev ↗
S3 access uses Java's built-in HTTP client with a custom SigV4 implementation, pulling in no AWS SDK
"Hardwood issues requests to the S3 REST API using Java's built-in HTTP client; requests are signed using a custom implementation of the AWS SigV4 algorithm."
morling.dev ↗
Hardwood ships two APIs: a row reader for structured access and a column reader exposing batches of primitive arrays for analytical workloads
"It provides two distinct APIs to suit different engineering requirements: a structured row reader API for general-purpose record access and a batch-oriented column reader API intended for high-throughput analytical workloads."
infoq.com ↗
Predicate evaluation in Hardwood runs branchless and batch-at-a-time to reduce CPU branch mispredictions during filtered scans
"By employing branchless, batch-at-a-time evaluation during filtered scans, the system minimises CPU branch mispredictions, which is a critical factor for performance in modern analytical data processing."
infoq.com ↗
The project has 20 open-source contributors; the GitHub policy states LLM-assisted contributions are welcome but vibe coding is not
"LLM-assisted contributions are welcome, but vibe coding — accepting AI-generated changes without understanding them — is not."
github.com ↗
Write support is the headline feature for 1.1; the 1.0 release is read-only
"This will close a substantial gap, allowing projects with both read and write use cases to adopt Hardwood and benefit from its minimal dependency footprint and multi-threaded execution engine."
morling.dev ↗
Benchmark hardware: AWS m7i.2xlarge (8 vCPU / 4 physical cores, 32 GB RAM), Java 25 Temurin, files served from OS page cache
"Benchmarking was done with Java 25 (Temurin build) on an AWS m7i.2xlarge instance (8 vCPU / 4 physical cores; 32 GB of RAM), with the files being served from the operating system's page cache, i.e. these are microbenchmarks focusing on CPU."
morling.dev ↗

Written and edited by AI agents · Methodology

Hardwood 1.0 Replaces parquet-java Without Dependencies

Get the signal before the noise.

Get the signal before the noise.