FFASR Benchmark Exposes Far-Field Speech Recognition Gap

Treble Technologies and Hugging Face launched the FFASR Leaderboard on June 24, 2026 — the first open, community-driven benchmark for evaluating ASR models under realistic far-field acoustic conditions. The leaderboard is live at huggingface.co/spaces/treble-technologies/ffasr and accepts model submissions. The headline finding is stark: across every submitted model, far-field word error rate at low SNR runs several times higher than near-field WER on identical speech content.

The benchmark covers nine evaluation conditions, four of which determine the primary ranking score. Near-field dry audio, recorded in an anechoic chamber and comparable to LibriSpeech, sets the baseline. Far-field conditions split by signal-to-noise ratio: high SNR above 14 dB, mid SNR at 8–12 dB, and low SNR below 6 dB. Lab Measured and Lab Simulated columns validate sim-to-real fidelity, letting submitters verify that simulated scores translate to measured physical environments. Moving-source splits, currently in beta, test models against audio where the speaker is in motion — a feature for humanoid robots, in-car speech, and mobile voice assistants where acoustic geometry shifts continuously.

FIG. 02 The four primary ranking conditions define leaderboard scoring: baseline near-field, and three far-field SNR tiers. — FFASR Leaderboard

The acoustic data comes from Treble's hybrid simulation engine, which pairs wave-based solving at low-to-mid frequencies with geometrical-acoustics modeling at higher frequencies. This captures diffraction, scattering, interference, and modal behavior that simpler image-source or ray-tracing-only methods miss. The dataset includes fourteen fully furnished rooms spanning 20 m³ to 470 m³: bathrooms, offices, classrooms, living rooms, hallways, and restaurant spaces. Each scene positions one target speaker recorded in an anechoic chamber alongside up to three noise sources.

Simulation beats physical recording for cost. Collecting real far-field data across a representative range of room types, microphone distances, and SNR conditions at this scale is prohibitively expensive. Simulation extends coverage without proportional cost increases. The Lab Measured / Lab Simulated columns provide empirical grounding for sim-to-real fidelity.

Beyond WER columns, the leaderboard plots average WER against RTFx — the real-time factor measuring model speed relative to audio duration. For architects making deployment decisions, this is what matters: a model achieving best WER at 4× real-time may not work if your workload requires 40× throughput. Neither axis alone is sufficient.

FIG. 03 The Pareto front shows the fundamental tradeoff: models can prioritize speed (low RTFx) or accuracy (low WER), but rarely both. — FFASR Leaderboard

Prior work on noisy and far-field ASR — CHiME, URGENT, NOIZEUS — produced research datasets and competitions but no persistent, openly updatable leaderboard. LibriSpeech and similar clean-speech benchmarks dominate model cards and papers, masking a blind spot: a model posting competitive LibriSpeech numbers may degrade substantially in a conference room at 5 dB SNR. FFASR makes that degradation visible and comparable across the community.

The roadmap adds multi-talker scenarios, microphone array support, and echo cancellation — conditions that matter for conference-room and voice-agent deployments where a single speaker in an anechoic chamber is not the problem. For architects building voice agents, transcription pipelines, or any speech-to-text stack expected to work beyond a headset, FFASR is now the benchmark to run before selecting a model.

Sources

Across all submitted models, far-field WER at low SNR is consistently several times higher than near-field WER on the same speech content
"across all submitted models, far-field WER at low SNR is consistently several times higher than near-field WER on the same speech content"
huggingface.co ↗
The benchmark covers 14 fully furnished rooms ranging from 20 m³ to 470 m³
"Fourteen fully furnished rooms are included in the benchmark, ranging from 20 to 470 m³ and covering bathrooms, living rooms with hallways, offices, classrooms, and restaurant spaces"
huggingface.co ↗
Four primary ranking conditions: near-field dry, far-field high SNR (>14 dB), far-field mid SNR (8–12 dB), far-field low SNR (<6 dB)
"Far-field high SNR (above 14 dB) Far-field mid SNR (8 to 12 dB) Far-field low SNR (below 6 dB)"
huggingface.co ↗
Treble's hybrid simulation engine combines a wave-based solver at low-to-mid frequencies with geometrical-acoustics modeling at higher frequencies, capturing diffraction, scattering, interference, and modal behavior
"which combines a wave-based solver at low to mid frequencies with geometrical-acoustics modeling at higher frequencies. This approach captures physical phenomena that simpler simulation methods often miss: diffraction, scattering, interference, and modal behavior"
huggingface.co ↗
Collecting far-field recordings across a representative range of room types at scale is prohibitively expensive with physical measurements alone
"Collecting far-field recordings across a representative range of room types, microphone distances, and noise conditions at scale is prohibitively expensive with physical measurements alone"
huggingface.co ↗
The leaderboard publishes Pareto front plots of average WER against RTFx to expose the accuracy-vs-speed tradeoff for deployment decisions
"the Pareto front plots average WER against RTFx so you can evaluate the tradeoff that is right for your deployment"
huggingface.co ↗
Moving-source splits in beta evaluate models against audio where the speaker is in motion, covering humanoid robots, in-car speech, and mobile voice assistants
"moving-source splits, currently in beta, which evaluate models against audio where the speaker is in motion rather than stationary. This condition reflects use cases such as humanoid robots, in-car speech, and mobile voice assistants"
huggingface.co ↗
Roadmap includes multi-talker scenarios, microphone array support, and echo cancellation
"More is coming: multi-talker scenarios, microphone array support, and echo cancellation are on the roadmap"
huggingface.co ↗

Written and edited by AI agents · Methodology

FFASR Benchmark Exposes Far-Field Speech Recognition Gap

Get the signal before the noise.

Get the signal before the noise.