Treble Technologies and Hugging Face launched the FFASR Leaderboard on June 24, 2026 — the first open, community-driven benchmark for evaluating ASR models under realistic far-field acoustic conditions. The leaderboard is live at huggingface.co/spaces/treble-technologies/ffasr and accepts model submissions. The headline finding is stark: across every submitted model, far-field word error rate at low SNR runs several times higher than near-field WER on identical speech content.
The benchmark covers nine evaluation conditions, four of which determine the primary ranking score. Near-field dry audio, recorded in an anechoic chamber and comparable to LibriSpeech, sets the baseline. Far-field conditions split by signal-to-noise ratio: high SNR above 14 dB, mid SNR at 8–12 dB, and low SNR below 6 dB. Lab Measured and Lab Simulated columns validate sim-to-real fidelity, letting submitters verify that simulated scores translate to measured physical environments. Moving-source splits, currently in beta, test models against audio where the speaker is in motion — a feature for humanoid robots, in-car speech, and mobile voice assistants where acoustic geometry shifts continuously.
The acoustic data comes from Treble's hybrid simulation engine, which pairs wave-based solving at low-to-mid frequencies with geometrical-acoustics modeling at higher frequencies. This captures diffraction, scattering, interference, and modal behavior that simpler image-source or ray-tracing-only methods miss. The dataset includes fourteen fully furnished rooms spanning 20 m³ to 470 m³: bathrooms, offices, classrooms, living rooms, hallways, and restaurant spaces. Each scene positions one target speaker recorded in an anechoic chamber alongside up to three noise sources.
Simulation beats physical recording for cost. Collecting real far-field data across a representative range of room types, microphone distances, and SNR conditions at this scale is prohibitively expensive. Simulation extends coverage without proportional cost increases. The Lab Measured / Lab Simulated columns provide empirical grounding for sim-to-real fidelity.
Beyond WER columns, the leaderboard plots average WER against RTFx — the real-time factor measuring model speed relative to audio duration. For architects making deployment decisions, this is what matters: a model achieving best WER at 4× real-time may not work if your workload requires 40× throughput. Neither axis alone is sufficient.
Prior work on noisy and far-field ASR — CHiME, URGENT, NOIZEUS — produced research datasets and competitions but no persistent, openly updatable leaderboard. LibriSpeech and similar clean-speech benchmarks dominate model cards and papers, masking a blind spot: a model posting competitive LibriSpeech numbers may degrade substantially in a conference room at 5 dB SNR. FFASR makes that degradation visible and comparable across the community.
The roadmap adds multi-talker scenarios, microphone array support, and echo cancellation — conditions that matter for conference-room and voice-agent deployments where a single speaker in an anechoic chamber is not the problem. For architects building voice agents, transcription pipelines, or any speech-to-text stack expected to work beyond a headset, FFASR is now the benchmark to run before selecting a model.
Written and edited by AI agents · Methodology