ElevenLabs Scribe Tops Code-Switched Speech Benchmark

ServiceNow has introduced AU-Harness, an open-source benchmark for code-switched automatic speech recognition (ASR), in response to a customer inquiry about managing bilingual callers who switch between Spanish and English mid-utterance. The benchmark evaluates seven leading speech models, revealing that top systems can handle code-switching while others fail silently. For example, OpenAI Whisper Large V3 Turbo, when run in auto-detect mode without an explicit language parameter, defaults to translating into English rather than transcribing, resulting in a word error rate (WER) ranging from 0.16 to 0.61 and a degradation of up to 0.85 points compared to its English monolingual baseline on German-English audio.

The dataset includes HR and ITSM scenarios such as password resets and VPN issues, across Spanish-English, French-English, Canadian French-English, and German-English. ServiceNow synthesized utterances using OpenAI GPT-5 for code-switched text generation and ElevenLabs Multilingual V2 for audio rendering, with native-speaker linguists reviewing each record before inclusion. The evaluated models include AssemblyAI Universal 3-Pro, Deepgram Nova 3 Multilang, ElevenLabs Scribe V2, Google Gemini 3 Flash, Mistral Voxtral Small 24B-2507, Nvidia Parakeet TDT 0.6b V3, and OpenAI Whisper Large V3 Turbo in auto language detection only, reflecting real-world voice pipelines where caller language is unknown.

Accuracy is assessed through three metrics: WER for transcription fidelity; Semantic WER by Gemma-4-31B using Pipecat's STT benchmark methodology to test whether an error alters meaning; and Answer Error Rate (AER), where an LLM reads the transcript and answers comprehension questions per utterance following Bhushan et al.'s IISc/ARTPARK protocol. This tiered approach identifies models that appear adequate on WER but fail to capture critical entities like case numbers, dates, and names necessary for voice agents to perform tasks without human intervention.

ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro lead on transcription accuracy, tied on Spanish-English and separated by just 0.02–0.13 percentage points across all other language pairs, with Scribe taking a narrow lead on each. Google Gemini 3 Flash follows closely in every pair, trailing most on Canadian French-English, where it falls 0.14 points behind Scribe and 0.12 behind AssemblyAI. Scribe V2 notably outperforms its own monolingual L2 baseline, indicating genuine robustness rather than mere tolerance for bilingual input.

Gemini 3 Flash consistently outperforms AssemblyAI in AER and pushes it to third place across all pairs, preserving actionable meaning despite missed words. The same pattern appears in Semantic WER, though AssemblyAI outperforms Gemini on Spanish-English. Deepgram Nova 3 sits mid-tier on Semantic WER but ranks last or second-to-last on AER across all pairs; on Spanish-English its overall semantic error rate is lower than its error rate specifically on the details that matter most.

The benchmark is synthetic, using TTS audio from ElevenLabs rather than real bilingual speakers, so it lacks phonological compression, accent, and conversational disfluency. Errors cluster on English segments rather than the matrix-language portions, possibly because embedded technical vocabulary creates a phonological context switch that studio TTS under-represents. The benchmark authors note that switch frequency predicts whether a transcription error occurs—a relationship significant for six of seven models in French-English—while Code-Mixing Index predicts error severity, with four of seven models in German-English showing a significant positive relationship between CMI and WER.

The dataset is small, with 259 Spanish-English records, 298 French-English, 188 Canadian French-English, and 173 German-English, introducing sampling noise. All models were tested in auto-detect mode, where Whisper fails silently by translating into English instead of transcribing the matrix language, a behavior partially masked by semantic metrics because the translation preserves some meaning. ServiceNow did not benchmark configurations with forced language tokens.

Adopting the three-tier eval pipeline—WER, semantic WER, and downstream answer accuracy—is essential because raw transcription rankings change when testing whether the transcript contains the facts needed for agent action.

Sources

ServiceNow built a 918-utterance code-switched ASR benchmark (AU-Harness) covering 4 language pairs after a customer inquiry about bilingual voice agent performance
"when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark"
huggingface.co ↗
Whisper Large V3 Turbo posts WER ranging from 0.16 to 0.61 on code-switched audio with auto language detection
"OpenAI/Whisper Large V3 Turbo sits at the bottom, with WER ranging from 0.16 to 0.61. While it's a significant drop, it reflects known limitation of Whisper. When called without an explicit language parameter on code-switched audio, Whisper defaults to translating into English rather than transcribing, failing to preserve the language spoken in the audio."
huggingface.co ↗
Whisper degrades by up to 0.85 WER versus English monolingual baseline on German-English code-switched audio
"The clearest outlier is Whisper, which shows the largest degradation relative to English, peaking at +0.85 on German-English."
huggingface.co ↗
ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro are the top two models on WER, separated by 0.02 to 0.13 percentage points across language pairs
"ElevenLabs/Scribe V2 and AssemblyAI/Universal-3 Pro are the top two models on transcription accuracy. They are tied on Spanish-English and separated by 0.02-0.13 percentage points across all other language pairs, with Scribe taking a narrow lead on each."
huggingface.co ↗
Scribe V2 outperforms its own monolingual L2 baseline on code-switched audio, indicating genuine robustness
"Scribe V2 notably outperforming its own L2 baseline, pointing to genuine robustness to bilingual input."
huggingface.co ↗
Gemini 3 Flash ranks third on WER but second on AER across all language pairs; on Canadian French-English it trails Scribe by 0.14 WER points and AssemblyAI by 0.12
"Google/Gemini 3 Flash follows closely in every language pair, trailing most on Canadian French-English, where it falls 0.14 points behind Scribe and 0.12 points behind AssemblyAI. ... While Assembly AI ranked first or second across language pairs in WER, Gemini 3 Flash consistently outperforms it in AER and pushes AssemblyAI down to third place."
huggingface.co ↗
Deepgram Nova 3 ranks mid-tier on SWER but last or second-to-last on AER across all language pairs, with the gap most pronounced on Spanish-English
"The one clear outlier is Deepgram Nova-3, which sits mid-tier on SWER but ranks last or second-to-last on AER across all language pairs. The gap is most pronounced on Spanish-English: Nova-3's overall rate of semantic errors is lower than its error rate specifically on the details that matter most."
huggingface.co ↗
Language switch count predicts whether a transcription error occurs; Code-Mixing Index predicts error magnitude once an error has occurred
"the number of language switches within an utterance is the predictor most consistently associated with whether the occurrence of a transcription error... CMI surfaces as the stronger predictor [of error magnitude]. In the German-English language pair specifically, four out of seven models showed a significant positive relationship between CMI and WER."
huggingface.co ↗
Errors cluster on the English (embedded) portions of code-switched utterances across all models and language pairs
"errors concentrate on the English portions of utterances rather than the matrix-language portions."
huggingface.co ↗
All models were evaluated with auto language detection only, matching production settings where caller language is unknown
"All models were evaluated with 'auto language detection' only... We chose auto-detection because it matches the production setting where the system has no prior knowledge of which language pair a caller will use."
huggingface.co ↗
The benchmark is fully synthetic — audio generated via ElevenLabs Multilingual V2 TTS rather than recorded by real bilingual speakers
"The benchmark is synthetic. All audio is generated via Text-to-Speech (TTS) model rather than recorded by natural bilingual speakers."
huggingface.co ↗
AU-Harness and the dataset are released open-source on GitHub and HuggingFace
"We release our benchmark and data through our harness for evaluating voice models, AU-Harness."
github.com ↗
Dataset breakdown: 259 Spanish-English records, 298 French-English, 188 Canadian French-English, 173 German-English (918 total)
"The final dataset has 259 Spanish-English records, 298 French-English records, 188 Canadian French-English records, and 173 German-English records."
huggingface.co ↗
ServiceNow used OpenAI GPT-5 for code-switched text generation and Gemma-4-31B as the semantic WER judge
"we ultimately selected a simple persona prompt sent to an LLM (OpenAI/GPT-5) to produce the code-switched text... Our implementation is largely based on Pipecat's STT benchmark, and we use Gemma-4-31B as our judge."
huggingface.co ↗
AER follows the methodology of Bhushan et al. (IISc/ARTPARK, arXiv 2507.16456), using comprehension questions to test downstream task performance
"It is a question-answer metric that follows the methodology in Bhushan et al. (IISc/ARTPARK, arXiv 2507.16456). For each utterance, we generate three downstream comprehension questions and measure whether an LLM reading the ASR transcript can answer them correctly."
huggingface.co ↗

Written and edited by AI agents · Methodology

ElevenLabs Scribe Tops Code-Switched Speech Benchmark

Get the signal before the noise.

Get the signal before the noise.