ServiceNow has introduced AU-Harness, an open-source benchmark for code-switched automatic speech recognition (ASR), in response to a customer inquiry about managing bilingual callers who switch between Spanish and English mid-utterance. The benchmark evaluates seven leading speech models, revealing that top systems can handle code-switching while others fail silently. For example, OpenAI Whisper Large V3 Turbo, when run in auto-detect mode without an explicit language parameter, defaults to translating into English rather than transcribing, resulting in a word error rate (WER) ranging from 0.16 to 0.61 and a degradation of up to 0.85 points compared to its English monolingual baseline on German-English audio.

The dataset includes HR and ITSM scenarios such as password resets and VPN issues, across Spanish-English, French-English, Canadian French-English, and German-English. ServiceNow synthesized utterances using OpenAI GPT-5 for code-switched text generation and ElevenLabs Multilingual V2 for audio rendering, with native-speaker linguists reviewing each record before inclusion. The evaluated models include AssemblyAI Universal 3-Pro, Deepgram Nova 3 Multilang, ElevenLabs Scribe V2, Google Gemini 3 Flash, Mistral Voxtral Small 24B-2507, Nvidia Parakeet TDT 0.6b V3, and OpenAI Whisper Large V3 Turbo in auto language detection only, reflecting real-world voice pipelines where caller language is unknown.

Accuracy is assessed through three metrics: WER for transcription fidelity; Semantic WER by Gemma-4-31B using Pipecat's STT benchmark methodology to test whether an error alters meaning; and Answer Error Rate (AER), where an LLM reads the transcript and answers comprehension questions per utterance following Bhushan et al.'s IISc/ARTPARK protocol. This tiered approach identifies models that appear adequate on WER but fail to capture critical entities like case numbers, dates, and names necessary for voice agents to perform tasks without human intervention.

ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro lead on transcription accuracy, tied on Spanish-English and separated by just 0.02–0.13 percentage points across all other language pairs, with Scribe taking a narrow lead on each. Google Gemini 3 Flash follows closely in every pair, trailing most on Canadian French-English, where it falls 0.14 points behind Scribe and 0.12 behind AssemblyAI. Scribe V2 notably outperforms its own monolingual L2 baseline, indicating genuine robustness rather than mere tolerance for bilingual input.

Gemini 3 Flash consistently outperforms AssemblyAI in AER and pushes it to third place across all pairs, preserving actionable meaning despite missed words. The same pattern appears in Semantic WER, though AssemblyAI outperforms Gemini on Spanish-English. Deepgram Nova 3 sits mid-tier on Semantic WER but ranks last or second-to-last on AER across all pairs; on Spanish-English its overall semantic error rate is lower than its error rate specifically on the details that matter most.

The benchmark is synthetic, using TTS audio from ElevenLabs rather than real bilingual speakers, so it lacks phonological compression, accent, and conversational disfluency. Errors cluster on English segments rather than the matrix-language portions, possibly because embedded technical vocabulary creates a phonological context switch that studio TTS under-represents. The benchmark authors note that switch frequency predicts whether a transcription error occurs—a relationship significant for six of seven models in French-English—while Code-Mixing Index predicts error severity, with four of seven models in German-English showing a significant positive relationship between CMI and WER.

The dataset is small, with 259 Spanish-English records, 298 French-English, 188 Canadian French-English, and 173 German-English, introducing sampling noise. All models were tested in auto-detect mode, where Whisper fails silently by translating into English instead of transcribing the matrix language, a behavior partially masked by semantic metrics because the translation preserves some meaning. ServiceNow did not benchmark configurations with forced language tokens.

Adopting the three-tier eval pipeline—WER, semantic WER, and downstream answer accuracy—is essential because raw transcription rankings change when testing whether the transcript contains the facts needed for agent action.

Written and edited by AI agents · Methodology