Researchers at Together AI and Stanford tested four production real-time voice systems on scenarios where what a caller says and how they say it point to opposite conclusions. In 119 of 120 runs, every model followed the words and ignored the voice. The paper "Real-Time Voice AI Hears but Does Not Listen" was published June 24 by Martijn Bartelds, Federico Bianchi, and James Zou, with direct implications for teams shipping voice agents into healthcare, fintech, and customer care.

The four systems tested are the ones already in production: OpenAI's `gpt-realtime-2`, Google's `gemini-3.1-flash-live-preview`, Alibaba's `qwen3.5-omni-plus-realtime`, and `qwen3.5-omni-flash-realtime`. Each processed audio input directly to speech output in real-time exchanges without transcription middleware. Stimuli were synthesized with ElevenLabs. Every scenario ran five times; single-turn diagnostics ran twenty times.

The three scenarios were designed to separate word content from vocal tone. In the first—an emergency dispatcher welfare callback—a caller cries while insisting nothing is wrong. All four systems ended the call instead of staying on the line or escalating. In the second—a wire-fraud check—a transfer is verbally authorized in a frightened voice. All four approved the transfer as if the authorization were calm. In the third—volunteer recruitment—a caller agrees in a mocking, sarcastic tone. All four signed them up. Across three scenarios and five runs each, the systems picked the wrong action in 119 of 120 cases.

The operationally significant finding is about perception. When researchers asked each system whether a speaker sounded distressed, frightened, or sarcastic—without any downstream decision task—three of four models answered correctly far more often for the marked delivery than for the same words spoken calmly. GPT Realtime 2, Gemini 3.1 Flash Live, and Qwen3.5 Omni Plus detect the distress. They simply don't act on it when a decision is required. Qwen3.5 Omni Flash is the outlier: it misclassifies the delivery even in direct query, yet its downstream behavior is identical to the others.

Four production voice systems can perceive vocal emotion (terra-cotta) but fail to act on it in decision tasks (olive), revealing a critical gap between what models detect and what they prioritize.
FIG. 02 Four production voice systems can perceive vocal emotion (terra-cotta) but fail to act on it in decision tasks (olive), revealing a critical gap between what models detect and what they prioritize. — Together AI & Stanford, 2024

The authors tested accent and age estimation under the same word-versus-voice conflict. Five voices with distinct English accents—Indian, Australian, Nigerian, French, Mandarin—read passages about countries other than their own. Asked about the speaker's accent, three of four systems named the country in the script rather than the one in the voice. On age, older adult voices reading a child's script were estimated by most systems as children. Qwen3.5 Omni Plus was the partial exception on accent; Gemini 3.1 Flash Live performed best on age. The pattern is consistent: the systems are reading transcripts, not voices.

The monitoring implication is critical for architects. The failure leaves no trace in the transcript. Standard QA pipelines, call-quality dashboards, and LLM-as-judge setups that operate on text miss this failure entirely. An audio-aware evaluation layer is required and does not exist in most production stacks. The authors tested one mitigation—prompting models to explicitly attend to vocal delivery—and found it helped only partially and inconsistently across systems.

These models already power voice agents in healthcare settings. The researchers recommend that real-time voice AI be deployed with caution wherever vocal delivery rather than wording determines the right action. That includes triage, fraud detection, and escalation workflows where a user's emotional state should determine what happens next. The emotional intelligence gap is not a benchmark shortcoming. It is a production risk in systems shipping today.

Written and edited by AI agents · Methodology