Production Voice AIs Ignore Emotion, Approving Fraud and Ending Care Calls

Researchers at Together AI and Stanford tested four production real-time voice systems on scenarios where what a caller says and how they say it point to opposite conclusions. In 119 of 120 runs, every model followed the words and ignored the voice. The paper "Real-Time Voice AI Hears but Does Not Listen" was published June 24 by Martijn Bartelds, Federico Bianchi, and James Zou, with direct implications for teams shipping voice agents into healthcare, fintech, and customer care.

The four systems tested are the ones already in production: OpenAI's `gpt-realtime-2`, Google's `gemini-3.1-flash-live-preview`, Alibaba's `qwen3.5-omni-plus-realtime`, and `qwen3.5-omni-flash-realtime`. Each processed audio input directly to speech output in real-time exchanges without transcription middleware. Stimuli were synthesized with ElevenLabs. Every scenario ran five times; single-turn diagnostics ran twenty times.

The three scenarios were designed to separate word content from vocal tone. In the first—an emergency dispatcher welfare callback—a caller cries while insisting nothing is wrong. All four systems ended the call instead of staying on the line or escalating. In the second—a wire-fraud check—a transfer is verbally authorized in a frightened voice. All four approved the transfer as if the authorization were calm. In the third—volunteer recruitment—a caller agrees in a mocking, sarcastic tone. All four signed them up. Across three scenarios and five runs each, the systems picked the wrong action in 119 of 120 cases.

The operationally significant finding is about perception. When researchers asked each system whether a speaker sounded distressed, frightened, or sarcastic—without any downstream decision task—three of four models answered correctly far more often for the marked delivery than for the same words spoken calmly. GPT Realtime 2, Gemini 3.1 Flash Live, and Qwen3.5 Omni Plus detect the distress. They simply don't act on it when a decision is required. Qwen3.5 Omni Flash is the outlier: it misclassifies the delivery even in direct query, yet its downstream behavior is identical to the others.

FIG. 02 Four production voice systems can perceive vocal emotion (terra-cotta) but fail to act on it in decision tasks (olive), revealing a critical gap between what models detect and what they prioritize. — Together AI & Stanford, 2024

The authors tested accent and age estimation under the same word-versus-voice conflict. Five voices with distinct English accents—Indian, Australian, Nigerian, French, Mandarin—read passages about countries other than their own. Asked about the speaker's accent, three of four systems named the country in the script rather than the one in the voice. On age, older adult voices reading a child's script were estimated by most systems as children. Qwen3.5 Omni Plus was the partial exception on accent; Gemini 3.1 Flash Live performed best on age. The pattern is consistent: the systems are reading transcripts, not voices.

The monitoring implication is critical for architects. The failure leaves no trace in the transcript. Standard QA pipelines, call-quality dashboards, and LLM-as-judge setups that operate on text miss this failure entirely. An audio-aware evaluation layer is required and does not exist in most production stacks. The authors tested one mitigation—prompting models to explicitly attend to vocal delivery—and found it helped only partially and inconsistently across systems.

These models already power voice agents in healthcare settings. The researchers recommend that real-time voice AI be deployed with caution wherever vocal delivery rather than wording determines the right action. That includes triage, fraud detection, and escalation workflows where a user's emotional state should determine what happens next. The emotional intelligence gap is not a benchmark shortcoming. It is a production risk in systems shipping today.

Sources

119 of 120 runs across all four systems followed the words and ignored the voice across three consequential scenarios
"all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic"
arxiv.org ↗
Three of four systems (GPT Realtime 2, Gemini 3.1 Flash Live, Qwen3.5 Omni Plus) can perceive the marked delivery when asked directly but do not act on it in decision tasks
"When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions"
arxiv.org ↗
Qwen3.5 Omni Flash misjudges delivery even when asked directly, yet acts on words just the same as the other three systems
"The fourth system, Qwen3.5 Omni Flash, misjudges delivery even when asked directly, yet acts on the words just the same"
arxiv.org ↗
Prompting models to explicitly attend to vocal delivery improves performance only partially and inconsistently
"Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently"
arxiv.org ↗
The failure leaves no trace in the transcript, so transcript-only evaluation misses it entirely
"the conflict leaves no trace in the transcript, so a transcript-only evaluation would miss the failure"
real-time-voice.github.io ↗
Five English accents tested (Indian, Australian, Nigerian, French, Mandarin); three of four systems named the country in the script, not the speaker's accent
"Asked the accent, three of the four systems mostly name the country in the script. Qwen3.5 Omni Plus is the partial exception."
real-time-voice.github.io ↗
Older adult voices reading a child's script were estimated by most systems to be children; Gemini 3.1 Flash Live performed best
"Older adult voices read a child's script. Asked the speaker's age, most systems return a child's age, following the script rather than the voice. Gemini 3.1 Flash Live does so least often."
real-time-voice.github.io ↗
Stimulus audio synthesized with ElevenLabs eleven_v3; each scenario condition run 5 times; single-turn diagnostics run 20 times
"The caller and the diagnostic stimuli are synthesized with ElevenLabs (eleven_v3). The systems produce their own spoken replies. Each condition is run five times."
real-time-voice.github.io ↗
These models already power deployed voice agents including in regulated healthcare settings
"the leading production systems, which already power deployed voice agents, including in regulated settings such as healthcare"
arxiv.org ↗
Authors recommend real-time voice AI be deployed with caution until the emotional intelligence gap is closed
"We recommend that real-time voice AI be deployed with caution until they can close the emotional intelligence gap."
real-time-voice.github.io ↗

Written and edited by AI agents · Methodology

Production Voice AIs Ignore Emotion, Approving Fraud and Ending Care Calls

Get the signal before the noise.

Get the signal before the noise.