GRPO Cuts Pause-Handling Errors in Full-Duplex Agents Without Semantic Loss

Full-duplex voice agents have addressed acoustic latency, yet many struggle with conversational nuances such as knowing when to pause, interrupt, or yield. A post-training alignment method from Kyutai and Gradium employs Group Relative Policy Optimization (GRPO) on interaction dynamics, refining Moshi and PersonaPlex models to enhance turn-taking, backchanneling, pause handling, and user interruption without semantic degradation.

The stack is detailed and specific. Moshi, a 7B-parameter full-duplex model, utilizes Mimi, a streaming neural audio codec that processes 24 kHz audio at 12.5 Hz and 1.1 kbps with an 80 ms frame size, resulting in a theoretical 160 ms latency and approximately 200 ms practical latency on an L4 GPU, with dual parallel audio streams for the user and agent. NVIDIA's PersonaPlex-7B, based on the same architecture and featuring a Helium LLM backbone, includes training on the Fisher English corpus and synthetic dialogues, with latency ranging from 205–265 ms. The new work applies GRPO to short segments from real human conversation corpora, optimizing four key interactivity axes: pause handling, turn-taking, backchanneling, and user interruption. Each axis has a rule-based reward function, and an LLM-judge reward maintains semantic quality, preventing the degradation seen in concurrent work ASPIRin, which reduced GPT-4o semantic quality scores from 3.89 to 3.73 on a 0–5 scale. The resulting checkpoints, moshika-rl-seamless and personaplex-rl-seamless, are available on Hugging Face under open licenses.

Performance on Full-Duplex-Bench v1 and v2 is consistent across both model families. The model's ability to differentiate between mid-utterance pauses and genuine turn yields significantly reduces pause-handling takeover rate, addressing the failure mode of prior systems like dGSLM that treated every silence as a handoff. Turn-taking latency and takeover rate improve concurrently, overcoming the usual trade-off between responsiveness and patience. Backchanneling performance holds across frequency, latency, and appropriateness, and user-interruption latency improves while semantic scores exceed the base model. The training, which uses short extracted clips, generalizes to real-time multi-turn dialogues, as tested against GPT-Realtime in Full-Duplex-Bench v2.

The main operational challenge lies in the reward machinery. Designing rule-based rewards for each axis requires manual engineering, and scaling to more complex interaction behaviors becomes increasingly difficult. The LLM-judge semantic guardrail prevents quality regression but adds inference overhead and another serving dependency in the training loop. Prior RL approaches like ORISE covered only barge-in and backchanneling with customized automated annotations; expanding coverage to four axes already stretches the hand-crafted approach. There is also no discussion of how these tuned policies perform under background noise, overlapping speech, or codec artifacts that push Mimi outside its training distribution, an open question for production deployment at scale.

FIG. 02 GRPO improves semantic quality and latency vs. baseline and prior RL approaches on Moshi and PersonaPlex models. — arXiv:2606.11167, GitHub Kyutai Labs

For architects, the takeaway is clear: token-level supervised loss cannot optimize conversation timing, but a multi-reward GRPO loop with a semantic guardrail can sharpen latency, turn-taking, and naturalness without compromising language quality.

Sources

Post-training alignment using GRPO addresses four canonical interactivity axes: pause handling, turn-taking, backchanneling, and user interruption
"We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions."
arxiv.org ↗
Token-level supervised learning causes excessive silence and ill-timed turn-taking in full-duplex models
"current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking."
arxiv.org ↗
ASPIRin (concurrent work) degraded GPT-4o semantic quality scores from 3.89 to 3.73 when applying GRPO to Moshi; this method improves quality above the baseline
"ASPIRin (Hsiao et al., 2026) reports that its GPT-4o score decreases from the base Moshi model's 3.89 to 3.73 (on a ... 5 scale). In contrast, our method improves this score, demonstrating the effectiveness of incorporating an LLM-based reward."
arxiv.org ↗
RL training yields consistent improvements in both Moshi and PersonaPlex families; pause-handling TOR decreases and turn-taking latency and TOR simultaneously improve
"Within both the Moshi and PersonaPlex families, RL training yields consistent improvements over the respective base models. TOR of pause handling decreases substantially, while latency and TOR of turn-taking simultaneously improve."
arxiv.org ↗
Training on short extracted segments generalizes to real-time multi-turn dialogues on Full-Duplex-Bench v2
"although training is performed on short, extracted segments, we also demonstrate that the improvements generalize to real-time multi-turn dialogues through the evaluation on Full-Duplex-Bench v2"
arxiv.org ↗
Rule-based reward design requires manual engineering effort per axis and becomes difficult to scale
"the rule-based reward design for each interactivity axis requires manual engineering effort and may overlook other aspects of conversational dynamics. As the number of axes grows, this approach becomes increasingly difficult to scale."
arxiv.org ↗
Moshi achieves theoretical latency of 160 ms and ~200 ms practical latency on an L4 GPU, using the Mimi codec at 12.5 Hz and 1.1 kbps
"Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an L4 GPU. Mimi is a neural audio codec that processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps."
github.com ↗
PersonaPlex-7B-v1 is built on Moshi architecture with the Helium backbone, trained on Fisher English corpus (7,303 conversations, 1,217 hours) plus ~410 hours of synthetic dialogues, with 205–265 ms latency
"NVIDIA used a combination of: Fisher English corpus — 7,303 real telephone conversations (up to 10 minutes each), totaling about 1,217 hours. Synthetic dialogues — approximately 410 hours of generated conversations."
collabnix.com ↗
PersonaPlex sub-second latency of 0.205–0.265 seconds; outperforms Gemini Live, Qwen 2.5 Omni on conversational dynamics benchmarks
"Full-duplex design eliminates the pause-talk-pause cycle of traditional voice assistants with sub-second latency (0.205-0.265s). Outperforms Gemini Live, Qwen 2.5 Omni, and Moshi on conversational dynamics and task adherence benchmarks."
genmedialab.com ↗
ORISE prior work covered only barge-in and backchanneling, not all four interactivity axes
"ORISE effectively improves robustness and accuracy in handling conversational dynamics, including turn-taking, user barge-in, and backchanneling."
openreview.net ↗

Written and edited by AI agents · Methodology

GRPO Cuts Pause-Handling Errors in Full-Duplex Agents Without Semantic Loss

Get the signal before the noise.

Get the signal before the noise.