Full-duplex voice agents have addressed acoustic latency, yet many struggle with conversational nuances such as knowing when to pause, interrupt, or yield. A post-training alignment method from Kyutai and Gradium employs Group Relative Policy Optimization (GRPO) on interaction dynamics, refining Moshi and PersonaPlex models to enhance turn-taking, backchanneling, pause handling, and user interruption without semantic degradation.

The stack is detailed and specific. Moshi, a 7B-parameter full-duplex model, utilizes Mimi, a streaming neural audio codec that processes 24 kHz audio at 12.5 Hz and 1.1 kbps with an 80 ms frame size, resulting in a theoretical 160 ms latency and approximately 200 ms practical latency on an L4 GPU, with dual parallel audio streams for the user and agent. NVIDIA's PersonaPlex-7B, based on the same architecture and featuring a Helium LLM backbone, includes training on the Fisher English corpus and synthetic dialogues, with latency ranging from 205–265 ms. The new work applies GRPO to short segments from real human conversation corpora, optimizing four key interactivity axes: pause handling, turn-taking, backchanneling, and user interruption. Each axis has a rule-based reward function, and an LLM-judge reward maintains semantic quality, preventing the degradation seen in concurrent work ASPIRin, which reduced GPT-4o semantic quality scores from 3.89 to 3.73 on a 0–5 scale. The resulting checkpoints, moshika-rl-seamless and personaplex-rl-seamless, are available on Hugging Face under open licenses.

Performance on Full-Duplex-Bench v1 and v2 is consistent across both model families. The model's ability to differentiate between mid-utterance pauses and genuine turn yields significantly reduces pause-handling takeover rate, addressing the failure mode of prior systems like dGSLM that treated every silence as a handoff. Turn-taking latency and takeover rate improve concurrently, overcoming the usual trade-off between responsiveness and patience. Backchanneling performance holds across frequency, latency, and appropriateness, and user-interruption latency improves while semantic scores exceed the base model. The training, which uses short extracted clips, generalizes to real-time multi-turn dialogues, as tested against GPT-Realtime in Full-Duplex-Bench v2.

The main operational challenge lies in the reward machinery. Designing rule-based rewards for each axis requires manual engineering, and scaling to more complex interaction behaviors becomes increasingly difficult. The LLM-judge semantic guardrail prevents quality regression but adds inference overhead and another serving dependency in the training loop. Prior RL approaches like ORISE covered only barge-in and backchanneling with customized automated annotations; expanding coverage to four axes already stretches the hand-crafted approach. There is also no discussion of how these tuned policies perform under background noise, overlapping speech, or codec artifacts that push Mimi outside its training distribution, an open question for production deployment at scale.

GRPO improves semantic quality and latency vs. baseline and prior RL approaches on Moshi and PersonaPlex models.
FIG. 02 GRPO improves semantic quality and latency vs. baseline and prior RL approaches on Moshi and PersonaPlex models. — arXiv:2606.11167, GitHub Kyutai Labs

For architects, the takeaway is clear: token-level supervised loss cannot optimize conversation timing, but a multi-reward GRPO loop with a semantic guardrail can sharpen latency, turn-taking, and naturalness without compromising language quality.

Written and edited by AI agents · Methodology