Hugging Face + Cerebras unlock real-time voice AI for robots; Gemma 4 at 1,800 TPS enables low-latency speech-to-speech on 7.5K+ Reachy Mini units
Hugging Face and Cerebras published a modular speech-to-speech pipeline on July 1 that pairs Cerebras Inference (running Gemma 4 31B at 1,851 tokens/sec) with open-source audio components: NVIDIA Parakeet for speech recognition, Alibaba Qwen3 TTS for speech synthesis, and Silero VAD for voice detection. The stack is production-deployed on Reachy Mini, Pollen Robotics' $300 desktop robot, which has 7,500+ units in the wild. Unlike previous embodied AI approaches requiring cloud APIs, the pipeline enables fully local, real-time conversational interaction at latencies previously impossible on edge hardware.
Gemma 4 31B on Cerebras reaches 1,851 tokens/sec—the first multimodal model the company brought to wafer-scale hardware and 18x faster than Claude Haiku at equivalent quality. The speed enables agentic loops with multiple tool calls and vision reasoning to complete in real-time rather than multi-second waits. Cerebras claims the latency unlocks new product experiences: screenshot-to-patch, dense document analysis, and conversational editing with tight human-in-the-loop feedback cycles.
The Reachy Mini deployment represents tangible shipping: 7,500+ units now capable of responsive voice interaction through open-source tooling. Hugging Face optimized the TTS bottleneck (Qwen3-TTS) via CUDA graphs and static KV caches, reducing time-to-first-audio from seconds to sub-200ms. Each component is modular and replaceable, allowing developers to swap ASR, LLM, or TTS layers independently. The architecture reflects a shift away from monolithic cloud APIs toward composable, open inference stacks.
For infrastructure builders, this signals that real-time embodied AI is now feasible on open-weight models without proprietary vendor lock-in. Architects deploying voice-first robots or agents can benchmark Cerebras' Gemma 4 speeds against proprietary API vendors and local deployment alternatives. The modular stack also reduces operational risk: if any component gets faster (e.g., better ASR), the entire pipeline benefits. Monitor whether Cerebras' wafer-scale hardware becomes the default inference layer for multi-turn agentic loops or remains a premium option for latency-critical applications.
Sources
- Primary source
- huggingface.co
“speech-to-speech experience that feels dramatically more natural. Instead of waiting for an AI to respond, conversations flow with the responsiveness users expect from human interaction”
- cerebras.ai
“Cerebras speed also translates into world class latency— Gemma 4 on Cerebras returns its first answer token inclusive of reasoning in 1.5 seconds, making Cerebras the only provider that lets Gemma 4 be used in real-time settings”
- sean-weldon.com
“With 7,500 units deployed and fully open-source software, the platform demonstrates that community-driven development can establish human-robot interaction paradigms”