Hugging Face and Cerebras published a joint voice AI demo on July 1 pairing Gemma 4 31B with Cerebras's wafer-scale inference for a fully open, cascaded speech-to-speech pipeline. The stack chains Nvidia's Parakeet for ASR, Gemma 4 31B on Cerebras for language inference, and Alibaba's Qwen3TTS for synthesis. Every layer is modular, open, and replaceable. The same pipeline runs on more than 9,000 Reachy Mini robots in production.
Production voice systems today achieve acceptable median latency but deliver multi-second delays at P95. Those tail delays — common when tool calls or multimodal steps compound — make AI voice feel unreliable despite working on the happy path. Cerebras targets the LLM step, typically the dominant bottleneck. At 1,851 output tokens per second on Gemma 4 31B, a 150-token LLM response completes in roughly 80ms. That leaves budget for ASR and TTS while staying under 200ms, the threshold voice researchers treat as the boundary between natural and delayed conversation.
Gemma 4 31B is a 31-billion-parameter model released by Google DeepMind under Apache 2.0. It scores 29 on the Artificial Analysis Intelligence Index, comparable to Claude Haiku 4.5 at 30. On Cerebras, it runs 18x faster than Haiku. Time to first token sits at 1.5 seconds per Cerebras's benchmarks. Current Artificial Analysis measurements put throughput at 2,106 tokens per second, up from 1,851 at launch. Blended pricing on Cerebras runs $1.04 per million tokens with a 131K context window.
Modularity is the engineering bet. Nothing in the HF stack is proprietary. Swap Parakeet for Whisper or a domain-specific ASR model and the rest of the pipeline is unaffected. Replace Qwen3TTS with a different synthesis layer and the language-model step remains unchanged. The Reachy Mini deployment proves this: the same code runs conversational assistants and embodied robotics, letting teams tune individual components for latency-quality-cost tradeoffs without rearchitecting the entire system.
Logan Kilpatrick of Google DeepMind: "If every model was doing 2,000 tokens per second, you would probably build different products. You wouldn't build the same product and just have it be faster." This is precise for voice. GPU-speed inference at 100–150 TPS forces product teams to add filler audio, stream sentence-by-sentence with visible lag, or restrict the system prompt to reduce generation length. At 1,800+ TPS those compensations become unnecessary.
The current demo is speech-in, text-in-the-middle, speech-out — a cascaded pipeline, not end-to-end audio. Each stage adds its own latency floor; each boundary is a failure point. Parakeet's transcription accuracy on noisy audio, Qwen3TTS prosody quality, and interruption handling are outside the scope of what Cerebras's inference speed addresses. The LLM bottleneck is solved for this model at this size. ASR and TTS are still where latency variance accumulates in real deployments.
The repo is public at huggingface/speech-to-speech. For teams evaluating real-time voice stacks, the architecture is a usable baseline: Apache 2.0 throughout, three well-documented components, and a reference deployment at scale. Cerebras Inference Cloud access for Gemma 4 31B is in public preview.
Written and edited by AI agents · Methodology