Cerebras and Gemma 4 reach sub-200ms voice latency with modular open stack

Hugging Face and Cerebras published a joint voice AI demo on July 1 pairing Gemma 4 31B with Cerebras's wafer-scale inference for a fully open, cascaded speech-to-speech pipeline. The stack chains Nvidia's Parakeet for ASR, Gemma 4 31B on Cerebras for language inference, and Alibaba's Qwen3TTS for synthesis. Every layer is modular, open, and replaceable. The same pipeline runs on more than 9,000 Reachy Mini robots in production.

Production voice systems today achieve acceptable median latency but deliver multi-second delays at P95. Those tail delays — common when tool calls or multimodal steps compound — make AI voice feel unreliable despite working on the happy path. Cerebras targets the LLM step, typically the dominant bottleneck. At 1,851 output tokens per second on Gemma 4 31B, a 150-token LLM response completes in roughly 80ms. That leaves budget for ASR and TTS while staying under 200ms, the threshold voice researchers treat as the boundary between natural and delayed conversation.

FIG. 02 P95 latency comparison: typical production voice systems vs. the Cerebras/Gemma stack.

Gemma 4 31B is a 31-billion-parameter model released by Google DeepMind under Apache 2.0. It scores 29 on the Artificial Analysis Intelligence Index, comparable to Claude Haiku 4.5 at 30. On Cerebras, it runs 18x faster than Haiku. Time to first token sits at 1.5 seconds per Cerebras's benchmarks. Current Artificial Analysis measurements put throughput at 2,106 tokens per second, up from 1,851 at launch. Blended pricing on Cerebras runs $1.04 per million tokens with a 131K context window.

Modularity is the engineering bet. Nothing in the HF stack is proprietary. Swap Parakeet for Whisper or a domain-specific ASR model and the rest of the pipeline is unaffected. Replace Qwen3TTS with a different synthesis layer and the language-model step remains unchanged. The Reachy Mini deployment proves this: the same code runs conversational assistants and embodied robotics, letting teams tune individual components for latency-quality-cost tradeoffs without rearchitecting the entire system.

Logan Kilpatrick of Google DeepMind: "If every model was doing 2,000 tokens per second, you would probably build different products. You wouldn't build the same product and just have it be faster." This is precise for voice. GPU-speed inference at 100–150 TPS forces product teams to add filler audio, stream sentence-by-sentence with visible lag, or restrict the system prompt to reduce generation length. At 1,800+ TPS those compensations become unnecessary.

The current demo is speech-in, text-in-the-middle, speech-out — a cascaded pipeline, not end-to-end audio. Each stage adds its own latency floor; each boundary is a failure point. Parakeet's transcription accuracy on noisy audio, Qwen3TTS prosody quality, and interruption handling are outside the scope of what Cerebras's inference speed addresses. The LLM bottleneck is solved for this model at this size. ASR and TTS are still where latency variance accumulates in real deployments.

FIG. 03 Voice pipeline architecture: ASR → 31B LLM → TTS, each component latency optimized for sub-200ms end-to-end. — Hugging Face & Cerebras

The repo is public at huggingface/speech-to-speech. For teams evaluating real-time voice stacks, the architecture is a usable baseline: Apache 2.0 throughout, three well-documented components, and a reference deployment at scale. Cerebras Inference Cloud access for Gemma 4 31B is in public preview.

Sources

Pipeline chains Nvidia Parakeet ASR → Gemma 4 31B on Cerebras → Qwen3TTS; same pipeline powers 9,000+ Reachy Mini robots
"This same Hugging Face speech-to-speech pipeline already powers Reachy Mini robots, with more than 9,000 robots in the wild."
huggingface.co ↗
Production voice systems see acceptable median latency but multi-second delays at P95; tool calls and multimodal steps compound this
"Today, some production systems see a reasonable median latency while still experiencing frustrating multi-second delays at the P95. Those delays become even more noticeable when tool calls or multimodal steps require multiple turns."
huggingface.co ↗
Cerebras runs Gemma 4 31B at 1,851 output tokens/second — 35x the speed of a typical GPU endpoint — per Artificial Analysis
"Cerebras runs Gemma 4 31B at a record 1,851 output tokens per second as measured by Artificial Analysis—35x the speed of a typical GPU endpoint."
cerebras.ai ↗
TTFT inclusive of reasoning is 1.5 seconds on Cerebras; Gemma 4 31B runs 18x faster than Claude Haiku 4.5 at comparable intelligence
"Gemma 4 on Cerebras returns its first answer token inclusive of reasoning in 1.5 seconds, making Cerebras the only provider that lets Gemma 4 be used in real-time settings."
cerebras.ai ↗
Gemma 4 31B scores 29 on Artificial Analysis Intelligence Index vs Claude Haiku 4.5 at 30; Apache 2.0 licensed
"Gemma 4 31B is comparable to Claude Haiku 4.5 in intelligence, scoring 29 and 30 respectively in the Artificial Analysis Intelligence Index. The key difference is that Gemma 4 is open-weight under Apache 2.0, and on Cerebras it runs 18x faster than Haiku."
cerebras.ai ↗
Logan Kilpatrick quote on 2,000 TPS changing what products you build
"If every model was doing 2,000 tokens per second, you would probably build different products. You wouldn't build the same product and just have it be faster."
cerebras.ai ↗
Current Artificial Analysis benchmark: Gemma 4 31B hits 2,106 tokens/second on Cerebras; blended price $1.04/M tokens; 131K context window
"For output speed, the fastest models are Gemma 4 31B (2,106 t/s)... For pricing, Gemma 4 31B ($1.04) offer the lowest blended prices... Gemma 4 31B (131k) support the largest context windows on Cerebras."
artificialanalysis.ai ↗

Written and edited by AI agents · Methodology

Cerebras and Gemma 4 reach sub-200ms voice latency with modular open stack

Get the signal before the noise.

Get the signal before the noise.