Research martes, 28 de abril de 2026, 00:00

Microsoft's MIT-licensed VibeVoice transcribes 1hr audio in ~9 mins on M5 MacBook, with built-in speaker diarization

Microsoft's VibeVoice — a Whisper-style speech-to-text model with speaker diarization baked directly into the architecture — is getting renewed attention after developer Simon Willison published hands-on benchmarks running the 4-bit MLX-quantized version (5.71 GB) on a 128 GB M5 Max MacBook Pro. The model transcribed one hour of podcast audio in 8 minutes 45 seconds, peaking at ~61.5 GB of RAM during the prefill stage and dropping to ~18 GB during generation. The model is MIT licensed and the community-converted mlx-community/VibeVoice-ASR-4bit checkpoint is available on Hugging Face.

Unlike Whisper, VibeVoice outputs structured JSON with per-segment speaker_id fields, timestamps, and duration — enabling downstream diarization without a separate pipeline step. Willison noted the model correctly distinguished two conversation participants plus a distinct "sponsor read" voice, surfacing three speaker IDs for a two-person podcast. A hard 1-hour audio limit per inference call means longer recordings must be split with overlapping windows and speaker IDs reconciled across segments. The original model (17.3 GB, fp16) was released by Microsoft on January 21, 2026; Willison's writeup is the first detailed Apple Silicon performance profile to gain wide circulation.

Leer en la fuente →

Fuentes

Primary source
microsoft/VibeVoice on GitHub
“n/a”
mlx-community/VibeVoice-ASR-4bit on Hugging Face
“n/a”