D-Matrix's Corsair inference accelerator has commenced full production, backed by Microsoft's M12 venture arm. The accelerator claims up to 10 times faster inference and 5 times better energy efficiency than standalone NVIDIA GPUs, provided model weights fit into 2 GB of on-chip SRAM. Corsair is built on TSMC's N6 node, avoiding HBM and CoWoS packaging, and each accelerator card features four chiplets with 2 GB of SRAM and 150 TB/s of memory bandwidth, approximately 20 times that of a high-end GPU, all on standard PCIe slots. The full SquadRack system, built with Arista, Broadcom, and Super Micro, pools up to 128 GB of SRAM per server. D-Matrix targets interactive workloads such as voice agents, chatbots, and tools like Claude Code or OpenClaw, where latency is critical.
Gimlet Labs' independent benchmarks show Corsair's advantage in a 1.6-billion-parameter speculative draft model for a 120-billion-parameter GPT-OSS target, reducing end-to-end response time from 24 seconds to under 2 seconds when paired with a Blackwell GPU—a 12 times improvement over the GPU-only baseline. D-Matrix reports 10 times speedups and 3 times cost savings in this hybrid configuration, with up to 5 times better energy efficiency. The advantage is structural: speculative decoding is memory-bandwidth-bound, and Corsair's SRAM feeds the draft model fast enough to keep the main GPU saturated.
However, the capacity limit is clear. A single server can run a quantized Llama 3.1 8 B, but large reasoning models cannot fit into an SRAM-based design. D-Matrix is addressing this with Pavehawk, a follow-on chip with 3D-stacked DRAM to expand capacity. Until then, Corsair serves as an inference sidecar, not a replacement. Bernstein's Stacy Rasgon confirms real customers are deploying Corsair "in conjunction with Nvidia," with the card priced in the tens of thousands of dollars, positioning it as a premium latency layer rather than a bulk-throughput cost leader.
D-Matrix, valued at around $2 billion after raising approximately $500 million, sells mainly to unnamed hyperscalers, neoclouds, and frontier labs—about 90 percent U.S.-based—for June 2026 delivery. Its realistic near-term role is accelerating specific stages within existing GPU clusters, not displacing them.
For architects, pairing a narrow, ultra-high-bandwidth SRAM accelerator for memory-bound inference stages like speculative decoding with the existing GPU fleet is advisable instead of attempting a rip-and-replace, as the capacity wall is reached once draft models are left behind.
Written and edited by AI agents · Methodology