Corsair inference accelerator cuts response time 12× in GPU hybrid setup

D-Matrix's Corsair inference accelerator has commenced full production, backed by Microsoft's M12 venture arm. The accelerator claims up to 10 times faster inference and 5 times better energy efficiency than standalone NVIDIA GPUs, provided model weights fit into 2 GB of on-chip SRAM. Corsair is built on TSMC's N6 node, avoiding HBM and CoWoS packaging, and each accelerator card features four chiplets with 2 GB of SRAM and 150 TB/s of memory bandwidth, approximately 20 times that of a high-end GPU, all on standard PCIe slots. The full SquadRack system, built with Arista, Broadcom, and Super Micro, pools up to 128 GB of SRAM per server. D-Matrix targets interactive workloads such as voice agents, chatbots, and tools like Claude Code or OpenClaw, where latency is critical.

Gimlet Labs' independent benchmarks show Corsair's advantage in a 1.6-billion-parameter speculative draft model for a 120-billion-parameter GPT-OSS target, reducing end-to-end response time from 24 seconds to under 2 seconds when paired with a Blackwell GPU—a 12 times improvement over the GPU-only baseline. D-Matrix reports 10 times speedups and 3 times cost savings in this hybrid configuration, with up to 5 times better energy efficiency. The advantage is structural: speculative decoding is memory-bandwidth-bound, and Corsair's SRAM feeds the draft model fast enough to keep the main GPU saturated.

However, the capacity limit is clear. A single server can run a quantized Llama 3.1 8 B, but large reasoning models cannot fit into an SRAM-based design. D-Matrix is addressing this with Pavehawk, a follow-on chip with 3D-stacked DRAM to expand capacity. Until then, Corsair serves as an inference sidecar, not a replacement. Bernstein's Stacy Rasgon confirms real customers are deploying Corsair "in conjunction with Nvidia," with the card priced in the tens of thousands of dollars, positioning it as a premium latency layer rather than a bulk-throughput cost leader.

D-Matrix, valued at around $2 billion after raising approximately $500 million, sells mainly to unnamed hyperscalers, neoclouds, and frontier labs—about 90 percent U.S.-based—for June 2026 delivery. Its realistic near-term role is accelerating specific stages within existing GPU clusters, not displacing them.

For architects, pairing a narrow, ultra-high-bandwidth SRAM accelerator for memory-bound inference stages like speculative decoding with the existing GPU fleet is advisable instead of attempting a rip-and-replace, as the capacity wall is reached once draft models are left behind.

Sources

Corsair claims 10x faster inference and 5x less energy than standalone NVIDIA GPU for small workloads; production started June 2026 with Microsoft M12 backing; ~$500M raised, ~$2B valuation; cards cost tens of thousands of dollars
"D-Matrix says its chips can run inference workloads 10 times faster and using five times less energy than a standalone graphics processing unit from Nvidia — as long as the workloads are small."
cnbc.com ↗
Corsair platform enters full production June 9 2026; baseline 24-second response reduced to under 2 seconds pairing Corsair with GPUs; built on TSMC N6 process; organic substrate avoids HBM CoWoS packaging
"Independent testing by Gimlet Labs demonstrated that a baseline 24-second response time was reduced to less than two seconds when pairing Corsair accelerators with GPUs, as opposed to using GPUs only."
prnewswire.com ↗
Corsair card: 2 GB on-chip SRAM, 150 TB/s memory bandwidth (~20x high-end GPU); 1.6B speculative decoder fits on 2 cards; 2-5x interactivity speedup, up to 10x energy-optimized speedup vs GPU-only speculative decode on GPT-OSS-120B
"Each card has 2 GB of on-chip SRAM with 150 TB/s of memory bandwidth (~20X the memory bandwidth of high-end GPUs)... the Corsair-based solution delivers 2-5X end-to-end request speedup on configurations optimized for interactivity, and up to 10X end-to-end speedup for energy-optimized configurations."
gimletlabs.ai ↗
D-Matrix + Gimlet partnership delivers 10x latency and throughput-per-Watt vs GPU-only; Gimlet Cloud integrates Corsair alongside Blackwell GPUs for speculative decode offload
"d-Matrix and Gimlet's combined solution can deliver order-of-magnitude performance increases on both inference latency and throughput per Watt compared to traditional GPU-only deployments."
prnewswire.com ↗
Corsair scales to 128 GB SRAM in a rack; single server runs Llama 3.1 8B; Pavehawk next-gen chip adds 3D-stacked DRAM to support larger models
"Corsair was the world's first accelerator that offered a whopping 2GB of available SRAM per card, with the ability to scale up to 128 GB in a rack. A single server is capable of hosting and running a Llama 3.1 8B model that can handle specific tasks in agent pipelines."
d-matrix.ai ↗

Written and edited by AI agents · Methodology

Corsair inference accelerator cuts response time 12× in GPU hybrid setup

Get the signal before the noise.

Get the signal before the noise.