NVIDIA Inference Stack Reduces Token Costs by Up to 5x on Blackwell in One Month
NVIDIA's full-stack inference software on the Blackwell GPU platform has cut token costs by up to 5x for the DeepSeek V4 model within a single month, according to benchmark data released June 30. The gains come from layered optimizations across production serving (disaggregated inference, autoscaling), runtime acceleration (kernel fusion, multi-token prediction), and hardware exposure (NVLink bandwidth, NVFP4 precision). Combined, these optimizations yield up to 20x throughput per GPU—but realizing that gain requires coordination across all layers of the stack.
Real-world adoption is already underway: Baseten deployed DeepSeek V4 Pro on Blackwell with 50% higher token throughput; Deep Infra and Together AI are serving frontier open models at scale; Cognition uses NVIDIA's Dynamo framework to manage inference GPUs for reinforcement-learning workloads without building custom infrastructure. NVIDIA's ecosystem leverage—PyTorch natively supports Tensor Cores and NVFP4; open projects like vLLM and SGLang integrate CUDA optimizations at release—means new research breakthroughs (DFlash speculative decode, FastVideo) translate to production performance in weeks, not months.
For infrastructure architects, this signals a maturation of the inference commodity: raw tokens-per-dollar are no longer competitive moats; the game is now vertical integration and software-hardware co-design. Teams running large inference fleets can no longer justify generic GPU utilization targets—they need to instrument full-stack cost per token and measure ROI on software stack updates. Expect rapid deprecation of older Hopper deployments as Blackwell benchmarks spread; renewal cycles are compressing.
Sources
- Primary source
- NVIDIA Blog: How NVIDIA's Inference Software Stack Powers the Lowest Token Cost
“On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month. Combined, they increase throughput by up to 20x”