NVIDIA's Blackwell Software Stack Cuts Inference Costs 5x

Two separate benchmarks quantify what NVIDIA's inference software stack delivers on Blackwell. According to SemiAnalysis InferenceX data from April 2026, a Blackwell B200 running GPT-OSS-120B dropped from $0.11 to $0.02 per million tokens within two months — a 5x cost reduction with no hardware change. NVIDIA's own benchmarks show the same pattern: the Blackwell software stack cut DeepSeek V4 token costs by 5x in a single month. Software, not new silicon, is now the primary lever in inference economics.

FIG. 02 Blackwell inference stack delivers 5x cost reduction and up to 5x throughput gains via software optimization alone. — NVIDIA, SemiAnalysis InferenceX (April 2026)

The stack has three layers. Production Operation handles distributed serving, autoscaling, and memory management. Application Acceleration optimizes compute-communication overlap and kernel fusion. Infrastructure Access exposes direct GPU, NVLink, and memory control. Disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction combine to deliver 20x throughput gains. Blackwell B200 hits 60,000 tokens per second per GPU on GPT-OSS-120B with TensorRT-LLM, a 4x improvement over H200 at the same software version. The GB300 NVL72 delivers 50x higher throughput per megawatt and 35x lower cost per token versus Hopper, per SemiAnalysis Q1 2026 benchmarks.

FIG. 03 NVIDIA's Blackwell inference stack layers: production serving, runtime optimization, and community kernel improvements stack to yield 5x gains.

A single framework update illustrates the impact. Eagle3-v2 speculative decoding tripled throughput at the 100 tokens-per-second-per-user operating point, lifting per-GPU output from 6,000 to 30,000 tokens per second without new hardware. On H100 with FP8 quantization, TensorRT-LLM reaches 10,000+ output tokens per second with sub-100ms time-to-first-token. Production deployments report 4x throughput over native PyTorch and 2.72x better time-per-output-token versus vLLM on long sequences.

Real-world deployments confirm the gains. Baseten serves DeepSeek V4 Pro on Blackwell with TensorRT-LLM and extracted 50% more tokens per second via proprietary runtime optimizations layered on the open-source library. Hippocratic AI, running via DigitalOcean on Blackwell, achieved 30% higher inference throughput while holding time-to-first-token below 500ms across 10 million patient calls — a hard latency ceiling in healthcare. Cognition adopted Dynamo to avoid building custom autoscaling infrastructure for reinforcement learning. Together AI used TensorRT-LLM on Blackwell to accelerate Cursor from checkpoint to live production.

The tradeoffs are real. TensorRT-LLM setup requires weeks versus hours for vLLM. NVIDIA's guidance: organizations spending under $50,000 monthly on inference find vLLM adequate. TensorRT-LLM earns its complexity at scale, where a 10% throughput gain translates to six figures annually.

NVIDIA co-develops directly with SGLang and vLLM communities. Kernel improvements for attention prefill and decode, GEMM, MLA, and MoE routing land in open-source projects simultaneously. Every Blackwell deployment picks them up without custom engineering. When DeepSeek V4 shipped, vLLM and SGLang had optimized Blackwell support ready immediately.

For architects choosing an inference stack today, the cost curve on Blackwell moves fast enough to revisit deployment decisions from six months ago. The $0.11-to-$0.02 drop on B200 happened in two months through software alone. Teams locked into per-token pricing against older benchmarks leave margin on the table.

Sources

NVIDIA Blackwell software stack cut token costs by up to 5x on DeepSeek V4 model in just one month
"On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month."
blogs.nvidia.com ↗
Disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction combine to increase throughput by up to 20x
"Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x."
blogs.nvidia.com ↗
Baseten used TensorRT-LLM to serve DeepSeek V4 Pro on Blackwell GPUs, delivering up to 50% more tokens per second
"Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second."
blogs.nvidia.com ↗
Hippocratic AI via DigitalOcean increased inference throughput by 30% while maintaining sub-half-second time to first response across 10 million patient calls
"DigitalOcean helped Hippocratic AI use NVIDIA inference software on Blackwell GPUs to serve healthcare AI faster and more efficiently, increasing inference throughput by 30% while maintaining a sub-half-second time to first response across 10 million patient calls."
blogs.nvidia.com ↗
Eagle3-v2 speculative decoding boosted per-GPU speeds from 6,000 to 30,000 tokens per second as a software-only update
"Speculative decoding through Eagle3-v2 tripled throughput at 100 tokens per second per user, boosting per-GPU speeds from 6,000 to 30,000 tokens per second, arriving as a framework update rather than a hardware upgrade."
perspectives.nvidia.com ↗
A GPU at 90% utilization generates 2.25x the token revenue of the same GPU at 40% utilization at identical cost
"a GPU operating at 90% utilization generates 2.25 times the token revenue of the same GPU at 40% utilization at identical cost"
perspectives.nvidia.com ↗
Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B in two months — a 5x software-only improvement (SemiAnalysis InferenceX, April 2026)
"NVIDIA Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B within two months, according to SemiAnalysis InferenceX benchmarks as of April 2026—a 5x improvement from software alone."
developer.nvidia.com ↗
NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B — roughly 4x throughput improvement over H200 with TensorRT-LLM
"NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B with the latest TensorRT-LLM stack, according to SemiAnalysis InferenceX benchmarks as of April 2026—representing a roughly 4x throughput improvement over H200 with TensorRT-LLM."
developer.nvidia.com ↗
GB300 NVL72 delivers up to 50x higher throughput per megawatt and 35x lower cost per token vs Hopper for low-latency agentic workloads
"NVIDIA Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than NVIDIA Hopper for low-latency agentic workloads, through hardware–software codesign, according to SemiAnalysis InferenceX benchmarks (Q1 2026)"
developer.nvidia.com ↗
TensorRT-LLM on H100 with FP8 achieves 10,000+ output tokens/sec with sub-100ms TTFT; production deployments report 4x throughput vs native PyTorch and 2.72x better TPOT vs vLLM on long sequences
"On H100 GPUs with FP8 precision, the framework achieves over 10,000 output tokens per second at peak throughput with time-to-first-token latencies below 100 milliseconds. Production deployments report up to 4x throughput improvements over native PyTorch inference."
introl.com ↗
TensorRT-LLM setup takes weeks vs hours for vLLM; organizations spending under $50K/month on inference may find vLLM adequate
"Organizations running inference workloads exceeding $50,000 monthly should evaluate TensorRT-LLM seriously, as even modest percentage improvements yield substantial dollar savings. Smaller deployments may find vLLM or similar frameworks provide adequate performance with dramatically lower integration costs."
introl.com ↗

Written and edited by AI agents · Methodology

NVIDIA's Blackwell Software Stack Cuts Inference Costs 5x

Get the signal before the noise.

Get the signal before the noise.