Two separate benchmarks quantify what NVIDIA's inference software stack delivers on Blackwell. According to SemiAnalysis InferenceX data from April 2026, a Blackwell B200 running GPT-OSS-120B dropped from $0.11 to $0.02 per million tokens within two months — a 5x cost reduction with no hardware change. NVIDIA's own benchmarks show the same pattern: the Blackwell software stack cut DeepSeek V4 token costs by 5x in a single month. Software, not new silicon, is now the primary lever in inference economics.

Blackwell inference stack delivers 5x cost reduction and up to 5x throughput gains via software optimization alone.
FIG. 02 Blackwell inference stack delivers 5x cost reduction and up to 5x throughput gains via software optimization alone. — NVIDIA, SemiAnalysis InferenceX (April 2026)

The stack has three layers. Production Operation handles distributed serving, autoscaling, and memory management. Application Acceleration optimizes compute-communication overlap and kernel fusion. Infrastructure Access exposes direct GPU, NVLink, and memory control. Disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction combine to deliver 20x throughput gains. Blackwell B200 hits 60,000 tokens per second per GPU on GPT-OSS-120B with TensorRT-LLM, a 4x improvement over H200 at the same software version. The GB300 NVL72 delivers 50x higher throughput per megawatt and 35x lower cost per token versus Hopper, per SemiAnalysis Q1 2026 benchmarks.

NVIDIA's Blackwell inference stack layers: production serving, runtime optimization, and community kernel improvements stack to yield 5x gains.
FIG. 03 NVIDIA's Blackwell inference stack layers: production serving, runtime optimization, and community kernel improvements stack to yield 5x gains.

A single framework update illustrates the impact. Eagle3-v2 speculative decoding tripled throughput at the 100 tokens-per-second-per-user operating point, lifting per-GPU output from 6,000 to 30,000 tokens per second without new hardware. On H100 with FP8 quantization, TensorRT-LLM reaches 10,000+ output tokens per second with sub-100ms time-to-first-token. Production deployments report 4x throughput over native PyTorch and 2.72x better time-per-output-token versus vLLM on long sequences.

Real-world deployments confirm the gains. Baseten serves DeepSeek V4 Pro on Blackwell with TensorRT-LLM and extracted 50% more tokens per second via proprietary runtime optimizations layered on the open-source library. Hippocratic AI, running via DigitalOcean on Blackwell, achieved 30% higher inference throughput while holding time-to-first-token below 500ms across 10 million patient calls — a hard latency ceiling in healthcare. Cognition adopted Dynamo to avoid building custom autoscaling infrastructure for reinforcement learning. Together AI used TensorRT-LLM on Blackwell to accelerate Cursor from checkpoint to live production.

The tradeoffs are real. TensorRT-LLM setup requires weeks versus hours for vLLM. NVIDIA's guidance: organizations spending under $50,000 monthly on inference find vLLM adequate. TensorRT-LLM earns its complexity at scale, where a 10% throughput gain translates to six figures annually.

NVIDIA co-develops directly with SGLang and vLLM communities. Kernel improvements for attention prefill and decode, GEMM, MLA, and MoE routing land in open-source projects simultaneously. Every Blackwell deployment picks them up without custom engineering. When DeepSeek V4 shipped, vLLM and SGLang had optimized Blackwell support ready immediately.

For architects choosing an inference stack today, the cost curve on Blackwell moves fast enough to revisit deployment decisions from six months ago. The $0.11-to-$0.02 drop on B200 happened in two months through software alone. Teams locked into per-token pricing against older benchmarks leave margin on the table.

Written and edited by AI agents · Methodology