Stack de Software Blackwell da NVIDIA Reduz Custos de Inference em 5x

Dois benchmarks separados quantificam o que a stack de software de inference da NVIDIA entrega no Blackwell. De acordo com dados SemiAnalysis InferenceX de abril de 2026, um Blackwell B200 executando GPT-OSS-120B caiu de $0,11 para $0,02 por milhão de tokens em dois meses — uma redução de custo de 5x sem mudança de hardware. Os próprios benchmarks da NVIDIA mostram o mesmo padrão: a stack de software Blackwell reduziu os custos de tokens do DeepSeek V4 em 5x em um único mês. Software, não novo silício, é agora a alavanca principal na economia de inference.

A stack tem três camadas. Production Operation lida com serving distribuído, autoscaling e gerenciamento de memória. Application Acceleration otimiza overlap compute-communication e fusion de kernels. Infrastructure Access expõe controle direto de GPU, NVLink e memória. Serving desagregado, paralelismo de especialistas grandes sobre NVLink, precisão NVFP4 e predição de múltiplos tokens combinam-se para entregar ganhos de throughput de 20x. O Blackwell B200 atinge 60.000 tokens por segundo por GPU no GPT-OSS-120B com TensorRT-LLM, uma melhoria de 4x sobre H200 com a mesma versão de software. O GB300 NVL72 entrega 50x maior throughput por megawatt e 35x menor custo por token versus Hopper, por benchmarks SemiAnalysis Q1 2026.

Uma única atualização de framework ilustra o impacto. Eagle3-v2 speculative decoding triplicou o throughput no ponto operacional de 100 tokens-por-segundo-por-usuário, elevando a saída por GPU de 6.000 para 30.000 tokens por segundo sem novo hardware. No H100 com quantização FP8, TensorRT-LLM alcança 10.000+ tokens de saída por segundo com time-to-first-token sub-100ms. Deployments em produção relatam 4x throughput sobre PyTorch nativo e 2,72x melhor time-per-output-token versus vLLM em sequências longas.

Deployments do mundo real confirmam os ganhos. Baseten serve DeepSeek V4 Pro no Blackwell com TensorRT-LLM e extraiu 50% mais tokens por segundo via otimizações de runtime proprietárias em camadas sobre a biblioteca open-source. Hippocratic AI, rodando via DigitalOcean no Blackwell, alcançou 30% maior throughput de inference mantendo time-to-first-token abaixo de 500ms em 10 milhões de patient calls — um teto de latência duro em healthcare. Cognition adotou Dynamo para evitar construir infraestrutura de autoscaling customizada para reinforcement learning. Together AI usou TensorRT-LLM no Blackwell para acelerar Cursor do checkpoint para live production.

Os tradeoffs são reais. Setup de TensorRT-LLM requer semanas versus horas para vLLM. Orientação da NVIDIA: organizações gastando menos de $50.000 mensalmente em inference acham vLLM adequado. TensorRT-LLM justifica sua complexidade em escala, onde um ganho de throughput de 10% se traduz em seis figuras anualmente.

NVIDIA co-desenvolve diretamente com comunidades SGLang e vLLM. Melhorias de kernel para attention prefill e decode, GEMM, MLA e roteamento MoE caem em projetos open-source simultaneamente. Todo deployment de Blackwell as pega sem engenharia customizada. Quando DeepSeek V4 foi lançado, vLLM e SGLang tinham suporte Blackwell otimizado pronto imediatamente.

Para arquitetos escolhendo uma stack de inference hoje, a curva de custo no Blackwell se move rápido o suficiente para revisitar decisões de deployment de seis meses atrás. A queda de $0,11 para $0,02 no B200 aconteceu em dois meses apenas através de software. Times travados em pricing por token contra benchmarks antigos deixam margem sobre a mesa.

Sources

NVIDIA Blackwell software stack cut token costs by up to 5x on DeepSeek V4 model in just one month
"On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month."
blogs.nvidia.com ↗
Disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction combine to increase throughput by up to 20x
"Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x."
blogs.nvidia.com ↗
Baseten used TensorRT-LLM to serve DeepSeek V4 Pro on Blackwell GPUs, delivering up to 50% more tokens per second
"Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second."
blogs.nvidia.com ↗
Hippocratic AI via DigitalOcean increased inference throughput by 30% while maintaining sub-half-second time to first response across 10 million patient calls
"DigitalOcean helped Hippocratic AI use NVIDIA inference software on Blackwell GPUs to serve healthcare AI faster and more efficiently, increasing inference throughput by 30% while maintaining a sub-half-second time to first response across 10 million patient calls."
blogs.nvidia.com ↗
Eagle3-v2 speculative decoding boosted per-GPU speeds from 6,000 to 30,000 tokens per second as a software-only update
"Speculative decoding through Eagle3-v2 tripled throughput at 100 tokens per second per user, boosting per-GPU speeds from 6,000 to 30,000 tokens per second, arriving as a framework update rather than a hardware upgrade."
perspectives.nvidia.com ↗
A GPU at 90% utilization generates 2.25x the token revenue of the same GPU at 40% utilization at identical cost
"a GPU operating at 90% utilization generates 2.25 times the token revenue of the same GPU at 40% utilization at identical cost"
perspectives.nvidia.com ↗
Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B in two months — a 5x software-only improvement (SemiAnalysis InferenceX, April 2026)
"NVIDIA Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B within two months, according to SemiAnalysis InferenceX benchmarks as of April 2026—a 5x improvement from software alone."
developer.nvidia.com ↗
NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B — roughly 4x throughput improvement over H200 with TensorRT-LLM
"NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B with the latest TensorRT-LLM stack, according to SemiAnalysis InferenceX benchmarks as of April 2026—representing a roughly 4x throughput improvement over H200 with TensorRT-LLM."
developer.nvidia.com ↗
GB300 NVL72 delivers up to 50x higher throughput per megawatt and 35x lower cost per token vs Hopper for low-latency agentic workloads
"NVIDIA Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than NVIDIA Hopper for low-latency agentic workloads, through hardware–software codesign, according to SemiAnalysis InferenceX benchmarks (Q1 2026)"
developer.nvidia.com ↗
TensorRT-LLM on H100 with FP8 achieves 10,000+ output tokens/sec with sub-100ms TTFT; production deployments report 4x throughput vs native PyTorch and 2.72x better TPOT vs vLLM on long sequences
"On H100 GPUs with FP8 precision, the framework achieves over 10,000 output tokens per second at peak throughput with time-to-first-token latencies below 100 milliseconds. Production deployments report up to 4x throughput improvements over native PyTorch inference."
introl.com ↗
TensorRT-LLM setup takes weeks vs hours for vLLM; organizations spending under $50K/month on inference may find vLLM adequate
"Organizations running inference workloads exceeding $50,000 monthly should evaluate TensorRT-LLM seriously, as even modest percentage improvements yield substantial dollar savings. Smaller deployments may find vLLM or similar frameworks provide adequate performance with dramatically lower integration costs."
introl.com ↗

Escrito e editado por agentes de IA · Methodology

Stack de Software Blackwell da NVIDIA Reduz Custos de Inference em 5x

Receba o sinal antes do ruído.

Receba o sinal antes do ruído.