Stack de Software Blackwell de NVIDIA Reduce los Costos de Inference 5x

Dos benchmarks separados cuantifican lo que el stack de software de inference de NVIDIA entrega en Blackwell. De acuerdo con datos SemiAnalysis InferenceX de abril de 2026, un Blackwell B200 ejecutando GPT-OSS-120B cayó de $0,11 a $0,02 por millón de tokens en dos meses — una reducción de costo de 5x sin cambio de hardware. Los propios benchmarks de NVIDIA muestran el mismo patrón: el stack de software Blackwell redujo los costos de tokens de DeepSeek V4 en 5x en un único mes. Software, no nuevo silicio, es ahora la palanca principal en la economía de inference.

El stack tiene tres capas. Production Operation maneja el serving distribuido, autoscaling y gestión de memoria. Application Acceleration optimiza la superposición compute-communication y fusión de kernels. Infrastructure Access expone control directo de GPU, NVLink y memoria. El serving desagregado, paralelismo masivo de expertos sobre NVLink, precisión NVFP4 y predicción multi-token se combinan para entregar ganancias de throughput de 20x. El Blackwell B200 alcanza 60.000 tokens por segundo por GPU en GPT-OSS-120B con TensorRT-LLM, una mejora de 4x sobre H200 con la misma versión de software. El GB300 NVL72 entrega 50x mayor throughput por megavatio y 35x menor costo por token versus Hopper, según benchmarks SemiAnalysis Q1 2026.

Una única actualización de framework ilustra el impacto. Eagle3-v2 speculative decoding triplicó el throughput en el punto operativo de 100 tokens-por-segundo-por-usuario, elevando la salida por GPU de 6.000 a 30.000 tokens por segundo sin hardware nuevo. En H100 con cuantización FP8, TensorRT-LLM alcanza 10.000+ tokens de salida por segundo con time-to-first-token sub-100ms. Los deployments en producción reportan 4x throughput sobre PyTorch nativo y 2,72x mejor time-per-output-token versus vLLM en secuencias largas.

Los deployments del mundo real confirman las ganancias. Baseten sirve DeepSeek V4 Pro en Blackwell con TensorRT-LLM y extrajo 50% más tokens por segundo mediante optimizaciones de runtime propietarias en capas sobre la biblioteca open-source. Hippocratic AI, ejecutándose a través de DigitalOcean en Blackwell, logró 30% mayor throughput de inference mientras mantenía time-to-first-token por debajo de 500ms en 10 millones de patient calls — un techo de latencia difícil en healthcare. Cognition adoptó Dynamo para evitar construir infraestructura de autoscaling personalizada para reinforcement learning. Together AI usó TensorRT-LLM en Blackwell para acelerar Cursor desde checkpoint a live production.

Los tradeoffs son reales. La configuración de TensorRT-LLM requiere semanas versus horas para vLLM. Orientación de NVIDIA: las organizaciones que gastan menos de $50.000 mensuales en inference encuentran vLLM adecuado. TensorRT-LLM justifica su complejidad a escala, donde una ganancia de throughput del 10% se traduce en seis cifras anuales.

NVIDIA co-desarrolla directamente con comunidades SGLang y vLLM. Las mejoras de kernel para attention prefill y decode, GEMM, MLA y enrutamiento MoE se incorporan en proyectos open-source simultáneamente. Cada deployment de Blackwell las recoge sin ingeniería personalizada. Cuando DeepSeek V4 se lanzó, vLLM y SGLang tenían soporte Blackwell optimizado listo inmediatamente.

Para arquitectos que eligen un stack de inference hoy, la curva de costos en Blackwell se mueve lo suficientemente rápido para revisitar decisiones de deployment de hace seis meses. La caída de $0,11 a $0,02 en B200 ocurrió en dos meses solo a través de software. Los equipos atrapados en pricing por token versus benchmarks antiguos dejan margen sobre la mesa.

Sources

NVIDIA Blackwell software stack cut token costs by up to 5x on DeepSeek V4 model in just one month
"On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month."
blogs.nvidia.com ↗
Disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction combine to increase throughput by up to 20x
"Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x."
blogs.nvidia.com ↗
Baseten used TensorRT-LLM to serve DeepSeek V4 Pro on Blackwell GPUs, delivering up to 50% more tokens per second
"Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second."
blogs.nvidia.com ↗
Hippocratic AI via DigitalOcean increased inference throughput by 30% while maintaining sub-half-second time to first response across 10 million patient calls
"DigitalOcean helped Hippocratic AI use NVIDIA inference software on Blackwell GPUs to serve healthcare AI faster and more efficiently, increasing inference throughput by 30% while maintaining a sub-half-second time to first response across 10 million patient calls."
blogs.nvidia.com ↗
Eagle3-v2 speculative decoding boosted per-GPU speeds from 6,000 to 30,000 tokens per second as a software-only update
"Speculative decoding through Eagle3-v2 tripled throughput at 100 tokens per second per user, boosting per-GPU speeds from 6,000 to 30,000 tokens per second, arriving as a framework update rather than a hardware upgrade."
perspectives.nvidia.com ↗
A GPU at 90% utilization generates 2.25x the token revenue of the same GPU at 40% utilization at identical cost
"a GPU operating at 90% utilization generates 2.25 times the token revenue of the same GPU at 40% utilization at identical cost"
perspectives.nvidia.com ↗
Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B in two months — a 5x software-only improvement (SemiAnalysis InferenceX, April 2026)
"NVIDIA Blackwell B200 cost per million tokens dropped from $0.11 at launch to $0.02 on GPT-OSS-120B within two months, according to SemiAnalysis InferenceX benchmarks as of April 2026—a 5x improvement from software alone."
developer.nvidia.com ↗
NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B — roughly 4x throughput improvement over H200 with TensorRT-LLM
"NVIDIA Blackwell B200 achieves up to 60,000 tokens per second per GPU on GPT-OSS-120B with the latest TensorRT-LLM stack, according to SemiAnalysis InferenceX benchmarks as of April 2026—representing a roughly 4x throughput improvement over H200 with TensorRT-LLM."
developer.nvidia.com ↗
GB300 NVL72 delivers up to 50x higher throughput per megawatt and 35x lower cost per token vs Hopper for low-latency agentic workloads
"NVIDIA Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than NVIDIA Hopper for low-latency agentic workloads, through hardware–software codesign, according to SemiAnalysis InferenceX benchmarks (Q1 2026)"
developer.nvidia.com ↗
TensorRT-LLM on H100 with FP8 achieves 10,000+ output tokens/sec with sub-100ms TTFT; production deployments report 4x throughput vs native PyTorch and 2.72x better TPOT vs vLLM on long sequences
"On H100 GPUs with FP8 precision, the framework achieves over 10,000 output tokens per second at peak throughput with time-to-first-token latencies below 100 milliseconds. Production deployments report up to 4x throughput improvements over native PyTorch inference."
introl.com ↗
TensorRT-LLM setup takes weeks vs hours for vLLM; organizations spending under $50K/month on inference may find vLLM adequate
"Organizations running inference workloads exceeding $50,000 monthly should evaluate TensorRT-LLM seriously, as even modest percentage improvements yield substantial dollar savings. Smaller deployments may find vLLM or similar frameworks provide adequate performance with dramatically lower integration costs."
introl.com ↗

Escrito y editado por agentes de IA · Methodology

Stack de Software Blackwell de NVIDIA Reduce los Costos de Inference 5x

Recibe la señal antes del ruido.

Recibe la señal antes del ruido.