Chips Monday, June 29, 2026 at 11:03 AM

MLPerf Training v6.0: NVIDIA Blackwell sweeps, AMD within 5-6% on dense LLM training

The MLPerf Training v6.0 benchmark suite, released by MLCommons on June 16, 2026, shows NVIDIA Blackwell achieving the fastest time-to-train across every workload tested, with the company submitting results on all seven benchmarks—the only vendor to do so. NVIDIA's GB300 NVL72 (Blackwell Ultra) system achieved leading per-accelerator and full-scale performance on both legacy dense LLM workloads and the new 671-billion-parameter mixture-of-experts (MoE) models added this round: DeepSeek-V3 and GPT-OSS-20B. CoreWeave, running cloud infrastructure, achieved the fastest DeepSeek-V3 time on 8,192 GPUs: 2.02 minutes.

AMD's MI355X came within 5% on Llama 2-70B fine-tuning and 6% on Llama 3.1-8B pre-training versus NVIDIA B200 using comparable FP4 precision recipes (MXFP4 vs. NVFP4). However, AMD did not submit results on the new MoE benchmarks; all entries for DeepSeek-V3 were NVIDIA-only, leaving the competitive picture incomplete on sparse-model training at scale. Microsoft Azure scaled Llama 3.1 405B (dense, 405B parameters) to 8,192 Blackwell GPUs in 7.07 minutes, a record-scale training job.

For practitioners, the headline spans two layers: hardware and software. At hardware level, NVIDIA's full-stack sweep and only-vendor-on-all-tests status signals platform maturity for production large-scale training. At software level, NVIDIA reports GB300 delivered 1.3x throughput gains on DeepSeek-V3 versus GB200 in six months driven purely by software optimization (CUDA graphs, kernel fusions, MoE router improvements)—no hardware change. This indicates that enterprises with current NVIDIA GPUs can expect performance gains between hardware-generation cycles. Cloud submissions doubled versus the prior round (v5.1), signaling a structural shift toward training-as-a-service rather than on-premises GPU procurement. For chip procurement teams and inference-provider planning, AMD's 5-6% parity on dense models makes it a node-level alternative, but lack of MoE results leaves uncertainty about competitiveness on the sparse-architecture workloads becoming industry-standard.

Sources

Primary source
digitalapplied.com
“NVIDIA Blackwell tops every workload it entered, AMD lands within a handful of percent, and cloud submissions double”
developer.nvidia.com
“NVIDIA achieved leading results in MLPerf Training v6.0 by winning every benchmark, setting records in both overall and per-accelerator performance, and uniquely submitting across all new and existing tests”
amd.com
“AMD Instinct MI355X GPUs also demonstrated competitive performance against NVIDIA B200 platforms on two important MLPerf Training 6.0 workloads: Llama 2-70B fine-tuning and Llama 3.1-8B pre-training, coming within 5% on Llama 2-70B fine-tuning and within 6% on Llama 3.1-8B pre-training”

MLPerf Training v6.0: NVIDIA Blackwell sweeps, AMD within 5-6% on dense LLM training

Sources

Get the signal before the noise.