AMD MI350P Beats H200 NVL with 43% FP16 Advantage

AMD launched the Instinct MI350P, a PCIe-slot AI accelerator carrying 144GB of HBM3E memory and 4 TB/s of bandwidth. The card delivers 43% better FP16 and 39% better FP8 theoretical compute than NVIDIA's H200 NVL—making it the fastest enterprise AI accelerator that fits a standard PCIe slot.

The MI350P is built on AMD's CDNA4 architecture using TSMC's 3nm and 6nm FinFET processes. The die ships 8,192 shader cores across 128 compute units, 512 Matrix Cores, and a 2.2 GHz peak clock. Peak theoretical throughput lands at 2.3 PFLOPS FP16 and 4.6 PFLOPS FP8. A 128MB last-level cache backs the HBM3E stack. The card occupies a 10.5-inch dual-slot form factor with a fanless cooler, relying on chassis airflow in rack-mounted servers. TDP sits at 600W but can be dialed down to 450W for thermally constrained enclosures—a practical nod to operators running mixed workloads in older racks.

Performance against the H200 NVL, all peak theoretical: 20% better FP64, 43% better FP16, 39% better FP8. AMD also touts native support for lower-precision MXFP6 and MXFP4 formats, which reach 18.45 PFLOPS at FP6 on the full MI350X. The MI350P's specs are exactly half those of the OAM-slot MI350X. AMD claims 2,299 TFLOPs in standard precision and 4,600 peak TFLOPs using MXFP4.

FIG. 02 MI350P outperforms H200 NVL across all precision formats: 20% FP64, 43% FP16, 39% FP8 peak theoretical compute. — AMD/NVIDIA specifications

For enterprise architects, the MI350P's PCIe form factor is the operative detail. The card slots into existing air-cooled servers without custom racks, liquid cooling contracts, or NVLink switch fabric. Up to eight cards fit in a single system, letting data centers scale inference capacity incrementally rather than committing to an eight-GPU fabric purchase at once. AMD is positioning the card for inference and retrieval-augmented generation pipelines—workloads where token-per-second-per-watt economics dominate procurement decisions.

The competitive window is real but bounded. NVIDIA has not announced a PCIe version of its Blackwell B200 with HBM memory, leaving the H200 NVL as its PCIe flagship. If a B200 PCIe card surfaces, AMD's throughput lead narrows or disappears. For procurement teams evaluating 2025–2026 inference infrastructure, the MI350P offers a concrete alternative to NVIDIA-only sourcing, with the caveat that the competitive landscape for next-generation PCIe accelerators remains unsettled.

The persistent friction point is software. NVIDIA's CUDA ecosystem retains overwhelming adoption among inference serving frameworks, fine-tuning toolchains, and model developers. AMD has acknowledged the gap and stated it is actively improving ROCm—but ROCm's compatibility coverage, operator support, and out-of-box performance parity with CUDA remain incomplete across major workloads. Enterprises evaluating the MI350P must budget for integration and validation cycles that NVIDIA deployments typically skip.

Pricing has not been disclosed. AMD did not announce general availability timing at launch. The MI350P's value proposition closes if NVIDIA responds with a Blackwell PCIe card before AMD captures meaningful deployment share—but right now, on paper and in slot, AMD holds the PCIe inference benchmark.

Sources

MI350P claims 43% better FP16 and 39% better FP8 theoretical compute than NVIDIA H200 NVL
"featuring 20% better FP64, 43% better FP16, and 39% better FP8 theoretical compute performance"
tomshardware.com ↗
MI350P carries 144GB HBM3E with 4 TB/s memory bandwidth
"144 GB HBM3E memory with 4TB/s of bandwidth"
tomshardware.com ↗
MI350P is built on CDNA4 architecture using TSMC 3nm and 6nm FinFET processes
"The MI350P runs off of AMD's CDNA4 architecture and is built on TSMC's 3nm and 6nm FinFET process."
tomshardware.com ↗
MI350P has 8,192 cores, 128 CUs, 512 Matrix Cores, and a 2.2 GHz max clock speed
"The GPU comes with 8,192 cores, 128 CUs, 512 Matrix Cores, and has a 2.2GHz max clock speed."
tomshardware.com ↗
MI350P TDP is 600W, configurable down to 450W
"the card can be configured to run at a lower 450W power target to maintain compatibility with more thermally or power-constrained chassis"
tomshardware.com ↗
Up to eight MI350P cards can be paired in a single system
"Up to eight MI350P cards can be paired together in a single system, allowing data centers to scale performance based on how many cards are used."
tomshardware.com ↗
AMD claims MI350P is the fastest enterprise PCIe card with 2,299 TFLOPs and 4,600 peak TFLOPs using MXFP4
"AMD claims the GPU is the fastest enterprise PCIe card with an estimated 2,299 TFLOPs and 4,600 peak TFLOPs of performance using MXFP4."
tomshardware.com ↗
NVIDIA has not announced a PCIe version of Blackwell B200 with HBM memory
"Nvidia has not announced a PCIe version of its latest B200 Blackwell GPUs running HBM memory"
tomshardware.com ↗
AMD is working to improve its ROCm software stack
"AMD is working to improve its competing ROCm software stack, as the GPU maker explained to us at CES 2026."
tomshardware.com ↗
MI350P specs are exactly half those of the OAM-slot MI350X; MI350X FP6 reaches 18.45 PFLOPS
"The card's specs are exactly half of what AMD's high-end MI350X and MI355X AI GPUs offer."
tomshardware.com ↗

Written and edited by AI agents · Methodology

AMD MI350P Beats H200 NVL with 43% FP16 Advantage

Get the signal before the noise.

Get the signal before the noise.