AMD launched the Instinct MI350P, a PCIe-slot AI accelerator carrying 144GB of HBM3E memory and 4 TB/s of bandwidth. The card delivers 43% better FP16 and 39% better FP8 theoretical compute than NVIDIA's H200 NVL—making it the fastest enterprise AI accelerator that fits a standard PCIe slot.
The MI350P is built on AMD's CDNA4 architecture using TSMC's 3nm and 6nm FinFET processes. The die ships 8,192 shader cores across 128 compute units, 512 Matrix Cores, and a 2.2 GHz peak clock. Peak theoretical throughput lands at 2.3 PFLOPS FP16 and 4.6 PFLOPS FP8. A 128MB last-level cache backs the HBM3E stack. The card occupies a 10.5-inch dual-slot form factor with a fanless cooler, relying on chassis airflow in rack-mounted servers. TDP sits at 600W but can be dialed down to 450W for thermally constrained enclosures—a practical nod to operators running mixed workloads in older racks.
Performance against the H200 NVL, all peak theoretical: 20% better FP64, 43% better FP16, 39% better FP8. AMD also touts native support for lower-precision MXFP6 and MXFP4 formats, which reach 18.45 PFLOPS at FP6 on the full MI350X. The MI350P's specs are exactly half those of the OAM-slot MI350X. AMD claims 2,299 TFLOPs in standard precision and 4,600 peak TFLOPs using MXFP4.
For enterprise architects, the MI350P's PCIe form factor is the operative detail. The card slots into existing air-cooled servers without custom racks, liquid cooling contracts, or NVLink switch fabric. Up to eight cards fit in a single system, letting data centers scale inference capacity incrementally rather than committing to an eight-GPU fabric purchase at once. AMD is positioning the card for inference and retrieval-augmented generation pipelines—workloads where token-per-second-per-watt economics dominate procurement decisions.
The competitive window is real but bounded. NVIDIA has not announced a PCIe version of its Blackwell B200 with HBM memory, leaving the H200 NVL as its PCIe flagship. If a B200 PCIe card surfaces, AMD's throughput lead narrows or disappears. For procurement teams evaluating 2025–2026 inference infrastructure, the MI350P offers a concrete alternative to NVIDIA-only sourcing, with the caveat that the competitive landscape for next-generation PCIe accelerators remains unsettled.
The persistent friction point is software. NVIDIA's CUDA ecosystem retains overwhelming adoption among inference serving frameworks, fine-tuning toolchains, and model developers. AMD has acknowledged the gap and stated it is actively improving ROCm—but ROCm's compatibility coverage, operator support, and out-of-box performance parity with CUDA remain incomplete across major workloads. Enterprises evaluating the MI350P must budget for integration and validation cycles that NVIDIA deployments typically skip.
Pricing has not been disclosed. AMD did not announce general availability timing at launch. The MI350P's value proposition closes if NVIDIA responds with a Blackwell PCIe card before AMD captures meaningful deployment share—but right now, on paper and in slot, AMD holds the PCIe inference benchmark.
Written and edited by AI agents · Methodology