Microsoft AKS on Bare Metal Cuts GPU Latency 12–18% at Build 2026

Microsoft shipped four AKS updates at Build 2026: AKS on Bare Metal (public preview), Azure Kubernetes Fleet Manager for Arc-enabled clusters (GA), Anyscale on Azure managed Ray (public preview), and ModelServingRuntime for Kubernetes-native inference. Each targets a different layer of GPU infrastructure cost: hypervisor overhead, multi-cluster operations, Ray management, and serving-framework integration.

AKS on Bare Metal removes the hypervisor, giving workloads direct access to NVLink and RDMA for distributed training and low-latency inference. Microsoft benchmarks showed 12–18% improvement in InfiniBand message rate and lower tail latency on bare-metal A100 nodes versus Azure dedicated hosts. The control plane manages both physical and virtual nodes, enabling hybrid deployments. Launch hardware includes Dell and HPE models certified through Azure Stack HCI; broader support ships end of year. No per-cluster fee.

FIG. 02 AKS on Bare Metal: 12–18% InfiniBand message-rate gain offset by reduced scheduling flexibility and longer hardware replacement cycles. — Microsoft Build 2026 / windowsnews.ai

Fleet Manager GA for Arc-enabled clusters extends centralized policy, workload placement, staged rollouts, and RBAC across Azure, on-premises, and other clouds. For teams split across regions or using on-premises clusters for data residency, this consolidation matters more than any single feature.

Anyscale on Azure brings managed Ray to AKS without independent cluster operations. The service handles heterogeneous and fractional GPU allocation, scaling per job. It runs in customer subscriptions, integrates with Entra ID, and bills per vCPU-second with a 200 vCPU-hour free tier during preview. Wayve runs this in production for autonomous vehicles, using AKS, Ray, and Anyscale on Azure to connect thousands of GPUs. CEO Alex Kendall described deploying a new Nissan vehicle in Japan—"a country where we had never driven"—and within four months demonstrating autonomous driving throughout Tokyo. He tied the milestone directly to Azure's elastic GPU capacity.

ModelServingRuntime exposes vLLM, KServe, and similar runtimes as native Kubernetes objects instead of separate stacks. A ModelServingRuntime workload gets automatic HTTPS, Entra ID auth, OpenTelemetry traces, and a sidecar for versioning, canary routing, and queuing. KAITO provisions resources and launches optimized runtimes under AI Runway, integrating with KEDA for autoscaling and Gateway API for traffic management. Teams move from model selection to production endpoint without writing Kubernetes serving boilerplate. Royal Bank of Canada runs KAITO in production, letting development teams provision GPU resources and deploy through their CI/CD pipeline with private registries, Entra ID, Key Vault, and private ACR.

Bare-metal reduces scheduling flexibility: hardware failures require longer replacements, and the Dell/HPE list limits placement options at launch. Fleet Manager GA covers Arc clusters only; Azure-only multi-cluster work uses separate mechanisms. Anyscale on Azure remains in preview, and production pricing at scale is unvalidated.

For architects: if p99 latency is bottlenecked by hypervisor overhead or NVLink utilization is low on virtualized nodes, bare-metal offers a path that keeps the Kubernetes control plane. ModelServingRuntime/KAITO reduces operational surface but adds an indirection layer. Validate that vLLM version pinning and custom runtime configs remain reachable before replacing hand-rolled serving stacks.

Sources

AKS on Bare Metal removes the virtualization layer, giving workloads direct access to NVLink, RDMA, and high-performance networking
"By removing the virtualization layer, AKS can now provide direct access to technologies such as NVLink, RDMA, and high-performance networking, capabilities that are increasingly important for large language model training and latency-sensitive inference workloads."
infoq.com ↗
Azure Kubernetes Fleet Manager for Arc-enabled clusters is generally available, enabling centralized policy enforcement, workload placement, staged rollouts, and RBAC governance
"Fleet Manager enables centralized policy enforcement, workload placement, staged rollouts, and RBAC governance across entire fleets of clusters."
infoq.com ↗
Anyscale on Azure brings managed Ray to AKS, handling heterogeneous and fractional GPU allocation within the customer's Azure subscription and billed per vCPU-second with a 200 vCPU-hour/month free tier during preview
"Anyscale on Azure, now in public preview, brings managed Ray to AKS, allowing organizations to orchestrate distributed AI workloads using CPUs and GPUs across dynamically scaling clusters. The service integrates directly into Azure subscriptions and governance models."
infoq.com ↗
KAITO provisions resources, launches optimized vLLM runtimes, and integrates with KEDA and Gateway API for Kubernetes-native model deployment
"Under the hood, KAITO provisions resources, launches optimized runtimes such as vLLM, and integrates with Kubernetes autoscaling and networking technologies like KEDA and Gateway API."
infoq.com ↗
Microsoft benchmarks showed 12–18% improvement in InfiniBand message rate on bare-metal A100 nodes versus Azure dedicated hosts
"Microsoft's own benchmarks, shared during a Build session, showed a 12–18% improvement in InfiniBand message rate and a measurable drop in tail latency when running NCCL all-reduce across bare-metal A100 nodes compared to the same GPUs on Azure dedicated hosts."
windowsnews.ai ↗
Bare-metal option is initially available on Dell and HPE server models validated through the Azure Stack HCI hardware list, with broader certification promised by end of year
"The bare-metal option is initially available on specific Dell and HPE server models validated through the Azure Stack HCI hardware list, with a broader certification program promised by the end of the calendar year."
windowsnews.ai ↗
No additional per-cluster fee for Fleet Manager or bare-metal provisioning; Managed Ray on Azure has a 200 vCPU-hour/month free tier during preview
"There is no additional per-cluster fee for the fleet manager or for bare-metal provisioning. Managed Ray on Azure will follow a per-vCPU-second charge similar to Azure Machine Learning compute, with a free tier covering 200 vCPU-hours per month during the preview."
windowsnews.ai ↗
Wayve CEO Alex Kendall confirmed the company took a new Nissan into Japan—a country it had never driven—and within four months demonstrated fully autonomous driving throughout Tokyo, using AKS and Azure infrastructure
"We were able to take a new vehicle from Nissan in Japan, a country where we had never driven. And in just four months, we were able to take this new vehicle and show that our system could drive autonomously all throughout Tokyo."
news.microsoft.com ↗
Wayve uses AKS, Ray, and Anyscale on Azure to connect thousands of GPUs and run distributed ML and data pipelines for autonomous driving AI
"Wayve uses Ray, and increasingly Anyscale on Azure to run distributed ML and data pipelines across large CPU and GPU fleets, supporting large-scale inference, analytics, and dataset processing with improved efficiency and resiliency."
prnewswire.com ↗
Royal Bank of Canada runs KAITO for production model serving on AKS with private endpoints, Entra ID, Key Vault, and private ACR inside the bank's Azure boundary
"KAITO handles production model serving, with model images hosted in the bank's private container registry. The compliance perimeter wraps the entire path: private endpoints, Entra ID, Key Vault, and a private ACR keep models and data inside the bank's Azure boundary."
techcommunity.microsoft.com ↗

Written and edited by AI agents · Methodology

Microsoft AKS on Bare Metal Cuts GPU Latency 12–18% at Build 2026

Get the signal before the noise.

Get the signal before the noise.