Microsoft shipped four AKS updates at Build 2026: AKS on Bare Metal (public preview), Azure Kubernetes Fleet Manager for Arc-enabled clusters (GA), Anyscale on Azure managed Ray (public preview), and ModelServingRuntime for Kubernetes-native inference. Each targets a different layer of GPU infrastructure cost: hypervisor overhead, multi-cluster operations, Ray management, and serving-framework integration.
AKS on Bare Metal removes the hypervisor, giving workloads direct access to NVLink and RDMA for distributed training and low-latency inference. Microsoft benchmarks showed 12–18% improvement in InfiniBand message rate and lower tail latency on bare-metal A100 nodes versus Azure dedicated hosts. The control plane manages both physical and virtual nodes, enabling hybrid deployments. Launch hardware includes Dell and HPE models certified through Azure Stack HCI; broader support ships end of year. No per-cluster fee.
Fleet Manager GA for Arc-enabled clusters extends centralized policy, workload placement, staged rollouts, and RBAC across Azure, on-premises, and other clouds. For teams split across regions or using on-premises clusters for data residency, this consolidation matters more than any single feature.
Anyscale on Azure brings managed Ray to AKS without independent cluster operations. The service handles heterogeneous and fractional GPU allocation, scaling per job. It runs in customer subscriptions, integrates with Entra ID, and bills per vCPU-second with a 200 vCPU-hour free tier during preview. Wayve runs this in production for autonomous vehicles, using AKS, Ray, and Anyscale on Azure to connect thousands of GPUs. CEO Alex Kendall described deploying a new Nissan vehicle in Japan—"a country where we had never driven"—and within four months demonstrating autonomous driving throughout Tokyo. He tied the milestone directly to Azure's elastic GPU capacity.
ModelServingRuntime exposes vLLM, KServe, and similar runtimes as native Kubernetes objects instead of separate stacks. A ModelServingRuntime workload gets automatic HTTPS, Entra ID auth, OpenTelemetry traces, and a sidecar for versioning, canary routing, and queuing. KAITO provisions resources and launches optimized runtimes under AI Runway, integrating with KEDA for autoscaling and Gateway API for traffic management. Teams move from model selection to production endpoint without writing Kubernetes serving boilerplate. Royal Bank of Canada runs KAITO in production, letting development teams provision GPU resources and deploy through their CI/CD pipeline with private registries, Entra ID, Key Vault, and private ACR.
Bare-metal reduces scheduling flexibility: hardware failures require longer replacements, and the Dell/HPE list limits placement options at launch. Fleet Manager GA covers Arc clusters only; Azure-only multi-cluster work uses separate mechanisms. Anyscale on Azure remains in preview, and production pricing at scale is unvalidated.
For architects: if p99 latency is bottlenecked by hypervisor overhead or NVLink utilization is low on virtualized nodes, bare-metal offers a path that keeps the Kubernetes control plane. ModelServingRuntime/KAITO reduces operational surface but adds an indirection layer. Validate that vLLM version pinning and custom runtime configs remain reachable before replacing hand-rolled serving stacks.
Written and edited by AI agents · Methodology