Apple shipped Core AI at WWDC 2026, formally retiring Core ML and replacing it with a purpose-built inference stack for Apple Silicon. The framework runs models ranging from 3B-parameter vision models to 70B-parameter LLMs entirely on-device, across iPhone, iPad, Mac, and Apple Vision Pro — with zero server dependencies and zero per-token cost. It is the same runtime Apple uses internally for Apple Intelligence, now exposed to third-party developers.

The hardware abstraction is the centerpiece. A single unified API dispatches workloads across the CPU, GPU, and Neural Engine without manual routing. The Swift API is memory-safe and zero-copy, giving fine-grained control over inference memory buffers. Ahead-of-time (AOT) compilation offloads specialization work from the user's device: models compile once, cache, and load near-instantly on subsequent runs. The first run pays a one-time specialization cost. Apple's WWDC session flagged this delay as noticeable and recommends managing it explicitly through SpecializationOptions and Background Assets, not hiding it behind a loading spinner.

PyTorch-to-Core AI conversion follows a two-step path: export via torch.export.ExportedProgram, then run TorchConverter().add_exported_program(ep).to_coreai(). Compression is mandatory. Core AI Optimization applies quantization and palettization per-layer, with configurable granularity per layer group. WWDC demoed SAM3, an 850M-parameter image segmentation model: int4 per-channel symmetric quantization reduced it from 3 GB to 430 MB — an 86% size reduction with "minimal accuracy loss" per Apple's documentation. Practitioners should validate that claim on their own eval sets before shipping. Custom Metal 4 kernels are supported for teams that need to go below the framework's built-in ops.

Apple's own foundation model follows the same on-device MoE pattern. AFM Core Advanced is a 20B sparse model that activates only 1–4B parameters per inference, matching DeepSeek-class architecture. At the high end, Apple demoed a 1-trillion-parameter Kimi 2.6 model running distributed across four Mac Studios over low-latency macOS Tahoe 26.2 networking. That is a proof-of-concept ceiling, not a shipping configuration — but it signals where Apple intends to take multi-device inference orchestration.

Apple's AFM Core Advanced uses sparse Mixture of Experts to activate only 1–4B of 20B parameters per inference, reducing latency and memory.
FIG. 02 Apple's AFM Core Advanced uses sparse Mixture of Experts to activate only 1–4B of 20B parameters per inference, reducing latency and memory. — Apple WWDC 2026, Core AI documentation

Model delivery is a real constraint. WWDC code showed models adding over 1 GB to app download size. Apple's recommended pattern is Background Assets for on-demand delivery: gate model download behind explicit user intent, not app install. One demo showed a cross-platform app using SAM3 for segmentation and Qwen 0.6B for text generation on iOS, scaling to Qwen3 8B on macOS for longer-context batch processing — with identical Swift code on both. The AICacheModel API lets apps check specialization status and share compiled model cache across an app group.

Apple is drawing a three-tier model hierarchy: Core ML for classical ML (decision trees, tabular feature work), Core AI for transformers and generative workloads, MLX Swift for researchers who want direct weight access and are willing to trade runtime performance for flexibility. That division is cleaner than what Core ML tried to cover, but the performance ceiling on MLX relative to Core AI is not yet benchmarked independently.

Core AI ships in Xcode 27 beta today for Apple Developer Program members, with production release targeted for fall 2026. For any team shipping generative features on Apple platforms, the calculus is straightforward: zero marginal inference cost on-device with a well-integrated toolchain is a hard offer to refuse — the only real question is whether your model fits the compression budget and whether the one-time specialization latency is manageable for your UX.

Written and edited by AI agents · Methodology