Apple Core AI Runs 70B LLMs Entirely On-Device

Apple shipped Core AI at WWDC 2026, formally retiring Core ML and replacing it with a purpose-built inference stack for Apple Silicon. The framework runs models ranging from 3B-parameter vision models to 70B-parameter LLMs entirely on-device, across iPhone, iPad, Mac, and Apple Vision Pro — with zero server dependencies and zero per-token cost. It is the same runtime Apple uses internally for Apple Intelligence, now exposed to third-party developers.

The hardware abstraction is the centerpiece. A single unified API dispatches workloads across the CPU, GPU, and Neural Engine without manual routing. The Swift API is memory-safe and zero-copy, giving fine-grained control over inference memory buffers. Ahead-of-time (AOT) compilation offloads specialization work from the user's device: models compile once, cache, and load near-instantly on subsequent runs. The first run pays a one-time specialization cost. Apple's WWDC session flagged this delay as noticeable and recommends managing it explicitly through SpecializationOptions and Background Assets, not hiding it behind a loading spinner.

PyTorch-to-Core AI conversion follows a two-step path: export via torch.export.ExportedProgram, then run TorchConverter().add_exported_program(ep).to_coreai(). Compression is mandatory. Core AI Optimization applies quantization and palettization per-layer, with configurable granularity per layer group. WWDC demoed SAM3, an 850M-parameter image segmentation model: int4 per-channel symmetric quantization reduced it from 3 GB to 430 MB — an 86% size reduction with "minimal accuracy loss" per Apple's documentation. Practitioners should validate that claim on their own eval sets before shipping. Custom Metal 4 kernels are supported for teams that need to go below the framework's built-in ops.

Apple's own foundation model follows the same on-device MoE pattern. AFM Core Advanced is a 20B sparse model that activates only 1–4B parameters per inference, matching DeepSeek-class architecture. At the high end, Apple demoed a 1-trillion-parameter Kimi 2.6 model running distributed across four Mac Studios over low-latency macOS Tahoe 26.2 networking. That is a proof-of-concept ceiling, not a shipping configuration — but it signals where Apple intends to take multi-device inference orchestration.

FIG. 02 Apple's AFM Core Advanced uses sparse Mixture of Experts to activate only 1–4B of 20B parameters per inference, reducing latency and memory. — Apple WWDC 2026, Core AI documentation

Model delivery is a real constraint. WWDC code showed models adding over 1 GB to app download size. Apple's recommended pattern is Background Assets for on-demand delivery: gate model download behind explicit user intent, not app install. One demo showed a cross-platform app using SAM3 for segmentation and Qwen 0.6B for text generation on iOS, scaling to Qwen3 8B on macOS for longer-context batch processing — with identical Swift code on both. The AICacheModel API lets apps check specialization status and share compiled model cache across an app group.

Apple is drawing a three-tier model hierarchy: Core ML for classical ML (decision trees, tabular feature work), Core AI for transformers and generative workloads, MLX Swift for researchers who want direct weight access and are willing to trade runtime performance for flexibility. That division is cleaner than what Core ML tried to cover, but the performance ceiling on MLX relative to Core AI is not yet benchmarked independently.

Core AI ships in Xcode 27 beta today for Apple Developer Program members, with production release targeted for fall 2026. For any team shipping generative features on Apple platforms, the calculus is straightforward: zero marginal inference cost on-device with a well-integrated toolchain is a hard offer to refuse — the only real question is whether your model fits the compression budget and whether the one-time specialization latency is manageable for your UX.

Sources

Core AI is the official successor to Core ML, supports 3B to 70B parameter models on-device, zero server dependencies, zero per-token cost, Apple Silicon only
"Apple says the new Core AI framework provides a unified architecture for deploying models ranging from compact 3B-parameter vision models to large-scale LLMs, including reasoning models with up to 70B-parameter reasoning models"
infoq.com ↗
Core AI provides memory-safe Swift API, zero-copy data paths, AOT compilation for instant load times, custom Metal 4 kernels supported
"The Core AI framework provides a modern, memory-safe Swift API to load and run AI models entirely on device with zero server dependencies and zero token costs."
developer.apple.com ↗
SAM3 (850M parameters) compressed from 3GB to 430MB using int4 per-channel symmetric quantization; Core AI Debugger is a new standalone app for on-device model inspection
"How to compress models using coreai-opt's config-driven optimization library — demonstrated on SAM3 (850M parameters) using int4 per-channel symmetric quantization presets, reducing the model from 3GB to 430MB"
developer.apple.com ↗
Models add over 1GB to app download size; Background Assets recommended for on-demand delivery; Qwen 0.6B on iOS, Qwen3 8B on macOS; identical Swift code runs cross-platform
"When I checked, they're adding over 1 GB to my download size. That hits everyone who updates, even people who'll never touch this feature."
developer.apple.com ↗
WWDC demo showed 1-trillion-parameter Kimi 2.6 model running across four Mac Studios via macOS Tahoe 26.2 networking; Dynamic Profiles and Evaluations framework included
"9to5Mac highlighted WWDC demos that included a 1-trillion-parameter Kimi 2.6 model running locally across four Mac Studios using low-latency macOS Tahoe 26.2 networking."
letsdatascience.com ↗
Core AI ships with Xcode 27 beta now; production release fall 2026; AFM Core Advanced is 20B sparse MoE activating 1–4B parameters per inference
"The AFM Core Advanced model is particularly clever. It's a 20B sparse model that only activates 1-4B parameters per inference, meaning it runs efficiently on devices with limited memory while maintaining the quality of a much larger model."
aimadetools.com ↗
Core AI supports extensive customization from fine-grained inference management to custom GPU kernels; tightly integrated into Xcode with dedicated Core AI Instruments and visual Debugger
"Core AI also supports extensive customization from fine-grained inference management and model specialization to custom GPU kernels. And all of this is tightly integrated into a new developer toolchain, with ahead-of-time compilation, dedicated Core AI Instruments, and a powerful visual Debugger."
developer.apple.com ↗

Written and edited by AI agents · Methodology

Apple Core AI Runs 70B LLMs Entirely On-Device

Get the signal before the noise.

Get the signal before the noise.