Meta Shrinks Mixture-of-Experts to Smartphones Without Cloud Offloading

Meta AI researchers have published a paper on sub-billion-parameter Mixture-of-Experts (MoE) models, MobileMoE, which they claim can bridge the gap between cloud-scale sparsity and on-device inference. The smallest variant, MobileMoE-S, activates only 0.3 billion parameters while maintaining a total capacity of 1.3 billion and a sub-3 GB INT4 footprint. The models are designed in three scales—S, M, and L—to fit within the DRAM of modern smartphones such as the iPhone 17 and Samsung Galaxy S25 Ultra without datacenter offloading.

The architecture is tuned for mobile constraints rather than server farms. While cloud MoE models aim for hundreds of billions of parameters, the paper identifies a sweet spot combining moderate sparsity, fine-grained experts, and shared expert layers that is both memory-optimal and compute-optimal. This challenges the conventional wisdom that MoE only pays off at massive scale. The variants scale to 0.5 billion active / 2.8 billion total parameters for MobileMoE-M and 0.9 billion active / 5.3 billion total for MobileMoE-L, with expert specialization emerging across knowledge, code, and math domains within the same unified weight footprint.

All models are trained through a four-stage pipeline—pre-training, mid-training, instruction fine-tuning, and 4-bit quantization-aware training—using only open-source data. Pre-training consumes roughly 6 trillion tokens, less than the 9 trillion used for Llama 3.2 1B or the 11 trillion for SmolLM2 1.7B, yet the paper reports matching or exceeding those dense baselines across 14 benchmarks covering commonsense, science, and reasoning. The 4-bit QAT stage is essential for achieving the sub-3 GB mobile DRAM target.

Operationally, MobileMoE outperforms both dense and sparse baselines. At comparable INT4 memory, MobileMoE-S achieves prefill speeds 1.8 to 3.8 times faster and decode speeds 2.2 to 3.4 times faster than the dense MobileLLM-Pro on commodity smartphones. MobileMoE-M matches accuracy with roughly 60 percent fewer active and total parameters than OLMoE-1B-7B, while MobileMoE-L exceeds OLMoE accuracy with 30 percent fewer active parameters and a 23 percent smaller memory footprint. These gains are set against a backdrop where flagship phone DRAM has increased from 4–8 GB a few generations ago to 12–16 GB today.

FIG. 02 MobileMoE models achieve accuracy parity with baselines using significantly fewer active and total parameters across three size tiers. — Meta MobileMoE paper, arxiv.org/abs/2605.27358v1

However, this is a research publication with no production deployment evidence yet. The speedups come from controlled on-device profiling, not from sustained user workloads subject to thermal throttling, background process contention, or battery-aware scheduling. Training 6 trillion tokens for models with fewer than one billion active parameters represents a high data-to-parameter ratio, implying a high upfront cost. The reliance on 4-bit QAT means teams cannot simply quantize existing FP16 checkpoints as an afterthought. The paper also omits per-request latency in milliseconds, dollar-per-inference economics, and the specific kernel-level routing logic required to execute conditional expert loads efficiently on mobile NPUs and GPUs—gaps any platform team would need to address before substituting a dense on-device model.

The paper's findings suggest that MoE efficiency gains can hold below one billion active parameters when sparsity is co-designed with aggressive quantization and a fixed mobile memory ceiling, rather than extrapolated downward from hundred-billion-parameter cloud recipes.

Sources

MobileMoE-S activates 0.3B parameters with 1.3B total and <3 GB INT4 weight footprint
"sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs"
arxiv.org ↗
Model family spans three sizes: S (0.3B/1.3B), M (0.5B/2.8B), L (0.9B/5.3B)
"0.3B/0.5B/0.9B active parameters (1.3B/2.8B/5.3B total) with <3 GB INT4 weight footprints to fit in mobile DRAM"
arxiv.org ↗
On-device scaling law identifies sweet spot of moderate sparsity with fine-grained and shared experts as simultaneously memory and compute optimal
"identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal"
arxiv.org ↗
Four-stage training pipeline: pre-training, mid-training, instruction fine-tuning, and 4-bit quantization-aware training
"four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets"
arxiv.org ↗
MobileMoE pre-trains on ~6 trillion tokens, vs 9T for Llama 3.2 1B and 11T for SmolLM2 1.7B
"With only ~6T pre-training tokens, MobileMoE matches or surpasses dense baselines trained on 1.5-2× more tokens (e.g., 9T for Llama 3.2 1B, 11T for SmolLM2 1.7B)"
arxiv.org ↗
MobileMoE-S/M match or exceed dense on-device LLMs with 2-4× fewer inference FLOPs across 14 benchmarks
"MobileMoE matches or exceeds leading on-device dense LLMs with 2-4× fewer inference FLOPs"
arxiv.org ↗
MobileMoE-M matches OLMoE-1B-7B accuracy with ~60% fewer active and total parameters
"MobileMoE-M matches its accuracy with ~60% fewer active and total parameters"
arxiv.org ↗
MobileMoE-L exceeds OLMoE accuracy with 30% fewer active parameters and 23% smaller memory footprint
"MobileMoE-L achieves much higher accuracy with 30% fewer active parameters and 23% smaller model memory footprint"
arxiv.org ↗
MobileMoE-S delivers 1.8-3.8× faster prefill and 2.2-3.4× faster decode than dense baseline MobileLLM-Pro at comparable INT4 weight memory
"MobileMoE-S delivers 1.8-3.8× faster prefill and 2.2-3.4× faster decode than the dense baseline MobileLLM-Pro"
arxiv.org ↗
iPhone DRAM grew from 4 GB (iPhone 13) to 12 GB (iPhone 17); Samsung S25 has 12–16 GB
"from 4 GB on iPhone 13 to 12 GB on iPhone 17, from 8 GB on Samsung Galaxy S21 to 12 GB, 16 GB on S25 and S25 Ultra"
arxiv.org ↗

Written and edited by AI agents · Methodology

Meta Shrinks Mixture-of-Experts to Smartphones Without Cloud Offloading

Get the signal before the noise.

Get the signal before the noise.