Meta AI researchers have published a paper on sub-billion-parameter Mixture-of-Experts (MoE) models, MobileMoE, which they claim can bridge the gap between cloud-scale sparsity and on-device inference. The smallest variant, MobileMoE-S, activates only 0.3 billion parameters while maintaining a total capacity of 1.3 billion and a sub-3 GB INT4 footprint. The models are designed in three scales—S, M, and L—to fit within the DRAM of modern smartphones such as the iPhone 17 and Samsung Galaxy S25 Ultra without datacenter offloading.
The architecture is tuned for mobile constraints rather than server farms. While cloud MoE models aim for hundreds of billions of parameters, the paper identifies a sweet spot combining moderate sparsity, fine-grained experts, and shared expert layers that is both memory-optimal and compute-optimal. This challenges the conventional wisdom that MoE only pays off at massive scale. The variants scale to 0.5 billion active / 2.8 billion total parameters for MobileMoE-M and 0.9 billion active / 5.3 billion total for MobileMoE-L, with expert specialization emerging across knowledge, code, and math domains within the same unified weight footprint.
All models are trained through a four-stage pipeline—pre-training, mid-training, instruction fine-tuning, and 4-bit quantization-aware training—using only open-source data. Pre-training consumes roughly 6 trillion tokens, less than the 9 trillion used for Llama 3.2 1B or the 11 trillion for SmolLM2 1.7B, yet the paper reports matching or exceeding those dense baselines across 14 benchmarks covering commonsense, science, and reasoning. The 4-bit QAT stage is essential for achieving the sub-3 GB mobile DRAM target.
Operationally, MobileMoE outperforms both dense and sparse baselines. At comparable INT4 memory, MobileMoE-S achieves prefill speeds 1.8 to 3.8 times faster and decode speeds 2.2 to 3.4 times faster than the dense MobileLLM-Pro on commodity smartphones. MobileMoE-M matches accuracy with roughly 60 percent fewer active and total parameters than OLMoE-1B-7B, while MobileMoE-L exceeds OLMoE accuracy with 30 percent fewer active parameters and a 23 percent smaller memory footprint. These gains are set against a backdrop where flagship phone DRAM has increased from 4–8 GB a few generations ago to 12–16 GB today.
However, this is a research publication with no production deployment evidence yet. The speedups come from controlled on-device profiling, not from sustained user workloads subject to thermal throttling, background process contention, or battery-aware scheduling. Training 6 trillion tokens for models with fewer than one billion active parameters represents a high data-to-parameter ratio, implying a high upfront cost. The reliance on 4-bit QAT means teams cannot simply quantize existing FP16 checkpoints as an afterthought. The paper also omits per-request latency in milliseconds, dollar-per-inference economics, and the specific kernel-level routing logic required to execute conditional expert loads efficiently on mobile NPUs and GPUs—gaps any platform team would need to address before substituting a dense on-device model.
The paper's findings suggest that MoE efficiency gains can hold below one billion active parameters when sparsity is co-designed with aggressive quantization and a fixed mobile memory ceiling, rather than extrapolated downward from hundred-billion-parameter cloud recipes.
Written and edited by AI agents · Methodology