Sparse MoEs retain accuracy at 87.5% weight pruning

Researchers introduced EMO, a pretraining method that makes Mixture-of-Experts models work efficiently when you activate only a subset of experts per task. This directly addresses the open-source LLM deployment challenge: how to run capable models in memory-constrained environments without catastrophic performance loss.

Researchers at the Allen Institute for AI have published EMO, a pretraining method that keeps Mixture-of-Experts language models accurate in memory-constrained deployment. When 87.5% of the model's expert weights are left on disk, performance drops under 3%.

Standard MoEs activate only a sparse subset of their total parameters per token — a design intended for selective deployment. But when inference is restricted to a domain-specific subset of experts, performance degrades severely. The routing mechanism trained on the full model cannot operate in partial configurations. EMO's authors — Ryan Wang, Akshita Bhagia, and Sewon Min — treat this as a pretraining problem, not an inference-time patch.

The mechanism is document-boundary routing. During pretraining, tokens within a single document are constrained to select experts from a shared pool. Documents can use different pools, but intra-document consistency is enforced. This structural constraint causes coherent expert groupings to emerge organically. The model learns which experts serve math, which serve code, and which serve general prose. The resulting architecture is designed for modularity from the start rather than retrofitted after training.

The team pretrained a 1B-active / 14B-total parameter EMO on 1 trillion tokens. At full capacity, it matches standard MoE performance. Retaining only 25% of the total experts produces a 1% absolute performance drop. Retaining 12.5% costs 3%. Standard MoEs tested under the same expert-pruning conditions fail at both thresholds. EMO also shows semantic-level expert specialization — domains such as math or code — while standard MoEs exhibit only low-level syntactic specialization, which is less useful for task-specific deployment.

FIG. 02 EMO maintains accuracy across aggressive expert pruning, while standard MoEs degrade sharply. — Allen Institute for AI, EMO preprint

For infrastructure teams, EMO changes the economics of on-premises and edge LLM deployment. A 14B-parameter sparse model is memory-hostile on most enterprise GPU configurations. EMO makes it viable to load only the 1.75B–3.5B active-parameter slice relevant to a given application domain, reducing VRAM requirements by 75–87.5% relative to the full model with minimal accuracy cost. This gap between theoretical feasibility and practical deployment is where most open-source MoE models have stalled. EMO closes it with empirical validation.

Because EMO's expert subsets are semantically coherent, organizations could combine subsets from independently trained EMO models — a code expert from one checkpoint merged with a multilingual expert from another — without retraining from scratch. The paper opens this possibility without fully exploring it; composition experiments across separately trained models are left as future work.

Open questions remain. The 1B-active scale is modest compared to frontier MoEs like Mixtral or DeepSeek-V3, which operate at significantly higher active-parameter counts. Whether document-boundary routing holds at 7B+ active parameters is untested. The paper also does not report wall-clock inference latency for partial-expert runs, which matters for production SLAs. Nonetheless, EMO gives deployment engineers a concrete pretraining recipe rather than a post-hoc compression workaround.

Sources

EMO is a 1B-active, 14B-total parameter MoE pretrained on 1 trillion tokens
"We pretrain a 1B-active, 14B-total EMO on 1T tokens."
arxiv.org ↗
Retaining only 25% of experts incurs just a 1% absolute performance drop
"retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop"
arxiv.org ↗
Retaining 12.5% of experts incurs a 3% absolute performance drop
"retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop"
arxiv.org ↗
Standard MoEs break under the same expert-restriction conditions
"whereas standard MoEs break under the same setting"
arxiv.org ↗
EMO expert subsets specialize at semantic levels such as math and code, unlike standard MoEs which show syntactic specialization
"expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs"
arxiv.org ↗
EMO uses document-boundary routing: tokens within a document select from a shared expert pool, while different documents can use different pools
"EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools"
arxiv.org ↗
EMO at full capacity matches standard MoE performance
"As a full model, it matches standard MoE performance."
arxiv.org ↗
Authors of EMO are Ryan Wang, Akshita Bhagia, and Sewon Min
"Ryan Wang, Akshita Bhagia, Sewon Min"
arxiv.org ↗

Written and edited by AI agents · Methodology

Sparse MoEs retain accuracy at 87.5% weight pruning

Get the signal before the noise.

Get the signal before the noise.