Sparse MoE Models Match Dense Transformers at 3× Faster Inference

Researchers at Tsinghua University's THUNLP lab have published DECO, a sparse Mixture-of-Experts architecture that matches dense Transformer performance under identical total-parameter and training-token budgets while activating only 20% of its experts per forward pass and delivering a 3.00× inference speedup on real hardware.

The key constraint DECO solves is storage: standard sparse MoE designs reduce active computation per token, but their total parameter count remains large. On edge and on-device deployments, that's a hard limit. DECO fits within the same parameter envelope as a comparable dense Transformer, closing the storage gap that made MoE impractical for resource-constrained infrastructure.

Three technical decisions drive DECO's efficiency. First, it replaces standard top-K gating with differentiable ReLU-based routing augmented by learnable per-expert scaling factors. This lets the model adaptively weight contributions from routed experts versus a shared expert pool instead of applying a hard-selection step that discards gradient signal. Second, the team introduces NormSiLU, an activation function that normalizes inputs before applying the SiLU nonlinearity. Normalization stabilizes the fraction of routed experts that activate across training—the "routed-expert activation ratio"—and pushes intrinsic sparsity higher without external load-balancing losses. Third, experiments confirm a simplification: non-gated MLP experts paired with ReLU-based routing outperform gated variants, removing parameter overhead standard in published MoE designs.

FIG. 02 DECO's three architectural innovations: ReLU routing, NormSiLU normalization, and non-gated MLP experts enable 3× inference speedup. — Tsinghua THUNLP, 2025

Benchmarking covered four model scales—0.1B, 0.2B, 0.5B, and 1.2B parameters—and compared DECO against dense baselines and established MoE architectures including BlockFFN, ReMoE, and DeepSeek-V3-style configurations. DECO matched dense performance at every scale while outperforming the MoE baselines. A specialized CUDA inference kernel tuned for the ReLU sparsity pattern produced the 3.00× wall-clock speedup on hardware relative to dense inference.

For teams evaluating on-device AI—edge inference for manufacturing quality control, retail computer vision, autonomous vehicle perception, or on-premise LLM serving—DECO's parameter parity with dense models is decisive. Existing dense-model deployment pipelines slot it in at the same memory budget. The 1.2B-parameter configuration falls within the range of models deployed on high-end mobile SoCs and mid-range server-class NPUs, making the architecture immediately practical.

Published benchmarks cover pre-training loss and standard downstream tasks at modest scales; there is no evaluation at the 7B–70B range where most enterprise foundation-model decisions occur. Training requires Megatron-LM multi-GPU cluster infrastructure—accessible to large enterprises but not fine-tune-only teams. ReLU-based routing also imposes different expert-load distribution than top-K methods, and behavior under fine-tuning on narrow domain data remains uncharacterized.

Code, training scripts, and pretrained checkpoints are publicly available on GitHub under the THUNLP organization. The 3× hardware speedup would represent meaningful cost reduction for continuous inference workloads at the edge and reason to revisit MoE architectures previously dismissed as storage-impractical.

Sources

DECO activates only 20% of experts per forward pass and achieves a 3.00× speedup on real hardware compared with dense inference
"Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00× speedup on real hardware compared with dense inference."
arxiv.org ↗
DECO matches dense Transformer performance under identical total parameter budgets and training tokens
"a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens"
arxiv.org ↗
Standard sparse MoE architectures create significant storage and memory-access bottlenecks on end-side devices
"its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment"
arxiv.org ↗
DECO uses ReLU-based routing with learnable expert-wise scaling to balance routed and shared experts
"DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts"
arxiv.org ↗
NormSiLU normalizes inputs before the SiLU operator, stabilizing the routed-expert activation ratio and increasing sparsity
"an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level"
arxiv.org ↗
Non-gated MLP experts with ReLU-based routing outperform gated variants, enabling architectural simplification
"We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification"
arxiv.org ↗
DECO benchmarks were run at model scales of 0.1B, 0.2B, 0.5B, and 1.2B parameters, comparing against Dense, ReMoE, BlockFFN-v1, DeepSeek-V3, and TopP baselines
"All launch commands for our main results, including DECO and the baselines (Dense, ReMoE, BlockFFN, DeepSeek-V3, TopP), at scales of 0.1B / 0.2B / 0.5B / 1.2B, are provided in run.sh."
github.com ↗
DECO code and pretrained checkpoints are publicly available on GitHub
"Source codes for pre-training DECO, introduced by the paper: DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices."
github.com ↗
DECO is built on the Megatron-LM training framework
"Our experiment is based on the framework of Megatron-LM."
github.com ↗

Written and edited by AI agents · Methodology

Sparse MoE Models Match Dense Transformers at 3× Faster Inference

Get the signal before the noise.

Get the signal before the noise.