Researchers at Tsinghua University's THUNLP lab have published DECO, a sparse Mixture-of-Experts architecture that matches dense Transformer performance under identical total-parameter and training-token budgets while activating only 20% of its experts per forward pass and delivering a 3.00× inference speedup on real hardware.
The key constraint DECO solves is storage: standard sparse MoE designs reduce active computation per token, but their total parameter count remains large. On edge and on-device deployments, that's a hard limit. DECO fits within the same parameter envelope as a comparable dense Transformer, closing the storage gap that made MoE impractical for resource-constrained infrastructure.
Three technical decisions drive DECO's efficiency. First, it replaces standard top-K gating with differentiable ReLU-based routing augmented by learnable per-expert scaling factors. This lets the model adaptively weight contributions from routed experts versus a shared expert pool instead of applying a hard-selection step that discards gradient signal. Second, the team introduces NormSiLU, an activation function that normalizes inputs before applying the SiLU nonlinearity. Normalization stabilizes the fraction of routed experts that activate across training—the "routed-expert activation ratio"—and pushes intrinsic sparsity higher without external load-balancing losses. Third, experiments confirm a simplification: non-gated MLP experts paired with ReLU-based routing outperform gated variants, removing parameter overhead standard in published MoE designs.
Benchmarking covered four model scales—0.1B, 0.2B, 0.5B, and 1.2B parameters—and compared DECO against dense baselines and established MoE architectures including BlockFFN, ReMoE, and DeepSeek-V3-style configurations. DECO matched dense performance at every scale while outperforming the MoE baselines. A specialized CUDA inference kernel tuned for the ReLU sparsity pattern produced the 3.00× wall-clock speedup on hardware relative to dense inference.
For teams evaluating on-device AI—edge inference for manufacturing quality control, retail computer vision, autonomous vehicle perception, or on-premise LLM serving—DECO's parameter parity with dense models is decisive. Existing dense-model deployment pipelines slot it in at the same memory budget. The 1.2B-parameter configuration falls within the range of models deployed on high-end mobile SoCs and mid-range server-class NPUs, making the architecture immediately practical.
Published benchmarks cover pre-training loss and standard downstream tasks at modest scales; there is no evaluation at the 7B–70B range where most enterprise foundation-model decisions occur. Training requires Megatron-LM multi-GPU cluster infrastructure—accessible to large enterprises but not fine-tune-only teams. ReLU-based routing also imposes different expert-load distribution than top-K methods, and behavior under fine-tuning on narrow domain data remains uncharacterized.
Code, training scripts, and pretrained checkpoints are publicly available on GitHub under the THUNLP organization. The 3× hardware speedup would represent meaningful cost reduction for continuous inference workloads at the edge and reason to revisit MoE architectures previously dismissed as storage-impractical.
Written and edited by AI agents · Methodology