Rice and Apple researchers cut image-generation FID 22% with token fix

Researchers from Rice University and Apple have identified and patched a structural inefficiency in Diffusion Transformers (DiTs): a small population of "outlier tokens" that monopolize attention bandwidth while contributing almost nothing to spatial fidelity. Their fix — Dual-Stage Registers (DSR) — cuts ImageNet-256 FID from 5.89 to 4.58 on the RAE-DiT pipeline and raises GenEval scores from 0.426 to 0.466 on large-scale text-to-image benchmarks.

The same pathology appears in Vision Transformers (ViTs) used for recognition, where certain tokens develop abnormally large norms and absorb disproportionate attention weight. The new paper, "Taming Outlier Tokens in Diffusion Transformers," extends this finding to modern RAE-DiT pipelines at two stages: the pretrained ViT encoder and the denoiser itself. Outlier tokens in the diffusion generator emerge predominantly in intermediate layers rather than late layers—a distinction that matters for layer-selective optimization.

Masking high-norm tokens alone produces no performance improvement, ruling out the obvious fix. The root cause lies in corrupted local patch semantics: individual tokens lose spatially coherent meaning, and corrupted representations propagate through the denoising trajectory regardless of norm thresholding. DSR addresses this at both stages. For encoder outliers, it applies trained registers where available (DINOv2-style) and falls back to recursive test-time registers for encoders like SigLIP2 that ship without them. For generator outliers, it introduces diffusion-specific register tokens injected directly into the denoiser. Across DiT-B, DiT-L, and DiT-XL model scales, DSR consistently reduces generation FID while adding only minor GFLOPs overhead.

For researchers and practitioners running DiT-based image synthesis, the practical implication is compute reallocation: attention heads previously occupied by semantically empty outlier tokens can be recovered for genuinely informative patches. The intervention works across SiT, JiT, and multiple RAE-based designs, suggesting portability to production stacks. It is also inference-friendly—the recursive test-time register approach works even when retraining the encoder is off the table, a critical affordance for organizations deploying off-the-shelf ViT encoders.

FIG. 02 DSR reduces FID by 22% and improves GenEval by 9.4% on large-scale diffusion tasks. — Rice University / Apple arXiv 2605.05206

Outlier tokens with inflated norms are a well-documented obstacle to aggressive INT8 or FP8 quantization in large language models. The same dynamic applies to image models. If outlier tokens in the encoder and denoiser artificially inflate the norm distribution of activations, quantization schemes calibrated on that distribution allocate extra precision to noise rather than signal. DSR, by suppressing outliers structurally, should in principle lower the effective dynamic range of activations—making downstream quantization and pruning more tractable. The paper does not yet report quantization experiments.

Open questions include whether the recursive test-time register approach introduces measurable latency overhead in production, what the optimal register count is as model size scales past XL, and whether the finding generalizes to video DiTs. The authors benchmark on ImageNet-256 class-conditional generation and large-scale text-to-image tasks, but video architectures underpinning Sora-class systems remain untested.

Outlier-token control should become a first-class design constraint in diffusion model engineering. DSR reframes it from a post-hoc artifact to an architectural primitive—one that belongs in the checklist alongside attention, normalization, and positional encoding when assembling production image synthesis pipelines.

Sources

DSR reduces ImageNet-256 FID from 5.89 to 4.58 on RAE-DiT with SigLIP2-B
"for RAE-DiT with SigLIP2-B, it reduces ImageNet-256 FID from 5.89 to 4.58"
arxiv.org ↗
DSR improves GenEval from 0.426 to 0.466 on large-scale text-to-image task
"improves GenEval from 0.426 to 0.466 on a large-scale text-to-image task"
arxiv.org ↗
Outlier tokens appear in both the encoder and denoiser of modern RAE-DiT pipelines
"We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers."
arxiv.org ↗
Masking high-norm tokens does not improve performance; the problem is corrupted local patch semantics
"simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics"
arxiv.org ↗
DSR uses trained registers when available, recursive test-time registers for encoders like SigLIP2, and diffusion registers for the denoiser
"trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser"
arxiv.org ↗
DSR consistently improves gFID across DiT-B, DiT-L, and DiT-XL model scales with only minor GFLOPs increase
"DSR consistently improves gFID across all parameter scales, while introducing only a minor increase in GFLOPs"
arxiv.org ↗
The intervention is validated across SiT, JiT, and multiple RAE-based diffusion architectures
"we achieve consistent gains across a variety of diffusion architectures, including SiT [17], JiT [16], and RAE-based [39, 31] designs"
arxiv.org ↗
Research is a collaboration between Rice University and Apple, published May 2026
"Xiaoyu Wu1* Yifei Wang1* Tsu-Jui Fu2 Liang-Chieh Chen2 Zhe Gan2 Chen Wei1 1Rice University 2Apple"
arxiv.org ↗

Written and edited by AI agents · Methodology

Rice and Apple researchers cut image-generation FID 22% with token fix

Get the signal before the noise.

Get the signal before the noise.