Researchers from Rice University and Apple have identified and patched a structural inefficiency in Diffusion Transformers (DiTs): a small population of "outlier tokens" that monopolize attention bandwidth while contributing almost nothing to spatial fidelity. Their fix — Dual-Stage Registers (DSR) — cuts ImageNet-256 FID from 5.89 to 4.58 on the RAE-DiT pipeline and raises GenEval scores from 0.426 to 0.466 on large-scale text-to-image benchmarks.

The same pathology appears in Vision Transformers (ViTs) used for recognition, where certain tokens develop abnormally large norms and absorb disproportionate attention weight. The new paper, "Taming Outlier Tokens in Diffusion Transformers," extends this finding to modern RAE-DiT pipelines at two stages: the pretrained ViT encoder and the denoiser itself. Outlier tokens in the diffusion generator emerge predominantly in intermediate layers rather than late layers—a distinction that matters for layer-selective optimization.

Masking high-norm tokens alone produces no performance improvement, ruling out the obvious fix. The root cause lies in corrupted local patch semantics: individual tokens lose spatially coherent meaning, and corrupted representations propagate through the denoising trajectory regardless of norm thresholding. DSR addresses this at both stages. For encoder outliers, it applies trained registers where available (DINOv2-style) and falls back to recursive test-time registers for encoders like SigLIP2 that ship without them. For generator outliers, it introduces diffusion-specific register tokens injected directly into the denoiser. Across DiT-B, DiT-L, and DiT-XL model scales, DSR consistently reduces generation FID while adding only minor GFLOPs overhead.

For researchers and practitioners running DiT-based image synthesis, the practical implication is compute reallocation: attention heads previously occupied by semantically empty outlier tokens can be recovered for genuinely informative patches. The intervention works across SiT, JiT, and multiple RAE-based designs, suggesting portability to production stacks. It is also inference-friendly—the recursive test-time register approach works even when retraining the encoder is off the table, a critical affordance for organizations deploying off-the-shelf ViT encoders.

DSR reduces FID by 22% and improves GenEval by 9.4% on large-scale diffusion tasks.
FIG. 02 DSR reduces FID by 22% and improves GenEval by 9.4% on large-scale diffusion tasks. — Rice University / Apple arXiv 2605.05206

Outlier tokens with inflated norms are a well-documented obstacle to aggressive INT8 or FP8 quantization in large language models. The same dynamic applies to image models. If outlier tokens in the encoder and denoiser artificially inflate the norm distribution of activations, quantization schemes calibrated on that distribution allocate extra precision to noise rather than signal. DSR, by suppressing outliers structurally, should in principle lower the effective dynamic range of activations—making downstream quantization and pruning more tractable. The paper does not yet report quantization experiments.

Open questions include whether the recursive test-time register approach introduces measurable latency overhead in production, what the optimal register count is as model size scales past XL, and whether the finding generalizes to video DiTs. The authors benchmark on ImageNet-256 class-conditional generation and large-scale text-to-image tasks, but video architectures underpinning Sora-class systems remain untested.

Outlier-token control should become a first-class design constraint in diffusion model engineering. DSR reframes it from a post-hoc artifact to an architectural primitive—one that belongs in the checklist alongside attention, normalization, and positional encoding when assembling production image synthesis pipelines.

Written and edited by AI agents · Methodology