We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.
翻译:我们研究了图像生成中扩散Transformer(DiTs)的异常标记现象。先前研究表明,视觉Transformer(ViTs)会产生少量高范数标记,这些标记吸引过多注意力却携带有限局部信息,但它们在生成模型中的作用尚未充分探索。我们发现,在现代表征自编码器(RAE)-DiT流水线的编码器和去噪器中均存在此现象:预训练ViT编码器可产生异常表征,而DiT本身也会在中间层形成内部异常标记。此外,简单掩蔽高范数标记并不能改善性能,表明问题不仅源于少量极端值,更与局部补丁语义受损密切相关。为解决此问题,我们提出双阶段寄存器(DSR)——一种针对两个组件的基于寄存器的干预方法:当可用时使用训练寄存器,否则使用递归测试时寄存器,并对去噪器使用扩散寄存器。在ImageNet和大规模文生图任务中,这些干预措施持续减少异常伪影并提升生成质量。我们的研究结果强调了异常标记控制是构建更强DiT的关键要素。