Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

翻译：扩散Transformer（DiT）及相关基于流的架构如今已成为最强大的文本到图像生成器之一，然而，提示词塑造图像语义的内部机制仍鲜为人知。在本工作中，我们研究大规模激活：隐藏状态通道中一小部分响应始终显著大于其余部分。我们表明，尽管这些少数通道具有稀疏性，但它们从三个互补意义上有效勾勒了全景。首先，它们在功能上至关重要：一个受控扰动探针归零大规模通道会导致生成质量急剧下降，而扰动相同大小的低统计量通道则影响甚微。其次，它们在空间上具有组织性：将图像流令牌限制于大规模通道并进行聚类，可得到与主体和显著区域紧密对齐的连贯分区，揭示出藏在看似异常子空间中的结构化空间编码。第三，它们具有可迁移性：将大规模激活从一个提示词条件轨迹传输到另一个，可使最终图像朝源提示词偏移，同时保留目标提示词的实质内容，从而产生局部语义插值，而非无结构的像素混合。我们利用这一特性实现两种用例：文本条件与图像条件的语义传输，其中大规模激活传输无需额外训练即可实现提示词插值和主体驱动生成。这些结果共同将大规模激活重构为激活异常，而非现代DiT模型中组织和控制语义信息的稀疏提示词条件载体子空间。