The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-$\alpha$). The visual exhibition and source code of Qihoo-T2X is available at https://360cvgroup.github.io/Qihoo-T2X/.
翻译:扩散Transformer中的全局自注意力机制由于视觉信息的稀疏性和冗余性,以及空间窗口内令牌注意力图的显著相似性,涉及大量冗余计算。为解决此冗余问题,我们提出了代理令牌化扩散Transformer(PT-DiT),它采用稀疏代表性令牌注意力(其中代表性令牌数量远少于令牌总数)来高效建模全局视觉信息。具体而言,在每个Transformer块内,我们从每个时空窗口计算一个平均令牌,作为该区域的代理令牌。全局语义通过这些代理令牌的自注意力捕获,然后通过交叉注意力注入所有潜在令牌。同时,我们引入了窗口和移位窗口注意力,以解决稀疏注意力机制在细节建模方面的局限性。基于精心设计的PT-DiT,我们进一步开发了Qihoo-T2X系列,包含适用于T2I、T2V和T2MV任务的多种模型。实验结果表明,PT-DiT在图像和视频生成任务中实现了有竞争力的性能,同时显著降低了计算复杂度(例如,与DiT相比减少49%,与PixArt-$\alpha$相比减少34%)。Qihoo-T2X的可视化展示和源代码可在 https://360cvgroup.github.io/Qihoo-T2X/ 获取。