Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.
翻译:扩散Transformer(DiT)已成为生成高质量图像与视频的强大模型架构。在视频DiT中,三维时空注意力机制使得令牌长度随帧数成比例增加,急剧推高计算成本。令牌缩减方法通过利用空间冗余来降低该成本,但现有方法依赖不精确的相似度估计与轻量级匹配算法,导致匹配质量低下且加速效果有限。为克服这些局限,我们提出ORBIS——一种面向视频DiT的软硬件协同设计加速器。ORBIS利用前一时间步的输出激活获取更精确的令牌间相似度,显著提升匹配质量并实现更高的令牌缩减比。我们还引入了一种分布感知令牌匹配(DATM)算法,该算法捕获全局令牌分布并显式最小化令牌对损失以获取额外增益。为完全隐藏DATM延迟,我们设计了专用深度流水线硬件,并通过量化技术将其硬件代价降至最低,仅占芯片总面积2.4%且精度损失可忽略。大量实验表明,ORBIS在实现相较最先进方法AsymRnR约2倍的令牌缩减比的同时,相较于NVIDIA A100 GPU可获得高达4.5倍加速比与79.3%的能耗降低。