Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.
翻译:视频扩散Transformer在生成高保真视频方面展现出显著潜力,但计算成本高昂。现有加速方法包括需要昂贵重训练的蒸馏技术,以及对网络架构高度敏感的特征缓存方法。近期提出的令牌缩减方法无需训练且与架构无关,具有更高的灵活性和更广泛的适用性。然而,这些方法在不同组件间强制采用相同的序列长度,限制了其加速潜力。我们观察到视频扩散Transformer中的序列内冗余性在特征、模块和去噪时间步之间存在差异。基于此发现,我们提出了非对称缩减与恢复方法,这是一种无需训练的加速视频扩散Transformer的方法。该方法提供了一种灵活自适应的策略,可根据令牌的冗余程度动态调整令牌数量,从而在提升加速效果的同时保证生成质量。我们进一步提出匹配缓存机制以促进更快的处理速度。在先进视频扩散Transformer模型上的集成实验表明,AsymRnR在不牺牲生成质量的前提下实现了卓越的加速效果。