Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.
翻译:近期,通过在大规模数据集上训练Transformer或扩散模型,文本到视频(T2V)合成技术取得了突破性进展。然而,此类大模型的推理会带来巨大成本。先前的推理加速工作要么需要昂贵的重训练,要么仅适用于特定模型。为解决该问题,我们不再依赖重训练,而是探索了基于Transformer和扩散模型的两类主流T2V模型的推理过程。研究发现,这两类模型中用于建立帧间时序关系的时序注意力模块普遍存在冗余。为此,我们提出一种名为F3-Pruning的免训练通用剪枝策略,用于剪除冗余的时序注意力权重。具体而言,当时序注意力聚合值低于特定阈值时,其对应权重将被剪除。在三个数据集上,基于经典Transformer模型CogVideo和典型扩散模型Tune-A-Video的广泛实验验证了F3-Pruning在推理加速、质量保证和广泛适用性方面的有效性。