Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
翻译:视频运动迁移旨在根据文本提示生成视觉内容,同时迁移参考视频中的运动模式,从而合成视频。近期方法主要采用扩散变换器架构。为达到令人满意的运行时间,已有多种方法尝试加速扩散变换器中的计算,但未能解决结构性的低效根源。本研究识别并消除了先前工作中的两种计算冗余:运动冗余源于通用扩散变换器架构未考虑帧间运动平滑且连续的特性;梯度冗余则因忽略了沿扩散轨迹梯度缓慢变化的事实而产生。为缓解运动冗余,我们对相应注意力层进行局部邻域掩码,从而避免针对不必要远距离图像区域计算交互权重。为利用梯度冗余,我们设计了一种优化方案,该方案复用先前扩散步骤的梯度并跳过不必要的梯度计算。平均而言,FastVMT在不降低生成视频的视觉保真度或时间一致性的前提下,实现了3.43倍的加速。