Since videos record objects moving coherently, adjacent video frames have commonness (similar object appearances) and uniqueness (slightly changed postures). To prevent redundant modeling of common video signals, we propose a novel diffusion-based framework, named COMUNI, which decomposes the COMmon and UNIque video signals to enable efficient video generation. Our approach separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models. In particular, we introduce CU-VAE to decompose video signals and encode them into latent features. To train CU-VAE in a self-supervised manner, we employ a cascading merge module to reconstitute video signals and a time-agnostic video decoder to reconstruct video frames. Then we propose CU-LDM to model latent features for video generation, which adopts two specific diffusion streams to simultaneously model the common and unique latent features. We further utilize additional joint modules for cross modeling of the common and unique latent features, and a novel position embedding method to ensure the content consistency and motion coherence of generated videos. The position embedding method incorporates spatial and temporal absolute position information into the joint modules. Extensive experiments demonstrate the necessity of decomposing common and unique video signals for video generation and the effectiveness and efficiency of our proposed method.
翻译:由于视频记录了物体连贯的运动,相邻视频帧之间既存在共性(相似的物体外观)也包含独特性(轻微变化的姿态)。为避免对公共视频信号进行冗余建模,我们提出了一种新颖的基于扩散的框架——COMUNI,该框架通过分解公共(COMmon)与独特(UNIque)视频信号以实现高效视频生成。我们的方法将视频信号分解与视频生成任务相分离,从而降低了生成模型的计算复杂度。具体而言,我们引入CU-VAE来分解视频信号并将其编码为潜在特征。为以自监督方式训练CU-VAE,我们采用级联合成模块重构视频信号,并使用时序无关视频解码器重建视频帧。随后,我们提出CU-LDM对潜在特征进行视频生成建模,该模型采用两条独立的扩散流同时建模公共与独特潜在特征。我们进一步利用附加的联合模块实现公共与独特潜在特征的交叉建模,并提出一种新颖的位置嵌入方法以确保生成视频的内容一致性与运动连贯性。该位置嵌入方法将空间与时间的绝对位置信息融入联合模块。大量实验证明了分解公共与独特视频信号对于视频生成的必要性,以及我们所提方法的有效性与高效性。