The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.
翻译:视频合成任务旨在将来自不同视频的指定前景和背景融合为和谐的整体。当前方法主要基于调整了前景颜色和光照的视频进行训练,难以处理超越表层调整的深层语义差异,例如域间差距。因此,我们提出一种无需训练的处理流程,采用具有语义先验知识的预训练扩散模型,能够处理语义差异更广泛的合成视频。具体而言,我们以级联方式处理视频帧,并通过扩散模型对每帧执行两个阶段的操作。在反转阶段,我们提出平衡局部反转方法,以获得兼顾可逆性与可修改性的生成初始点。随后在生成阶段,我们进一步提出帧间增强注意力机制,以提升跨帧的前景连续性。实验结果表明,我们的流程成功确保了输出结果的视觉和谐性与帧间连贯性,证明了其在处理更广泛语义差异方面的有效性。