Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. Project page: https://collaborativevideodiffusion.github.io/.
翻译:视频生成研究近期取得了巨大进展,使得从文本提示或图像生成高质量视频成为可能。为视频生成过程添加控制是未来发展的重要目标,近期基于相机轨迹条件化的视频生成方法已在此方向取得进展。然而,从多个不同相机轨迹生成同一场景的视频仍具挑战性。解决这一多视频生成问题将能实现具有可编辑相机轨迹的大规模三维场景生成等应用。我们提出协作视频扩散作为实现该愿景的重要一步。CVD框架包含新颖的跨视频同步模块,该模块通过极线注意力机制促进从不同相机位姿渲染的同一视频对应帧之间的一致性。在现有最先进的视频生成相机控制模块基础上进行训练,CVD生成的多个不同相机轨迹渲染视频在广泛实验中展现出显著优于基线方法的一致性。项目页面:https://collaborativevideodiffusion.github.io/。