Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.
翻译:摘要:近年来,利用视频生成模型对场景进行建模引起了广泛研究兴趣。然而,现有方法大多依赖仅合成场景有限观测的透视视频模型,导致完整性和全局一致性问题。我们提出OmniRoam,一种可控的全景视频生成框架,该框架利用全景表示丰富的逐帧场景覆盖及其固有的长时空间与时间一致性,实现长视界场景漫游。我们的框架始于预览阶段,在该阶段中,轨迹控制的视频生成模型从给定的输入图像或视频快速生成场景概览。随后在细化阶段,该视频在时间上被扩展且在空间上被上采样,以生成长距离、高分辨率视频,从而实现高保真世界漫游。为训练模型,我们引入了两个全景视频数据集,其中包含合成与真实世界采集视频。实验表明,我们的框架在视觉质量、可控性及长时场景一致性方面,无论定性还是定量均持续优于现有最优方法。我们进一步展示了该框架的若干扩展,包括实时视频生成与三维重建。代码已开源:https://github.com/yuhengliu02/OmniRoam。