This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation. The code and pretraiend models are open-sourced at https://github.com/AIDC-AI/Omni-View.
翻译:本文提出了Omni-View,它将统一的多模态理解与生成扩展至基于多视角图像的三维场景,旨在探索“生成促进理解”的原理。Omni-View由理解模型、纹理模块和几何模块组成,联合建模场景理解、新视角合成与几何估计,实现了三维场景理解任务与生成任务之间的协同交互。通过设计,它充分利用了负责外观合成的纹理模块的时空建模能力,以及其专用几何模块提供的显式几何约束,从而丰富了模型对三维场景的整体理解。采用两阶段训练策略,Omni-View在VSI-Bench基准测试中取得了55.4的先进分数,超越了现有的专用三维理解模型,同时在新视角合成和三维场景生成任务上均表现出色。代码与预训练模型已在 https://github.com/AIDC-AI/Omni-View 开源。