This paper introduces MVDiffusion, a simple yet effective multi-view image generation method for scenarios where pixel-to-pixel correspondences are available, such as perspective crops from panorama or multi-view images given geometry (depth maps and poses). Unlike prior models that rely on iterative image warping and inpainting, MVDiffusion concurrently generates all images with a global awareness, encompassing high resolution and rich content, effectively addressing the error accumulation prevalent in preceding models. MVDiffusion specifically incorporates a correspondence-aware attention mechanism, enabling effective cross-view interaction. This mechanism underpins three pivotal modules: 1) a generation module that produces low-resolution images while maintaining global correspondence, 2) an interpolation module that densifies spatial coverage between images, and 3) a super-resolution module that upscales into high-resolution outputs. In terms of panoramic imagery, MVDiffusion can generate high-resolution photorealistic images up to 1024$\times$1024 pixels. For geometry-conditioned multi-view image generation, MVDiffusion demonstrates the first method capable of generating a textured map of a scene mesh. The project page is at https://mvdiffusion.github.io.
翻译:本文提出MVDiffusion,一种简洁有效的多视图图像生成方法,适用于存在像素级对应关系的场景(如全景图视角裁剪或给定几何信息(深度图与位姿)的多视图图像)。与依赖迭代图像扭曲和补全的现有模型不同,MVDiffusion以全局感知方式并行生成所有图像,具备高分辨率与丰富内容,有效克服了先前模型中普遍存在的误差累积问题。该方法特别引入对应感知注意力机制以实现高效的跨视图交互,该机制支撑三大核心模块:1) 保持全局对应关系的低分辨率图像生成模块;2) 增强图像间空间覆盖密度的插值模块;3) 输出高分辨率结果的超分辨率模块。在全景图像生成方面,MVDiffusion可生成高达1024×1024像素的高保真图像。针对几何条件约束的多视图图像生成,MVDiffusion首次实现了场景网格纹理图的生成能力。项目主页:https://mvdiffusion.github.io。