We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.
翻译:我们提出了一种基于扩散的框架,通过扭曲与修复方法实现对齐的新视角图像与几何生成。与先前需要密集姿态图像或局限于域内视角的嵌入姿态生成模型不同,我们的方法利用现成的几何预测器从参考图像预测部分几何,并将新视角合成构建为图像与几何的修复任务。为确保生成图像与几何间的精确对齐,我们提出跨模态注意力蒸馏,在训练和推理过程中将图像扩散分支的注意力图注入并行的几何扩散分支。这种多任务方法实现了协同效应,促进了几何鲁棒的图像合成以及定义清晰的几何预测。我们进一步引入基于邻近度的网格条件化机制,整合深度与法线线索,在点云间进行插值并过滤错误预测的几何信息,避免其影响生成过程。实验表明,我们的方法在多种未见场景中实现了图像与几何的高保真外推视角合成,在插值设置下提供了具有竞争力的重建质量,并生成了几何对齐的彩色点云以实现全面的三维补全。项目页面位于 https://cvlab-kaist.github.io/MoAI。