A Single Image and Multimodality Is All You Need for Novel View Synthesis

Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.

翻译：基于扩散的方法最近通过将生成模型建立在从单目深度估计推断出的几何条件上，在单幅图像新颖视角合成任务中展现出强大性能。然而在实践中，合成视角的质量与一致性从根本上受限于底层深度估计的可靠性，后者在低纹理、恶劣天气和严重遮挡的真实场景条件下往往表现脆弱。本研究表明，融入稀疏多模态距离测量为克服这些限制提供了一种简单而有效的途径。我们提出了一种多模态深度重建框架，该框架利用极稀疏的距离传感数据（如汽车雷达或激光雷达）来生成稠密深度图，作为基于扩散的新颖视角合成任务的鲁棒几何条件。我们的方法采用局部化高斯过程公式在角度域对深度进行建模，在实现计算高效推理的同时，能显式量化观测有限区域的不确定性。重建的深度及不确定性可作为现有基于扩散的渲染流程中单目深度估计器的即插即用替代方案，而无需修改生成模型本身。在真实世界多模态驾驶场景上的实验表明，用我们基于稀疏距离的重建结果替代纯视觉深度估计，能显著提升单幅图像新颖视角视频生成的几何一致性与视觉质量。这些结果凸显了可靠几何先验对基于扩散的视角合成的重要性，并证明了即使在极端稀疏条件下，多模态感知仍具有实际效益。