We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.
翻译:我们提出内在图像扩散法,一种用于室内场景外观分解的生成模型。给定单个输入视图,我们采样多种可能的材质解释,表示为反照率、粗糙度和金属度贴图。由于光照与材质属性之间的固有歧义性以及真实数据集的缺乏,外观分解在计算机视觉中构成重大挑战。为解决此问题,我们采用概率公式,不直接预测真实材质属性,而是利用条件生成模型从解空间中采样。此外,我们证明,近期在大规模真实世界图像上训练的扩散模型的强大学习先验可适应于材质估计,并显著提升对真实图像的泛化能力。我们的方法生成更清晰、更一致且更详细的材质,在PSNR上以1.5分贝优于现有方法,在反照率预测的FID分数上改进45%。通过在合成和真实世界数据集上的实验,我们验证了方法的有效性。