Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures. (iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows and depth, but less performant for occlusion. (iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.
翻译:近期以Stable Diffusion为代表的生成模型在生成高度逼真图像方面取得了显著进展。本文旨在探究扩散网络对图像中三维场景不同属性的理解程度。为此,我们做出以下贡献:(i)提出一种评估协议,通过探测表征场景物理属性的显式特征,检验网络对三维场景多项物理属性的建模能力。该协议应用于带有属性标注的真实图像数据集。(ii)将该协议应用于涵盖场景几何、场景材质、支撑关系、光照及视图依赖测量等属性。(iii)发现Stable Diffusion在场景几何、支撑关系、阴影和景深等属性上表现优异,但在遮挡关系上性能较弱。(iv)将相同探测方法应用于DINO和CLIP等其他大规模训练模型,发现其性能均不及Stable Diffusion。