Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understanding of the environment. This requirement limits the ability of existing carefully engineered approaches to generalize across domains. Inspired by the zero-shot ability of large language models (LLMs) to generalize to novel contexts, we investigate the possibility of leveraging the broad world knowledge encoded in such models in solving inverse-graphics problems. To this end, we propose the Inverse-Graphics Large Language Model (IG-LLM), an inverse-graphics framework centered around an LLM, that autoregressively decodes a visual embedding into a structured, compositional 3D-scene representation. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training. Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the use of image-space supervision. Our analysis opens up new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We will release our code and data to ensure the reproducibility of our investigation and to facilitate future research at https://ig-llm.is.tue.mpg.de/
翻译:逆向图形学——将图像逆转为物理变量,使得渲染后能够复现观察到的场景——是计算机视觉和图形学中的一项基础挑战。将图像分解为其组成元素,例如产生该图像的3D场景中物体的形状、颜色和材质属性,需要全面理解环境。这一要求限制了现有精心设计的方法跨领域泛化的能力。受大型语言模型(LLM)在零样本情况下泛化到新场景能力的启发,我们探索了利用这些模型中编码的广泛世界知识解决逆向图形学问题的可能性。为此,我们提出了逆向图形学大型语言模型(IG-LLM),一个以LLM为核心的逆向图形学框架,该框架自回归地将视觉嵌入解码为结构化、组合式的3D场景表示。我们整合了冻结的预训练视觉编码器和连续数值头,以实现端到端训练。通过研究,我们证明了LLM能够通过下一个词元预测促进逆向图形学,而无需使用图像空间监督。我们的分析开辟了新的可能性,即利用LLM的视觉知识对图像进行精确的空间推理。为确保研究可复现并促进未来研究,我们将发布代码和数据,详见https://ig-llm.is.tue.mpg.de/。