Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
翻译:视觉作为逆向图形,即将图像重建为可编辑图形程序的概念,是计算机视觉领域的长期目标。然而,即使是强大的视觉语言模型(VLMs)也无法一次性实现这一目标,因为它们缺乏细粒度的空间和物理基础能力。我们的核心见解是,弥合这一差距需要通过迭代执行与验证的交错多模态推理。基于此,我们提出了VIGA(视觉作为逆向图形智能体),它从一个空世界开始,通过闭环的编写-运行-渲染-比较-修订流程来重建或编辑场景。为支持长程推理,VIGA结合了(i)在生成器与验证器角色间交替的技能库,以及(ii)包含计划、代码差异和渲染历史的动态上下文记忆。VIGA是任务无关的,因其无需辅助模块,可覆盖广泛任务,如三维重建、多步场景编辑、四维物理交互和二维文档编辑等。实证研究表明,VIGA在BlenderGym(提升35.32%)和SlideBench(提升117.17%)上显著优于单次推理基线。此外,VIGA也是模型无关的,无需微调即可实现,为评估异构基础视觉语言模型提供了统一协议。为更好地支持该协议,我们引入了BlenderBench——一个利用图形引擎对交错多模态推理进行压力测试的挑战性基准,VIGA在该基准上取得了124.70%的性能提升。