Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).
翻译:视觉逆图形——将图像重建为可编辑程序的概念——对视觉语言模型而言仍具挑战性,因为其本质上缺乏单次推理场景中细粒度的空间定位能力。为此,我们提出VIGA(视觉逆图形智能体),一种通过符号逻辑与视觉感知主动交叉验证的交错多模态推理框架。VIGA基于紧密耦合的“编码-渲染-验证”循环运作:合成符号化程序、将其投影为视觉状态,并通过差异检测引导迭代修正。该框架配备高层语义能力与演进式多模态记忆,可在长程任务中维持基于证据的修改。作为无需训练、任务无关的通用框架,VIGA无缝支持二维文档生成、三维重建、多步三维编辑及四维物理交互。最后,我们提出BlenderBench这一具有挑战性的视觉-代码基准测试。实验表明,在BlenderGym(提升35.32%)、SlideBench(提升117.17%)及本文提出的BlenderBench(提升124.70%)上,VIGA相较单次推理基线方法显著提升了准确率。