Program code serves as a bridge linking vision and logic, providing a feasible supervisory approach for enhancing the multimodal reasoning capability of large models through geometric operations such as auxiliary line construction and perspective transformation. Nevertheless, current inverse graphics methods face tremendous challenges in accurately reconstructing complex geometric details, which often results in the loss of key geometric constraints or structural distortion. To address this bottleneck, we propose Geo-coder -- the first inverse programming framework for geometric images based on a multi-agent system. Our method innovatively decouples the process into geometric modeling via pixel-wise anchoring and metric-driven code evolution: Stage 1 leverages the complementary advantages of visual operators and large models to achieve precise capture of pixel coordinates and visual attributes; Stage 2 introduces a synthesis-rendering-validation closed loop, where bidirectional visual feedback drives the self-correction of code. Extensive experiments demonstrate that Geo-coder achieves a substantial lead in both geometric reconstruction accuracy and visual consistency. Notably, by effectively preserving the core geometric semantics, the images reconstructed with our method exhibit equivalent performance to the original ones in multimodal reasoning tasks, which fully validates the robustness of the framework. Finally, to further reduce research costs, we have open-sourced the Geo-coder dataset constructed on the GeoCode framework, which contains more than 1,500 samples. On this basis, we have also open-sourced the GeocodeLM model, laying a solid data and model foundation for subsequent research in this field.
翻译:程序代码作为连接视觉与逻辑的桥梁,通过辅助线构造、透视变换等几何操作为增强大模型多模态推理能力提供了可行的监督途径。然而,当前逆向图形方法在精确重建复杂几何细节方面面临巨大挑战,常导致关键几何约束丢失或结构失真。为突破这一瓶颈,我们提出Geo-coder——首个基于多智能体系统的几何图像逆向编程框架。本方法创新性地将过程解耦为基于像素锚定的几何建模与度量驱动的代码演化两阶段:第一阶段融合视觉算子与大模型的互补优势,实现像素坐标与视觉属性的精准捕获;第二阶段引入合成-渲染-验证闭环,通过双向视觉反馈驱动代码自校正。大量实验表明,Geo-coder在几何重建精度与视觉一致性上均取得显著领先。值得注意的是,通过有效保留核心几何语义,本方法重建的图像在多模态推理任务中展现出与原始图像等效的性能,充分验证了框架的鲁棒性。最后,为降低研究成本,我们开源了基于GeoCode框架构建的Geo-coder数据集,包含超过1,500个样本。在此基础上,我们还开源了GeocodeLM模型,为该领域的后续研究奠定了坚实的数据与模型基础。