Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).
翻译:多模态大语言模型(MLLMs)在视觉-语言理解方面取得了显著进展。然而,即使是当前最先进的模型在几何推理任务上仍存在困难,这揭示了一个关键瓶颈:高质量图像-文本对的极度稀缺。人工标注成本高昂,而自动化方法又难以保证数据的保真度与训练有效性。现有方法要么被动适应现有图像,要么采用低效的随机探索加过滤策略,导致数据生成与模型学习需求脱节。我们提出了Socratic-Geo,一个通过多智能体交互动态耦合数据合成与模型学习的全自主框架。教师智能体通过反思反馈(Reflect确保可解性,RePI确保视觉有效性)生成参数化的Python脚本,从而保证图像-文本对的纯净度。求解器智能体通过偏好学习优化推理能力,其失败路径将指导教师智能体进行针对性数据增强。与此同时,生成器智能体在积累的“图像-代码-指令”三元组上学习图像生成能力,将程序化绘图智能蒸馏至视觉生成中。仅从108个种子问题出发,Socratic-Solver在六个基准测试中仅使用基线四分之一的数据量即达到49.11分,超越强基线2.43分。Socratic-Generator在GenExam上达到42.4%,为开源模型确立了新的最先进水平,超越了Seedream-4.0(39.8%)并接近Gemini-2.5-Flash-Image(43.1%)。