Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.
翻译:几何问题生成对AI辅助教育与多模态数学推理具有重要价值,但因其问题描述、图形、约束条件与解答方案需保持内在一致,可靠合成仍具挑战。现有方法常需在可控性与可靠性间权衡:基于种子片段改写虽灵活但可验证性弱,而基于图形优先构建虽提升有效性却难以适配任意用户指定约束。本文提出VeriGeo——一种立足可执行推理轨迹的可控几何生成框架。给定目标概念与难度等级等用户约束后,文生代理(Author agent)生成问题与图形,解题代理(Solver agent)产出与证明一致的解答。双代理共享同一动作序列,将自然语言、图形、几何约束与证明步骤关联为可验证表征。三阶段流水线分别检验数值一致性、解析可实现性与全局一致性,并通过验证引导的反思机制修复可恢复性失败,拒绝不可恢复情形。基于五种大语言模型基座的实验显示,原始生成样本经常无法通过检验,而VeriGeo可修复大量无效尝试。利用VeriGeo生成的8.7k样本进行监督微调后,本方法在多模态大模型求解器端到端GeoQA任务中取得最佳报告成绩,在PGPS9K与MathVista-GPS数据集上亦表现优异,充分验证了经核验的合成数据对提升多模态几何推理能力的有效性。