Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Recent attempts construct auxiliary lines via code-driven rendering, a strategy that relies on accurate and executable code generation to produce visual renderings of the auxiliary lines for subsequent reasoning. However, in complex solid geometry settings, such a strong dependence on precise specifications substantially restricts the robustness of this strategy. Alternatively, we turn to a simpler and more stable solution, representing auxiliary-line constructions as structured textual descriptions. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. The core is a cross-modal reward model that evaluates how well the generated auxiliary-line description matches the ground-truth auxiliary-line diagram. The reward signal drives a GRPO-based RL stage to yield informative auxiliary-line descriptions for the reasoning. To support the training and evaluation, we develop a scalable data pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. Based on this framework, we derive GeoVLMath, an LVLM for solving complex solid geometry.
翻译:辅助线对于解决复杂几何问题至关重要,但对大型视觉语言模型而言仍具挑战性。近期研究尝试通过代码驱动渲染构建辅助线,该策略依赖准确且可执行的代码生成来生成辅助线的视觉渲染以供后续推理。然而,在复杂立体几何场景中,这种对精确规范的高度依赖严重限制了该策略的鲁棒性。为此,我们转向一种更简单、更稳定的解决方案,将辅助线构建表示为结构化文本描述。为弥合文本描述与空间结构之间的鸿沟,我们提出一种增强图文对齐的强化学习框架。其核心是一个跨模态奖励模型,用于评估生成的辅助线描述与真实辅助线图示的匹配程度。该奖励信号驱动基于GRPO的强化学习阶段,从而为推理生成信息丰富的辅助线描述。为支持训练与评估,我们开发了可扩展的数据流水线,并构建了AuxSolidMath数据集,该数据集包含3,018道真实考试几何题及其配对的图示与对齐的文本字段。基于此框架,我们推导出GeoVLMath——一个用于解决复杂立体几何问题的大型视觉语言模型。