Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. The code will be made public available soon.
翻译:几何问题求解(GPS)是一项具有挑战性的数学推理任务,需要多模态理解、融合与推理。现有神经求解器将GPS视为视觉-语言任务,但在承载丰富复杂布局信息的几何图形表示方面存在不足。本文提出一种布局感知神经求解器LANS,其集成了两个新模块:多模态布局感知预训练语言模块(MLA-PLM)与布局感知融合注意力(LA-FA)。MLA-PLM采用结构-语义预训练(SSP)实现全局关系建模,并通过点匹配预训练(PMP)实现视觉点与文本点对齐。LA-FA采用布局感知注意力掩码实现点引导的跨模态融合,以进一步强化LANS的布局感知能力。在Geometry3K和PGPS9K数据集上的大量实验验证了布局感知模块的有效性,且LANS求解器在问题求解性能上优于现有符号求解器与神经求解器。代码将很快公开。