This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.
翻译:本研究探究大语言模型从科学文本生成数学方程的能力。现有工作在非结构化接地、多方程依赖以及人类对齐评估方面存在挑战。为此,我们构建了一个AI研究论文数据集,将上下文段落与真实方程及变量描述进行配对。我们开发了一种可解释的方程生成工作流程,并在多种开源与闭源大语言模型骨干网络上进行了评估。我们引入了一套评估协议,结合自动指标、基于LLM的评分准则和人类判断,以评估准确性、可解释性及人机对齐程度。结果表明,LLM在词汇和句法层面的相似性上表现中等,但在语义准确性上存在困难。基于LLM的评估与人类判断的比较显示对齐程度有限,凸显了使用LLM评估方程质量的挑战。这些发现为改进方程生成模型、开发更可靠的科学文本评估方法提供了启示。为促进可复现性,我们提供了代码与数据。